Rediscovering a little known fact about the t-test: algebraic, geometric, distributional and graphical considerations
RRediscovering a little known fact about the t -test: algebraic, geometric, distributional and graphical considerations Jennifer A. Sinnott, Steven MacEachern and Mario Peruggia
Department of Statistics, The Ohio State University1958 Neil Ave, Columbus, OH 43210
May 15, 2020
Abstract
We discuss the role that the null hypothesis should play in the construction of a teststatistic used to make a decision about that hypothesis. To construct the test statisticfor a point null hypothesis about a binomial proportion, a common recommendationis to act as if the null hypothesis is true. We argue that, on the surface, the one-samplet-test of a point null hypothesis about a Gaussian population mean does not appear tofollow the recommendation. We show how simple algebraic manipulations of the usualt-statistic lead to an equivalent test procedure consistent with the recommendation,we provide geometric intuition regarding this equivalence, and we consider extensionsto testing nested hypotheses in Gaussian linear models. We discuss an application tographical residual diagnostics where the form of the test statistic makes a practicaldifference. We argue that these issues should be discussed in advanced undergraduateand graduate courses.
Keywords:
Binomial proportion; F -test; nested models; null hypothesis; orthogonal sum ofsquares decomposition; test statistic. Among the first procedures taught in an introductory statistics class are hypothesis testingand confidence interval estimation for a proportion (see, e.g., Moore et al., 2012). Forexample, a student may be given data on the sexes of a sample of n babies born during acertain time period and be asked either to estimate the true proportion p of babies bornmale and provide a confidence interval, or to test whether the proportion is equal to, for1 a r X i v : . [ s t a t . O T ] M a y xample, 0.5. Typically, for large n, the distribution of the sample proportion is approxi-mated by (cid:98) p · ∼ N ( p, p (1 − p ) /n ) , and two slightly different procedures are introduced. Forestimation and confidence interval construction, (cid:98) p is commonly plugged into the varianceformula, and a 100(1 − α )% confidence interval is calculated as (cid:98) p ± z α (cid:112)(cid:98) p (1 − (cid:98) p ) /n. (1)For testing H : p = p for a pre-specified p , students are advised to act as though the nullwere true, and use the null to construct the test statistic. As a result, p is plugged intothe variance formula, producing the test statistic (cid:98) p − p (cid:112) p (1 − p ) /n . (2)Although many different approaches to both testing and interval estimation have been pro-posed — and many commonly used statistical software packages do not use these formulasexactly — in the authors’ experience, the above methods are still frequently taught forhand calculation in introductory statistics classes of various levels. For instance, Exam-ple 10.3.5 in Casella and Berger (2002) discusses precisely two test procedures based ontest statistics that use (cid:98) p or p to estimate the variance, commenting on their relative meritsin terms of a comparison of their power functions. For further discussions of proceduresused in the one-sample proportion setting, see, e.g., Agresti and Coull (1998) and Yangand Black (2019).Also among the first procedures taught are estimation and hypothesis testing for themean µ of a normal N ( µ, σ ) population with unknown variance σ . For example, a studentmay be given data on the heights of a random sample of U.S. women and be asked toestimate the true mean height, or test whether it is equal to some specified value. If ourdata consist of a random sample Y , . . . , Y n from the N ( µ, σ ) population, ¯ Y ∼ N ( µ, σ /n ) , and a confidence interval is constructed analogously to (1), as¯ Y ± t n − , α S/ √ n where S = 1 n − n (cid:88) i =1 ( Y i − ¯ Y ) (3) There is ample evidence that this proportion is larger than 0.5 in most of the world; see, e.g., Chaoet al., 2019.
2s the sample variance. (This follows from observing that T := ( ¯ Y − µ ) / ( S/ √ n ) has a t distribution with n − σ with S ) Totest H : µ = µ for a pre-specified µ , we can, analogously to (2), invoke the null. When H holds, we know µ = µ but still need to estimate σ . Since µ is known, the most efficientestimator of σ is: S := 1 n n (cid:88) i =1 ( Y i − µ ) . Our test statistic would thus be: T := ¯ Y − µ S / √ n . But, of course, people do not use this test statistic! Instead, they construct a statisticthat ignores the information that µ = µ provided by H , and perform the standard one-sample t -test using the test statistic T = ¯ Y − µ S/ √ n . At first glance, one might suspect that using this test statistic would be less efficient thanusing T , since its denominator has n − n. We are thus led to wonder why information provided by the null is discarded in con-structing the one-sample t -test. In the remainder of the paper we clarify this question andpresent a more general perspective. The connection between the two methods proposed at the end of the previous section canbe established from an algebraic and from a geometric point of view. We look at these twoapproaches separately.
The first point to make is that any intuition that a test based on T rather than T could bemore efficient is wrong: a tail-area test based on T and one based on T produce identical answers. This is because T is a one-to-one, increasing function of T ,T = √ n − T (cid:112) n − T , (4)3ver the interval ( −√ n, √ n ), which is the set of possible values for T . Specifically, forany fixed α , with 0 ≤ α ≤
1, let c α ≥ α test basedon T . The rejection region of this test is R T = { y = ( y , . . . , y n ) T : | T ( y ) | ≥ c α } .Because the transformation in Equation (4) is monotonic increasing on [0 , √ n ), the set R T = { y = ( y , . . . , y n ) T : | T ( y ) | ≥ ( √ n − c α ) / ( (cid:112) n − c α ) } satisfies R T = R T . It followsthat the test that rejects if and only if | T ( y ) | ≥ ( √ n − c α ) / ( (cid:112) n − c α ) has the exact samerejection region (in sample space) as the test that rejects when | T ( y ) | ≥ c α . The two testsmust then have the same size and power function and are therefore equivalent.As noted by a reviewer, a simple way to establish Equation (4) is to recognize that theone sample t -test can be derived as a likelihood ratio test that rejects H : µ = µ whenthe ratio λ ( Y ) = sup σ L ( µ , σ | Y )sup µ,σ L ( µ, σ | Y )is small or, equivalently, when the ratio of sums of squares under the null and full model, R = (cid:80) j =1 ( Y j − µ ) (cid:80) nj =1 ( Y j − ¯ Y ) , (5)is large. This ratio can be expressed as R = (cid:80) j =1 ( Y j − ¯ Y ) + n ( ¯ Y − µ ) (cid:80) nj =1 ( Y j − ¯ Y ) = 1 + T n − R = (cid:80) j =1 ( Y j − µ ) (cid:80) nj =1 ( Y j − µ ) − n ( ¯ Y − µ ) = 11 − T /n . The former expression leads to the standard t -test based on T , while the latter leads to thetest based on T . Equating these two expressions yields the identity of Equation (4).This relationship between T and T , of course, is not new: for example, it arises substan-tively in Lehmann’s approach for demonstrating that the one sample t -test is a uniformlymost powerful (UMP) unbiased test of H : µ = µ vs. H A : µ (cid:54) = µ (Lehmann, 1986). Thefull details of the argument are best left to Lehmann, but, very briefly, for parameters inexponential family distributions, Lehmann’s Theorem 1 in Chapter 5 gives a set of condi-tions about the form of a test statistic in relation to the family’s sufficient statistics. Whenthese conditions are satisfied, a test based on the test statistic is UMP unbiased. The set of4onditions Lehmann provides is satisfied by T rather than T, and the UMP unbiasednessof the t -test is then established by exhibiting that T is a one-to-one function of T . Interestingly, this equivalence does not seem to be widely known (at least based onour informal surveying of several colleagues). This is somewhat surprising. In fact, inaddition to appearing in Lehmann’s book, the algebraic equivalence of the test statistics isperiodically mentioned in the literature (see, e.g., Lefante Jr and Shah, 1986; Good, 1986;Shah and Lefante Jr, 1987; Shah and Krishnamoorthy, 1993; LaMotte, 1994). Therefore,we believe students should be routinely exposed to this equivalence and given more insightinto its essence, emphasizing not only the algebraic aspect, but also the geometric anddistributional implications.
The second point is that the equivalence of T and T can be understood geometricallybecause they can both be viewed as trigonometric functions of the same angle, and it ispossible to express any trigonometric function in terms of any other trigonometric function,up to sign. To see the geometric relationship, define the vectors v = ( Y − µ , Y − µ , . . . , Y n − µ ) T and = (1 , , . . . , T . Then, the orthogonal projection of v onto is u = ( ¯ Y − µ ) , and the Pythagorean Theorem implies: (cid:107) v (cid:107) = (cid:107) u (cid:107) + (cid:107) v − u (cid:107) , i.e., n (cid:88) i =1 ( Y i − µ ) = n ( ¯ Y − µ ) + n (cid:88) i =1 ( Y i − ¯ Y ) , i.e., SSTO = SST + SSE , where we introduce analysis of variance terminology, with SSTO, SST, and SSE indicatingthe Sums of Squares for Total, Treatment, and Error, respectively. Thus, if we define θ tobe the angle between and v , then: T = n SSTSSTO = n cos θ and T = ( n −
1) SSTSSE = ( n −
1) cot θ. These geometric relationships are illustrated in Figure 1. Using basic trigonometric expres-sions it is easy to derive the stated algebraic relationship between T and T . In fact, T = ( n −
1) cot θ = ( n −
1) cos θ sin θ = ( n −
1) cos θ − cos θ . u v-u � ab c a = (cid:107) v (cid:107) = √ SSTO, b = a cos θ = (cid:107) u (cid:107) = √ SST, c = a sin θ = (cid:107) v − u (cid:107) = √ SSE, T = n ( b /a ) = n cos θ , T = ( n −
1) ( b /c ) = ( n −
1) cot θ .Figure 1: Geometric representation of the test statistics T and T .Substituting cos θ = T /n into this expression and taking square roots on both sides(making sure the signs match, as they should) yields Equation (4). The results presented in the previous section are not specific to the t -test setting. Infact, constructing a test statistic by invoking the null hypothesis and constructing it inthe “traditional” way produces equivalent test procedures across a range of linear models.This connection can be established by rewriting the two statistics as functions of differentterms in the orthogonal decomposition of the sum of squares. For instance, consider the standard linear model Y = X β + (cid:15) , where Y = ( Y , . . . , Y n ) T is a vector of observations, X n × p is a design matrix of rank p < n , β = ( β , . . . , β p ) T is a vector of regression parameters, and (cid:15) = ( (cid:15) , . . . , (cid:15) n ) T is an errorvector with elements (cid:15) i iid ∼ N (0 , σ ). Suppose we wish to test the hypothesis that theparameters in a certain subset of size p are all zero. Without loss of generality we canassume that the parameters of interest are the last p < p and rewrite the model as Y = X β + X β + (cid:15) , X = [ X | X ] and β = ( β T , β T ) T , with β i of dimension p i for i = 1 , , and p + p = p. The testing problem concerning the nested model can then be stated as H : β = vs. H A : β (cid:54) = . Both the “traditional” and the “null hypothesis” testing procedures try to quantifythe importance of the reduction in error sums of squares that ensues from entertainingthe full model rather than the reduced model, but they differ in the comparison yardstickthey use. The “traditional” procedure uses a yardstick based on the full model. The “nullhypothesis” procedure uses a yardstick based on the reduced model with β = .Geometrically, the statistics arise from a sequence of projections. Specifically, define: P = X ( X T X ) − X T , Q = I − P , and P = X ( X T X ) − X T , Q = I − P . The matrix P operates an orthogonal projection onto the space spanned by the columnsof the reduced design matrix X and the matrix P operates an orthogonal projectiononto the space spanned by the columns of the full design matrix X . Under the reducedmodel, the vector of predicted values is (cid:98) Y = P Y , the vector of residuals is r = Y − (cid:98) Y = Q Y , and the residual sum of squares isSSE = Y T Q T Q Y = Y T Q Y . Similarly, under the full model, the vector of predicted values is (cid:98) Y = P Y , the vector of residuals is r = Y − (cid:98) Y = Q Y , = Y T Q Y . The reduction in sums of squares ensuing from fitting the larger model is given bySS | = SSE − SSE = Y T ( Q − Q ) Y = Y T ( P − P ) Y . The “traditional” procedure compares SS | to SSE , the error sum of squares for thefull model, while the “null hypothesis” procedure compares SS | to SSE = SS | + SSE ,the error sum of squares for the reduced model envisioned to hold under the null. Afteradjusting for the degrees of freedom of the various sums of squares, the resulting teststatistics are F trad = SS | /p SSE / ( n − p )and F null = SS | /p SSE / ( n − p ) = SS | /p (SS | + SSE ) / ( n − p ) , respectively. The orthogonal decomposition at play in this setting is analogous to the one presented inSection 2 and is described in Figure 2, along with the relationships between its variouselements. Algebraic and trigonometric manipulations similar to those outlined in Section 2show that F trad is a one-to-one, increasing function of F null over (0 , ( n − p ) /p ), the set ofpossible values for F null : F trad = ( n − p ) F null n − p − p F null . (6)Thus, as in the case of the t -test, tail-area tests using F trad and F null are identical. Notethat, when p = 1 , p = 0 , and p = 1 , the relationship between F trad and F null given inEquation (6) reduces to the relationship between T and T implied by Equation (4).The implementation of either test procedure requires knowledge of the distribution ofthe corresponding test statistic under the null hypothesis. Using the notation introduced8 r - rr � ab c a = (cid:107) r (cid:107) = √ SSE , b = a cos θ = (cid:107) r − r (cid:107) = (cid:112) SS | , c = a sin θ = (cid:107) r (cid:107) = √ SSE , F null = [( n − p ) /p ] ( b /a ) = [( n − p ) /p ] cos θ , F trad = [( n − p ) /p ] ( b /c ) = [( n − p ) /p ] cot θ .Figure 2: Geometric representation of the decomposition of the sums of squares for testinga nested hypothesis in the general linear model.in Figure 2, standard distributional results imply that, under the null hypothesis, b /σ = SS | /σ ∼ χ p ,c /σ = SSE /σ ∼ χ n − p , with b independent of c .Then, F trad = b /p c / ( n − p ) ∼ F p ,n − p , as it is the ratio of two independent chi-square random variables divided by their degreesof freedom. Also, p n − p F null = b b + c ∼ Beta (cid:18) p ,
12 ( n − p ) (cid:19) , as it is the ratio between a chi-square random variable and the sum of that chi-squarerandom variable and an independent chi-square random variable. While the test procedures based on F trad and F null produce identical inferences, the realizedvalues of the test statistics are different. In this section we consider a situation in which,arguably, it is preferable to work with one of the two statistics rather than the other.Residual plots are effective graphical devices for assessing the quality of the fit of alinear regression model and for detecting potential outliers. As noted in Section 9.4.1 ofWeisberg (2014), a simple test for determining if observation i is an outlier in a regression9odel that includes p predictors is to include an additional predictor which is an indicatorof the observation in question (i.e., a 0-1 vector whose only element equal to 1 is the i -thone) and to test if the regression coefficient of the indicator is equal to zero.Assuming normal errors for the regression model and letting p = 1, it is natural to castthis problem into the framework of Section 3.1 and compare the full model with p = p + p predictors (the original predictors and the indicator of observation i ) and the nested modelthat omits the indicator variable. Observation i is declared an outlier if the null hypothesisthat the coefficient of its indicator variable is zero is rejected.The traditional statistic for this problem is F trad , which has an F ,n − p distribution underthe null. The square root of F trad (with sign matching the sign of the regression residualfor observation i ) is the usual t statistic for outlier detection described by Weisberg (2014).It is also a quantity known as the studentized residual for observation i , a normalizedversion of the raw residual computed using an estimate of the error variance that omits observation i from the calculation. Conceptually, this point of view is appealing because,if the null hypothesis were violated and observation i were indeed an outlier, its inclusionin the calculation would inflate the estimate of the error variance.On the other hand, as seen in Section 3.1, the same test could also be performed usingthe statistic F null . The signed square root of F null turns out to be what is called the standardized residual for observation i , a normalized version of the raw residual computedusing an estimate of the error variance that uses all observations, including observation i .This would be the natural calculation to perform if one were to assume that the nullhypothesis were true. The deterministic relationship between studentized and standardizedresiduals, mirrors, on the square root scale, the deterministic relationship between F trad and F null and is discussed in Weisberg (2014). Ultimately, because of the deterministicrelationships relating F trad , F null , and the two residual test statistics, an outlier test basedon any of these four statistics leads to the same decision.Residual plots are often used to conduct an exploratory assessment of the fit of theregression model. In this type of analysis, the plots are scanned visually for the existence ofidentifiable patterns and idiosyncratic features that might reveal violations of the modelingassumptions. With regard to outlier detection specifically, plots of residuals vs. fitted values10re inspected to reveal the presence of unusually large residuals. We argue that, owing tothe nonlinearity of the transformation that relates standardized residuals to studentizedresiduals, a studentized residual plot is better suited than a standardized residual plot toachieve this goal.We illustrate this point with an example based on a subset of the data on brain and bodyweights for 100 species of placental mammals reported in Sacher and Staffeldt (1974). Here,for the measurements on the 21 species of primates included in the data set, we consider thesimple linear regression of the natural logarithm of brain weight on the natural logarithmof body weight. Standardized and studentized residual plots are presented in the top row ofFigure 3. Two species stand out: Homo Sapiens (with large positive residuals) and
GorillaGorilla (with large negative residuals). Both are flagged as outliers at the 0.05 level withrespective p-values of 0.0034 and 0.0301 (unadjusted for multiplicity of comparisons).The extent to which these two species outlie compared to the other 19 species is clearlydifferent. As evidenced visually in both plots, the residual for
Homo Sapiens is furtherremoved from the bulk of the residuals than the residual for
Gorilla Gorilla and thisimpression is more notably accentuated in the studentized residual plot. This is due tothe nonlinear relationship between standardized and studentized residuals which causesthe difference in absolute size between the two to increase monotonically as the absolutesize of the standardized residual goes from 1 to infinity. In particular, comparing the tworesidual plots, the size of such difference becomes very noticeable when the absolute valueof the standardized residual exceeds a value of about 2.5.In our example, the absolute difference between studentized and standardized residualsis 0.6563 (very noticeable) for
Homo Sapiens , 0.2394 (noticeable) for
Gorilla Gorilla , andbetween 0.0011 and 0.0273 (hardly noticeable) for all other species. The displays in thebottom line of Figure 3, being based on F null and F trad which are the squared versions of thestandardized and studentized residuals, emphasize even more the features just described.In summary, the displays based on the studentized residuals and on F trad can focus theanalyst’s attention on the most extreme cases more effectively than those based on thestandardized residuals and on F null . 11 ll ll lll l ll l lllll ll ll − s t anda r d i z ed r e s i dua l s GORILLA GORILLAHOMO SAPIENS lll ll lll l ll l lllll ll ll − s t uden t i z ed r e s i dua l s GORILLA GORILLAHOMO SAPIENS lll ll lll l ll l lllll ll ll fitted values F _nu ll GORILLA GORILLAHOMO SAPIENS lll ll lll l ll l lllll ll ll fitted values F _ t r ad GORILLA GORILLAHOMO SAPIENS
Figure 3: Standardized and studentized residuals vs. fitted values for the primates data(top row) and their squared counterparts (bottom row).
The idea of constructing a test statistic by pretending that the null hypothesis is trueis routinely presented as a general guideline when using binomial data for testing thehypothesis that a population proportion is equal to a given value. Yet, this guideline isnot followed, at least on the surface, when normal data are used to build the t -test fortesting the hypothesis that the population mean is equal to a given value. As we noted inthe paper, the t -test is actually equivalent to a procedure based on a test statistic derivedby following the guideline, but making the connection requires a little algebra, and is, toour knowledge, not typically made in introductory statistics classes. We have also noted12hat the the same considerations presented for the t -test extend to the use of the F -testfor testing hypotheses concerning nested linear models with Gaussian errors.So, we are left to speculate why, in the case of the t -test and of the F -test, the “tradi-tional” procedure is preferred to the “null hypothesis” procedure. If a formal comparison isrequired, there is no clear distributional advantage of one approach over the other. For thecomparison of nested linear models, under the null, the “traditional” procedure requirescalculation of the tail area of an F distribution and the “null hypothesis” procedure re-quires calculation of the tail area of a Beta distribution. If a power calculation has to beperformed under some alternative, it can be based on the non-central F -distribution forthe traditional procedure and on the Type I non-central Beta distribution for the “null hy-pothesis” procedure, again with no clear advantage of one approach over the other. Similarconsiderations apply to the case of the t -test.An appealing aspect of the “traditional” procedures is that the t -statistic T and the F -statistic F trad are both constructed as ratios of independent quantities. Because, in bothcases, the decision rule is based on an assessment of the relative size of the numerator anddenominator, it is conceivable that independence may have been a key factor in establishingthe tradition, as an informal comparison of independent quantities is easier. Under the null,the denominators of the “null hypothesis” test statistics are more efficient estimators ofvariability (have more degrees of freedom) than their “traditional” counterparts. However,this gain in efficiency is offset by the dependence between numerator and denominator (seeLaMotte, 1994, for a related discussion).The fundamental question raised by the examples we presented is this article concernsthe role that the null hypothesis should play in the testing paradigm. By assumption,the null hypothesis is assumed true in order to assess statistical significance, but to whatextent should one rely on it to construct the test statistic? When confronted with a newstatistical model and a new parameter of interest, it can be somewhat of an art to determinea good choice of test statistic. Three common “automatic” approaches for constructing teststatistics from likelihoods privilege the null differently: score tests are typically built underthe null; Wald tests are typically built under the alternative; and likelihood ratio testscompare the null and the alternative somewhat equally.13e can see this trait of likelihood ratio tests in the specific problem of testing a nestedreduced model against the full model in the Gaussian linear model setting. There, thelikelihood ratio test rejects the null hypothesis when the ratio λ ( Y , X ) = sup β ,σ L ( β , σ | Y , X )sup β ,σ L ( β , σ | Y , X )is small. or, equivalently, when the ratio SSE / SS of the error sum of squares underthe reduced (null) model and the full model is large, ultimately leading to the equivalenttests based on F null and F trad . This structure of the likelihood ratio test for nested modelshad already been noticed for the special case presented in Section 2.1, when discussing thederivation of a test based on the ratio of Equation (5).In addition to the basic guiding principles, other considerations may be at play when acertain tradition is established of preferring one form of a test procedure over another fora given problem. For the nested model comparison, we already noted one desirable featureexhibited by F trad , namely that its numerator and denominator are independent. Anotherfeature worth noting is that the denominator of F trad does not depend on the particularreduced model under consideration while the denominator of F null does. Although this isnot much of a computational burden, it is intuitively appealing to be able to use the sameyardstick in the denominator when testing different nested models against the same fullmodel. Further, the graphical example of Section 3.3 illustrates that when the value of thestatistic itself is of interest, rather than the formal testing decision, there may be practicalreasons for preferring the use of one statistic over the other.In sum, while we do not have a conclusive explanation as to why certain traditions haveestablished themselves as the standard of practice for specific problems, we believe thatthese issues, often overlooked, should be brought to the attention of statistics students inadvanced undergraduate and graduate courses. The examples we discussed provide resultsfor common testing problems that can be used to focus the in-class discussion. Acknowledgments.
This material is based upon work supported by the National ScienceFoundation under Grants No. SES-1424481, No. DMS-1613110, and No. SES-1921523. Theauthors declare no conflicts of interest. 14 eferences
Agresti, A. and Coull, B. A. (1998). Approximate is better than exact for interval estimationof binomial proportions.
The American Statistician , 52(2):119–126.Casella, G. and Berger, R. (2002).
Statistical Inference . Duxbury-Thomson Learning,Second edition.Chao, F., Gerland, P., Cook, A. R., and Alkema, L. (2019). Systematic assessment of thesex ratio at birth for all countries and estimation of national imbalances and regionalreference levels.
Proceedings of the National Academy of Sciences , 116(19):9303–9311.Good, I. (1986). Comments, conjectures, and comclusions: C258 editorial note on c257regarding the t-test.
Journal of statistical somputation and simulation , 25(3-4):296–297.LaMotte, L. R. (1994). A note on the role of independence in t statistics constructed fromlinear statistics in regression models.
The American Statistician , 48(3):238–240.Lefante Jr, J. J. and Shah, A. K. (1986). C257. a note on the one-sample t-test.
Journalof statistical somputation and simulation , 25(3-4):295–296.Lehmann, E. L. (1986).
Testing Statistical Hypotheses . John Wiley & Sons.Moore, D. S., McCabe, G. P., and Craig, B. A. (2012). Introduction to the practice ofstatistics.Sacher, G. A. and Staffeldt, E. F. (1974). Relation of gestation time to brain weight forplacental mammals: implications for the theory of vertebrate growth.
The AmericanNaturalist , 108(963):593–615.Shah, A. K. and Krishnamoorthy, K. (1993). Testing means using hypothesis-dependentvariance estimates.
The American Statistician , 47(2):115–117.Shah, A. K. and Lefante Jr, J. J. (1987). C293. a note on using a hypothesis-dependentvariance estimate.
Journal of Statistical Computation and Simulation , 28(4):347–349.15eisberg, S. (2014).
Applied Linear Regression . Wiley Series in Probability and Statistics.Wiley.Yang, S. and Black, K. (2019). Using the standard wald confidence interval for a populationproportion hypothesis test is a common mistake.