[PDF] Why we (usually) don't have to worry about multiple comparisons

Abstract

Applied researchers often find themselves making statistical inferences in settings that would seem to require multiple comparisons adjustments. We challenge the Type I error paradigm that underlies these corrections. Moreover we posit that the problem of multiple comparisons can disappear entirely when viewed from a hierarchical Bayesian perspective. We propose building multilevel models in the settings where multiple comparisons arise. Multilevel models perform partial pooling (shifting estimates toward each other), whereas classical procedures typically keep the centers of intervals stationary, adjusting for multiple comparisons by making the intervals wider (or, equivalently, adjusting the p -values corresponding to intervals of fixed width). Thus, multilevel models address the multiple comparisons problem and also yield more efficient estimates, especially in settings with low group-level variation, which is where multiple comparisons are a particular concern.

Full PDF

aa r X i v : . [ s t a t . A P ] J u l Why we (usually) don’t have to worry about multiplecomparisons ∗ Andrew Gelman † , Jennifer Hill ‡ , Masanao Yajima § July 13, 2009

Abstract

Applied researchers often ﬁnd themselves making statistical inferences in settingsthat would seem to require multiple comparisons adjustments. We challenge the Type Ierror paradigm that underlies these corrections. Moreover we posit that the problem ofmultiple comparisons can disappear entirely when viewed from a hierarchical Bayesianperspective. We propose building multilevel models in the settings where multiplecomparisons arise.Multilevel models perform partial pooling (shifting estimates toward each other),whereas classical procedures typically keep the centers of intervals stationary, adjustingfor multiple comparisons by making the intervals wider (or, equivalently, adjusting the p -values corresponding to intervals of ﬁxed width). Thus, multilevel models addressthe multiple comparisons problem and also yield more eﬃcient estimates, especiallyin settings with low group-level variation, which is where multiple comparisons are aparticular concern.Keywords: Bayesian inference, hierarchical modeling, multiple comparisons, Type Serror, statistical signiﬁcance Researchers from nearly every social and physical science discipline have found themselvesin the position of simultaneously evaluating many questions, testing many hypothesis, orcomparing many point estimates. In program evaluation this arises, for instance, whencomparing the impact of several diﬀerent policy interventions, comparing the status ofsocial indicators (test scores, poverty rates, teen pregnancy rates) across multiple schools, ∗ We thank the participants at the NCEE/IES multiple comparisons workshop for helpful comments andthe National Science Foundation, National Institutes of Health, and Columbia University Applied StatisticsCenter for ﬁnancial support. † Department of Statistics and Department of Political Science, Columbia University, New York, [email protected] , ∼ gelman ‡ Department of Humanities and Social Sciences, NYU Steinhardt, New York University, New York, [email protected] § Department of Statistics, University of California, Los Angeles, [email protected] , ∼ yajima p -values corresponding to intervals of ﬁxed width). In this way,multilevel estimates make comparisons appropriately more conservative, in the sense thatintervals for comparisons are more likely to include zero. As a result we can say withconﬁdence that those comparisons made with multilevel estimates are more likely to bevalid. At the same time this “adjustment” doesn’t sap our power to detect true diﬀerencesas many traditional methods do.Rather than correcting for the problems that can arise when examining many compar-isons (performing many signiﬁcance tests), when we work within the Bayesian paradigm allof the relevant research questions can be represented as parameters in one coherent mul-tilevel model. Simply put, rather than correcting for a perceived problem, we just buildthe multiplicity into the model from the start. This puts more of a burden on the model,and a key goal of this paper is to demonstrate the eﬀectiveness of our procedure in realisticexamples. 2ections 2 and 3 present the multiple comparisons problem from the classical andBayesian perspectives, respectively. Both are described within the context of a commonillustrative example and then potential solutions are outlined. In Section 4, we bolsterour argument against traditional multiple comparisons corrections through a series of smallexamples that illustrate several of the scenarios described above. Section 5 concludes. In this section we walk through a relatively simple example using data from a real studyto illustrate the issues involved in performing multiple comparisons from classical and mul-tilevel perspectives. We use data from the Infant Health and Development Program, anintervention that targeted premature and low-birth-weight infants and provided them withservices such as home visits and intensive high quality child care. The program was evalu-ated using an experiment in which randomization took place within site and birth weightgroup. The experimental design was actually slightly more complicated (as described inIHDP, 1990); we simplify here for expository purposes. In fact, for this ﬁrst illustration wewill assume that it was a simple randomized block experiment with eight sites as blocks.In this context, we’re not just interested in the overall treatment eﬀect. Given thatthe composition of participating children was quite diﬀerent across sites and that programimplementation varied across sites as well, we would like to know for each site individually whether or not a statistically signiﬁcant eﬀect was present. However, we may be concernedthat, in the process of conducting eight diﬀerent signiﬁcance tests, we are misperceivingour overall risk of making a false claim. This overall risk of error (formally, the probabilitythat we have any rejections when the null hypothesis in fact holds) is sometimes referred toas the familywise error rate (Tukey, 1953). A similar problem arises if we are interested incomparing whether there are signiﬁcant diﬀerences in treatment eﬀects across sites.

A classical model ﬁt to these data might look like: y i = X j =1 ( γ j S ji + δ j S ji P i ) + ǫ i ,ǫ i ∼ N(0 , σ ) , y i denotes student i ’s test score, S ji in an indicator for living in site j , and P i is anindicator for program status. Although this may not be the most common way to specifythis model, it is useful here because δ j represents the treatment eﬀect in the j th site and γ j represents the average test score for the untreated in each site. This allows us to directlytest the signiﬁcance of each site eﬀect.For any given test of a null hypothesis, say H j : δ j = 0, versus an alternative, say H jA : δ j = 0, there is a 5% chance of incorrectly rejecting H j when in fact it is true. Ofcourse if we test two independent hypotheses at the same signiﬁcance level ( α = 0.05)then the probability that at least one of these tests yields an erroneous rejection raises to1 − Pr(neither test yields an erroneous rejection of the null) = 1 − . ∗ .

95 = 0 . ≈ . One of the most basic and historically most popular ﬁxes to this problem is the Bonferronicorrection. The Bonferroni correction adjusts the p -value at which a test is evaluated forsigniﬁcance based on the total number of tests being performed. Speciﬁcally, the working p -value is calculated as the original p -value divided by the number of tests being performed.Implicitly, it assumes that these test statistics are independent. So in our current examplean overall desired signiﬁcance level of 0.05 would translate into individual tests each usinga p -value threshold of 0 . / . p -value needed to reject the null (or equivalentlywidening the uncertainty intervals) the number of claims of rejected null hypotheses willindeed decrease on average. While this reduces the number of false rejections, it alsoincreases the number of instances that the null is not rejected when in fact it should have The actual analysis also included birthweight group as a predictor in this model, but we ignore this inthis description for simplicity of exposition. Classical Linear Regression − site t r ea t m en t e ff e c t e s t i m a t e Classical Linear Regressionwith Bonferroni Correction − site 1 2 3 4 5 6 7 8 Multilevel Model − site Figure 1: Treatment eﬀect point estimates and 95% intervals across the eight Infant Healthand Development Program sites. The left panel display classical estimates from a linearregression. The middle panel displays the same point estimates as in the left panel butwith conﬁdence intervals adjusted to account for a Bonferroni correction. The right paneldisplays posterior means and 95% intervals for each of the eight site-speciﬁc treatmenteﬀects from a ﬁtted multilevel model.been. Thus, the Bonferroni correction can severely reduce our power to detect an importanteﬀect.

Motivated by the shortcomings of the Bonferroni correction, researchers have proposed moresophisticated procedures. The goal of these methods typically is to reduce the familywiseerror rate (again, the probability of having at least one false positive) without undulysacriﬁcing power. A natural way to achieve this is by taking account of the dependenceacross tests. A variety of such corrections exist that rely upon bootstrapping methods orpermutation tests (see, for example, Westfall and Young, 1993).A more recent class of approaches to this problem focuses not on reducing the fami-lywise error rate but instead on controlling the expected proportion of false positives, the“false discovery rate” or FDR (Benjamini and Hochberg, 1995). The rationale is that theresearcher should be more worried about a situation in which many tests show up as sta-tistically signiﬁcant and an unknown proportion of these are erroneous than a situation inwhich all but a few tests show up as insigniﬁcant. Controlling for the false discovery raterather than the familywise error rate leads to a less conservative testing procedure withrespect to Type 1 error but is more powerful in terms of detecting eﬀects that are real.5hese tests sit squarely within the Type 1 paradigm. As with procedures to control forthe familywise error rate, the initial versions assumed independence across tests; however,procedures to control the FDR have also been extended to account for certain types ofdependency across tests (Benjamini and Yekutieli, 2001).Methods that control for the FDR may make particular sense in ﬁelds like genetics whereone would expect to see a number of real eﬀects amidst a vast quantity of zero eﬀects suchas when examining the eﬀect of a treatment on diﬀerential gene expression (Grant et al.,2005). They may be less useful in social science applications when we are less likely to betesting thousands of hypotheses at a time and when there are less likely to be eﬀects that aretruly zero (or at least the distinction between zero and not-zero may be more blurry). Forour IHDP example, using a standard procedure (the Simes procedure, see Benjamini andHochberg, 1995) to control the FDR at a 0.05 level, the tests with the 6 smallest p -valueswould reject. Still others might argue that in many situations there is no need to formally control for theseerror rates. Many of us are already used to informally performing appropriate calibrations.For instance, consider a researcher who presents a table of mean diﬀerences in pre-treatmentvariables across an experimental and control group in which there is one statistically sig-niﬁcant diﬀerence. The classical perspective would have us worry if we performed 20 testsat a 0.05 signiﬁcance level that there is a 64% chance that at least one of these will yielda statistically signiﬁcant result inappropriately. Thus, for instance, a Bonferroni correctioncould be performed to change the critical value to reﬂect a p -value of 0 . /

20 = 0 . expect to see at least one such deviationgiven a 0.05 signiﬁcance level. Alternatively, research organizations sometimes recommend-ing pre-specifying the tests that will be performed in an oﬃcally-sanctioned analysis. Ineither case, the goal is to manage expectations in a way similar to a mutiple comparisonscorrection but perhaps less strictly tied to familywise error rates. Classical methods typically start with the assumption that the null hypothesis is true—an unhelpful starting point, as we discuss below. Moreover, we argue that they classicalprocedures fail by insuﬃciently modeling the ensemble of parameters corresponding to the6ests of interest (see Louis, 1984). We cannot hope to have the goal of proposing an optimalstatistical method for all circumstances. Rather we present an entirely diﬀerent perspectiveon the issue and its implications and argue that, when viewed from a Bayesian perspective,many of these problems simply disappear.

The classical perspective worries primarily about Type 1 errors, and we argue that theseshould not be the focus of our concern. Suppose we’ve established the following two hy-potheses regarding our site-speciﬁc treatment eﬀects τ j for j = 1 , . . . , J : H j : τ j = 0, and H jA : τ j = 0. A primary concern from the classical multiple comparisons perspective isthat we might erroneously accept H jA when, in fact, H j is true (Type 1 error). But do weever believe that τ j exactly equals zero? What is the practical importance of such a test?Similarly, a Type 1 error occurs when we mistakenly accept the H j that τ j = τ k when infact the H jA that τ j = τ k is true. Again, under what circumstances do we truly believe thereare absolutely no diﬀerences between groups? There may be no practical diﬀerences butthis is a distinct point which we shall discuss shortly. Moreover, if true eﬀects are zero, wedon’t want anything close to a 5% chance of ﬁnding statistically signiﬁcant results.A more serious concern might be that we claim τ j > τ j <

0, ﬁnding apositive eﬀect when in fact the eﬀect is detrimental. A similar phenomenon occurs if weclaim that τ j > τ k when in fact τ j < τ k , for instance claiming that a treatment eﬀect islarger in Miami than in New York when in fact the reverse is true. These are both examplesof what is referred to as “Type S” (sign) errors (Gelman and Tuerlinckx, 2000).However in policy analysis, there is also a fair bit of concern about examples where thediﬀerences might actually be very close to zero: for example, comparing diﬀerent educationalinterventions, none of which might be very eﬀective. Here we would want to be thinkingabout “Type M” (magnitude) errors: saying that a treatment eﬀect is near zero when it isactually large, or saying that it’s large when it’s near zero (Gelman and Tuerlinckx, 2000).In that setting, underpowered studies present a real problem because type M errors are morelikely when uncertainty is high. For instance it is not uncommon in an underpowered studyfor a researcher to state that although his estimate is not statistically signiﬁcantly diﬀerentfrom 0, that could simply be a function of the overly large standard error. Ironically,however, large estimates are actually a byproduct of large standard errors.This property is illustrated in Figure 2. This plot displays two sampling distributionin a situation in which the true eﬀect is zero (or very close to 0). It’s clear from this7

10 −5 0 5 10 . . . . . treatment effect estimate li k e li hood sampling distribution withsmall standard deviationsampling distribution withlarge standard deviation Figure 2: Two error distributions with diﬀering levels of uncertainty in a situation whenthere is no eﬀect. The estimator with the sampling distribution with greater uncertainty(standard deviation equal to 3) has a greater probability of yielding a larger estimate thanthe estimator with the sampling distribution with smaller uncertainty (standard deviationequal to 1). The researcher is more likely to commit a Type M (magnitude) error when thestandard deviation is large.plot that the estimator with the sampling distribution with greater uncertainty (standarddeviation equal to 3) is much more likely to produce eﬀect estimates that are larger inmagnitude than eﬀect estimates resulting from an estimator with relatively less uncertainty(standard deviation equal to 1). Thus, for instance, when we switch from examining maineﬀects to subgroup eﬀects, for example, we automatically increase our probability of seeinglarge estimates and tricking ourselves into thinking that something is going on. Bayesianmodeling helps here too, as we shall see below.

More strongly, we claim that when viewed within a Bayesian framework, many of theseproblems disappear, or in the case of Type S and Type M errors, are at least substantiallyameliorated. We illustrate with a relatively simple multilevel model for a setting in whichindividuals in a common site experience the same eﬀect on test scores, as in y i ∼ N( γ j [ i ] + δ j [ i ] P i , σ y ) , Here δ j [ i ] is the parameter for the treatment eﬀect corresponding to person i ’s site (indexedby j ), and it is assigned its own distribution, for example δ j ∼ N( µ, σ δ ) .

8e have also allowed the intercept, γ , to vary across sites in a similar manner. It does notseem a strong assumption to think of these as realizations from a common distribution andthis addition should strengthen our model. Additionally, our Bayesian analysis requires usto specify prior distributions for the parameters µ , σ y , and σ δ . However (particularly forthis kind of simple model) it is not diﬃcult to choose weakly informative priors (Gelman,2006). Finally, we could (and should, in a real analysis) easily include other predictorsto the model to increase our predictive power—most notably, the group-level intercept γ j can be a predictor for the group-level treatment eﬀect δ j —but have refrained from addingpredictors in this example, so we can focus on primary issues. Partial pooling.

Multilevel modeling can be thought of as a compromise between twoextremes. One extreme, complete pooling, would assume the treatment eﬀects are the sameacross all sites, that is, δ j = δ , for all j . The other extreme, no pooling, would estimatetreatment eﬀects separately for each site. The compromise found in the multilevel modelis often referred to as partial pooling . Figure 1 graphically illustrates this compromisewith a plot of the multilevel intervals next to the classical estimates and intervals (withand without Bonferroni corrections). The horizontal dashed line in each plot displays thecomplete pooling estimate. We also display a horizontal solid line at zero to quickly showwhich estimates would be considered to be statistically signiﬁcant. This process leads topoint estimates that are closer to each other (and to the “main eﬀect” across all sites) thanthe classical analysis. Rather than inﬂating our uncertainty estimates, which doesn’t reallyreﬂect the information we have regarding the eﬀect of the program, we have shifted thepoint estimates toward each other in ways that reﬂect the information we have. (Moregenerally, if the model has group-level predictors, the inferences will be partially pooledtoward the ﬁtted group-level regression surface rather than to a common mean.) The intuition.

Why does partial pooling make sense at an intuitive level? Let’s startfrom the basics. The only reason we have to worry about multiple comparisons issues isbecause we have uncertainty about our estimates. If we knew the true (population-average)treatment eﬀect within each site, we wouldn’t be making any probabilistic statements tobegin with—we would just know the true sign and true magnitude of each (and certainlythen whether or not each was really diﬀerent from 0 or from each other). Classical inferencein essence uses only the information in each site to get the treatment eﬀect estimate in thatsite and the corresponding standard error. 9 multilevel model, however, recognizes that this site-speciﬁc estimate is actually ignor-ing some important information—the information provided by the other sites. While stillallowing for heterogeneity across sites, the multilevel model also recognizes that since allthe sites are measuring the same phenomenon it doesn’t make sense to completely ignorewhat has been found in the other sites. Therefore each site-speciﬁc estimate gets “shrunk”or pulled towards the overall estimate (or, in a more general setting, toward a group-levelregression ﬁt). The greater the uncertainty in a site, the more it will get pulled towardsthe overall estimate. The less the uncertainty in a site, the more we trust that individualestimate and the less it gets shrunk.To illustrate this point we ran our multilevel model on slightly altered versions of thedataset. In the ﬁrst altered version we decreased the sample size in Site 3 from 138 to arandom sample of 30; results are displayed in the center panel of Figure 3. In the secondaltered version we increased the sample size in Site 3 to 300 by bootstrapping the originalobservations in that site; results are displayed in the right panel of Figure 3. The left-most panel displays the original multilevel model results. The key observation is that theshrinkage of the Site 3 treatment eﬀect estimate changes drastically across these scenariosbecause the uncertainty of the estimate relative to that of the grand mean also changesdrastically across these scenarios. Note, however, that the overall uncertainty increases inthe right-most plot even though the sample size in Site 3 increases. That is because weincreased the sample size while keeping the point estimate the same. This leads to greatercertainty about the level of treatment eﬀect heterogeneity across sites, and thus greateruncertainty about the overall mean.

The algebra.

Partial pooling tends to reduce the number of statistically signiﬁcant com-parisons. To see this algebraically, consider the estimate for the treatment eﬀect in a singlegroup in a simple normal-normal hierarchical model:posterior E( θ j ) = (cid:18) σ θ µ + 1 σ y ¯ y (cid:19) (cid:30)(cid:18) σ θ + 1 σ y (cid:19) . The corresponding uncertainty for this estimate isposterior sd( θ j ) = 1 ,s σ θ + 1 σ y The smaller the prior variance σ θ , the more the posterior estimates for diﬀerent groupsare pooled toward a common mean value. At the same time, their posterior variances areshrunk toward zero, but much more slowly. 10 MLM: original site t r ea t m en t e ff e c t e s t i m a t e MLM: site 3 smaller site 1 2 3 4 5 6 7 8

MLM: site 3 bigger site

Figure 3: Comparison of results from multilevel model using diﬀerent versions of the data.The left panel displays results from the original data. The center panel displays resultsfrom a model ﬁt to data where Site 3 has been reduced to a random sample of 30 from theoriginal 138 observations in that site. The right panel displays results from a model ﬁt todata in which Site 3 observations were bootstrapped to create a sample of size 300.As a result, the z -score for any comparison—the diﬀerence in posterior means, divided bythe posterior standard deviation for that diﬀerence—decreases, and statistically signiﬁcantBayesian comparisons become less likely. Algebraically:posterior E( θ j − θ k ) = σ θ σ y + σ θ (¯ y j − ¯ y k )posterior sd( θ j − θ k ) = √ σ ¯ y σ θ (cid:30)q σ y + σ θ posterior z -score of θ j − θ k : (¯ y j − ¯ y k ) √ σ ¯ y · q σ y /σ θ The ﬁrst factor in this last expression is the z -score for the classical unpooled estimates;the second factor is the correction from partial pooling, a correction that is always lessthan 1 (that is, it reduces the z -score) and approaches zero as the group-level variance σ θ approaches zero; see Figure 4.The actual adjustment to z -scores and signiﬁcance levels is slightly more complicatedbecause the variance parameters and the group-level mean are estimated from data, but thebasic pattern holds, which is that posterior means are pulled together faster than posteriorstandard deviations decrease.Greenland and Robins (1991) make a similar argument about the advantages of partial11 ariance ratio, s q s y2 S h r i n k age o f z − sc o r e C o m p l e t e s h r i n k age N o s h r i n k age Classical z−score of a comparison, q j − q k From multilevel model: z−score of q j − q k Figure 4: Shrinkage of the z -score for a comparison, θ j − θ k , as estimated using multilevelmodeling as compared to classical inference. When the variance ratio is small—that is,when the groups are similar to each other–Plots of two sampling distributions with diﬀeringlevels of uncertainty. When the group-level variance is small, there is a lot of shrinkage.pooling, going so far as to frame the multiple comparisons problem as an “opportunityto improve our estimates through judicious use of any prior information (in the form ofmodel assumptions) about the ensemble of parameters being estimated. Unlike conventionalmultiple comparisons, EB [empirical Bayes] and Bayes approaches will alter and can improvepoint estimates and can provide more powerful tests and more precise (narrower) intervalestimators.” Model ﬁtting.

One barrier to more widespread use of multilevel models is that re-searchers aren’t always sure how to ﬁt such models. We often recommend ﬁtting multilevelmodels in a fully Bayesian way using a software package such as Bugs (as described indetail in Gelman and Hill, 2007). However many simple models can be ﬁt quite well usingpackages that have been built in (or can be easily installed into) existing software packages.For instance the model above can be ﬁt easily in R , as ihdp.fit <- lmer (y ~ treatment + (1 + treatment | group)) Further functions exist in the arm package in R to help the user sample from the posteriordistribution for each site-speciﬁc treatment eﬀect (or any other parameter from the modelor functions thereof; see Gelman and Hill, 2007). Similar options for ﬁtting the model areavailable in Stata and SAS as well (see Appendix C of Gelman and Hill, 2007).12f we want to make comparisons across two sites, say site 1 and site 3, we don’t need toreﬁt the model using diﬀerent contrasts or perform any algebraic manipulations. All canbe done using posterior simulations.

We explore our ideas on multilevel models and multiple comparisons through a series ofexamples that illustrate diﬀerent scenarios in which multiple comparisons might arise as anissue.

This next example illustrates how these issues play out in a situation in which all pairwisecomparisons across groups are potentially of interest. Figure 5 shows a graph that mimicsone produced for a National Center for Education Statistics (NCES; 1997) report that or-dered all states based on average scores on the National Assessment of Educational Progress(NAEP) fourth-grade mathematics test. Our version makes use of 2007 fourth-grade math-ematics scores and performs the false discovery rate (FDR) correction currently used byNCES for this sort of problem (many comparisons). In the graph, statistically signiﬁcantcomparisons have been shaded. In theory, this plot allows us to answer questions such as,Does North Carolina have higher average test scores than neighboring South Carolina? Thisinformation could be displayed better (Wainer, Hambleton, and Meara, 1999, Almond etal., 2000), and maybe should not be displayed at all (Wainer, 1986), but here our concernis with the formulation as a multiple comparisons problem.

Concerns with the classical multiple comparisons display.

Here is a situation inwhich most classical multiple comparisons adjustments, such as the FDR adjustment thatwas used, will not be appropriate because we know ahead of time that the null hypothesis(zero average diﬀerences between states) is false, so there is no particular reason to worryabout the Type 1 error rate. Therefore, any motivation for multiple comparisons then restseither on (a) wanting more than 95% of the 95% intervals to contain the true values, or (b)wanting a lower Type S error rate, in other words, minimizing the chance of, for instance,stating that New Jersey has higher average test scores than Pennsylvania when, in fact, theopposite is the case.With regard to 95% intervals, we can do better using multilevel modeling, either on theraw state averages from any given year or, even better, expanding the model to include13 a ss a c hu s e tt s N e w J e r s e y N e w H a m p s h i r e K an s a s M i nne s o t a V e r m on t N o r t h D a k o t a I nd i ana O h i o W i sc on s i n P enn sy l v an i a W y o m i ng M on t ana V i r g i n i a I o w a C onne c t i c u t N e w Y o r k W a s h i ng t on M a i ne T e x a s F l o r i da D e l a w a r e N o r t h C a r o li na S ou t h D a k o t a I daho M a r y l and C o l o r ado M i ss ou r i U t ah N eb r a sk a A r k an s a s M i c h i gan I lli no i s A l a sk a S ou t h C a r o li na O k l aho m a W e s t V i r g i n i a O r egon R hode I s l and G eo r g i a K en t u cky H a w a ii T enne ss ee A r i z ona N e v adaLou i s i ana C a li f o r n i a A l aba m a N e w M e x i c o M i ss i ss i pp i Comparisons of Average Mathematics Scale Schores forGrade 4 Public Schools in Participating Jurisdictions

MSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMEWANYCTIAVAMTWYPAWIOHINNDVTMNKSNHNJMA

Figure 5: Graph that mimics one produced for the National Center for Education Statistics(1997) report comparing average math test scores of students in diﬀerent states. Shadedcomparisons represent diﬀerences that are statistically signiﬁcant after a FDR multiplecomparisons correction. We argue that these multiple-comparisons adjustments make nosense, as they are based on the irrelevant model in which true population diﬀerences areexactly zero. 14 edians and 80% intervalstheta lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll lll * Figure 6: Summary of inferences for average math scores by state (see Figure 5) based ona ﬁtted multilevel model. States have been ordered by increasing average test score in theraw data. This graph could be improved in many ways; actually, though, it portrays thecomparisons fairly clearly while being nothing more than a piece of the default graphicaloutput obtained by ﬁtting the model in R and Bugs.state-level predictors and test scores from other years. If Type S error rates are a concern,then, again, a multilevel model will more directly summarize the information available.The objection may be raised that although we know that the true diﬀerences cannot be exactly zero, what about the null hypothesis that they are nearly zero ? Our reply is thatthis weaker version of classical hypothesis testing doesn’t work here either. One way to seethis is that the data used to create Figure 5 clearly reject either of these null hypotheses.But classical multiple comparisons procedures just plug along ignoring this information,widening the conﬁdence intervals as new comparisons are added.

Multilevel model and corresponding display of comparisons.

As an alternative,we ﬁt a multilevel model: y j ∼ N( α j , σ j ), where j = 1 , . . . , J are the diﬀerent states, and y j is the average fourth grade mathematics score for the students who took the test in state j . The parameters α j represent the true mean in each state—that is, the population averagethat would be obtained if all the students in the state were to take the test. We modelthese population averages with a normal distribution: α j ∼ N( µ α , σ α ). Finally, we assignnoninformative uniform prior distributions to the hyperparameters µ α , σ α , σ j . We display We recognize that this model could be improved, most naturally by embedding data from multiple yearsin a time series structure. The ability to include additional information in a reliable way is indeed a keyadvantage of multilevel models; however, here we chose a simple model because it uses no more informationthan was used in the published tables. j, k , we check whether 95% or more of the simulationsshow α j > α k . If so—that is, if α j > α k for at least 950 of the 1000 simulations—we canclaim with 95% conﬁdence that state j outperforms state k . We plot the results in Figure7. States that have eﬀects that are signiﬁcantly lower are shaded with light blue, the oneswhich are higher are shaded with darker blue, and ones that are not signiﬁcantly diﬀerentare left as white.Compared to the classical multiple comparisons summaries in Figure 5, the multilevelestimates in Figure 7 are more informative, with more claims with conﬁdence and fewercells in the central region where comparisons are not statistically signiﬁcant. The classicalprocedure overcorrects for multiple comparisons in this setting where the true diﬀerencesbetween states are large. (At the other extreme, classical procedures undercorrect formultiple comparisons when true diﬀerences are small, as we discuss in Section 4.2.)When there is evidence for a multiple comparisons problem, our procedure makes cor-rections. When there is no evidence for a multiple comparisons problem our procedure issimilar to the direct inference without a multiple comparisons correction. Rubin (1981) discusses an example (reprised in chapter 5 of Gelman et al., 2003) of a meta-analysis of randomized experiments of coaching for the Scholastic Aptitude Test (SAT) ineight high schools in New Jersey. This example is notable as one of the ﬁrst fully Bayesiananalysis of a hierarchical model and also because there was no evidence in the data ofdiﬀerences between the treatment eﬀects in the diﬀerent schools. (And, in fact, the totalestimated eﬀects are small.) The ﬁrst two columns of numbers in Figure 8 give the data.Just to get a sense of the variation, the standard deviation of the eight school estimates is10, which is of the same order as the standard errors.

Classical and Bayesian analysis.

This is the sort of situation where one might worryabout multiple comparisons. (In the actual data in Figure 8, none of the raw comparisons16 a ss a c hu s e tt s N e w J e r s e y N e w H a m p s h i r e K an s a s M i nne s o t a V e r m on t N o r t h D a k o t a I nd i ana O h i o W i sc on s i n P enn sy l v an i a W y o m i ng M on t ana V i r g i n i a I o w a C onne c t i c u t W a s h i ng t on N e w Y o r k M a i ne T e x a s F l o r i da D e l a w a r e N o r t h C a r o li na S ou t h D a k o t a I daho M a r y l and C o l o r ado M i ss ou r i U t ah N eb r a sk a A r k an s a s M i c h i gan I lli no i s A l a sk a S ou t h C a r o li na O k l aho m a W e s t V i r g i n i a O r egon R hode I s l and G eo r g i a K en t u cky H a w a ii T enne ss ee A r i z ona N e v adaLou i s i ana C a li f o r n i a A l aba m a N e w M e x i c o M i ss i ss i pp i Comparisons of Average Mathematics Scale Schores forGrade 4 Public Schools in Participating Jurisdictions

MSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMAMSNMALCALANVAZTNHIKYGARIORWVOKSCAKILMIARNEUTMOCOMDIDSDNCDEFLTXMENYWACTIAVAMTWYPAWIOHINNDVTMNKSNHNJMA

Figure 7: Posterior inference of NAEP fourth grade mathematics test scores from 2007 bystate, ﬁt using a multilevel model. Compared to the classical multiple comparisons sum-maries in Figure 5, we have fewer cases of ambiguity, that is, more claims with conﬁdence.This makes sense because the hierarchical model adapts to the setting, which in this caseis a high variance ratio and thus little need to worry about noise in the comparisons.17 aw estimate Standard error Bayes Bayesof treatment of raw eﬀect posterior posteriorSchool eﬀect, y j estimate, σ y j mean sdA 28 15 11 8B 8 10 7 6C − − Figure 8: First two columns of numbers: Data from the 8-schools experiment of Rubin(1981). A separate randomized experiment was conducted in each school, and regressionanalysis gave separate treatment eﬀect estimates (labeled as y j above) and standard errors(labeled as σ y j ). Eﬀects are on the scale of points in the SAT-Verbal test (which was scoredfrom 200 to 800). An eﬀect of 8 points corresponds to approximately one additional testitem correct.Last two columns: Posterior mean and standard deviation of treatment eﬀects, as estimatedfrom a Bayesian multilevel model. The evidence is that the eﬀects vary little betweenschools, hence the estimates are pooled strongly toward the common mean. None of thecomparisons from the Bayesian inference are even close to statistically signiﬁcant.happen to be statistically signiﬁcant, but as we discuss below, they could be in a replicationof the study.)The hierarchical Bayesian analysis of Rubin (1981) has no multiple comparisons prob-lems, however. The group-level variance is estimated to be low—the marginal maximumlikelihood or posterior mode estimate is zero, and the Bayesian analysis averages over theposterior distribution, which is largely below 10—and as a result the Bayes estimates arepooled strongly toward the common mean. Simulation study with small eﬀects.

To get further insight into this example, we per-form repeated simulations of a world in which the true treatment eﬀects in diﬀerent schoolscome from a normal distribution with standard deviation 5 (a plausible estimate given thedata in Figure 8). For each replication, we simulate eight true values θ , . . . , θ from thisdistribution, then simulate data y , . . . , y from the eight separate normal distributions cor-responding to each θ j . The standard deviations σ j for each of these distributions is givenby Figure 8. Relative to the within-group standard deviations, the between-group standarddeviation of 5 is small. We then performed both classical and hierarchical Bayesian analy-ses. For each analysis, we computed all (8 · / Simulation study with large eﬀects.

To get a sense of what happens when eﬀects aremore clearly distinguishable, we repeat the above simulation but assume the true treatmenteﬀects come from a distribution with standard deviation 10. This time, 12% of the classicalcomparisons are statistically signiﬁcant, with 86% of these having the correct sign. Fromthe Bayesian analysis, 3% of the comparisons are statistically signiﬁcant, with 96% of thesehaving the correct sign. Whether using classical multiple comparisons or Bayesian hierar-chical modeling, the price to pay for more reliable comparisons is to claim conﬁdence infewer of them.

Rockoﬀ (2004) and Kane, Rockoﬀ, and Staiger (2007) analyzed a huge panel dataset ofteachers and children from the New York City school system to assess the importanceof factors such as educational background, training, and experience in determining theeﬀectiveness of teachers. One of the ﬁndings was that variation in teacher “eﬀects” (we arenot interpreting these ﬁnding causally) on student grades was moderately large, about 0.15standard deviations on a scale in which standard deviation was calculated using test scoresfor all students of a given grade level in the system. The researchers, using an approach19hat approximates the ﬁt from a multilevel model, learned from the scale of unexplainedvariation in teacher eﬀects over time—the residual group-level variance—that teacher eﬀectsare important and often persistent and are largely not explained by background variables,except for a small improvement in performance during the ﬁrst decade of a teacher’s career.More broadly, there has been an increasing push by certain school districts and policyadvocates to use data like these to compare the “eﬀectiveness” of individual teachers (orschools) to award merit pay or provide other incentives or sanctions. (Grudgingly) leavingthe problematic causal issues aside (see Rubin et al., 2004, for a relevant discussion), wenote that a outstanding methodological problem thus far has been that analyses used tomake such comparisons rarely if ever address the extreme multiple comparisons problemsinvolved. This study could have been set up as a multiple comparisons problem, tryingto get appropriate p -values for comparing thousands of teachers or for distinguishing indi-vidual teacher eﬀects from zero. However, we know that there are true diﬀerences acrossteachers and that teachers should not have no eﬀect (an individual teacher’s eﬀect could benegative or positive but will not be precisely zero), and we therefore should primarily beconcerned with Type S and Type M errors. Therefore a multilevel model would be a farmore appropriate choice for such analyses—and, in fact, this is essentially what the Kane,Rockoﬀ, and Staiger did. In an analysis of data from 2000 participants in the U.S. adolescent health study, Kanazawa(2007) found that more attractive people were more likely to have girls, compared to thegeneral population: 52% of the babies born to people rated “very attractive” were girls,compared to 44% girls born to other participants in the survey. The diﬀerence was sta-tistically signiﬁcant, with a t -value of 2.43. However, as discussed by Gelman (2007), thisparticular diﬀerence—most attractive versus all others—is only one of the many plausiblecomparisons that could be made with these data. Physical attractiveness in the survey usedby this paper was measured on a ﬁve-point scale. The statistically signiﬁcant comparisonwas between category 5 (“very attractive”) vs. categories 1–4. Other possibilities includecomparing categories 4–5 to categories 1–3 (thus comparing “attractive” people to others),or comparing 3–5 to 1–2, or comparing 2–5 to 1. Moreoever these data are from one of 3potential survey waves. Therefore there are 20 possible comparisons (5 natural comparisonsby 4 possible time summaries (wave 1, wave 2, wave 3, or average). It is not a surprise thatone of this set of comparisons comes up statistically signiﬁcant.20n this study, classical multiple comparisons adjustments may not be such a bad idea,because actual sex ratio diﬀerences tend to be very small—typically less than 1 percentagepoint—and so the null hypothesis is approximately true here. A simple Bonferroni correctionbased on our count of 20 possible comparisons would change the critical value from 0.05to 0.0025 in which case the ﬁnding above (with p -value of 0.015) would not be statisticallysigniﬁcant.With a properly designed study, however, multiple comparisons adjustments would notbe needed here either. To start with, a simple analysis (for example, linear regression ofproportion of girl births on the numerical attractiveness measure) should work ﬁne. Thepredictor here is itself measured noisily (and is not even clearly deﬁned) so it would probablybe asking too much to look for any more ﬁnely-grained patterns beyond a (possible) overalltrend. More importantly, the sample size is simply too low here, given what we know fromthe literature on sex-ratio diﬀerences. From a classical perspective, an analysis based on2000 people is woefully underpowered and has a high risk of both Type S and Type Merrors.Alternatively, a Bayesian analysis with a reasonably uninformative prior distribution(with heavy tails to give higher probability to the possibility of a larger eﬀect) revealsthe lack of information in the data (Gelman and Weakliem, 2009). In this analysis theprobability that the eﬀect is positive is only 58%, and the estimated eﬀect size is well under1 percentage point. We build on our Infant Health and Development Program example from Section 2.1 toillustrate how a multi-site analysis could be expanded to accommodate subgroup eﬀectsas well. The most important moderator in the IHDP study was the birth-weight groupdesignation. In fact there was reason to believe that children in the lighter low-birth-weight(less than 2 kg) group might respond diﬀerently to the intervention than children in theheavier (more than 2 kg) low-birth-weight group.We expand our model to additionally allow for diﬀerences in treatment eﬀects acrossbirth weight group, y i ∼ N( γ j [ i ] + δ L j [ i ] P i (1 − B i ) + δ H j [ i ] P i B i , σ y ) , Here the treatment eﬀect corresponding to person i ’s site (indexed by j ) depends on whetherthe child belongs to the lower low-birth-weight group δ L j [ i ] or the higher low-birth-weight21roup δ H j [ i ] . This time each of these sets of parameters gets its own distribution δ L j ∼ N( µ L , σ δ L ), and δ H j ∼ N( µ H , σ δ H ) . In this case we allow the treatment eﬀects for the lower and higher low-birth-weight childrento have a correlation ρ . Again we have allowed the intercept, γ j , to vary across sites j andhave speciﬁed prior distributions for the hyperparameters that should have little to noimpact on our inferences.Figure 9 plots some results from this model. The estimates for the lighter low-birth-weight group are quite volatile in the classical setting, where the information we have aboutthe relationship between the sites is ignored. The Bonferroni correction serves only toreinforce our uncertainty about these estimates. On the other hand, the results from theBayesian multilevel model for this group have been shrunk towards the main eﬀect acrossgroups and thus are less subject to the idiosyncracies that can arise in small samples. (Thesample sizes across these sites for this group range from 67 to 93.) The point estimates forthe heavier low-birth-weight children are more stable for all analyses relative to those for thelighter low-birth-weight group, reﬂecting that there is generally less treatment heterogeneityfor this group of children. The classical and corrected standard errors are larger than forthe lighter low-birth-weight children most likely because the sample sizes across sites forthis group are slighlty smaller (ranging from 35 to 51).Overall, the results from the Bayesian multilevel analysis are the most stable and theylead to substantively diﬀerent conclusions than the classical analyses. None of the Bayesian95% intervals even comes close to covering zero. This contrasts sharply with the resultsfrom the Bonferroni-adjusted classical intervals, all of which are quite wide (typically atleast twice the width of the Bayesian intervals) and four of which actually include zero (theother four end quite close to zero). Similar issues arise when researchers attempt to evaluate the impact of a program on manydiﬀerent outcomes. If you look at enough outcomes, eventually one of them will appear todemonstrate a positive and signiﬁcant impact, just by chance. In theory, multilevel modelscan be extended to accommodate multiple outcomes as well. However this often requires abigger methodological and conceptual jump. The reason that multilevel models were sucha natural ﬁt in the examples described above is because all the estimated groups eﬀects22

Classical Linear Regression − site 1 2 3 4 5 6 7 8 Classical Linear Regressionwith Bonferroni Correction − site 1 2 3 4 5 6 7 8 Multilevel Model − site1 2 3 4 5 6 7 8 Classical Linear Regression − site 1 2 3 4 5 6 7 8 Classical Linear Regressionwith Bonferroni Correction − site 1 2 3 4 5 6 7 8 Multilevel Model − site L i gh t e r L B W H ea v i e r L B W t r ea t m en t e ff e c t e s t i m a t e t r ea t m en t e ff e c t e s t i m a t e Figure 9: Treatment eﬀect point estimates and 95% intervals across the eight IHDP sitesnow broken down by birth-weight group as well. The left panel display classical estimatesfrom a linear regression. The middle panel displays the same point estimates as in the leftpanel but the conﬁdence intervals have been adjusted to account for a Bonferroni correction.The right panel displays 95% intervals and means from the posterior distributions for eachof the eight site-speciﬁc treatment eﬀects generated by ﬁtting a multilevel model.23ere estimates of the same phenomenon. Thus it was reasonable to assume that they were exchangeable . Basically this means that we could think of these estimates as random drawsfrom the same distribution without any a priori knowledge that one should be bigger thananother. If we had such information then this should be included in the model, and in morecomplicated settings it is not trivial to set up such a model.On the other hand, there are some situations when modeling multiple outcomes simul-taneously within a simple multilevel model might be fairly natural. For instance, sometimesresearchers acquire several diﬀerent measures of the same phenomenon (such as educationalachievement, behavioral problems, or attitudes towards related issues). It is also commonto measure the same attribute at several diﬀerent time points over the course of the study.This might require a slightly more complicated modeling strategy to account for trendsover time, but is otherwise a reasonable choice. If many outcomes have been measuredacross several disparate domains, however, more eﬀort may be needed to set up a suitablemultilevel model.We illustrate a simple example of multiple outcomes by returning to the IHDP data. Thistime we allow treatment eﬀects to vary by site and type of test. We include eight diﬀerenttypes of cognitive tests that were administered either at year 3 (PPVT-R, Stanford Binet),year 5 (PPVT-R, Weschler Preschool and Primary Scale of Intelligance Revised verbal andperformance subscales), or year 7 (PPVT-R, Weschler Intelligence Scale for Children verbaland perfomance subscales). In this new formulation test-speciﬁc individual level outcomes(indexed by i ) are allowed both site-speciﬁc γ site j and test-speciﬁc γ test k contributions andthe treatment eﬀects, δ l , are also allowed to vary by site and outcome. Here l indexes site × test (that is, j × k ) combinations. y i ∼ N( µ + γ site j [ i ] + γ test k [ i ] + δ l [ i ] P i , σ y ) , As with the previous models, the site and test-speciﬁc intercepts, γ site j and γ test k , respectively,are assumed to follow normal distributions each with mean zero (since the model alreadyincludes a parameter for overall mean, µ ) and each with its own variance.What’s more interesting here is the model for the δ l , δ l ∼ N( δ + δ site j [ l ] + δ test k [ l ] , σ δ ) . Here the site-speciﬁc contributions to the treatment eﬀects δ site l [ j ] are simply modeled witha normal distribution with mean zero and separate variance component. However the test-speciﬁc contributions, δ test k , are allowed to systematically vary based on the age at which24he test was taken and whether or not the test measures verbal skills δ test k ∼ N( φ year5 + φ year7 + φ verbal , σ δ test ) . This last piece of the model increases the plausibility of the exchangeability assumption forthe test scores. We also simplify the model by ﬁrst standardizing each of the test scores (tohave mean 0 and standard deviation 1) within the sample.The results from this model are displayed in Figure 10. Each row of the ﬁgure corre-sponds to the test score outcome from a diﬀerent year (row 1 for age 3, row 2 for age 5, row3 for age 8). Within each plot, 95% intervals are displayed for each site. The treatmenteﬀects are larger on average for tests taken directly as the intervention ended at age 3. Sim-ilar patterns appear across sites for each outcome—this phenomenon reﬂects an assumptionthat could be relaxed by including site by outcome interactions.A comparison to a classical correction such as a Bonferroni adjustment is even moreextreme in this setting, as illustrated in Figure 11. When taking all eight outcomes intoconsideration, the Bonferroni correction applied to the classic linear regression ﬁt resultsin even more extreme uncertainty bounds than in our original example because now thereare 64 comparisons rather than simply eight. The multilevel model estimates are nowadditionally shrunk towards the grand mean across outcomes, after adjusting for diﬀerencesin mean test scores attributable to the year they were administered and the type of test. Thisshrinkage causes the estimates to be more conservative than they previously were. Howeverthe overall precision of the multilevel model results is vastly superior to the Bonferroni-adjusted intervals.

Further complications

Harder problems arise when modeling multiple comparisons thathave more structure. For example, suppose we have 5 outcome measures, 3 varieties oftreatments, and subgroups classiﬁed by 2 sexes and 4 racial groups. We would not want tomodel this 2 × × × − . − . . . . . . ppvtr.36 t r ea t m en t e ff e c t e s t i m a t e ( s d un i t s ) − . − . . . . . . iqsb.36 − . − . . . . . . ppvtr.60 t r ea t m en t e ff e c t e s t i m a t e ( s d un i t s ) − . − . . . . . . iqv.60 − . − . . . . . . iqp.60 − . . . . . . ppvtr.96 site t r ea t m en t e ff e c t e s t i m a t e ( s d un i t s ) − . . . . . . iqv.96 site − . . . . . . iqp.96 site Figure 10: Example of inference for multiple outcomes. Estimated 95% intervals and medi-ans from the posterior distributions for each of 64 site-and-test-speciﬁc treatment eﬀects inthe Infant Health and Development Program analysis, from by ﬁtting a multilevel model.26 − . . . . Classical Linear Regressionwith Bonferroni Correction t r ea t m en t e ff e c t e s t i m a t e ( s d un i t s ) − . . . . Multilevel Model

Figure 11: Comparison of treatment eﬀects by site between classical linear regression withBonferroni correction and multilevel model ﬁt for the multiple-outcomes analysis of theInfant Health and Development Program (see Figure 10). Both account for eight sites andeight outcomes though only one outcome, Stanford Binet IQ score at age 36 months, isdisplayed.

Multiple comparisons can indeed create problems and it is useful to address the issue.However, statistical methods must be mapped to the applied setting. Classical Type 1 andType 2 errors and false discovery rates are based on the idea that the true eﬀect could reallybe zero (see Johnstone and Silverman, 2004, and Efron, 2006, for connections between theseideas and hierarchical Bayes methods). Eﬀects that are truly zero (not just “small”) canmake sense in genetics (Efron and Tibshirani, 2002) but are less likely in social science oreducation research. We prefer to frame the issue in terms of Type S or Type M errors.Therefore, when doing social science or program evaluation, we do not recommend clas-sical methods that alter p -values or (equivalently) make conﬁdence intervals wider. Instead,we prefer multilevel modeling, which shifts point estimates and their corresponding intervalscloser to each other (that is, performs partial pooling) where necessary—especially whenmuch of the variation in the data can be explained by noise. Therefore ﬁtting the multi-level model should result in the positive externality of yielding more reliable estimates forindividual groups.We recognize that multilevel modeling can prove to be more of a challenge for compli-cated structures. More research needs to be done in this area. However we believe it ismore worthwhile to invest research time and eﬀort towards expanding the multilevel model27ramework than to invest in classical multiple comparisons adjustments that start from aperspective on the problem to which we do not adhere.Applied researchers may balk at having to learn to ﬁt a diﬀerent kind of model. However,functions for ﬁtting multilevel models are now available in many statistical software pack-ages; therefore, implementing our suggestions should not be overly burdensome. Moreover,multiple comparisons problems arise frequently in research studies in which participantshave been clustered because of interest in examining diﬀerences across these program sites,schools, cities, etc; arguably, data from these types of studies should be ﬁt using a multilevelmodel anyway to correctly reﬂect the within-group correlation structure of the errors. Thusthe multilevel model will not only yield better results than the simplest multiple compar-isons corrections, it should not pose a greater burden than performing one of the fanciertypes of classical types of multiple comparisons corrections. References

Almond, R. G., Lewis, C., Tukey, J. W., and Yan, D. (2000). Displays for comparing agiven state to many others.

American Statistician , 89–93.Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practicaland powerful approach to multiple testing. Journal of the Royal Statistical Society B , 289–300.Benjamini, Y., and Yekutieli, D. (2001). The control of the false discovery rate in multipletesting under dependency. Annals of Statistics , 1165–1188.Efron, B. (2006). Size, power, and false discovery rates. Technical report, Department ofStatistics, Stanford University.Efron, B., and Morris, C. (1975). Data analysis using Stein’s estimator and its generaliza-tions. Journal of the American Statistical Association , 311–319.Efron, B., and Tibshirani, R. (2002). Empirical Bayes methods and false discovery ratesfor microarrays. Genetic Epidemiology , 70–86.Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models. Bayesian Analysis , 515–533.Gelman, A. (2007). Letter to the editor regarding some papers of Dr. Satoshi Kanazawa. Journal of Theoretical Biology , 597–599.Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003).

Bayesian Data Analysis ,second edition. London: CRC Press. 28elman, A., and Hill, J. (2007).

Data Analysis Using Regression and Multilevel/HierarchicalModels . Cambridge University Press.Gelman, A., and Stern, H. S. (2006). The diﬀerence between “signiﬁcant” and “not signif-icant” is not itself statistically signiﬁcant.

American Statistician , 328–331.Gelman, A., and Tuerlinckx, F. (2000). Type S error rates for classical and Bayesian singleand multiple comparison procedures. Computational Statistics , 373–390.Gelman, A., and Weakliem, D. (2009). Of beauty, sex, and power: statistical challenges inestimating small eﬀects. American Scientist , 310–316.Genovese, C., and Wasserman, L. (2002). Operating characteristics and extensions of thefalse discovery rate procedure. Journal of the Royal Statistical Society: Series B (Sta-tistical Methodology , 499–517.Grant, G. R., Liu, J., and Stoeckert, C. J. (2005). A practical false discovery rate approachto identifying patterns of diﬀerential experession in microarray data. Bioinformatics , 2684–2690.Greenland, S., and Robins, J. M. (1991). Empirical-Bayes adjustments for multiple com-parisons are sometimes useful. Epidemiology , 244–251.Hansen, B. B. (2004). Full matching in an observational study of coaching for the SAT. Journal of the American Statistical Association , 609–619.Hsu, J. C. (1996). Multiple Comparisons: Theory and Methods . London: Chapman andHall.Infant Health and Development Program (1990). Enhancing the outcomes of low-birth-weight, premature infants. A multisite, randomized trial. The Infant Health and Devel-opment Program.

Journal of the American Medical Association , 3035–3042.James, W., and Stein, C. (1960). Estimation with quadratic loss. In

Proceedings of theFourth Berkeley Symposium on Mathematical Statistics and Probability , ed. J. Ney-man, 361–380. Berkeley: University of California Press.Johnstone, I., and Silverman, B. (2004). Needles and straw in a haystacks: empirical Bayesapproaches to thresholding a possibly sparse sequence. Annals of Statistics , 1594–1649.Kanazawa, S. (2007). Beautiful parents have more daughters: a further implication of thegeneralized Trivers-Willard hypothesis. Journal of Theoretical Biology .Kane, T. J., Rockoﬀ, J. E., and Staiger, D. O. (2007). What does certiﬁcation tell us aboutteacher eﬀectiveness? Evidence from New York City.

Economics of Education Review .29rantz, D. H. (1999). The null hypothesis testing controversy in psychology.

Journal ofthe American Statistical Association , 1372–1381.Louis, T. A. (1984). Estimating a population of parameter values using Bayes and empiricalBayes methods. Journal of the American Statistical Association , 393–398.National Center for Education Statistics (1997). 1996 NAEP comparisons of average scoresfor participating jurisdictions. Washington, D.C.: Government Printing Oﬃce.Poole, C. (1991). Multiple comparisons? No problem! Epidemiology , 241–243.Rockoﬀ, J. (2004). The impact of individual teachers on student achievement: evidencefrom panel data. American Economic Review , Papers and Proceedings, May.Rubin, D. B. (1981). Estimation in parallel randomized experiments.

Journal of Educa-tional Statistics , 377–401.Rubin, D. B., Stuart, E. A., and Zanutto, E. L. (2004). A potential outcomes view ofvalue-added assessment in education. Journal of Educational and Behavioral Statistics , 103–116.Spiegelhalter, D., Thomas, A., Best, N., Gilks, W., and Lunn, D. (1994, 2002). BUGS:Bayesian inference using Gibbs sampling. MRC Biostatistics Unit, Cambridge, England.Tukey, J. W. (1953). The problem of multiple comparisons. Mimeographed notes , PrincetonUniversity.Wainer, H. (1986). Five pitfalls encountered when trying to compare states on their SATscores.