[PDF] Accounting for total variation and robustness in profiling health care providers

Abstract

Monitoring outcomes of health care providers, such as patient deaths, hospitalizations and hospital readmissions, helps in assessing the quality of health care. We consider a large database on patients being treated at dialysis facilities in the United States, and the problem of identifying facilities with outcomes that are better than or worse than expected. Analyses of such data have been commonly based on random or fixed facility effects, which have shortcomings that can lead to unfair assessments. A primary issue is that they do not appropriately account for variation between providers that is outside the providers' control due, for example, to unobserved patient characteristics that vary between providers. In this article, we propose a smoothed empirical null approach that accounts for the total variation and adapts to different provider sizes. The linear model provides an illustration that extends easily to other nonlinear models for survival or binary outcomes, for example. The empirical null method is generalized to allow for some variation being due to quality of care. These methods are examined with numerical simulations and applied to the monitoring of survival in the dialysis facility data.

Full PDF

aa r X i v : . [ s t a t . A P ] J un Accounting for total variation and robustness inproﬁling health care providers

Lu Xia, Kevin He, Yanming Li, John D. Kalbﬂeisch ∗ Department of Biostatistics, University of Michigan1415 Washington Heights, Ann Arbor, MI 48109, USA ∗ jdkalbﬂ@umich.edu Abstract

Monitoring outcomes of health care providers, such as patient deaths, hospitalizationsand hospital readmissions, helps in assessing the quality of health care. We consider a largedatabase on patients being treated at dialysis facilities in the United States, and the problemof identifying facilities with outcomes that are better than or worse than expected. Analy-ses of such data have been commonly based on random or ﬁxed facility eﬀects, which haveshortcomings that can lead to unfair assessments. A primary issue is that they do not appro-priately account for variation between providers that is outside the providers’ control due,for example, to unobserved patient characteristics that vary between providers. In this ar-ticle, we propose a smoothed empirical null approach that accounts for the total variationand adapts to diﬀerent provider sizes. The linear model provides an illustration that extendseasily to other nonlinear models for survival or binary outcomes, for example. The empiricalnull method is generalized to allow for some variation being due to quality of care. Thesemethods are examined with numerical simulations and applied to the monitoring of survivalin the dialysis facility data.

Keywords:

Health care provider proﬁling; Empirical null; Fixed eﬀects; Random eﬀects;Non-linear models; Standardized mortality ratio.

In many instances, the quality of medical care diﬀers considerably across providers. Large healthcare databases provide patient-level data that can be analyzed and summarized into provider-level ∗ To whom correspondence should be addressed. and others , 2010). Similarly, the rate of hospital readmissions following a hospital dis-charge is important in monitoring hospitals and also other providers such as nursing homes anddialysis facilities since it can be indicative of the degree of coordination of post-discharge care(He and others , 2013; Wish, 2014). Provider proﬁling based on clinical outcomes can help pa-tients make decisions in choosing health care providers, and also can aid overseers and payers inidentifying providers whose outcomes are worse or better than a normative standard, by signalingthe need for further review or to target quality improvement programs. The statistical analysisalong with further review may lead to ﬁnancial consequences and even suspension for providerswith poor outcomes. Such evaluations require appropriate statistical methods that account for dif-ferences in patient characteristics and suitably adapt to provider size to ensure no speciﬁc groupsof small or large providers are unfairly penalized in the assessment.This paper is concerned with a large database assembled by the University of Michigan’s KidneyEpidemiology and Cost Center and associated with its contacts with the Centers for Medicare &Medicaid Services (CMS). A similar database is available from the United States Renal DataSystems (USRDS), in the form of standard analysis ﬁles, for use in suitable research projects.The data consist of longitudinal information on over half a million dialysis patients with end-stage renal disease (ESRD) and associated with over 6,000 dialysis facilities in the United States.These data are compiled from various sources within the CROWNWeb data system from theCMS as well as supplementary data from the Social Security master death ﬁle. The CROWNWebdata include demographic and administrative data on patients and dialysis facilities along withongoing reporting on treatment, the Medicare claims database which can be used to determinehospitalizations, comorbidities, and other events, as well as the data on kidney transplants in theUnited States. One important use of these data is to monitor dialysis facilities with respect tovarious quality measures including mortality, hospitalization, and readmission. In this paper, weconsider the evaluation of mortality over a four-year period, accomplished through the calculationof a standardized mortality ratio (SMR) for each dialysis facility. The SMR is a risk adjustedmeasure which compares the observed number of deaths in a facility with the number of deathsthat would be expected if the mortality rate for the center, taking account of observed patientcharacteristics, were the same as for a national norm (UM-KECC, 2018). The measure is typicallyadjusted for many covariates including demographic characteristics, comorbidities at baseline, andsometimes comorbidities identiﬁed over time. Adjusting for covariates after admission of thepatient to the facility must be done with care since such comorbidities may reﬂect treatmentchoices.Regression models for patient-level outcomes are typically used to adjust for observed pa-tient characteristics. Methods are often based on hierarchical random eﬀects models, in which2he provider-speciﬁc eﬀects are modeled as independent and identically distributed random vari-ables. A typical analysis of these models is based on the empirical Bayes posterior distributionof the provider eﬀect where an individual provider eﬀect is estimated by the posterior mean.We refer to this as the random-eﬀects (RE) approach. As is well known, this analysis yieldsshrinkage estimates, i.e. the estimated provider eﬀects are shrunk towards the overall average(Jones and Spiegelhalter, 2011). Many authors have advocated this approach; see, for exam-ple, Normand and others (1997); Normand and Shahian (2007); Jones and Spiegelhalter (2011);Ohlssen and others (2007). An alternative approach, treating the provider eﬀects as ﬁxed, obtainsthe maximum likelihood estimates of the provider eﬀects. We refer to this as the ﬁxed-eﬀects ap-proach (FE). In both approaches, it is common to use the estimate and its frequency or posteriordistribution to assess the sharp null hypothesis that a provider eﬀect is identical to a nationalnorm. Much of the between-provider variation is typically outside the providers’ control due toinadequate risk adjustment for unobserved or observed covariates associated with the outcome thatvary substantially between providers. Jones and Spiegelhalter (2011) and Kalbﬂeisch and others (2018) stressed that all sources of variation should be taken into account in proﬁling.In this paper, we propose empirical null methods that take into account the total variation asa robustly modeled function of provider size. The empirical null was introduced in Efron (2004)and Efron (2010) to account for overdispersion of Z-scores, and Efron (2007) described methodsto estimate a single empirical null distribution with an emphasis on controlling false discoveryrates. However, these approaches that estimate and refer to a single sampling distribution donot apply immediately when providers vary markedly in size. In Kalbﬂeisch and Wolfe (2013)and He and others (2013), this shortcoming was partially addressed through stratiﬁcation. In thisarticle, we deﬁne an empirical null for each provider by smoothing the reference empirical nulldistributions. Our proposal provides a uniﬁed framework for provider proﬁling that encompassesvarious types of outcomes. We also generalize this approach to situation where a provider shouldbe held accountable for an externally speciﬁed proportion of the total variation that is due to thequality of care.In Section 2, we consider issues associated with proﬁling in the linear model where calculationsare more easily done and there is immediate comparison with familiar tests. These methods wouldapply directly to a quality measure that is based on normally distributed variables. In Section 3,we propose the empirical null methods with straightforward generalizations to other more complexmodels as well. In Section 4, we illustrate the empirical null methods through simulation studiesand analysis of the ESRD dialysis facility data with respect to monitoring standardized mortalityratios. Section 5 includes some discussion and extensions.3

Standard proﬁling methods based on the linear model

Let i = 1 , · · · , N index providers and j = 1 , · · · , n i index patients within the i th provider. Weconsider an underlying linear regression model Y ∗ ij = µ + α i + β T X ij + ǫ ij , (1)where Y ∗ ij represents the continuous outcome of interest with large values corresponding to pooroutcomes, µ is the grand mean, α i is the provider eﬀect, X ij and β are vectors of patient charac-teristics and regression coeﬃcients, respectively. Our primary goal is to identify providers whoseoutcomes are worse than expected.In many instances, we assume that the α i ’s are random and, conditional on X ij , α i iid ∼ N (0 , σ α ) , independent of ǫ ij iid ∼ N (0 , σ w ) . (2)The parameters µ, σ α , σ w , and β appear in the probability laws for all observations, and thusare estimated using all patient-level data. In what follows, we assume that the total numbers ofproviders, N , and patients, P i n i , are large, e.g. more than 500,000 patient records and over 6,000dialysis facilities in our application, so that µ, σ α , σ w and β can be precisely estimated. Thus, weproceed without considering the variation associated with their estimates and replace Y ∗ ij ’s withthe risk adjusted responses, Y ij = Y ∗ ij − ˆ β T X ij , where ˆ β , based on ﬁxed eﬀects for the α i ’s, isconsistent for β . Ignoring variability in the estimation of these structural parameters is justiﬁedin large healthcare databases. In much smaller applications, one should take into account theassociated variability by considering exact sampling distributions, adjustment of empirical Bayesestimates or fully Bayesian methods (Jones and Spiegelhalter, 2011). In a ﬁxed-eﬀects (FE) analysis, we consider the provider eﬀect α i in (1) is a constant with aconstraint, usually P Ni =1 n i α i = 0, for identiﬁability. The maximum likelihood estimator of α i isˆ α FE i = ¯ Y i − ¯¯ Y , where ¯ Y i = P n i j =1 Y ij /n i and ¯¯ Y = P Ni =1 ( n i ¯ Y i ) / P Ni =1 n i . As noted above, ¯¯ Y is anaccurate estimate of µ as is ˆ σ w of σ w . The FE Z-score for testing the sharp null of α i = 0 is Z FE ,i = ˆ α FE i ˆ σ w / √ n i ≈ √ n i ( ¯ Y i − µ ) σ w , (3)which has a N (0 ,

1) distribution under the null α i = 0. This can be used as the reference distribu-tion to ﬂag unusual providers. Then, one approach might be to ﬂag the i th provider if Z FE ,i > z ρ ,where z ρ is the upper ρ th quantile of the standard normal distribution and ρ is the nominal4igniﬁcance level of the related one-sided test. We use ρ = 0 .

05 throughout this article.The usual random-eﬀects (RE) analysis of (1) is based on the assumption (2) that α i iid ∼ N (0 , σ α ) , i = 1 , · · · , N , and that the provider eﬀects α i are independent of patient character-istics X ij , j = 1 , · · · , n i . The latter is an important assumption in RE analysis that is rarely metor noted. The confounding between α i and X ij can lead to substantially biased estimates of β and can alter the estimates of α i (Kalbﬂeisch and Wolfe, 2013). One way to address such bias isto utilize the ﬁxed-eﬀects estimate of β , i.e. ˆ β , to provide an oﬀset. As described earlier, we canobtain the risk adjusted response, Y ij , and then estimate σ α , σ w , µ and α i ’s assuming the model Y ij = µ + α i + ǫ ij .An empirical Bayes approach gives the approximate posterior distribution of α i , α i |{ Y ij } ∼ N (cid:0) R i ( ¯ Y i − µ ) , R i σ w /n i (cid:1) , (4)where R i = σ α / ( σ α + σ w /n i ). This approach yields an estimate ˆ α RE i = ˆ R i ( ¯ Y i − ¯¯ Y ) that is shrunktoward zero by ˆ R i . These RE estimates are conditionally biased; if the i th provider has true eﬀect α i , E ( ˆ α RE i | α i ) ≈ R i α i . The corresponding RE Z-score for a test of α i = 0 from (4) is Z RE ,i = ˆ R i ( ¯ Y i − ¯¯ Y ) q ˆ R i ˆ σ w /n i = q ˆ R i Z FE ,i ≈ √ R i ( ¯ Y i − µ ) σ w / √ n i , (5)which has a posterior reference distribution N (0 , Z RE ,i > z ρ . Note that this reference N (0 ,

1) is based on the posteriordistribution (4); the sampling distribution for Z RE ,i given α i = 0 is approximately N (0 , R i ) ratherthan N (0 , Z FERE ,i = ˆ α FE i p ˆ σ α + ˆ σ w /n i ≈ ¯ Y i − µ p σ α + σ w /n i . (6)Analogous to FE and RE methods, one ﬂags the i th provider if Z FERE ,i > z ρ . This is not a testof the sharp null hypothesis α i = 0, but rather assesses whether ¯ Y i could reasonably have arisenfrom the model (1). FE and RE methods are often used in proﬁling medical providers and are most appropriate when allof the variation in provider eﬀects is due to the quality of care. Both methods make reference to asharp null hypothesis, H i : α i = 0. The RE approach is often thought to account for the variationbetween providers. The RE approach has been promoted largely on the basis of the well-known5esult that, compared to FE, it improves the overall precision of estimating the provider eﬀects by“borrowing information” from other providers (Jones and Spiegelhalter, 2011; Efron and Morris,1973; Louis, 1991; Krumholz and others , 2011). Thus, the mean squared error (MSE) of theestimated provider eﬀect is reduced by shrinkage, i.e. E (cid:0) ˆ α RE i − α i (cid:1) ≤ E (cid:0) ˆ α FE i − α i (cid:1) . This resultis important in some contexts, but not in proﬁling. Consider, for example, the conditional MSE,MSE met ( α i ) = E (cid:20) ( ˆ α met i − α i ) (cid:12)(cid:12)(cid:12)(cid:12) α i (cid:21) for met=FE, RE, plotted in Fig. 1, where σ α = 1 , σ w = 5 and n i = 100. Reduction in the overall MSE by RE is achieved by an average of substantial lossesfor extreme values of α i and modest gains for more frequent values of α i closer to 0. This is anexample of an overall average missing important features. For proﬁling, it is the extreme valuesthat are of primary interest. For this reason among others, when the between-provider variationis entirely due to the quality of care, the FE estimates are preferred for proﬁling purposes.Although the FE approach produces unbiased estimates of the provider eﬀects, the FE Z-scoresare substantially overdispersed compared to the standard normal reference distribution used intesting α i = 0 (Efron, 2004; Spiegelhalter and others , 2012). The marginal variance of the FEZ-score assuming randomness in the α i ’s is V ar ( Z FE ,i ) ≈ σ α + σ w /n i σ w /n i = 1 + n i σ α σ w , (7)which shows the variability in FE Z-scores, especially among large providers, is much larger thanthe reference N (0 ,

1) distribution. The FE approach does not take into account such unexplainedvariation and only assesses the sharp null hypothesis of a provider eﬀect. A large sample sizecan translate into more accurate estimation of the provider eﬀect, and any deviation in α i from 0will eventually be detected by FE as n i → ∞ . As a result, the FE approach disproportionatelyﬂags large providers even when their provider eﬀects are small and not clinically meaningful.Similarly, the marginal variance of Z RE ,i increases linearly with n i , and the same problem persists.With a large provider size n i , FE and RE are nearly identical and neither takes into account theunexplained variation between providers.Often, much of the variation in the α i ’s is outside the providers’ control and cannot be attributedto the quality of care. This can arise, for example, when there are unmeasured patient-levelcovariates, such as socio-economic status, genetic diﬀerences and comorbidities, that are predictiveof the outcomes but vary substantially between providers. When most of the variation betweenproviders is outside of the providers’ control, the FERE approach is preferred and its advantageshave been discussed in the literature before, e.g. in the context of funnel plots which illustratehow increasing provider size aﬀects variation in the FE estimates (Jones and Spiegelhalter, 2011).Providers that give rise to a large value of Z FERE ,i have extreme values with reference to thepopulation of all providers, and include all sources of variation.However, the usual FERE approach is not robust. When there exists a substantial proportion6f providers with extreme outcomes, the between-provider variance σ α is overestimated, whichcompromises the ability of the FERE approach to identify extreme providers. Although it ispossible to develop robust estimates of σ α (see, e.g., Koller 2016) to improve the FERE method,generalizing them to non-linear models would be diﬃcult. The empirical null was introduced in Efron (2004) and Efron (2010) in the context of multipletesting and controlling false discovery rates. The empirical null (EN) methods in this sectionare designed to address overdispersion for fair assessments of all providers in a robust fashion tooutliers. These methods estimate the total variation in the FE Z-scores, Z FE ,i , i = 1 , . . . , N , toassess extreme values in the empirical null distributions. For the large majority of providers under consideration, suppose that the model (1) holds, with α i iid ∼ N (0 , σ α ). For now, we assume all providers have an equal number of patients, i.e. n i = n for i = 1 , · · · , N . The empirical null distribution is deﬁned as the normal distribution with mean,ˆ µ M , and variance, ˆ σ M , which are estimated robustly from the FE Z-scores Z FE ,i , i = 1 , . . . , N so as to reduce or eliminate the impact of a relativelly small number of outlying providers. Thisdistribution is used in place of the reference N (0 ,

1) so that, for example, the i th provider is ﬂaggedas “worse than expected” if Z FE ,i > ˆ µ M + z ρ ˆ σ M . To obtain robust estimates ˆ µ M and ˆ σ M , we adapt the “MLE ﬁtting” method (Efron, 2007). Thisis based on a mixture model in which a proportion p M of providers have Z-scores arising from theempirical null distribution, N ( µ M , σ M ), whereas the remainder, including any outliers, come fromanother non-null distribution. The target quantities µ M and σ M are estimated by maximizing themixture likelihood so that the estimates are robust to outliers. More speciﬁcally, we assume thatthe non-null distribution has support outside an interval [ A, B ] ≡ [ µ (0) M − ζ · σ (0) M , µ (0) M + ζ · σ (0) M ], where( µ (0) M , σ (0) M ) are reasonable initial estimates and ζ > I = { i : Z FE ,i ∈ [ A, B ] } , N = | I | , N = N − N , φ µ,σ ( z ) = 1 √ πσ exp ( − (cid:18) z − µσ (cid:19) ) ,and Q ( µ, σ ) = Φ (cid:18) B − µσ (cid:19) − Φ (cid:18) A − µσ (cid:19) , where Φ( · ) is the cumulative distribution function of N (0 , θ = p M Q ( µ, σ ). Then the likelihood based on the observed Z-scores is L ( µ, σ, p ) = θ N [1 − θ ] N Y i ∈ I φ µ,σ ( Z FE ,i ) Q ( µ, σ ) . (8)To obtain the MLE, (ˆ µ M , ˆ σ M , ˆ p M ) = argmax ( µ,σ,p ) L ( µ, σ, p ), we proceed by computing the proﬁle7ikelihood of p M , which avoids solutions with ˆ p M >

1. Thus we deﬁne a grid of K values for p (e.g.[0.5, 1] by increment of 0.001), denoted as { p (1) , p (2) , · · · , p ( K ) } . For each k = 1 , , · · · , K , com-pute ℓ ( k ) = max µ,σ L ( µ, σ, p ( k ) ) and the corresponding maximizer (ˆ µ ( k ) , ˆ σ ( k ) ). This maximizationstep is over a 2-dimensional parameter and is very fast using existing optimization methods (e.g.Nelder and Mead 1965). Finally, ﬁnd ˜ k = argmax k ℓ ( k ) , and then (ˆ µ M , ˆ σ M , ˆ p M ) = (ˆ µ (˜ k ) , ˆ σ (˜ k ) , p (˜ k ) ).An important strength of the MLE ﬁtting approach is that the likelihood (8) is only parametricwithin the interval [ A, B ], and is free from speciﬁcations of the distribution of outliers, all ofwhich only occur outside the interval. A robust M-estimation technique can be used to obtainthe preliminary estimates (Huber, 1964, 1973; Andrews and others , 1972). We use the bi-weightfunction, which is the default in SAS ® ROBUSTREG procedure (Chen, 2002). A stratiﬁed ENapproach was previously implemented using the robust M-estimation, in Kalbﬂeisch and Wolfe(2013) and He and others (2013) for time-invariant proﬁling, and Estes and others (2018) for time-dynamic proﬁling. The MLE ﬁtting outlined above is preferable to the robust M-estimation, sincethe latter can overestimate the variance with many outliers present. We have also found thatmoderate values of ζ such as Φ − (0 .

9) = 1 .

28 or Φ − (0 .

95) = 1 .

64 help avoid including outlierswhile allowing good estimation of µ M and σ M .Next we consider a more realistic setting where the sample size n i varies across providers.Since the total variability of FE Z-scores depends on n i , we ﬁrst stratify providers into a fewgroups based on their sample sizes, as in Kalbﬂeisch and Wolfe (2013) and He and others (2013).For example, we may stratify the providers into three groups based on the tertiles of sample sizes { n i } . Within group g , we obtain the estimates ˆ µ M,g and ˆ σ M,g , and an empirical null distribution N (ˆ µ M,g , ˆ σ M,g ) , g = 1 , ,

3. Each health care provider is compared to the empirical null correspond-ing to its group.As an illustration, consider the mortality data from the ESRD database introduced in Section1, on patients treated at 6,363 dialysis facilities in the United States over four calendar years, from2012 to 2015. More data descriptions can be found in Section 4.3. Facilities are stratiﬁed into threegroups, small, medium and large facilities, using facility size tertiles. The failure rates are modeledusing a Cox model with covariates measuring demographic variables and comorbidities at baseline.The SMR (see Section 1) is obtained for each facility and each is converted to a one-sided Z-scorebased on a test of the sharp null hypothesis. These are used in an extension of the EN methodsto non-linear models as discussed in detail in Section 3.3. The estimated means of one-sided Z-scores are − . − .

07 and − .

13 for the small, medium and large groups respectively, and thecorresponding variance estimates are 1 . , 1 . and 1 . ; see Fig. 2(a)-(c) for Z-score histogramsalong with the empirical null distributions. It can be seen that the empirical null distributionsare more dispersed than the standard normal, and that the variation of Z-scores increases withfacility size. We use the upper 5% quantiles in the empirical null distributions as critical values,and facilities with Z-scores larger than those in their corresponding groups are ﬂagged as havingpoor outcomes. Fig. 2(e)-(f) visualizes the stratum-speciﬁc critical values of Z-scores in the scatter8lot (solid blue segments). The stratiﬁed empirical null approach has some limitations, as evident from Fig. 2(e)-(f). First,the choice of three groups is arbitrary and a diﬀerent number of strata will result in changes inthe list of ﬂagged providers. A second related problem is the discontinuity of the critical region atthe stratum boundaries. Consequently, two providers near a boundary may have similar sizes andZ-scores, yet one may be ﬂagged and the other not, due only to the arbitrary choice of boundaries.To overcome these issues, we model the mean and variance of the empirical null distributions assmooth functions of provider size, and use robust techniques to lessen the impact of potentialoutliers.To estimate the regression of the variance on the provider size, we ﬁrst obtain variance estimatesin each of G groups deﬁned by quantiles of provider sample size. We then regress these varianceestimates on the median provider size in each group. Speciﬁcally, we proceed as follows. Withinthe g th group, we use the local MLE ﬁtting described in Section 3.1 to estimate the mean ˜ z g andvariance ˜ σ g of Z-scores, and let ˜ m g represent the median size in this group, g = 1 , · · · , G . Focusingﬁrst on the variance estimates, we ﬁt a regression model with variance estimates { ˜ σ g } as dependentand median sizes { ˜ m g } as independent variables. Based on our empirical studies and theoreticalderivations, a linear regression of the form γ + γ ˜ m g will typically suﬃce. An iteratively re-weightedalgorithm is used to estimate ( γ , γ ). We set the initial estimates as the ordinary least squaresestimates (ˆ γ (0)0 , ˆ γ (0)1 ), then update with (ˆ γ ( t +1)0 , ˆ γ ( t +1)1 ) = arg min ( γ ,γ ) P Gg =1 ω ( t ) g (˜ σ g − γ − γ ˜ m g ) untilconvergence, where the weights ω ( t ) g = N g (ˆ γ ( t )0 + ˆ γ ( t )1 ˜ m g ) − and N g is the number of providers inthe g th group. The ﬁnal ﬁtted variance values for each providers are denoted as ˆ σ i , i = 1 , · · · , N .We include weights in the regression to reduce the leverage of the variance estimates with largevariability from large providers.An important issue is the choice of the total number of groups G . A small G will not providesuﬃcient information for estimating regression coeﬃcients precisely in a linear model, whereas alarge G results in a very small number of providers in each group and consequently unstable withingroup variance estimates. From our experience with the SMR example, we ﬁnd it satisfactory tochoose G that there are 50 to 300 providers in each group. The estimated mean and variancefunctions under diﬀerent choices for G are fairly stable, as reported in the supplementary materialSection S2.To estimate the mean Z-score as a function of provider size, we use a weighted smoothing tech-nique such as smoothing spline, B-spline or LOESS on data { (˜ z g , ˜ m g ) } Gg =1 . The weight associatedwith the mean estimate ˜ z g in the g th group is inversely proportional to the variance estimate ˆ σ g from the iteratively re-weighted algorithm above. One can easily ﬁnd existing implementations forthese methods, for example, smooth.spline() for smoothing spline in R . For the provider size that9alls outside the range of { ˜ m g } Gg =1 , one may extrapolate either with the estimated linear functionor with a plateau so that the mean function remains continuous and ﬂat. The ﬁtted Z-scores fromsmoothing are denoted as ˆ Z i , i = 1 , · · · , N . We propose this group-based smoothing instead ofdirect smoothing based on the original Z-scores, because the MLE ﬁtting within each group ismore robust against potential outliers.Examination of the i th provider may proceed based on the individual empirical null distribution N ( ˆ Z i , ˆ σ i ). Similar to other methods, the i th provider is ﬂagged as worse than expected if Z FE ,i > ˆ Z i + z ρ ˆ σ i , for a one-sided test with nominal level ρ . Smoothing provides a better approximationto the total variance of Z-scores and avoids any unfairness associated with simple stratiﬁcation.When the normal linear model (1) holds for all providers, this approach gives an almost identicalresult to the FERE approach. Fig. 2(e)-(f) show the critical line for ﬂagging dialysis facilities withpoor outcomes in the SMR data using the smoothed empirical null ( z ρ = 1 .

64 for black dottedlines).

Many types of outcomes are monitored for quality of care, including patient hospitalizations, read-missions, transfusions and death events in the ESRD dialysis facility data. A hospital readmissionmeasure within thirty days following a hospital discharge is based on a logistic model for binaryoutcomes, and 30-day mortality rates are analyzed in a similar manner. Hospitalizations are ana-lyzed using a model for recurrrent events. The empirical null approach can be readily extended tonon-linear models and provides a unifying framework for proﬁling providers.We continue using the SMR example, which assesses patient death events within a health careprovider. For the i th provider, we denote the observed number of deaths by O i . A two-stagemodeling procedure is used to obtain the expected number of deaths under the assumption thatpatients of this provider have death events at the national average rate (UM-KECC, 2018). Thefollowing is the approach currently adopted by CMS in Dialysis Facility Compare.In the ﬁrst stage model, the hazard function of the j th patient in the i th provider is assumedto be λ ij ( t ) = λ i ( t ) exp { X Tij β } , where λ i is a provider-speciﬁc baseline hazard, and β representsthe regression coeﬃcients associated with the observed patient-level characteristics X ij such asage, gender, race, BMI and a selected set of comorbidities. This stratiﬁed Cox model is ﬁtted tothe national data in order to estimate β , denoted as ˆ β . Stratiﬁcation by providers is important toaccurately estimate the within provider eﬀects of covariates, β .In the second stage, the “population-average” cumulative baseline hazard Λ ( t ) = R t λ ( u ) du is estimated through an unstratiﬁed Cox model with an oﬀset, X Tij ˆ β , obtained from the ﬁrststage. Conditional on patient characteristics X ij and at-risk process Y ij ( t ), the expected numberof events for the j th patient in the i th provider is calculated as E ij = R τ Y ij ( t ) exp { ˆ β T X ij } d ˆΛ ( t ) , where τ is the maximum follow-up time. For the i th provider, the expected number of events is10 i = P n i j =1 E ij and the corresponding SMR is estimated by \ SM R i = O i /E i . If \ SM R i > < i th provider experiences more (fewer) deaths than expected under the national norm given theobserved characteristics of patients. Note that E ij is a sum of conditional expectations and is infact the compensator in the martingale corresponding to the counting process for the individual.A test of the sharp null hypothesis H i : SM R i = 1, where SM R i is the underlying SMR forthe i th provider, can be obtained using a Poisson approximation whereby the number of events O i is assumed to follow a Poisson distribution with mean E i . In this case, the one-sided midp-value is p i = P ( X = O i ) / P ( X > O i ) , where X ∼ P oisson ( E i ). These can be converted toZ-scores using Z FE ,i = Φ − (1 − p i ). By this convention, large values of Z FE ,i are associated withpoor outcomes. Mid p-values are used to avoid diﬃculties in converting p-values to Z-scores. ThePoisson-based p-values in this example are preferable rather than a normal approximation withZ-scores O i − E i √ E i , since they are more accurate when the provider size is small.These FE Z-scores are consequently used to construct empirical null distributions for proﬁling,as introduced in Sections 3.1 and 3.2. A similar approach applies directly to other standardizedmeasures based on other regression models, such as hierarchical logistic regression for hospitalreadmission. Instead of converting p-values, FE Z-scores could also be based on Wald statisticsfrom ﬁxed-eﬀects estimates of provider eﬀects and a test of the sharp null hypothesis. FE and REmethods analogous to those in the linear model are also sometimes used. RE methods tend to becomplicated and subject to the same concerns as described in Section 2.2 for the linear model, andthey behave similarly to FE methods in large providers and result in unreasonably higher ﬂaggingrates for large providers. Similar to FERE, the empirical null approach presented above takes account of the total variationin the Z-scores. This is appropriate when most or all of the between-provider variation is dueto incomplete risk adjustment, as opposed to the quality of care. Kalbﬂeisch and others (2018)suggests introducing a value, λ , that represents the proportion of the between-provider variancethat is due to incomplete risk adjustment, and holding providers accountable for a proportion of1 − λ of the between-provider variation. Note that λ cannot be estimated on the data and mustbe speciﬁed based perhaps on expert opinion.In the linear model (1), we can write α i = α i + α i , where α i ∼ N (0 , (1 − λ ) σ α ) represents theportion of the eﬀect due to the quality of care whereas the independent eﬀect, α i ∼ N (0 , λσ α ),is variation that is outside the provider’s control. In this case, it is natural to base proﬁling onan assessment of the hypothesis H λ : α i = 0. Under this hypothesis, the null distribution for Z FE is N (0 , ( λσ α + σ w /n i ) / ( σ w /n i )). This is a natural generalization that connects the FE andFERE approaches discussed in Section 2, which correspond respectively to λ = 0 and λ = 1.Furthermore, for a general 0 ≤ λ ≤

1, the null distribution for Z FE ,i can also be written as11 − λ N (0 ,

1) + √ λ N (0 , / (1 − r i )). Here, r i = σ α / ( σ α + σ w /n i ), the shrinkage factor deﬁnedearlier, is also referred to as the inter-unit reliability (IUR).Analogous to the relaxation in the linear model above, we can extend the idea of decomposingthe between-provider variance to the empirical null in non-linear models. Suppose we have obtainedthe empirical null distribution for a provider, N ( ˆ Z i , ˆ σ i ). The IUR, in general, represents theproportion in the total variance that the between-provider variance takes, and can be computedin non-linear models (He and others , 2019). Then, for the i th provider with an IUR equal to r i ,the variance to be allowed isˆ σ λ,i = ˆ σ i − r i · ˆ σ i (1 − λ ) = [1 − r i (1 − λ )]ˆ σ i . (9)Additionally, if the FE Z-scores are computed as Wald statistics or asymptotically equivalentstatistics by assuming ﬁxed provider eﬀects, the variance in the reference distribution ˆ σ i can beapproximated by 1 / (1 − r i ) when all between-provider variation is due to incomplete risk adjustment( λ = 1), and ˆ σ λ,i in (9) can also be written as ˆ σ λ,i = 1 − λ + λ ˆ σ i . The new reference distributionallowing a proportion λ of the between-provider variance is N ( ˆ Z i , ˆ σ λ,i ). Note that the incorporationof λ simply changes the critical value of the test based on the empirical null.More generally, we might elicit a prior distribution f λ ( λ ) for λ that reﬂects experts’ uncertaintyabout its value. In this case, we might gauge a provider’s performance by comparing its FE Z-scoreto the marginal empirical null distribution. Since the distribution of the data does not depend on λ , the posterior distribution is the same as the prior. The marginal empirical null distribution hasdensity R f EN ( z | λ ) f λ ( λ ) dλ , where f EN ( z | λ ) is the density of N ( ˆ Z i , ˆ σ λ,i ). This distribution can beapproximated with Monte Carlo methods that draw random samples from f λ ( λ ) and then f EN ( z | λ ).Given ( ˆ Z i , ˆ σ i ), the mean of the marginal distribution is simply ˆ Z i and the variance is obtained bysubstituting λ with its prior mean in (9). Depending on the prior, the marginal can have heaviertails than a normal distribution. Detailed discussion is presented in the Supplementary MaterialSection S3. We ﬁrst restrict all providers to have the same sample size, and compare the probability thatproviders give rise to a signal under diﬀerent approaches. We assume the true model (1), with µ = 0, β = 0, ǫ ij ∼ N (0 , α i ∼ N (0 ,

1) for i ≥ α ﬁxed at a value varying from 0 to 3.5.We simulate N = 200 providers of size n i = n for all i , where n = 10 , , , − (0 .

95) = 1 .

64 are ﬂagged as worse than expected.12ig. 3 shows the estimated probabilities of signaling provider 1 for FE, RE, FERE and ENapproaches. The FE approach ﬂags provider 1 with the highest probabilities in all cases. Asexpected, the diﬀerence between FE and RE diminishes for a large sample size n . Without out-lying providers, FERE and EN result in almost identical probabilities of signaling, reﬂecting theirasymptotic equivalence in this setup. For a large sample size, e.g. n = 100, standard FE and REmethods signal provider 1 with moderate to high probability even with relatively small values of α , say α = 0 . α are well within the range of variation expected for α i under the true model. On the other hand, FERE and EN allow for this variation and do not signalwith high probability until α > .

0, that is until the eﬀect is in the tail of the distribution ofprovider eﬀects ( σ α = 1). It should be noted that the exact probability of ﬂagging can be easilycalculated and plotted for all methods except EN. We present the empirical probability in all casesto facilitate fair comparison.To illustrate the robustness of EN compared to FERE, we also simulate N = 3 ,

000 providerswith 5% outliers. Still, we assume model (1) for the majority of providers, with µ = 0, β = 0, σ w = 4, and σ α = 1 except that α is ﬁxed at a value varying from 0 to 3 . σ α . Half of theoutliers have provider eﬀect α i = 4 σ α and the other half α i = − σ α , well outside the center of thetrue distribution for the majority of provider eﬀects. Sample size n i is simulated from a uniformdistribution on integers { , , · · · , } for all providers except provider 1. For provider 1, itssample size n = 25 , , , σ α and σ w are known (black dashed line), and is more robust than FERE, which has lower ﬂagging proportionsdue to over-estimation of the between-provider variance, especially when n is large. We consider a realistic situation where providers are of diﬀerent sizes and simulate survival out-comes that mimic the SMR example in Section 3.3. A similar study where providers are of thesame size is presented in the Supplementary Materials Section S1.We generate N = 2000 providers whose sample sizes are simulated from a uniform distributionon integers in [10 , j th subject in the i th provider, the sur-vival time T ij follows an exponential distribution with hazard λ ij = 0 . × exp { α i + X ij β + X ij β } ,where β = 1 and β = − α i iid ∼ N (0 , . ) are the provider eﬀects, and the covariates, X ij and X ij , are independent N (0 ,

1) variables. The censoring time C ij iid ∼ U nif (10 , G = 5 , , ,

60. Linear regressionmodels are ﬁtted to the group-wise variance estimates of Z-scores, and weighted smoothing splinesto the group-wise mean estimates. We repeat the simulation 500 times.Fig. 5(a)-(b) show the estimated mean and variance functions of Z-scores in one replicationwhen the number of groups G = 20. Results with diﬀerent G show no sensitivity to the selectionof G in the range 5 to 60 (Section S2 in the Supplementary Materials). Our proposed methodcaptures the main features of the mean and variance functions while being smooth enough toprovide consistent ﬂagging rules within the considered range of provider size.If one assumes that a proportion λ of the between-provider variance is due to incomplete riskadjustment, then the methods of Section 3.4 can be used. Following the same simulation setupwith survival outcomes above, we consider λ = 0 , . , . , σ λ,i in the reference distribution accordingly. Fig. 5(c) shows box plots of the proportion of timesthat providers are ﬂagged in 500 replications, stratiﬁed into three groups. The FE approach,corresponding to λ = 0, ﬂags a provider if its one-sided p-value is less than 0.05, which results inover 25% of the large providers and about 15% of the small providers being signaled. In contrast,the EN approach that allows all variation ( λ = 1) has very stable ﬂagging rates around 5% for allthree groups. The EN approach with the relaxed factor λ can be viewed as a hybrid of the FEanalysis and the EN allowing total variance, hence its ﬂagging rates lie between the latter two.When λ = 0 . , .

75, the ﬂagging rates increase somewhat with provider size, a feature inheritedfrom FE.

The Standardized Mortality Ratio (SMR) is used as a measure of mortality to proﬁle dialysisfacilities at CMS. More details on SMR can be found in Sections 1 and 3.3. Data were collectedfrom 2012 to 2015, involving over half a million dialysis patients. In this analysis, we include 6,363dialysis facilities with expected number of deaths of 3.0 or more. The number of observed deathsranges from 0 to 581, the number of expected deaths from 3.0 to 308.6, and the facility sizes rangesfrom 6.9 to 1569.8 patient-years. Facility size tertiles, deﬁned by cut points at 156.9 and 302.8patient-years, create three groups of small, medium and large facilities.Fig. 2 shows the histograms of FE Z-scores by stratum, the distribution of facility size anddiﬀerent ﬂagging threshold lines for the stratiﬁed and the smoothed EN methods, and has beendiscussed in Section 3. Switching from the stratiﬁed EN to the smoothed version changes theﬂagging labels for some facilities. Nine facilities are ﬂagged as “worse than expected” by thestratiﬁed EN but not the smoothed EN, and 18 facilities are ﬂagged by the smoothed EN butnot the other. A total of 367 facilities are ﬂagged by both EN methods, and 5,968 facilities byneither. The FE approach results in 768 facilities being ﬂagged (12.1% of the total number), with231 (10.9%), 241 (11.4%) and 296 (14.0%) in the small, medium and large groups, respectively14percentages are with respect to the number of facilities in each group). The FE approach ﬂagsmore large facilities than small ones, and the ﬂagging rates exceed the target 5% by large marginsdue at least in part to overdispersion. Using the smoothed EN, the ﬂagging rates in the threegroups are brought down to a more equitable level, 141 (6.6%), 109 (5.1%) and 123 (5.8%). Dueto possible existence of outliers, the empirical null based ﬂagging rates are slightly larger than thetarget level.

Besides the comments in Section 3.4, one may also choose a λ value that would result in a certainproportion of providers being ﬂagged, especially when there are constraints on resources madeavailable for the review process or quality improvement program. This could also be accomplishedby changing the nominal ﬂagging rate ρ .Generalization of RE and FERE methods to non-linear models, such as the logistic modelor the Cox proportional hazards model, is complicated. For example, methods developed forassessing hospital readmission rates were based on an RE analysis of a hierarchical logistic model(Horwitz and others , 2011), and entailed complicated bootstrap techniques to assess signiﬁcance.The EN method, however, generalizes immediately as described in Section 3.3, and is applied to amore complicated logistic model for hospital readmissions of dialysis patients in (He and others ,2013).In the EN approach, we have assumed that there are a large number of providers to be proﬁledand that the central part of the histogram of Z-scores is well described by a normal distribution.These two assumptions are satisﬁed in many applications. When the number of facilities is muchsmaller, it is important to take into account uncertainty in the estimation of the empirical nulldistribution. Also, in some instances, it may be useful to use a transformation of the basic mea-sure in order to achieve approximate normality of the empirical null. In other cases, there maybe situations where a non normal distribution that incorporates skewness, for example, is moreappropriate. These are areas where additional work is needed.As noted earlier, an FE analysis to estimate β and then use of an oﬀset is one approachto correct for the confounding between covariates and provider eﬀects in the RE method. Analternative approach is to include both a within and a between regression coeﬃcient in the model Y ij = µ + α i + β Tw ( X ij − ¯ X i ) + β Tb ¯ X i + ǫ ij as described in Neuhaus and Kalbﬂeisch (1998). A feature which sometimes arises is that thebetween-provider regression coeﬃcient, β b = β w , in which case the within coeﬃcient does notmake a full adjustment for the covariates under consideration. This would arise, for example, ifeven having adjusted for a variable like race, one found that there was still a diﬀerence between15roviders according to the racial mix that they treat. Such situations require careful consideration,and a discussion of issues associated with this can be found in Kalbﬂeisch and others (2018).Bayesian methods are computationally attractive and have often been used in proﬁling. Theseimpose distributional assumptions on the random provider eﬀects, often with hyperparameters infull Bayesian methods (Normand and others , 1997; Racz and Sedransk, 2010; Normand and Shahian,2007). However, there still remains confusion as to nonuniﬁed criteria for identifying outly-ing providers based on posterior distributions. For instance, Racz and Sedransk (2010) assessedwhether a pre-determined norm lies between the posterior percentiles, e.g. 2.5% and 97.5%, of aprovider eﬀect in a hierarchical logistic regression model. Normand and others (1997) consideredthe posterior probability of the excess expected mortality (the diﬀerence between the expected mor-tality under a provider’s own regression coeﬃcients and that averaged over the provider-speciﬁcparameters) being larger than a benchmark, and the posterior probability of the adjusted mortalityof a reference patient greater than that for similar patients in all providers in the same sample.Often, extremeness of the observed mortality have also been assessed based on the posterior pre-dictive distrbution through replications (Normand and Shahian, 2007). Based on a hierarchicallogistic regression model, Bayesian intervals can be constructed for standardized readmission mea-sures via bootstrapping from a normal approximation to the posterior distribution of providereﬀects in an approach similar to the RE methods discussed in this paper. As with RE methods, itis worthwhile to take measures with a Bayesian approach to ensure fair assessments of providersof all sizes.A causal inference framework provides a promising but challenging approach to proﬁling healthcare providers. In general, the existence of unmeasured confounders poses diﬃculties in the in-ference on providers’ performance. Spertus and others (2016) implemented augmented inverseprobability weighting (Robins and others , 1994) and targeted maximum likelihood estimation(van der Laan and Rubin, 2006) under a causal inference framework for proﬁling, coupled withelastic net for variable selection. Spertus and others (2016) discussed using instrumental variablesif one has strong reasons to assume the underlying causal mechanism. As is often the case, forcausal inference, cautions are needed in connection with many commonly made but non-veriﬁableassumptions. Software

The R code for implementing the empirical null with simulation examples in this paper has beenmade available at https://github.com/luxia-bios/Empirical-Null .16 upplementary Material

Supplementary material will be available online at http://biostatistics.oxfordjournals.org . Acknowledgements

This work was supported in part by The Centers for Medicare and Medicaid Services [contractnumber HHSM-500-2008-000211], although the opinions expressed in this article do not necessarilyreﬂect those of the CMS or the U.S. government.

Conﬂict of Interest:

None.

References

Andrews, D. F., Bickel, P. J., Hampel, F. R., Huber, P. J., Rogers, W. H. andTukey, J. W. (1972).

Robust Estimates of Location: Survey and Advances . Princeton, NJ:Princeton University Press.

Chen, C. (2002). Paper 265-27 Robust regression and outlier detection with the ROBUSTREGprocedure. In:

SAS Institute Inc., Proceedings of the Twenty-Seventh Annual SAS ® UsersGroup International Conference . Cary, NC: SAS Institute Inc.

Efron, B. (2004). Large-scale simultaneous hypothesis testing: the choice of a null hypothesis.

Journal of the American Statistical Association (465), 96–104. Efron, B. (2007). Size, power and false discovery rates.

The Annals of Statistics (4), 1351–1377. Efron, B. (2010).

Large-scale inference: empirical Bayes methods for estimation, testing, andprediction . Cambridge, UK: Cambridge University Press.

Efron, B. and Morris, C. (1973). Stein’s estimation rule and its competitors – an empiricalBayes approach.

Journal of the American Statistical Association (341), 117–130. Estes, J. P., Nguyen, D. V., Chen, Y., Dalrymple, L. S., Rhee, C. M., Kalantar-Zadeh, K. and S¸ent¨urk, D. (2018). Time-dynamic proﬁling with application to hospitalreadmission among patients on dialysis.

Biometrics (4), 1383–1394. He, K., Kalbfleisch, J. D., Li, Y. and Li, Y. (2013). Evaluating hospital readmission ratesin dialysis facilities; adjusting for hospital eﬀects.

Lifetime Data Analysis (4), 490–512. He, K., Kalbfleisch, J. D., Yang, Y. and Fei, Z. (2019). Inter-unit reliability for nonlinearmodels.

Statistics in Medicine (5), 844–854.17 orwitz, L., Partovian, C., Lin, Z., Herrin, J., Grady, J., Conover, M.,Montague, J., Dillaway, C., Bartczak, K., Ross, J., Bernheim, S., Drye,E. and others Huber, P. J. (1964). Robust estimation of a location parameter.

The Annals of MathematicalStatistics (1), 73–101. Huber, P. J. (1973). Robust regression: asymptotics, conjectures and Monte Carlo.

The Annalsof Statistics (5), 799–821. Jarman, B., Pieter, D., van der Veen, A. A., Kool, R. B., Aylin, P., Bottle, A.,Westert, G. P. and Jones, S. (2010). The hospital standardised mortality ratio: A powerfultool for dutch hospitals to assess their quality of care?

BMJ Quality & Safety (1), 9–13. Jones, H. E. and Spiegelhalter, D. J. (2011). The identiﬁcation of “unusual” health-careproviders from a hierarchical model.

The American Statistician (3), 154–163. Kalbfleisch, J. D., He, K., Xia, L. and Li, Y. (2018). Does the inter-unit reliability (IUR)measure reliability?

Health Services and Outcomes Research Methodology (3), 215–225. Kalbfleisch, J. D. and Wolfe, R. A. (2013). On monitoring outcomes of medical providers.

Statistics in Biosciences (2), 286–302. Koller, M. (2016). robustlmm: an R package for robust estimation of linear mixed-eﬀects models. Journal of Statistical Software (6), 1–24. Krumholz, H. M., Lin, Z., Drye, E. E., Desai, M. M., Han, L. F., Rapp, M. T.,Mattera, J. A. and Normand, S-L. T. (2011). An administrative claims measure suitablefor proﬁling hospital performance based on 30-day all-cause readmission rates among patientswith acute myocardial infarction.

Circulation: Cardiovascular Quality and Outcomes (2), 243–252. Louis, T. A. (1991). Assessing, accommodating, and interpreting the inﬂuences of heterogeneity.

Environmental Health Perspectives , 215–222. Nelder, J. A. and Mead, R. (1965). A simplex method for function minimization.

TheComputer Journal (4), 308–313. Neuhaus, J. M. and Kalbfleisch, J. D. (1998). Between-and within-cluster covariate eﬀectsin the analysis of clustered data.

Biometrics (2), 638–645.18 ormand, S. T., Glickman, M. E. and Gatsonis, C. A. (1997). Statistical methods forproﬁling providers of medical care: issues and applications. Journal of the American StatisticalAssociation (439), 803–814. Normand, S. T. and Shahian, D. M. (2007). Statistical and clinical aspects of hospitaloutcomes proﬁling.

Statistical Science (2), 206–226. Ohlssen, D. I., Sharples, L. D. and Spiegelhalter, D. J. (2007). A hierarchical modellingframework for identifying unusual performance in health care providers.

Journal of the RoyalStatistical Society: Series A (Statistics in Society) (4), 865–890.

Racz, M. J. and Sedransk, J. (2010). Bayesian and frequentist methods for provider proﬁl-ing using risk-adjusted assessments of medical outcomes.

Journal of the American StatisticalAssociation (489), 48–58.

Robins, J. M., Rotnitzky, A. and Zhao, L. (1994). Estimation of regression coeﬃcients whensome regressors are not always observed.

Journal of the American statistical Association (427),846–866. Spertus, J. V., Normand, S-L., Wolf, R., Cioffi, M., Lovett, A. and Rose, S. (2016).Assessing hospital performance after percutaneous coronary intervention using big data.

Circu-lation: Cardiovascular Quality and Outcomes (6), 659–669. Spiegelhalter, D. J., Sherlaw-Johnson, C., Bardsley, M., Blunt, I., Wood, C.and Grigg, O. (2012). Statistical methods for healthcare regulation: rating, screening andsurveillance.

Journal of the Royal Statistical Society: Series A (Statistics in Society) (1),1–47.

UM-KECC . (2018, July). Technical notes on the standardized mor-tality ratio (SMR) for the dialysis facility reports.

Technical Report .https://dialysisdata.org/sites/default/ﬁles/content/Methodology/SMRDocumentation.pdf. van der Laan, M. J. and Rubin, D. (2006). Targeted maximum likelihood learning.

TheInternational Journal of Biostatistics (1). doi:10.2202/1557-4679.1043. Wish, J. B. (2014). The role of 30-day readmission as a measure of quality.

Clinical Journal ofthe American Society of Nephrology (3), 440–442.19 i M SE −3 −2 −1 0 1 2 3 . . . . FE: ConditionalRE: ConditionalRE: Marginal

Figure 1: Conditional and marginal MSEs of the FE and RE estimates of α i in the linear model(1). Conditional MSEs are calculated conditional on the true value, α i . Here σ α = 1 , σ w = 5 and n i = 100. The marginal MSE of the FE estimate coincides with its conditional MSE.20 −score D en s i t y −5 0 5 10 15 . . . . . N(0,1)N(−0.02,1.23 ) (a) Z−score D en s i t y −5 0 5 10 15 . . . . . N(0,1)N(−0.07,1.4 ) (b) Z−score D en s i t y −5 0 5 10 15 . . . . . N(0,1)N(−0.13,1.61 ) (c) Facility Size (Patient−Years) F r equen cy (d) ********************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************** ** **** **** ** * * ** − Facility Size (Patient−Years) Z − sc o r e Stratified ENSmoothed EN (e) *********************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************

150 200 250 300 350 . . . . . Facility Size (Patient−Years) Z − sc o r e (f) Figure 2: Histograms of FE Z-scores in dialysis mortality data (SMR), stratiﬁed by facility sizetertiles into (a) small, (b) medium, and (c) large facility groups. The smooth curves representthe standard normal and the empirical null distributions. (d) Histogram of facility size in patient-years. (e) Scatter plot of FE Z-scores versus facility size along with ﬂagging thresholds based onthe stratiﬁed EN (solid blue lines) and the smoothed EN (dotted black lines). The two verticalgrey lines separate facilities into three groups by the tertiles of facility size. The red square in (e)is magniﬁed in (f). 21 .0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 . . . . . . a P r obab ili t y o f S i gna l FEREFEREEN (a) . . . . . . a P r obab ili t y o f S i gna l (b) . . . . . . a P r obab ili t y o f S i gna l (c) . . . . . . a P r obab ili t y o f S i gna l (d) Figure 3: Estimated probability of signaling provider 1 as worse than expected under the linearmodel (1), with all providers having sample size (a) n = 10, (b) n = 25, (c) n = 50, and (d) n = 100. The x-axis represents the ﬁxed value of α .22 . . . . . . a P r obab ili t y o f S i gna l (a) . . . . . . a P r obab ili t y o f S i gna l (b) . . . . . . a P r obab ili t y o f S i gna l (c) . . . . . . a P r obab ili t y o f S i gna l (d) Figure 4: Estimated probability of signaling provider 1 as worse than expected under the linearmodel (1), with 5% outliers, µ = 0 , β = 0 , σ w = 4 and σ α = 1. Provider sample sizes are drawnfrom a discrete uniform distribution except for provider 1 with (a) n = 25, (b) n = 50, (c) n = 100 and (d) n = 125. Outlier eﬀects have equal probability of taking values of ±

4. Thex-axis represents the ﬁxed value of α . Dashed black lines represent the estimated probabilitywhen the true parameters σ w and σ α are known.23 − Median size E s t. M ean (a)

50 100 150 200

Median size E s t. V a r i an c e (b) . . . . P r obab ili t y o f S i gna l Small Medium Large

EN ( l =1)EN ( l =0.75)EN ( l =0.5)FE / EN ( l =0) (c) Figure 5: (a) Estimated mean function and (b) estimated variance function, by the smoothedempirical null, in one replication of the simulation with survival outcomes, with the number ofgroups G = 20. Black dots represent group-wise robust mean and variance estimates of Z-scores.(c) Boxplots of empirical probability of signal summarized over 500 replications, all providersstratiﬁed into three groups by provider size. A proportion λ = 0 , . , . , λλ