Accounting for total variation and robustness in profiling health care providers
aa r X i v : . [ s t a t . A P ] J un Accounting for total variation and robustness inprofiling health care providers
Lu Xia, Kevin He, Yanming Li, John D. Kalbfleisch ∗ Department of Biostatistics, University of Michigan1415 Washington Heights, Ann Arbor, MI 48109, USA ∗ jdkalbfl@umich.edu Abstract
Monitoring outcomes of health care providers, such as patient deaths, hospitalizationsand hospital readmissions, helps in assessing the quality of health care. We consider a largedatabase on patients being treated at dialysis facilities in the United States, and the problemof identifying facilities with outcomes that are better than or worse than expected. Analy-ses of such data have been commonly based on random or fixed facility effects, which haveshortcomings that can lead to unfair assessments. A primary issue is that they do not appro-priately account for variation between providers that is outside the providers’ control due,for example, to unobserved patient characteristics that vary between providers. In this ar-ticle, we propose a smoothed empirical null approach that accounts for the total variationand adapts to different provider sizes. The linear model provides an illustration that extendseasily to other nonlinear models for survival or binary outcomes, for example. The empiricalnull method is generalized to allow for some variation being due to quality of care. Thesemethods are examined with numerical simulations and applied to the monitoring of survivalin the dialysis facility data.
Keywords:
Health care provider profiling; Empirical null; Fixed effects; Random effects;Non-linear models; Standardized mortality ratio.
In many instances, the quality of medical care differs considerably across providers. Large healthcare databases provide patient-level data that can be analyzed and summarized into provider-level ∗ To whom correspondence should be addressed. and others , 2010). Similarly, the rate of hospital readmissions following a hospital dis-charge is important in monitoring hospitals and also other providers such as nursing homes anddialysis facilities since it can be indicative of the degree of coordination of post-discharge care(He and others , 2013; Wish, 2014). Provider profiling based on clinical outcomes can help pa-tients make decisions in choosing health care providers, and also can aid overseers and payers inidentifying providers whose outcomes are worse or better than a normative standard, by signalingthe need for further review or to target quality improvement programs. The statistical analysisalong with further review may lead to financial consequences and even suspension for providerswith poor outcomes. Such evaluations require appropriate statistical methods that account for dif-ferences in patient characteristics and suitably adapt to provider size to ensure no specific groupsof small or large providers are unfairly penalized in the assessment.This paper is concerned with a large database assembled by the University of Michigan’s KidneyEpidemiology and Cost Center and associated with its contacts with the Centers for Medicare &Medicaid Services (CMS). A similar database is available from the United States Renal DataSystems (USRDS), in the form of standard analysis files, for use in suitable research projects.The data consist of longitudinal information on over half a million dialysis patients with end-stage renal disease (ESRD) and associated with over 6,000 dialysis facilities in the United States.These data are compiled from various sources within the CROWNWeb data system from theCMS as well as supplementary data from the Social Security master death file. The CROWNWebdata include demographic and administrative data on patients and dialysis facilities along withongoing reporting on treatment, the Medicare claims database which can be used to determinehospitalizations, comorbidities, and other events, as well as the data on kidney transplants in theUnited States. One important use of these data is to monitor dialysis facilities with respect tovarious quality measures including mortality, hospitalization, and readmission. In this paper, weconsider the evaluation of mortality over a four-year period, accomplished through the calculationof a standardized mortality ratio (SMR) for each dialysis facility. The SMR is a risk adjustedmeasure which compares the observed number of deaths in a facility with the number of deathsthat would be expected if the mortality rate for the center, taking account of observed patientcharacteristics, were the same as for a national norm (UM-KECC, 2018). The measure is typicallyadjusted for many covariates including demographic characteristics, comorbidities at baseline, andsometimes comorbidities identified over time. Adjusting for covariates after admission of thepatient to the facility must be done with care since such comorbidities may reflect treatmentchoices.Regression models for patient-level outcomes are typically used to adjust for observed pa-tient characteristics. Methods are often based on hierarchical random effects models, in which2he provider-specific effects are modeled as independent and identically distributed random vari-ables. A typical analysis of these models is based on the empirical Bayes posterior distributionof the provider effect where an individual provider effect is estimated by the posterior mean.We refer to this as the random-effects (RE) approach. As is well known, this analysis yieldsshrinkage estimates, i.e. the estimated provider effects are shrunk towards the overall average(Jones and Spiegelhalter, 2011). Many authors have advocated this approach; see, for exam-ple, Normand and others (1997); Normand and Shahian (2007); Jones and Spiegelhalter (2011);Ohlssen and others (2007). An alternative approach, treating the provider effects as fixed, obtainsthe maximum likelihood estimates of the provider effects. We refer to this as the fixed-effects ap-proach (FE). In both approaches, it is common to use the estimate and its frequency or posteriordistribution to assess the sharp null hypothesis that a provider effect is identical to a nationalnorm. Much of the between-provider variation is typically outside the providers’ control due toinadequate risk adjustment for unobserved or observed covariates associated with the outcome thatvary substantially between providers. Jones and Spiegelhalter (2011) and Kalbfleisch and others (2018) stressed that all sources of variation should be taken into account in profiling.In this paper, we propose empirical null methods that take into account the total variation asa robustly modeled function of provider size. The empirical null was introduced in Efron (2004)and Efron (2010) to account for overdispersion of Z-scores, and Efron (2007) described methodsto estimate a single empirical null distribution with an emphasis on controlling false discoveryrates. However, these approaches that estimate and refer to a single sampling distribution donot apply immediately when providers vary markedly in size. In Kalbfleisch and Wolfe (2013)and He and others (2013), this shortcoming was partially addressed through stratification. In thisarticle, we define an empirical null for each provider by smoothing the reference empirical nulldistributions. Our proposal provides a unified framework for provider profiling that encompassesvarious types of outcomes. We also generalize this approach to situation where a provider shouldbe held accountable for an externally specified proportion of the total variation that is due to thequality of care.In Section 2, we consider issues associated with profiling in the linear model where calculationsare more easily done and there is immediate comparison with familiar tests. These methods wouldapply directly to a quality measure that is based on normally distributed variables. In Section 3,we propose the empirical null methods with straightforward generalizations to other more complexmodels as well. In Section 4, we illustrate the empirical null methods through simulation studiesand analysis of the ESRD dialysis facility data with respect to monitoring standardized mortalityratios. Section 5 includes some discussion and extensions.3
Standard profiling methods based on the linear model
Let i = 1 , · · · , N index providers and j = 1 , · · · , n i index patients within the i th provider. Weconsider an underlying linear regression model Y ∗ ij = µ + α i + β T X ij + ǫ ij , (1)where Y ∗ ij represents the continuous outcome of interest with large values corresponding to pooroutcomes, µ is the grand mean, α i is the provider effect, X ij and β are vectors of patient charac-teristics and regression coefficients, respectively. Our primary goal is to identify providers whoseoutcomes are worse than expected.In many instances, we assume that the α i ’s are random and, conditional on X ij , α i iid ∼ N (0 , σ α ) , independent of ǫ ij iid ∼ N (0 , σ w ) . (2)The parameters µ, σ α , σ w , and β appear in the probability laws for all observations, and thusare estimated using all patient-level data. In what follows, we assume that the total numbers ofproviders, N , and patients, P i n i , are large, e.g. more than 500,000 patient records and over 6,000dialysis facilities in our application, so that µ, σ α , σ w and β can be precisely estimated. Thus, weproceed without considering the variation associated with their estimates and replace Y ∗ ij ’s withthe risk adjusted responses, Y ij = Y ∗ ij − ˆ β T X ij , where ˆ β , based on fixed effects for the α i ’s, isconsistent for β . Ignoring variability in the estimation of these structural parameters is justifiedin large healthcare databases. In much smaller applications, one should take into account theassociated variability by considering exact sampling distributions, adjustment of empirical Bayesestimates or fully Bayesian methods (Jones and Spiegelhalter, 2011). In a fixed-effects (FE) analysis, we consider the provider effect α i in (1) is a constant with aconstraint, usually P Ni =1 n i α i = 0, for identifiability. The maximum likelihood estimator of α i isˆ α FE i = ¯ Y i − ¯¯ Y , where ¯ Y i = P n i j =1 Y ij /n i and ¯¯ Y = P Ni =1 ( n i ¯ Y i ) / P Ni =1 n i . As noted above, ¯¯ Y is anaccurate estimate of µ as is ˆ σ w of σ w . The FE Z-score for testing the sharp null of α i = 0 is Z FE ,i = ˆ α FE i ˆ σ w / √ n i ≈ √ n i ( ¯ Y i − µ ) σ w , (3)which has a N (0 ,
1) distribution under the null α i = 0. This can be used as the reference distribu-tion to flag unusual providers. Then, one approach might be to flag the i th provider if Z FE ,i > z ρ ,where z ρ is the upper ρ th quantile of the standard normal distribution and ρ is the nominal4ignificance level of the related one-sided test. We use ρ = 0 .
05 throughout this article.The usual random-effects (RE) analysis of (1) is based on the assumption (2) that α i iid ∼ N (0 , σ α ) , i = 1 , · · · , N , and that the provider effects α i are independent of patient character-istics X ij , j = 1 , · · · , n i . The latter is an important assumption in RE analysis that is rarely metor noted. The confounding between α i and X ij can lead to substantially biased estimates of β and can alter the estimates of α i (Kalbfleisch and Wolfe, 2013). One way to address such bias isto utilize the fixed-effects estimate of β , i.e. ˆ β , to provide an offset. As described earlier, we canobtain the risk adjusted response, Y ij , and then estimate σ α , σ w , µ and α i ’s assuming the model Y ij = µ + α i + ǫ ij .An empirical Bayes approach gives the approximate posterior distribution of α i , α i |{ Y ij } ∼ N (cid:0) R i ( ¯ Y i − µ ) , R i σ w /n i (cid:1) , (4)where R i = σ α / ( σ α + σ w /n i ). This approach yields an estimate ˆ α RE i = ˆ R i ( ¯ Y i − ¯¯ Y ) that is shrunktoward zero by ˆ R i . These RE estimates are conditionally biased; if the i th provider has true effect α i , E ( ˆ α RE i | α i ) ≈ R i α i . The corresponding RE Z-score for a test of α i = 0 from (4) is Z RE ,i = ˆ R i ( ¯ Y i − ¯¯ Y ) q ˆ R i ˆ σ w /n i = q ˆ R i Z FE ,i ≈ √ R i ( ¯ Y i − µ ) σ w / √ n i , (5)which has a posterior reference distribution N (0 , Z RE ,i > z ρ . Note that this reference N (0 ,
1) is based on the posteriordistribution (4); the sampling distribution for Z RE ,i given α i = 0 is approximately N (0 , R i ) ratherthan N (0 , Z FERE ,i = ˆ α FE i p ˆ σ α + ˆ σ w /n i ≈ ¯ Y i − µ p σ α + σ w /n i . (6)Analogous to FE and RE methods, one flags the i th provider if Z FERE ,i > z ρ . This is not a testof the sharp null hypothesis α i = 0, but rather assesses whether ¯ Y i could reasonably have arisenfrom the model (1). FE and RE methods are often used in profiling medical providers and are most appropriate when allof the variation in provider effects is due to the quality of care. Both methods make reference to asharp null hypothesis, H i : α i = 0. The RE approach is often thought to account for the variationbetween providers. The RE approach has been promoted largely on the basis of the well-known5esult that, compared to FE, it improves the overall precision of estimating the provider effects by“borrowing information” from other providers (Jones and Spiegelhalter, 2011; Efron and Morris,1973; Louis, 1991; Krumholz and others , 2011). Thus, the mean squared error (MSE) of theestimated provider effect is reduced by shrinkage, i.e. E (cid:0) ˆ α RE i − α i (cid:1) ≤ E (cid:0) ˆ α FE i − α i (cid:1) . This resultis important in some contexts, but not in profiling. Consider, for example, the conditional MSE,MSE met ( α i ) = E (cid:20) ( ˆ α met i − α i ) (cid:12)(cid:12)(cid:12)(cid:12) α i (cid:21) for met=FE, RE, plotted in Fig. 1, where σ α = 1 , σ w = 5 and n i = 100. Reduction in the overall MSE by RE is achieved by an average of substantial lossesfor extreme values of α i and modest gains for more frequent values of α i closer to 0. This is anexample of an overall average missing important features. For profiling, it is the extreme valuesthat are of primary interest. For this reason among others, when the between-provider variationis entirely due to the quality of care, the FE estimates are preferred for profiling purposes.Although the FE approach produces unbiased estimates of the provider effects, the FE Z-scoresare substantially overdispersed compared to the standard normal reference distribution used intesting α i = 0 (Efron, 2004; Spiegelhalter and others , 2012). The marginal variance of the FEZ-score assuming randomness in the α i ’s is V ar ( Z FE ,i ) ≈ σ α + σ w /n i σ w /n i = 1 + n i σ α σ w , (7)which shows the variability in FE Z-scores, especially among large providers, is much larger thanthe reference N (0 ,
1) distribution. The FE approach does not take into account such unexplainedvariation and only assesses the sharp null hypothesis of a provider effect. A large sample sizecan translate into more accurate estimation of the provider effect, and any deviation in α i from 0will eventually be detected by FE as n i → ∞ . As a result, the FE approach disproportionatelyflags large providers even when their provider effects are small and not clinically meaningful.Similarly, the marginal variance of Z RE ,i increases linearly with n i , and the same problem persists.With a large provider size n i , FE and RE are nearly identical and neither takes into account theunexplained variation between providers.Often, much of the variation in the α i ’s is outside the providers’ control and cannot be attributedto the quality of care. This can arise, for example, when there are unmeasured patient-levelcovariates, such as socio-economic status, genetic differences and comorbidities, that are predictiveof the outcomes but vary substantially between providers. When most of the variation betweenproviders is outside of the providers’ control, the FERE approach is preferred and its advantageshave been discussed in the literature before, e.g. in the context of funnel plots which illustratehow increasing provider size affects variation in the FE estimates (Jones and Spiegelhalter, 2011).Providers that give rise to a large value of Z FERE ,i have extreme values with reference to thepopulation of all providers, and include all sources of variation.However, the usual FERE approach is not robust. When there exists a substantial proportion6f providers with extreme outcomes, the between-provider variance σ α is overestimated, whichcompromises the ability of the FERE approach to identify extreme providers. Although it ispossible to develop robust estimates of σ α (see, e.g., Koller 2016) to improve the FERE method,generalizing them to non-linear models would be difficult. The empirical null was introduced in Efron (2004) and Efron (2010) in the context of multipletesting and controlling false discovery rates. The empirical null (EN) methods in this sectionare designed to address overdispersion for fair assessments of all providers in a robust fashion tooutliers. These methods estimate the total variation in the FE Z-scores, Z FE ,i , i = 1 , . . . , N , toassess extreme values in the empirical null distributions. For the large majority of providers under consideration, suppose that the model (1) holds, with α i iid ∼ N (0 , σ α ). For now, we assume all providers have an equal number of patients, i.e. n i = n for i = 1 , · · · , N . The empirical null distribution is defined as the normal distribution with mean,ˆ µ M , and variance, ˆ σ M , which are estimated robustly from the FE Z-scores Z FE ,i , i = 1 , . . . , N so as to reduce or eliminate the impact of a relativelly small number of outlying providers. Thisdistribution is used in place of the reference N (0 ,
1) so that, for example, the i th provider is flaggedas “worse than expected” if Z FE ,i > ˆ µ M + z ρ ˆ σ M . To obtain robust estimates ˆ µ M and ˆ σ M , we adapt the “MLE fitting” method (Efron, 2007). Thisis based on a mixture model in which a proportion p M of providers have Z-scores arising from theempirical null distribution, N ( µ M , σ M ), whereas the remainder, including any outliers, come fromanother non-null distribution. The target quantities µ M and σ M are estimated by maximizing themixture likelihood so that the estimates are robust to outliers. More specifically, we assume thatthe non-null distribution has support outside an interval [ A, B ] ≡ [ µ (0) M − ζ · σ (0) M , µ (0) M + ζ · σ (0) M ], where( µ (0) M , σ (0) M ) are reasonable initial estimates and ζ > I = { i : Z FE ,i ∈ [ A, B ] } , N = | I | , N = N − N , φ µ,σ ( z ) = 1 √ πσ exp ( − (cid:18) z − µσ (cid:19) ) ,and Q ( µ, σ ) = Φ (cid:18) B − µσ (cid:19) − Φ (cid:18) A − µσ (cid:19) , where Φ( · ) is the cumulative distribution function of N (0 , θ = p M Q ( µ, σ ). Then the likelihood based on the observed Z-scores is L ( µ, σ, p ) = θ N [1 − θ ] N Y i ∈ I φ µ,σ ( Z FE ,i ) Q ( µ, σ ) . (8)To obtain the MLE, (ˆ µ M , ˆ σ M , ˆ p M ) = argmax ( µ,σ,p ) L ( µ, σ, p ), we proceed by computing the profile7ikelihood of p M , which avoids solutions with ˆ p M >
1. Thus we define a grid of K values for p (e.g.[0.5, 1] by increment of 0.001), denoted as { p (1) , p (2) , · · · , p ( K ) } . For each k = 1 , , · · · , K , com-pute ℓ ( k ) = max µ,σ L ( µ, σ, p ( k ) ) and the corresponding maximizer (ˆ µ ( k ) , ˆ σ ( k ) ). This maximizationstep is over a 2-dimensional parameter and is very fast using existing optimization methods (e.g.Nelder and Mead 1965). Finally, find ˜ k = argmax k ℓ ( k ) , and then (ˆ µ M , ˆ σ M , ˆ p M ) = (ˆ µ (˜ k ) , ˆ σ (˜ k ) , p (˜ k ) ).An important strength of the MLE fitting approach is that the likelihood (8) is only parametricwithin the interval [ A, B ], and is free from specifications of the distribution of outliers, all ofwhich only occur outside the interval. A robust M-estimation technique can be used to obtainthe preliminary estimates (Huber, 1964, 1973; Andrews and others , 1972). We use the bi-weightfunction, which is the default in SAS ® ROBUSTREG procedure (Chen, 2002). A stratified ENapproach was previously implemented using the robust M-estimation, in Kalbfleisch and Wolfe(2013) and He and others (2013) for time-invariant profiling, and Estes and others (2018) for time-dynamic profiling. The MLE fitting outlined above is preferable to the robust M-estimation, sincethe latter can overestimate the variance with many outliers present. We have also found thatmoderate values of ζ such as Φ − (0 .
9) = 1 .
28 or Φ − (0 .
95) = 1 .
64 help avoid including outlierswhile allowing good estimation of µ M and σ M .Next we consider a more realistic setting where the sample size n i varies across providers.Since the total variability of FE Z-scores depends on n i , we first stratify providers into a fewgroups based on their sample sizes, as in Kalbfleisch and Wolfe (2013) and He and others (2013).For example, we may stratify the providers into three groups based on the tertiles of sample sizes { n i } . Within group g , we obtain the estimates ˆ µ M,g and ˆ σ M,g , and an empirical null distribution N (ˆ µ M,g , ˆ σ M,g ) , g = 1 , ,
3. Each health care provider is compared to the empirical null correspond-ing to its group.As an illustration, consider the mortality data from the ESRD database introduced in Section1, on patients treated at 6,363 dialysis facilities in the United States over four calendar years, from2012 to 2015. More data descriptions can be found in Section 4.3. Facilities are stratified into threegroups, small, medium and large facilities, using facility size tertiles. The failure rates are modeledusing a Cox model with covariates measuring demographic variables and comorbidities at baseline.The SMR (see Section 1) is obtained for each facility and each is converted to a one-sided Z-scorebased on a test of the sharp null hypothesis. These are used in an extension of the EN methodsto non-linear models as discussed in detail in Section 3.3. The estimated means of one-sided Z-scores are − . − .
07 and − .
13 for the small, medium and large groups respectively, and thecorresponding variance estimates are 1 . , 1 . and 1 . ; see Fig. 2(a)-(c) for Z-score histogramsalong with the empirical null distributions. It can be seen that the empirical null distributionsare more dispersed than the standard normal, and that the variation of Z-scores increases withfacility size. We use the upper 5% quantiles in the empirical null distributions as critical values,and facilities with Z-scores larger than those in their corresponding groups are flagged as havingpoor outcomes. Fig. 2(e)-(f) visualizes the stratum-specific critical values of Z-scores in the scatter8lot (solid blue segments). The stratified empirical null approach has some limitations, as evident from Fig. 2(e)-(f). First,the choice of three groups is arbitrary and a different number of strata will result in changes inthe list of flagged providers. A second related problem is the discontinuity of the critical region atthe stratum boundaries. Consequently, two providers near a boundary may have similar sizes andZ-scores, yet one may be flagged and the other not, due only to the arbitrary choice of boundaries.To overcome these issues, we model the mean and variance of the empirical null distributions assmooth functions of provider size, and use robust techniques to lessen the impact of potentialoutliers.To estimate the regression of the variance on the provider size, we first obtain variance estimatesin each of G groups defined by quantiles of provider sample size. We then regress these varianceestimates on the median provider size in each group. Specifically, we proceed as follows. Withinthe g th group, we use the local MLE fitting described in Section 3.1 to estimate the mean ˜ z g andvariance ˜ σ g of Z-scores, and let ˜ m g represent the median size in this group, g = 1 , · · · , G . Focusingfirst on the variance estimates, we fit a regression model with variance estimates { ˜ σ g } as dependentand median sizes { ˜ m g } as independent variables. Based on our empirical studies and theoreticalderivations, a linear regression of the form γ + γ ˜ m g will typically suffice. An iteratively re-weightedalgorithm is used to estimate ( γ , γ ). We set the initial estimates as the ordinary least squaresestimates (ˆ γ (0)0 , ˆ γ (0)1 ), then update with (ˆ γ ( t +1)0 , ˆ γ ( t +1)1 ) = arg min ( γ ,γ ) P Gg =1 ω ( t ) g (˜ σ g − γ − γ ˜ m g ) untilconvergence, where the weights ω ( t ) g = N g (ˆ γ ( t )0 + ˆ γ ( t )1 ˜ m g ) − and N g is the number of providers inthe g th group. The final fitted variance values for each providers are denoted as ˆ σ i , i = 1 , · · · , N .We include weights in the regression to reduce the leverage of the variance estimates with largevariability from large providers.An important issue is the choice of the total number of groups G . A small G will not providesufficient information for estimating regression coefficients precisely in a linear model, whereas alarge G results in a very small number of providers in each group and consequently unstable withingroup variance estimates. From our experience with the SMR example, we find it satisfactory tochoose G that there are 50 to 300 providers in each group. The estimated mean and variancefunctions under different choices for G are fairly stable, as reported in the supplementary materialSection S2.To estimate the mean Z-score as a function of provider size, we use a weighted smoothing tech-nique such as smoothing spline, B-spline or LOESS on data { (˜ z g , ˜ m g ) } Gg =1 . The weight associatedwith the mean estimate ˜ z g in the g th group is inversely proportional to the variance estimate ˆ σ g from the iteratively re-weighted algorithm above. One can easily find existing implementations forthese methods, for example, smooth.spline() for smoothing spline in R . For the provider size that9alls outside the range of { ˜ m g } Gg =1 , one may extrapolate either with the estimated linear functionor with a plateau so that the mean function remains continuous and flat. The fitted Z-scores fromsmoothing are denoted as ˆ Z i , i = 1 , · · · , N . We propose this group-based smoothing instead ofdirect smoothing based on the original Z-scores, because the MLE fitting within each group ismore robust against potential outliers.Examination of the i th provider may proceed based on the individual empirical null distribution N ( ˆ Z i , ˆ σ i ). Similar to other methods, the i th provider is flagged as worse than expected if Z FE ,i > ˆ Z i + z ρ ˆ σ i , for a one-sided test with nominal level ρ . Smoothing provides a better approximationto the total variance of Z-scores and avoids any unfairness associated with simple stratification.When the normal linear model (1) holds for all providers, this approach gives an almost identicalresult to the FERE approach. Fig. 2(e)-(f) show the critical line for flagging dialysis facilities withpoor outcomes in the SMR data using the smoothed empirical null ( z ρ = 1 .
64 for black dottedlines).
Many types of outcomes are monitored for quality of care, including patient hospitalizations, read-missions, transfusions and death events in the ESRD dialysis facility data. A hospital readmissionmeasure within thirty days following a hospital discharge is based on a logistic model for binaryoutcomes, and 30-day mortality rates are analyzed in a similar manner. Hospitalizations are ana-lyzed using a model for recurrrent events. The empirical null approach can be readily extended tonon-linear models and provides a unifying framework for profiling providers.We continue using the SMR example, which assesses patient death events within a health careprovider. For the i th provider, we denote the observed number of deaths by O i . A two-stagemodeling procedure is used to obtain the expected number of deaths under the assumption thatpatients of this provider have death events at the national average rate (UM-KECC, 2018). Thefollowing is the approach currently adopted by CMS in Dialysis Facility Compare.In the first stage model, the hazard function of the j th patient in the i th provider is assumedto be λ ij ( t ) = λ i ( t ) exp { X Tij β } , where λ i is a provider-specific baseline hazard, and β representsthe regression coefficients associated with the observed patient-level characteristics X ij such asage, gender, race, BMI and a selected set of comorbidities. This stratified Cox model is fitted tothe national data in order to estimate β , denoted as ˆ β . Stratification by providers is important toaccurately estimate the within provider effects of covariates, β .In the second stage, the “population-average” cumulative baseline hazard Λ ( t ) = R t λ ( u ) du is estimated through an unstratified Cox model with an offset, X Tij ˆ β , obtained from the firststage. Conditional on patient characteristics X ij and at-risk process Y ij ( t ), the expected numberof events for the j th patient in the i th provider is calculated as E ij = R τ Y ij ( t ) exp { ˆ β T X ij } d ˆΛ ( t ) , where τ is the maximum follow-up time. For the i th provider, the expected number of events is10 i = P n i j =1 E ij and the corresponding SMR is estimated by \ SM R i = O i /E i . If \ SM R i > < i th provider experiences more (fewer) deaths than expected under the national norm given theobserved characteristics of patients. Note that E ij is a sum of conditional expectations and is infact the compensator in the martingale corresponding to the counting process for the individual.A test of the sharp null hypothesis H i : SM R i = 1, where SM R i is the underlying SMR forthe i th provider, can be obtained using a Poisson approximation whereby the number of events O i is assumed to follow a Poisson distribution with mean E i . In this case, the one-sided midp-value is p i = P ( X = O i ) / P ( X > O i ) , where X ∼ P oisson ( E i ). These can be converted toZ-scores using Z FE ,i = Φ − (1 − p i ). By this convention, large values of Z FE ,i are associated withpoor outcomes. Mid p-values are used to avoid difficulties in converting p-values to Z-scores. ThePoisson-based p-values in this example are preferable rather than a normal approximation withZ-scores O i − E i √ E i , since they are more accurate when the provider size is small.These FE Z-scores are consequently used to construct empirical null distributions for profiling,as introduced in Sections 3.1 and 3.2. A similar approach applies directly to other standardizedmeasures based on other regression models, such as hierarchical logistic regression for hospitalreadmission. Instead of converting p-values, FE Z-scores could also be based on Wald statisticsfrom fixed-effects estimates of provider effects and a test of the sharp null hypothesis. FE and REmethods analogous to those in the linear model are also sometimes used. RE methods tend to becomplicated and subject to the same concerns as described in Section 2.2 for the linear model, andthey behave similarly to FE methods in large providers and result in unreasonably higher flaggingrates for large providers. Similar to FERE, the empirical null approach presented above takes account of the total variationin the Z-scores. This is appropriate when most or all of the between-provider variation is dueto incomplete risk adjustment, as opposed to the quality of care. Kalbfleisch and others (2018)suggests introducing a value, λ , that represents the proportion of the between-provider variancethat is due to incomplete risk adjustment, and holding providers accountable for a proportion of1 − λ of the between-provider variation. Note that λ cannot be estimated on the data and mustbe specified based perhaps on expert opinion.In the linear model (1), we can write α i = α i + α i , where α i ∼ N (0 , (1 − λ ) σ α ) represents theportion of the effect due to the quality of care whereas the independent effect, α i ∼ N (0 , λσ α ),is variation that is outside the provider’s control. In this case, it is natural to base profiling onan assessment of the hypothesis H λ : α i = 0. Under this hypothesis, the null distribution for Z FE is N (0 , ( λσ α + σ w /n i ) / ( σ w /n i )). This is a natural generalization that connects the FE andFERE approaches discussed in Section 2, which correspond respectively to λ = 0 and λ = 1.Furthermore, for a general 0 ≤ λ ≤
1, the null distribution for Z FE ,i can also be written as11 − λ N (0 ,
1) + √ λ N (0 , / (1 − r i )). Here, r i = σ α / ( σ α + σ w /n i ), the shrinkage factor definedearlier, is also referred to as the inter-unit reliability (IUR).Analogous to the relaxation in the linear model above, we can extend the idea of decomposingthe between-provider variance to the empirical null in non-linear models. Suppose we have obtainedthe empirical null distribution for a provider, N ( ˆ Z i , ˆ σ i ). The IUR, in general, represents theproportion in the total variance that the between-provider variance takes, and can be computedin non-linear models (He and others , 2019). Then, for the i th provider with an IUR equal to r i ,the variance to be allowed isˆ σ λ,i = ˆ σ i − r i · ˆ σ i (1 − λ ) = [1 − r i (1 − λ )]ˆ σ i . (9)Additionally, if the FE Z-scores are computed as Wald statistics or asymptotically equivalentstatistics by assuming fixed provider effects, the variance in the reference distribution ˆ σ i can beapproximated by 1 / (1 − r i ) when all between-provider variation is due to incomplete risk adjustment( λ = 1), and ˆ σ λ,i in (9) can also be written as ˆ σ λ,i = 1 − λ + λ ˆ σ i . The new reference distributionallowing a proportion λ of the between-provider variance is N ( ˆ Z i , ˆ σ λ,i ). Note that the incorporationof λ simply changes the critical value of the test based on the empirical null.More generally, we might elicit a prior distribution f λ ( λ ) for λ that reflects experts’ uncertaintyabout its value. In this case, we might gauge a provider’s performance by comparing its FE Z-scoreto the marginal empirical null distribution. Since the distribution of the data does not depend on λ , the posterior distribution is the same as the prior. The marginal empirical null distribution hasdensity R f EN ( z | λ ) f λ ( λ ) dλ , where f EN ( z | λ ) is the density of N ( ˆ Z i , ˆ σ λ,i ). This distribution can beapproximated with Monte Carlo methods that draw random samples from f λ ( λ ) and then f EN ( z | λ ).Given ( ˆ Z i , ˆ σ i ), the mean of the marginal distribution is simply ˆ Z i and the variance is obtained bysubstituting λ with its prior mean in (9). Depending on the prior, the marginal can have heaviertails than a normal distribution. Detailed discussion is presented in the Supplementary MaterialSection S3. We first restrict all providers to have the same sample size, and compare the probability thatproviders give rise to a signal under different approaches. We assume the true model (1), with µ = 0, β = 0, ǫ ij ∼ N (0 , α i ∼ N (0 ,
1) for i ≥ α fixed at a value varying from 0 to 3.5.We simulate N = 200 providers of size n i = n for all i , where n = 10 , , , − (0 .
95) = 1 .
64 are flagged as worse than expected.12ig. 3 shows the estimated probabilities of signaling provider 1 for FE, RE, FERE and ENapproaches. The FE approach flags provider 1 with the highest probabilities in all cases. Asexpected, the difference between FE and RE diminishes for a large sample size n . Without out-lying providers, FERE and EN result in almost identical probabilities of signaling, reflecting theirasymptotic equivalence in this setup. For a large sample size, e.g. n = 100, standard FE and REmethods signal provider 1 with moderate to high probability even with relatively small values of α , say α = 0 . α are well within the range of variation expected for α i under the true model. On the other hand, FERE and EN allow for this variation and do not signalwith high probability until α > .
0, that is until the effect is in the tail of the distribution ofprovider effects ( σ α = 1). It should be noted that the exact probability of flagging can be easilycalculated and plotted for all methods except EN. We present the empirical probability in all casesto facilitate fair comparison.To illustrate the robustness of EN compared to FERE, we also simulate N = 3 ,
000 providerswith 5% outliers. Still, we assume model (1) for the majority of providers, with µ = 0, β = 0, σ w = 4, and σ α = 1 except that α is fixed at a value varying from 0 to 3 . σ α . Half of theoutliers have provider effect α i = 4 σ α and the other half α i = − σ α , well outside the center of thetrue distribution for the majority of provider effects. Sample size n i is simulated from a uniformdistribution on integers { , , · · · , } for all providers except provider 1. For provider 1, itssample size n = 25 , , , σ α and σ w are known (black dashed line), and is more robust than FERE, which has lower flagging proportionsdue to over-estimation of the between-provider variance, especially when n is large. We consider a realistic situation where providers are of different sizes and simulate survival out-comes that mimic the SMR example in Section 3.3. A similar study where providers are of thesame size is presented in the Supplementary Materials Section S1.We generate N = 2000 providers whose sample sizes are simulated from a uniform distributionon integers in [10 , j th subject in the i th provider, the sur-vival time T ij follows an exponential distribution with hazard λ ij = 0 . × exp { α i + X ij β + X ij β } ,where β = 1 and β = − α i iid ∼ N (0 , . ) are the provider effects, and the covariates, X ij and X ij , are independent N (0 ,
1) variables. The censoring time C ij iid ∼ U nif (10 , G = 5 , , ,
60. Linear regressionmodels are fitted to the group-wise variance estimates of Z-scores, and weighted smoothing splinesto the group-wise mean estimates. We repeat the simulation 500 times.Fig. 5(a)-(b) show the estimated mean and variance functions of Z-scores in one replicationwhen the number of groups G = 20. Results with different G show no sensitivity to the selectionof G in the range 5 to 60 (Section S2 in the Supplementary Materials). Our proposed methodcaptures the main features of the mean and variance functions while being smooth enough toprovide consistent flagging rules within the considered range of provider size.If one assumes that a proportion λ of the between-provider variance is due to incomplete riskadjustment, then the methods of Section 3.4 can be used. Following the same simulation setupwith survival outcomes above, we consider λ = 0 , . , . , σ λ,i in the reference distribution accordingly. Fig. 5(c) shows box plots of the proportion of timesthat providers are flagged in 500 replications, stratified into three groups. The FE approach,corresponding to λ = 0, flags a provider if its one-sided p-value is less than 0.05, which results inover 25% of the large providers and about 15% of the small providers being signaled. In contrast,the EN approach that allows all variation ( λ = 1) has very stable flagging rates around 5% for allthree groups. The EN approach with the relaxed factor λ can be viewed as a hybrid of the FEanalysis and the EN allowing total variance, hence its flagging rates lie between the latter two.When λ = 0 . , .
75, the flagging rates increase somewhat with provider size, a feature inheritedfrom FE.
The Standardized Mortality Ratio (SMR) is used as a measure of mortality to profile dialysisfacilities at CMS. More details on SMR can be found in Sections 1 and 3.3. Data were collectedfrom 2012 to 2015, involving over half a million dialysis patients. In this analysis, we include 6,363dialysis facilities with expected number of deaths of 3.0 or more. The number of observed deathsranges from 0 to 581, the number of expected deaths from 3.0 to 308.6, and the facility sizes rangesfrom 6.9 to 1569.8 patient-years. Facility size tertiles, defined by cut points at 156.9 and 302.8patient-years, create three groups of small, medium and large facilities.Fig. 2 shows the histograms of FE Z-scores by stratum, the distribution of facility size anddifferent flagging threshold lines for the stratified and the smoothed EN methods, and has beendiscussed in Section 3. Switching from the stratified EN to the smoothed version changes theflagging labels for some facilities. Nine facilities are flagged as “worse than expected” by thestratified EN but not the smoothed EN, and 18 facilities are flagged by the smoothed EN butnot the other. A total of 367 facilities are flagged by both EN methods, and 5,968 facilities byneither. The FE approach results in 768 facilities being flagged (12.1% of the total number), with231 (10.9%), 241 (11.4%) and 296 (14.0%) in the small, medium and large groups, respectively14percentages are with respect to the number of facilities in each group). The FE approach flagsmore large facilities than small ones, and the flagging rates exceed the target 5% by large marginsdue at least in part to overdispersion. Using the smoothed EN, the flagging rates in the threegroups are brought down to a more equitable level, 141 (6.6%), 109 (5.1%) and 123 (5.8%). Dueto possible existence of outliers, the empirical null based flagging rates are slightly larger than thetarget level.
Besides the comments in Section 3.4, one may also choose a λ value that would result in a certainproportion of providers being flagged, especially when there are constraints on resources madeavailable for the review process or quality improvement program. This could also be accomplishedby changing the nominal flagging rate ρ .Generalization of RE and FERE methods to non-linear models, such as the logistic modelor the Cox proportional hazards model, is complicated. For example, methods developed forassessing hospital readmission rates were based on an RE analysis of a hierarchical logistic model(Horwitz and others , 2011), and entailed complicated bootstrap techniques to assess significance.The EN method, however, generalizes immediately as described in Section 3.3, and is applied to amore complicated logistic model for hospital readmissions of dialysis patients in (He and others ,2013).In the EN approach, we have assumed that there are a large number of providers to be profiledand that the central part of the histogram of Z-scores is well described by a normal distribution.These two assumptions are satisfied in many applications. When the number of facilities is muchsmaller, it is important to take into account uncertainty in the estimation of the empirical nulldistribution. Also, in some instances, it may be useful to use a transformation of the basic mea-sure in order to achieve approximate normality of the empirical null. In other cases, there maybe situations where a non normal distribution that incorporates skewness, for example, is moreappropriate. These are areas where additional work is needed.As noted earlier, an FE analysis to estimate β and then use of an offset is one approachto correct for the confounding between covariates and provider effects in the RE method. Analternative approach is to include both a within and a between regression coefficient in the model Y ij = µ + α i + β Tw ( X ij − ¯ X i ) + β Tb ¯ X i + ǫ ij as described in Neuhaus and Kalbfleisch (1998). A feature which sometimes arises is that thebetween-provider regression coefficient, β b = β w , in which case the within coefficient does notmake a full adjustment for the covariates under consideration. This would arise, for example, ifeven having adjusted for a variable like race, one found that there was still a difference between15roviders according to the racial mix that they treat. Such situations require careful consideration,and a discussion of issues associated with this can be found in Kalbfleisch and others (2018).Bayesian methods are computationally attractive and have often been used in profiling. Theseimpose distributional assumptions on the random provider effects, often with hyperparameters infull Bayesian methods (Normand and others , 1997; Racz and Sedransk, 2010; Normand and Shahian,2007). However, there still remains confusion as to nonunified criteria for identifying outly-ing providers based on posterior distributions. For instance, Racz and Sedransk (2010) assessedwhether a pre-determined norm lies between the posterior percentiles, e.g. 2.5% and 97.5%, of aprovider effect in a hierarchical logistic regression model. Normand and others (1997) consideredthe posterior probability of the excess expected mortality (the difference between the expected mor-tality under a provider’s own regression coefficients and that averaged over the provider-specificparameters) being larger than a benchmark, and the posterior probability of the adjusted mortalityof a reference patient greater than that for similar patients in all providers in the same sample.Often, extremeness of the observed mortality have also been assessed based on the posterior pre-dictive distrbution through replications (Normand and Shahian, 2007). Based on a hierarchicallogistic regression model, Bayesian intervals can be constructed for standardized readmission mea-sures via bootstrapping from a normal approximation to the posterior distribution of providereffects in an approach similar to the RE methods discussed in this paper. As with RE methods, itis worthwhile to take measures with a Bayesian approach to ensure fair assessments of providersof all sizes.A causal inference framework provides a promising but challenging approach to profiling healthcare providers. In general, the existence of unmeasured confounders poses difficulties in the in-ference on providers’ performance. Spertus and others (2016) implemented augmented inverseprobability weighting (Robins and others , 1994) and targeted maximum likelihood estimation(van der Laan and Rubin, 2006) under a causal inference framework for profiling, coupled withelastic net for variable selection. Spertus and others (2016) discussed using instrumental variablesif one has strong reasons to assume the underlying causal mechanism. As is often the case, forcausal inference, cautions are needed in connection with many commonly made but non-verifiableassumptions. Software
The R code for implementing the empirical null with simulation examples in this paper has beenmade available at https://github.com/luxia-bios/Empirical-Null .16 upplementary Material
Supplementary material will be available online at http://biostatistics.oxfordjournals.org . Acknowledgements
This work was supported in part by The Centers for Medicare and Medicaid Services [contractnumber HHSM-500-2008-000211], although the opinions expressed in this article do not necessarilyreflect those of the CMS or the U.S. government.
Conflict of Interest:
None.
References
Andrews, D. F., Bickel, P. J., Hampel, F. R., Huber, P. J., Rogers, W. H. andTukey, J. W. (1972).
Robust Estimates of Location: Survey and Advances . Princeton, NJ:Princeton University Press.
Chen, C. (2002). Paper 265-27 Robust regression and outlier detection with the ROBUSTREGprocedure. In:
SAS Institute Inc., Proceedings of the Twenty-Seventh Annual SAS ® UsersGroup International Conference . Cary, NC: SAS Institute Inc.
Efron, B. (2004). Large-scale simultaneous hypothesis testing: the choice of a null hypothesis.
Journal of the American Statistical Association (465), 96–104. Efron, B. (2007). Size, power and false discovery rates.
The Annals of Statistics (4), 1351–1377. Efron, B. (2010).
Large-scale inference: empirical Bayes methods for estimation, testing, andprediction . Cambridge, UK: Cambridge University Press.
Efron, B. and Morris, C. (1973). Stein’s estimation rule and its competitors – an empiricalBayes approach.
Journal of the American Statistical Association (341), 117–130. Estes, J. P., Nguyen, D. V., Chen, Y., Dalrymple, L. S., Rhee, C. M., Kalantar-Zadeh, K. and S¸ent¨urk, D. (2018). Time-dynamic profiling with application to hospitalreadmission among patients on dialysis.
Biometrics (4), 1383–1394. He, K., Kalbfleisch, J. D., Li, Y. and Li, Y. (2013). Evaluating hospital readmission ratesin dialysis facilities; adjusting for hospital effects.
Lifetime Data Analysis (4), 490–512. He, K., Kalbfleisch, J. D., Yang, Y. and Fei, Z. (2019). Inter-unit reliability for nonlinearmodels.
Statistics in Medicine (5), 844–854.17 orwitz, L., Partovian, C., Lin, Z., Herrin, J., Grady, J., Conover, M.,Montague, J., Dillaway, C., Bartczak, K., Ross, J., Bernheim, S., Drye,E. and others Huber, P. J. (1964). Robust estimation of a location parameter.
The Annals of MathematicalStatistics (1), 73–101. Huber, P. J. (1973). Robust regression: asymptotics, conjectures and Monte Carlo.
The Annalsof Statistics (5), 799–821. Jarman, B., Pieter, D., van der Veen, A. A., Kool, R. B., Aylin, P., Bottle, A.,Westert, G. P. and Jones, S. (2010). The hospital standardised mortality ratio: A powerfultool for dutch hospitals to assess their quality of care?
BMJ Quality & Safety (1), 9–13. Jones, H. E. and Spiegelhalter, D. J. (2011). The identification of “unusual” health-careproviders from a hierarchical model.
The American Statistician (3), 154–163. Kalbfleisch, J. D., He, K., Xia, L. and Li, Y. (2018). Does the inter-unit reliability (IUR)measure reliability?
Health Services and Outcomes Research Methodology (3), 215–225. Kalbfleisch, J. D. and Wolfe, R. A. (2013). On monitoring outcomes of medical providers.
Statistics in Biosciences (2), 286–302. Koller, M. (2016). robustlmm: an R package for robust estimation of linear mixed-effects models. Journal of Statistical Software (6), 1–24. Krumholz, H. M., Lin, Z., Drye, E. E., Desai, M. M., Han, L. F., Rapp, M. T.,Mattera, J. A. and Normand, S-L. T. (2011). An administrative claims measure suitablefor profiling hospital performance based on 30-day all-cause readmission rates among patientswith acute myocardial infarction.
Circulation: Cardiovascular Quality and Outcomes (2), 243–252. Louis, T. A. (1991). Assessing, accommodating, and interpreting the influences of heterogeneity.
Environmental Health Perspectives , 215–222. Nelder, J. A. and Mead, R. (1965). A simplex method for function minimization.
TheComputer Journal (4), 308–313. Neuhaus, J. M. and Kalbfleisch, J. D. (1998). Between-and within-cluster covariate effectsin the analysis of clustered data.
Biometrics (2), 638–645.18 ormand, S. T., Glickman, M. E. and Gatsonis, C. A. (1997). Statistical methods forprofiling providers of medical care: issues and applications. Journal of the American StatisticalAssociation (439), 803–814. Normand, S. T. and Shahian, D. M. (2007). Statistical and clinical aspects of hospitaloutcomes profiling.
Statistical Science (2), 206–226. Ohlssen, D. I., Sharples, L. D. and Spiegelhalter, D. J. (2007). A hierarchical modellingframework for identifying unusual performance in health care providers.
Journal of the RoyalStatistical Society: Series A (Statistics in Society) (4), 865–890.
Racz, M. J. and Sedransk, J. (2010). Bayesian and frequentist methods for provider profil-ing using risk-adjusted assessments of medical outcomes.
Journal of the American StatisticalAssociation (489), 48–58.
Robins, J. M., Rotnitzky, A. and Zhao, L. (1994). Estimation of regression coefficients whensome regressors are not always observed.
Journal of the American statistical Association (427),846–866. Spertus, J. V., Normand, S-L., Wolf, R., Cioffi, M., Lovett, A. and Rose, S. (2016).Assessing hospital performance after percutaneous coronary intervention using big data.
Circu-lation: Cardiovascular Quality and Outcomes (6), 659–669. Spiegelhalter, D. J., Sherlaw-Johnson, C., Bardsley, M., Blunt, I., Wood, C.and Grigg, O. (2012). Statistical methods for healthcare regulation: rating, screening andsurveillance.
Journal of the Royal Statistical Society: Series A (Statistics in Society) (1),1–47.
UM-KECC . (2018, July). Technical notes on the standardized mor-tality ratio (SMR) for the dialysis facility reports.
Technical Report .https://dialysisdata.org/sites/default/files/content/Methodology/SMRDocumentation.pdf. van der Laan, M. J. and Rubin, D. (2006). Targeted maximum likelihood learning.
TheInternational Journal of Biostatistics (1). doi:10.2202/1557-4679.1043. Wish, J. B. (2014). The role of 30-day readmission as a measure of quality.
Clinical Journal ofthe American Society of Nephrology (3), 440–442.19 i M SE −3 −2 −1 0 1 2 3 . . . . FE: ConditionalRE: ConditionalRE: Marginal
Figure 1: Conditional and marginal MSEs of the FE and RE estimates of α i in the linear model(1). Conditional MSEs are calculated conditional on the true value, α i . Here σ α = 1 , σ w = 5 and n i = 100. The marginal MSE of the FE estimate coincides with its conditional MSE.20 −score D en s i t y −5 0 5 10 15 . . . . . N(0,1)N(−0.02,1.23 ) (a) Z−score D en s i t y −5 0 5 10 15 . . . . . N(0,1)N(−0.07,1.4 ) (b) Z−score D en s i t y −5 0 5 10 15 . . . . . N(0,1)N(−0.13,1.61 ) (c) Facility Size (Patient−Years) F r equen cy (d) ********************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************** ** **** **** ** * * ** − Facility Size (Patient−Years) Z − sc o r e Stratified ENSmoothed EN (e) *********************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
150 200 250 300 350 . . . . . Facility Size (Patient−Years) Z − sc o r e (f) Figure 2: Histograms of FE Z-scores in dialysis mortality data (SMR), stratified by facility sizetertiles into (a) small, (b) medium, and (c) large facility groups. The smooth curves representthe standard normal and the empirical null distributions. (d) Histogram of facility size in patient-years. (e) Scatter plot of FE Z-scores versus facility size along with flagging thresholds based onthe stratified EN (solid blue lines) and the smoothed EN (dotted black lines). The two verticalgrey lines separate facilities into three groups by the tertiles of facility size. The red square in (e)is magnified in (f). 21 .0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 . . . . . . a P r obab ili t y o f S i gna l FEREFEREEN (a) . . . . . . a P r obab ili t y o f S i gna l (b) . . . . . . a P r obab ili t y o f S i gna l (c) . . . . . . a P r obab ili t y o f S i gna l (d) Figure 3: Estimated probability of signaling provider 1 as worse than expected under the linearmodel (1), with all providers having sample size (a) n = 10, (b) n = 25, (c) n = 50, and (d) n = 100. The x-axis represents the fixed value of α .22 . . . . . . a P r obab ili t y o f S i gna l (a) . . . . . . a P r obab ili t y o f S i gna l (b) . . . . . . a P r obab ili t y o f S i gna l (c) . . . . . . a P r obab ili t y o f S i gna l (d) Figure 4: Estimated probability of signaling provider 1 as worse than expected under the linearmodel (1), with 5% outliers, µ = 0 , β = 0 , σ w = 4 and σ α = 1. Provider sample sizes are drawnfrom a discrete uniform distribution except for provider 1 with (a) n = 25, (b) n = 50, (c) n = 100 and (d) n = 125. Outlier effects have equal probability of taking values of ±
4. Thex-axis represents the fixed value of α . Dashed black lines represent the estimated probabilitywhen the true parameters σ w and σ α are known.23 − Median size E s t. M ean (a)
50 100 150 200
Median size E s t. V a r i an c e (b) . . . . P r obab ili t y o f S i gna l Small Medium Large
EN ( l =1)EN ( l =0.75)EN ( l =0.5)FE / EN ( l =0) (c) Figure 5: (a) Estimated mean function and (b) estimated variance function, by the smoothedempirical null, in one replication of the simulation with survival outcomes, with the number ofgroups G = 20. Black dots represent group-wise robust mean and variance estimates of Z-scores.(c) Boxplots of empirical probability of signal summarized over 500 replications, all providersstratified into three groups by provider size. A proportion λ = 0 , . , . , λλ