[PDF] Empirical Bayes methods for monitoring health care quality

Abstract

The paper discusses empirical Bayes methodology for repeated quality comparisons of health care institutions using data from the Dutch VOKS study that annually monitors the relative performance and quality of nearly all Dutch gynecological centres with respect to different aspects of the childbirths taking place in these centres. This paper can be seen as an extension of the pioneering work of Thomas, Longford and Rolph and Goldstein and Spiegelhalter. First of all, this paper introduces a new simple crude estimate of the centre effect in a logistic regression setting. Next, a simple estimate of the expected percentile of a centre given all data and a measure of rankability of the centres based on the expected percentiles are presented. Finally, the temporal dimension is explored and methods are discussed to predict next years performance.

Full PDF

11 Corresponding author. Address: P.O. Box 9604, 2300 RC Leiden, The Netherlands;phone: +31-71-5276826; fax: +31-71-5276799; email: [email protected] (cid:0) E MPIRICAL B AYES METHODS FOR MONITORING HEALTH CARE QUALITY

Hans C. van Houwelingen , Ronald Brand Department of Medical Statistics, Leiden University Medical Center, Leiden, The Netherlands

Thomas A. Louis

Division of Biostatistics, School of Public HealthUniversity of Minnesota, Minneapolis, USA

Abstract

The paper discusses empirical Bayes methodology for repeated quality comparisons of healthcare institutions using data from the Dutch VOKS study that annually monitors the relativeperformance and quality of nearly all (about 140) Dutch gynecological centres with respectto different aspects (interventions and outcomes) of the childbirths taking place in thesecentres.This paper can be seen as an extension of the pioneering work of Thomas, Longford andRolph [20] and Goldstein and Spiegelhalter [7]. First of all, this paper introduces a newsimple crude estimate of the centre effect in a logistic regression setting. Next, a simpleestimate of the expected percentile of a centre given all data and a measure of “rankability”of the centres based on the expected percentiles are presented. Finally, the temporaldimension is explored and methods are discussed to predict next year’s performance.

1. Introduction

Quality comparison of health care and educational institutions has drawn much attention overthe last years. Statisticians have picked up the issue in an early stage and the awareness hasgrown that such comparisons ask for proper statistical methodology. See the commentary byGore [8], the very nice paper by Spiegelhalter [19] and the editorials of McKee [13] andMcKee and Sheldon [14] in the British Medical Journal. Statistically speaking, there are two stages in dealing with the comparison problem. First, thedifference in patient mix between institutions should be taken into consideration andcorrections should be made for those differences by proper regression models. That stage isrelatively easy to carry out, although it can be hard to deal with the pitfalls of observationalmodelling such as selection bias. It yields some crude estimates of performance for allinstitutions with some measure of imprecision. The second, much harder, stage is to drawproper conclusions from these crude estimates. In the clinical literature, there is anirrepressible tendency to rank institutions and to produce so-called league tables as if theinstitutions were taking part in some sort of a competition. A nice example of such tables isgiven in the recent paper by Parry et al. [16].The statistical community agrees that empirical Bayes models can be very useful here to givea more realistic description of the differences between the institutions. ( However, it does notagree about the terminology; terms as hierarchical model and multi-level model are used todescribe the same model). Thomas, Longford et al. [20] introduced the empirical Bayesmodels in this setting. The paper by Goldstein and Spiegelhalter [7] sketching the merits ofthe hierarchical model has been very influential and has found its way into practicalapplications as Marshall and Spiegelhalter [12] and Leyland and Body [10]. Refinements andextensions are discussed in Normand, Glickman et al. [15], Deely and Smith [4] and Danielsand Gatsonis [3]. DeLong et al. [5] compare fixed effects models with random effectsmodels. Cox [2] mentions the problem in his broad overview of the current position ofstatistics.This paper arises from a Dutch project on the quality comparison of gynecological units withrespect to different aspects of childbirth, in which the Leiden Department of MedicalStatistics has been involved for a long time. In this project all participating centres receive ayearly report on their performances. Hence, we have a longitudinal series of yearlycomparisons as in Parry et al. [16]. As argued in Van Houwelingen [21], the interestingissue here is how well the relative position of an institution for next year can be predicted. Ifso, that asks for measures by the authorities to correct “bad predictions”. If not, the wholeexercise does not make very much sense.The main novel issue of this paper is how to deal with longitudinal data in the context of“league tables”. Along the line, a new crude estimate of the centre effect will be presented inthe line of Yusuf-Peto approach in meta analysis [22]. Furthermore, the ranking problem willbe reviewed and a modification of the expected percentile, going back to Laird and Louis [9],see also Shen and Louis [18], will be advocated as a very suitable summary measure. Throughout the paper we will use the “classical” plug-in empirical Bayes methodology. Theadvantage is that it can easily be implemented in software packages as SAS PROC MIXED,but we do realize that it ignores the uncertainty in the parameters of the mixing distributionand, therefore, may lead to overrating the differences between the institutions.The paper is organized as follows: Section 2 gives more information about the Dutch VOKSproject, such as the number of institutions and patients, the relevant outcomes and theexplanatory variables that describe the patient mix. Section 3 discusses crude estimates of thecentre effects from a likelihood perspective. Section 4 reviews the classical empirical Bayesmodel and the ranking problem. In section 5 some results will be given from the VOKS studyillustrating the issues of sections 3 and 4. The extension of the model to longitudinal data isgiven in section 6 and results for the VOKS data in section 7. Finally, a concludingdiscussion is given in section 8.

2. Description of the problem and the VOKS data

The VOKS study (Verloskundige Onderlinge Kwaliteits Spiegeling) annually monitors therelative performance and quality of nearly all (about 140) Dutch gynecological centres withrespect to different aspects (interventions and outcomes) of the childbirths taking place inthese centres. The project started in 1988. The data from the first years are not very reliable.After 1995 an intervention study was started that randomized the centres into a group thatreceived yearly reports on their own performance and a group that is not informed. Since theintervention study might influence the results, we only consider the data for the five yearperiod 1991-95. The infants are grouped into 4 subgroups: very preterm (25-32 weeks), preterm (32-37weeks), at term (37-42 weeks) and postterm (42+ weeks). The annual sizes of these groupsare of the order 1,500, 10,000, 80,000 and 6,000 respectively. This does not cover allchildbirths in The Netherlands, because a substantial part off all deliveries take place at themother’s home under guidance of a midwife or the family doctor. In the study several measures of intervention and patient’s outcome are considered. In thispaper we concentrate on two dichotomous measures: the intervention Caesarian Section (CS)and the outcome Mortality within seven days (M). An indication of the crude CS and M ratesin the four subgroups is given in Table 1. The original purpose of the project was to report to all participating centres their relativeperformance after correction for explanatory variables at the level of the infant/mother. Noexplanatory variables at the centre level (such as centre size) were taken into account.Logistic regression models without centre effects were used to compute the expected “risk”for each centre. The explanatory variables describing the “patient mix” are1. gestational age2. birth weight3. lethal congenital malformation4. multiple pregnancy5. fetal position6. tension of the mother7. planned delivery at the institution or referred by a midwife or family doctor in a latestage8. duration of primary care9. intra-uterine transport10. ethical minority11. estimated mortality riskIn principle, all covariates are used in the modelling of CS. In the modelling of M onlycovariates 1-9 are used. Different functional models (inclusion and coding/transformation ofcovariates, inclusion of interactions) are used in the four subgroups. The same functionalform of the model was used each year, but the coefficients were re-estimated each year. Thedifference between observed risk and expected risk as derived from the logistic regressionmodel was taken as performance measure to assess the quality of an institution. Theinstitutions were ranked on this criterion and received information about their own rankingon the different measures.In the light of [7] and our own research the practice of ranking based on crude estimates hasbeen abandoned. Empirical Bayes methodology has been introduced, using the originallogistic regressions as starting point.

3. Obtaining crude estimates of the centre effects

We consider a dichotomous outcome that has been measured in centre i , in year j on Y ijk patient k (i =1,...,I, j =1,...,J, k =1,..., ). (We use the term patient throughout in a broad sense. n ij In the example patient stands for the infant/mother combination) We restrict attention todichotomous outcomes and logistic regression, but the whole set up easily carries over toother generalized linear models and survival analysis.The usual logistic regression model is (3.1)logit( P ( Y ijk (cid:2) X ijk , ij )) (cid:3) X (cid:4) ijk (cid:5) ij with the vector of covariates describing the patient’s characteristics and including X ijk the constant term,the vector of regression coefficients describing the effect of the covariates andthe intercept,the “effect” of the i -th centre in the j -th year, that is ln(odds ratio) with respect ij to some overall mean.If there are many centres involved and some centres have small sample sizes, fitting suchmodels may lead to exploding centre effects. This could either be remedied by using a mixedmodel with random ‘s model right from the start or by following a two step procedure as ij used in Thomas et al. [20].. Mixed models could be fitted by software packages as SASGLIMMIX or ML-Win, but it is very time-consuming to find the best regression model in(3.1) in a mixed model context with many centres and many observations. We will follow thetwo step strategy, partly because it is in line with the original analysis strategy of the VOKSproject and also because it is easy to communicate. In the first stage, we estimate theregression coefficients ignoring the centre effects . In the sequel we concentrate ij completely on the ‘s and act as if the regression coefficients are known. ij Define i) O ij (cid:6) k Y ijk ii) (3.2) p ijk (cid:7) P ( Y ijk (cid:8) (cid:9) exp( X (cid:10) ijk )/(1 (cid:11) exp( X (cid:12) ijk ))iii) E ij (cid:13) k p ijk (cid:14) k exp( X (cid:15) ijk )/(1 (cid:16) exp( X (cid:17) ijk ))iv) var ij (cid:18) k p ijk (1 (cid:19) p ijk )Obviously, a measure of performance of centre i in year j will somehow depend on thedifference . A popular measure, used and defended in Thomas et al. [20], used in O ij (cid:20) E ij Parry et al. [16], DeLong et al. [5] and, originally, in the VOKS study is (3.3) W ij (cid:21) ( O ij (cid:22) E ij )/ n ij with standard error . (3.4)(1/ n ij ) var ij However, this measure is not quite in line with the logistic regression model. It is easy toshow that the Taylor expansion of the log-likelihood for around is given by l ij ij ij (cid:23) l ij ( ) (cid:24) l ij (0) (cid:25) ( O ij (cid:26) E ij ) (cid:27) ½ var ij (cid:28) ...That suggests to use (3.6)ˆ ij (cid:29) ( O ij (cid:30) E ij )/ var ij as a crude measure of (the lack of) performance and to act as if (3.7)ˆ ij (cid:31) N ( ij ,1/ var ij )A similar approximation was introduced in the field of meta-analysis by Yusuf and Peto[22]. It is an application of the notion of the (efficient) score statistic. See Cox and Hinkley[1, chapter 9] for more theoretical background. The equations (3.6) and (3.7) are used in aslightly different context by Louis and Bailey [11]. Notice that the approximate normality in (3.7) not a statement about the sampling distributionof , but one about the validity of using the Taylor expansion of the log-likelihood asˆ ij function of around 0. For the time being we will assume that the true centre effects are ij relatively small. We will check the validity of this assumption for our data later on.In practice there is little difference between an analysis based on our and one based onˆ ij . Roughly speaking with =average p. The whole analysis as presented W ij ˆ ij W ij /( ¯ p (1 ! ¯ p )) ¯ p in the sequel could also be based on . Using W might be more appealing to the intuition of W ij the practician while appeals more to the theoretical mind. We prefer using because itsˆ ij ˆ ij comes much closer to doing a full scale Maximum Likelihood analysis for model (3.1). Wealso like the idea that the approximate normal distribution is not based on the Central Limittheorem but on a Taylor expansion of the likelihood. From now on, we act as if theregression coefficients contained in the vector are known, use (3.6) as our definition ofperformance measure and act as if (3.7) is true.

4. Univariate empirical Bayes model

For the time being we restrict attention to one fixed year and suppress the year index j. Wewant to make inference about the individual effects for each centre knowing that we have i estimates that are approximately (In the likelihood sense as discussed in sectionˆ i N ( i , s i )3). This information can be graphically displayed as a sequence of 95%- confidence intervals 95%-CI = ( ) (4.1)ˆ i ±1.96 s i Using the crude estimates may lead to misleading conclusions, especially if the standarderrors differ from centre to centre. Small centres tend to have large and extreme s i s i estimates . ˆ i At this stage we do not consider explanatory variables at the centre level. Technicallyspeaking that would not be very hard, but we stick to the simple situation that centre effectsare only corrected for patients mix. We will come back to this point in the discussion. As pointed out by Robinson [17], the empirical Bayes approach is very well suited forhandling this kind of data and it gives a much more realistic idea of the unknown effects inmany applications. The practical results of this approach might even convince those whohave strong philosophical objections against declaring the unknown centre effects to berandom. The empirical Bayes approach takes the centre effects to be random with some i distribution G. The simplest model is . One might even postulate here that G " N (µ, ) µ $ found in DerSimonian and Laird [6], who consider the same model in the context of meta-analysis. However, it is not hard to obtain the Maximum Likelihood Estimates for and µ from the marginal -distribution (using SAS PROC MIXED or similar software) or N (µ, s i % )via a home-made EM-algorithm. (The MLE might not be the best estimator for . Alternatives are REML or some Bayesian estimator. The actual choice of the estimator for i (4.2) i |ˆ i ~ N ( EBE i , pv i )with (4.3) EBE i & µ ’ /( ( s i ) ) (ˆ i * µ)the posterior mean. and (4.4) pv i + s i /( , s i ) the posterior variance.The posterior mean is known as the empirical Bayes estimate (EBE) of the centre effect. The data can be graphically presented as a sequence of 95%-posterior probability intervals 95%-PPI = ( ) (4.5) EBE i ±1.96 pv i It is interesting to reflect for a while about the differences between the fixed effect modelwith its confidence intervals (4.1) and the random effect model with it posterior probabilityintervals (4.5) . If all centres have the same standard error , the two representations differ s i only in scale. The EBE ’s are closer to the common mean , but (4.1) and (4.5) reflect theµsame uncertainty about the ranking of the centres. If the centres have different standarderrors , the story is more complicated. If the standard error is small, the

EBE is close to s i s i the MLE and the posterior variance close to . If is large, the EBE is close to and the s i s i µposterior variance close to . EBE ’s give better estimates of the true centre effects than the crudeestimates. The posterior distribution summarizes our (lack of) knowledge about the centreeffects. (Notice that the histogram of the

EBE ’s does not estimate the mixing distribution G .See Shen and Louis [8] for a discussion about how to combine estimating G and ranking thecentres in a single procedure)Ranking of centres according to quality is very popular. The simple approach, still used inParry et al.[16] can be misleading because the result is very much affected by randomvariation and there is no room for the possibility that it is all pure random noise and rankingdoes not make sense at all. Ranking of the EBE ’s is not much better because it also ignoresthe uncertainty reflected by the posterior variances . pv i Most authors agree that whatever way we want to rank the centres, it should be based on theposterior distributions (4.2). Goldstein and Spiegelhalter [7] simulate from the posteriordistribution and compute the median rank and its Bayesian confidence interval by means ofMarkov Chain Monte Carlo (MCMC). Deely and Smith [4] advocate the use of a singlenumber, such as or . P ( i <µ| data ) P ( i - min j ( j )| data )We think that the expected rank, as proposed by Laird and Louis [9] is a very suitablesummary measure. It is defined as (4.6) ER i . / j i P ( j < i | data ) j i (( EBE i EBE j )/ pv i pv j )It is very close to the median rank of Goldstein and Spiegelhalter [7]. There are two extremepossibilities. If the standardized differences are very close to zero,( EBE i EBE j )/ pv i pv j that is if the posterior probability intervals show considerable overlap, the expected ranks ER (n+1)/2 . If the standardized differences are very large, that is ifthere is hardly any overlap between the PPI ’s , the expected ranks get very close to the ranksbased on the posterior means

EBE .It is tempting to rank the centres on the basis of the expected ranks , as suggested in Laird ER i and Louis [9] and implicitly done in the graphics of Goldstein and Spiegelhalter [7].However, such a ranking hides the underlying uncertainties and , therefore, we prefer to use the expected rank as such and not to transform them into ranks 1,..n,.Ranks are a bit inconvenient in practice, because , by definition, they depend heavily on thenumber of centres. Therefore, we propose to compute percentiles instead defined by(4.7) PCER i ( ER i : n The symbol

PCER stands for “PerCentiles based on Expected Ranks”. This percentile can beinterpreted as an estimate of the probability that the effect is smaller than the effect of an i j randomly selected centre. Here, selection is from all n centres included an independent copyof centre i itself. It is easy to extend this notion to the whole population of all conceivablecentres. That means that we want an estimate of the percentile 100* . A straightforward G ( i )estimate is (4.8 EPC i ; < E [ G ( i | data )] = > (( EBE i ? µ)/ @ pv i ) )Here, the symbol EPC stands for “Expected Percentile”.In the practical examples discussed below, there turned out to be hardly any differencebetween

PCER and

EPC . The latter has the advantage of a very simple computation.Therefore, we will base our presentation on

EPC , but emphasize that there is very littleconceptual difference between this new measure and the expected rank ( ER ) of Laird and3Louis [9] and the median rank of Goldstein and Spiegelhalter [7]. The variance of EPC can be compared with the maximal variance of uniform percentiles( ). The ratio can serve as a measure RA of “rankability”. Formally, we define100 /12 (4.9) RA A B var ( EPC )/100 This measure describes the true variation of the percentiles on a 0-1 scale. If it is very small,the differences between the centres are completed obscured by the measurement error. Wecould define a similar and even simpler measure based on the variance of the random effect and the total variation of the crude estimate. Here, we have to take account of thevariability in the error variance . Since the distribution of the error variance is rather s i skewed , we take the median as typical value. That leads to our measure of proportion truevariation between centres (4.10) C /( D median ( s i ))It measures what part of the variation between the crude centre effects is due to truedifferences (as opposed to measurement error). It is the theoretical correlation between crudecentre effects in different years if the true centre effect is constant over time. Therefore, wedenote it by .

5. Annual results for the VOKS data.

As mentioned before we will consider the data from the years 1991-95 on the endpointCaesarian Section and Mortality in the subgroups Very Preterm, Preterm, At Term andPostterm. For each subgroup×year combination we fit the mixture model to the crude N (µ, )4estimates of the centre effects defined by (3.6) using the likelihood of (3.7). Correlationbetween centre effects over the groups or over the years is ignored for the time being. Theestimated parameters of the empirical Bayes model are very stable over the years. Table 1gives some typical numbers for each of the subgroups.Table 1. about here.The fitted values of µ (not shown because they are not very relevant) and indicate that the true centre effects are indeed close to zero and the approximation (3.7) seems reasonablehere. The variance of the random centre effects is largest in the first subgroup of Very Preterm infants for both outcomes. However, due to the small number of patients the variability isvery large and the proportion true variation quite small. In the other subgroups the mortalityrates are very low, the variance component and the proportion true variation are very small, so it seems that any attempt to discriminate between the centres on that outcome inthese subgroups is hopeless. For the outcome Caesarian Section, the variance component is quite small, but the proportion true variation is much higher due to the high CaesarianSection rates in all these subgroups. The proportion true variation is maximal in the At Termsubgroup due to the large number of patients.To get more insight we consider two situations in more detail. First of all, the outcomeCaesarian Section in the At Term group in 1995. That should be about the ideal case forcomparison of the centres. Second, we consider the outcome Mortality in the subgroup VeryPreterm in 1993, because in that year the variance component was the largest of all years. That should be a borderline case for any sensible comparison of the centres.5Caesarian Section in the At Term group in 1995.There are 112 centres The mean number of patients per centre is 695 and the overall CS rateis 16%. The confidence intervals of equation (4.1) are plotted in figure 1.Figure 1 about here.. The estimated parameters of the random effect distribution are andµ E F G H EPC versus the crude percentile in Figure 3.Figure 2 and Figure 3 about here.The measure of rankability RA =0.90, very close to the measure of proportion true variation. The conclusion in this subgroup is that, thanks to the large sample sizes per centre I J K L M EBE ‘s and the

EPC . First, it shows that the orderis very much changed going from crude estimates to

EBE ’s. Second, it shows that the vastmajority of the centres have an expected percentile between 30% and 70%. There are a few(larger) centres that have an outspoken low risk.Figures 5 and 6The conclusion for this subgroup must be that, due to the small sample sizes per centre, it isvery hard to make any inference on the relative quality of an individual centre, although thereis a substantial (latent) variation between the centres reflected by the relatively large variancecomponent .

6. Extension of the model to longitudinal data

The whole exercise of comparing centres in a particular year only makes sense if the resultsof that year carry over to the next year. If we have J annually repeated measures as in theVOKS data set, we can both check the correlation between the years and produce someestimate for the performance in year J+1 by using an appropriate mixed model for therepeated measures. As for the univariate case, the model could best be described as a twostage model7(6.1)ˆ i | i ~ MVN ( i , S i ) i ~ MVN ( M , T )The vector contains the crude estimates for centre i in each year . Conditional on the trueˆ i centre effect , these estimates are multivariate normal with mean and diagonal i i covariance matrix containing the conditional variances of section 4. The true centre S i s effects have a multivariate normal distribution with mean M containing the ‘s of section 4µand covariance matrix T containing the annual variance components on the diagonal and allowing correlation between the years. Fitting such a model to the data is not extremely hard.The EM algorithm is very suited for fitting the two-stage model, but it is a bit slow. Standardsoftware as SAS Proc Mixed could be used as well. We will give some more detail in thenext section. Missing data could occur if a centre joins the exercise in a later stage or dropsout, but such missing data does not constitute a serious problem as long as it is missing atrandom (MAR). Fitting some model to the observed data gives an understanding of thecorrelations between the years and the (im)possibility to say anything about next year. If wereally want to make any statement about the next year we need to extrapolate both M and T.The extrapolation of M is not crucial because the actual mean is not relevant in measures asµ EPC or RA . However, it is crucial to extrapolate T. Models for the covariance matrix thatallow easy extrapolation are the compound symmetry model (constant variance, equalcorrelations), the random regression coefficient model with and j N ZB i Z O constant leading to , the pure stationary autoregressive model Z P Q Q time

R R T S Z T cov ( B ) U Z V T W R E [ i ( J X | ˆ i1 ,..., ˆ iJ ] Y µ J Z [ T J \ ... T J , J ] ^ T _ s ... T J ... ... ... T J ... T JJ ‘ s J ( a ˆ i b µ ...ˆ iJ c µ J var [ i ( J d | ˆ i1 ,..., ˆ iJ ] e J f g T J h ... T J , J i j T k s ... T J ... ... ... T J ... T JJ l s J ( m T J n ... T J , J o with , or a combination of a random intercept and an autoregressive term. We will R ij p q | i r j | discuss some models in more detail in the next section. Once the model is fitted on the datafrom the years 1,..,J and M and T are properly extrapolated, it is not hard to compute thedistribution of given all data of centre i . It is normal with i , J s (6.2)Similar expressions hold if the past information is not complete.Having obtained these conditional predictive distributions , the predictive rankability of thecentres for year J+1 can be further analysed in the way of section 5. We will demonstrate thatin the next section.

7. Longitudinal results for the VOKS data T t R R ij u v | i w j | and unstructured mean. The correlation parameter is a bit hard to estimate. Given thatparameter, estimation of the mean and variance parameters can easily be done by EM. So weestimated by maximizing its profile likelihood derived from the EM algorithm for the otherparameters. We did not model the mean value, but extrapolated it manually to the next year.Model II is the random coefficients model0 ij x A i y B i z ( year j { A i B i ~ N ( , A AB A BAB A B B )Again, this model is well suited for fitting by the EM-algorithm.The results of the model fit are given in Table 4.Insert Table 4The goodness-of-fit of both models is very similar, the random regression coefficient modelfits slightly better, presumable because it adapts to the slowly increasing yearly variances inTable 3. On the other hand, extrapolations of model I are more robust than those of modelII. Next, we compute the predictive distribution for 1996 for each centre using both models.As can be seen from formula (6.2) predictive distribution does not only depend on the modelwe use for the true centre effects and the past crude estimates for each centre, but also on theprecision of the crude estimates that directly depend on the size of the centre. From thepredictive distributions the expected percentile (4.8) is computed for each centre and theoverall predicted rankability RA (4.9). Notice that the concept of the proportion truevariation does not carry over easily to the prediction setting while the concept rankabilitydoes. The two prediction models are not completely identical. There is some shift in scale andlocation. The results for the better fitting model II are shown in Figure 7. The picture formodel I looks very much the same. The Expected Percentiles according to the two models arevery similar as can be seen in Figure 8.1Figures 7 and 8 about here.The rankability in Figure 7 is . Comparing this with section 4 where we found RA | RA }

8. Discussion

The results of this paper show once again that the empirical Bayes framework is very wellsuited for the analysis of quality comparison data. We think that our Expected Percentilemeasure (4.8.) is very suited to single out very extreme centres. It is related to, but more strictthan the simpler , estimated by , as proposed by Deely and Smith P ( i <µ) (( EBE i ~ µ)/ pv i )[4]. It is comforting to see the consistency over the years in the data analysis and the highcorrelation of the successive centre effects. That means that statistical analysis makes senseand can reveal interesting issues of the data.One should be very careful in drawing firm public health conclusions from these data.Although there seems to be a general agreement that the quality comparison of centres shouldbe corrected for differences in patient mix, all results are completely observational and theremight be some simple unobserved variable at the patient level that explains everything. The role of explanatory variables at the centre level is more ambiguous. If one could show,2for example, that smaller centres fare worse than big centres, that might explain the variationbetween the centres but it does not take away the concern about the centres with poorperformance; it is only easier to pinpoint them.We have chosen not to include explanatory variables at the centre level. As we said insection 4 , it is not very hard to include explanatory variables at the centre level in V ,..., V m the second stage of our analysis by a two- stage model like ˆ i | i ~ N ( i , s i ) i ~ N ( (cid:127) ml (cid:128) V il l , )and to replace the 95%-posterior probability intervals of (4.5) by the tolerance interval for i in that data. If the model fits well, the estimated centre effects will about the same, but theuncertainties may be smaller and, therefore, the rankability may improve. Our approach is quite traditional. The advantage is that in the first stage the centre effects areestimated by simple Observed - Expected type statistics and that in the second stage theanalysis can be carried out by traditional software and the ranking problem can be discussedin terms of posterior mean and variance of the centre effects. The assumption that the centreeffects were small enough to warrant quadratic Taylor expansions around zero seems not tobe violated in the data analysed here. We admit that we did not check the validity of thenormal distribution for the true centre effects, but we would be very surprised if thatinfluenced the results given that the variation of the mixing distribution is quite small. The main drawback of our approach is that we just plug in the estimated parameters of themixing distribution and that we have no simple method of accounting for the imprecision of3the estimates. Since the main interest is in difference between centres, imprecision in themean of the mixing distribution does not matter, but the uncertainty in variance parameter (and the correlation over time in sections 6 and 7) could have quite some impact on themeasures like the Expected Percentile (4.8) and the rankability RA. We conjecture that takingaccount of imprecision in the variance parameters would lead to more conservative answersin the sense that the Expected Percentile get closer to 50% and the estimated rankabilitiesbecome lower. A primitive way out is to obtain a confidence interval for the varianceparameter from the Maximum Likelihood analysis and to check how sensitive the answer isto variations within the confidence. Another alternative is bootstrapping but that is not quitestraightforward in the multilevel setting. Presumable, the best way is adopting a hierarchicalBayesian model and integrating out all uncertainties. Analytically, that could becomecumbersome, but the Markov Chain Monte Carlo ( MCMC) method as advocated byGoldstein and Spiegelhalter [7] can help to handle that. The price to pay is the loss of beingable to summarize the results in a few simple statistics. The choice between our moretraditional approach and the MCMC approach is partly a matter of taste. We can imaginedoing a MCMC analysis on the crude estimates of the centre effects replacing our Maximum-Likelihood-by-EM approach, but we would hesitate to MCMC the whole mixed modelbecause we would like to explain to our clinical counterparts why some centre gets a good ora bad ranking at the end.A refinement of our approach is to connect the results for the different subgroups and toinvestigate the correlation between the centre effects for the four subgroups. That asks fordoubly multivariate (Group × Year ) models, or even more complicated models if we alsolink the different outcomes. In summary, there is certainly room for further refinements of our model and our approach4and there is much work to do for the statisticians, provided we keep it digestible for theclinicians involved.5 References

1. Cox DR and Hinkley DV, Theoretical Statistics, Chapman and Hall, London, 19742. Cox DR, The current position of statistics: A personal view , International Statistical Review,

65 (1997), 261-276 3. Daniels MJ, Gatsonis C, Hierarchical generalized linear models in the analysis ofvariations in health care utilization, Journal of the American Statistical Association,94 (1999), 29-424. Deely JJ, Smith AFM, Quantitative refinements for comparisons of institutionalperformance, Journal of the Royal Statistical Society Series A, 161 (1998), 5-12 5. DeLong ER, Peterson ED, DeLong DM, et al., Comparing risk-adjustment methodsfor provider profiling, Statistics in Medicine, 16 (1997), 2645-26646. DerSimonian R, Laird N., Meta-analysis in clinical trials, Controlled Clinical Trials 7(1986), 177-1887. Goldstein H, Spiegelhalter DJ, League tables and their limitations: Statistical issues incomparisons of institutional performance, Journal of The Royal Statistical SocietySeries A, 159 (1996) 385-4098. Gore SM, Selection to medical school in Great Britain - League tables only help if uncertainty is properly looked at, British Medical Journal, 318 (1999) 938-939 9. Laird N, Louis T, Empirical Bayes ranking methods, Journal of Educational Statistics,14 (1989) , 29-4610. Leyland AH, Boddy FA, League tables and acute myocardial infarction, Lancet, 351(1998), 555-558 11. Louis TA, Bailey JK, Controlling error rates using prior information and marginaltotals to select tumor sites, J. Statistical Planning and Inference, 24 (1990), 297-3166 12. Marshall EG, Spiegelhalter DJ, Reliability of league tables of in vitro fertilisationclinics: retrospective analysis of live birth rates, British Medical Journal, 316 (1998), 1701-1704 13. McKee M, Indicators of clinical performance, British Medical Journal, 315 (1997),142-142 14. McKee M, Sheldon T, Measuring performance in the NHS - Good that it’s movedbeyond money and activity but many problems remain, British Medical Journal, 316(1998), 322-322 15. Normand SLT, Glickman ME, Gatsonis CA, Statistical methods for profilingproviders of medical care: Issues and applications, Journal of the American Statistical Association, 92 (1997), 803-814 16. Parry GJ, Gould CR, McCabe CJ, et al., Annual league tables of mortality in neonatalintensive care units: longitudinal study, British Medical Journal, 316 (1998),1931-1935 17. Robinson GK, The BLUP is a good thing: the estimation of random effects, StatisticalScience, 6 (1991), 15-5118. Shen W, Louis TA, Triple-goal estimates in two-stage hierarchical models Journal of the Royal Statistical Society Series B, 60 (1998), 455-47119. Spiegelhalter DJ, Surgical audit: statistical lessons from Nightingale and Codman,Journal of the Royal Statistical Society Series A, 162 (1999), 45-58 20. Thomas N, Longford NT, Rolph JE, Empirical Bayes methods for estimatinghospital-specific mortality-rates, Statistics in Medicine, 13 (1994), 889-90321. Van Houwelingen HC, The future of biostatistics: Expecting the unexpected,Statistics in Medicine, 16 (1997), 2773-2784 722. Yusuf S, Peto R, Lewis J, Collins R, Sleight P., Beta blockade during and aftermyocardial infarction: an overview of the randomised trials, Progress inCardiovascular Diseases 27(1985), 335-3718Legends

Caesarian Section Mortalitysubgroup numberofcentres meannumber of patients overallrate overallrate Very Preterm 105 15 0.45 0.30 0.17 0.25 0.25 0.15Preterm 115 80 0.25 0.12 0.55 0.025 0.08 0.12At Term 115 625 0.15 0.15 0.91 0.0035 0.04 0.07Postterm 115 50 0.15 0.10 0.36 0.0040 0.15 0.02

Table 1.

Typical values per year of the number of centres, the number of patients, theoverall rates of the outcome considered and measures of between centrevariability for all subgroup×outcome combinations991 92 93 94 95means 0.06 0.06 0.06 0.06 0.06µvariances Table 2.

The saturated longitudinal mixed model for Caesarian Section in the At Termgroup. The tables shows means, variances and correlations of the true centreeffects091 92 93 94 95means 0.15 0.36 0.40 0.42 0.37µvariances Table 3.

The saturated longitudinal mixed model for Mortality in the Very Pretermgroup. The tables shows means, variances and correlations of the true centreeffects1 model I:autoregressive modelwith free means model II:random coefficientsmodel: A+B*(year-90)model fit number of parameters for themean 5 2number of parameters for thecovariance matrix 2 3log-likelihood -410.30 -408.51AIC -417.30 -413.51 parameters mean 0.27; 0.33; 0.34; 0.36; 0.37 0.18+ 0.053*(year-90)µvariance A B AB -0.23 prediction extrapolated mean in 1996 0.40 (manually) 0.50extrapolated variance in 1996 0.25 0.51 Table 4.

Details of the two mixed models for Mortality in the group Very Preterm2

Crude 95%-CI’s for centre effects ranked on crude estimate l n ( odd s r a t i o ) Figure 1

95% Confidence Intervals for the true centre effects for theoutcome CS in the At Term group in 1995 3 ranked on crude estimate l n ( odd s r a t i o ) Figure 2

95% Posterior Probability Intervals for the true centre effects forthe outcome CS in the At Term group in 19954

Crude Percentile E x pe c t ed P e r c en t il e Figure 3

The relation between the Crude Percentile and the ExpectedPercentile for the outcome CS in the At Term group in 19955

Crude 95%-C.I.’s for centre effects ranked on crude estimate l n ( odd s r a t i o ) Figure 4

95% Confidence Intervals for the true centre effects for theoutcome M in the Very Preterm group in 19936 ranked on crude estimate l n ( odd s r a t i o ) Figure 5

95% Posterior Probability Intervals for the true centre effects forthe outcome M in the Very Preterm group in 19937

Figure 6

The relation between the different percentiles in the Very Preterm group in1993. PCRUDE stands for the percentile based on the crude estimate. PEBE for thepercentile based on the posterior mean EBE and EPC is the Expected Percentile 8 "Predicted" 95%-Probability Intervals ranked on predicted mean l n ( odd s r a t i o ) Figure 7 “Predicted” 95% probability intervals for the true centre effects forthe outcome M in the Very Preterm group in 19969

Comparing EPC’s