[PDF] Improving the Hosmer-Lemeshow Goodness-of-Fit Test in Large Models with Replicated Trials

Abstract

The Hosmer-Lemeshow (HL) test is a commonly used global goodness-of-fit (GOF) test that assesses the quality of the overall fit of a logistic regression model. In this paper, we give results from simulations showing that the type 1 error rate (and hence power) of the HL test decreases as model complexity grows, provided that the sample size remains fixed and binary replicates are present in the data. We demonstrate that the generalized version of the HL test by Surjanovic et al. (2020) can offer some protection against this power loss. We conclude with a brief discussion explaining the behaviour of the HL test, along with some guidance on how to choose between the two tests.

Full PDF

IImproving the Hosmer-Lemeshow Goodness-of-Fit Test in LargeModels with Replicated Trials

Nikola Surjanovic and Thomas M. Loughin

Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, British Columbia V5A1S6, Canada

Summary

The Hosmer-Lemeshow (HL) test is a commonly used global goodness-of-ﬁt (GOF) test that assesses the quality ofthe overall ﬁt of a logistic regression model. In this paper, we give results from simulations showing that the type1 error rate (and hence power) of the HL test decreases as model complexity grows, provided that the sample sizeremains ﬁxed and binary replicates are present in the data. We demonstrate that the generalized version of the HLtest by Surjanovic et al. (2020) can oﬀer some protection against this power loss. We conclude with a brief discussionexplaining the behaviour of the HL test, along with some guidance on how to choose between the two tests.

Key Words: chi-squared test, generalized linear model, goodness-of-ﬁt test, Hosmer-Lemeshow test, logistic regres-sion

Logistic regression models have gained a considerable amount of attention as a tool for estimating the prob-ability of success of a binary response variable, conditioning on several explanatory variables. Researchers inhealth and medicine have used these models in a wide range of applications. One of many examples includesestimating the probability of hospital mortality for patients in intensive care units as a function of variouscovariates (Lemeshow et al., 1988).Regardless of the application, it is always desirable to construct a model that ﬁts the observed data well.One of several ways of assessing the quality of the ﬁt of a model is with goodness-of-ﬁt (GOF) tests (Bilderand Loughin, 2014). In general, GOF tests examine a null hypothesis that the structure of the ﬁtted modelis correct. They may additionally identify speciﬁc alternative models or deviations to test against, but thisis not required. Global (omnibus) GOF tests are useful tools that allow one to assess the validity of a modelwithout restricting the alternative hypothesis to a speciﬁc type of deviation.An example of a well-known global GOF test for logistic regression is the Hosmer-Lemeshow (HL) test,introduced by Hosmer and Lemeshow (1980). The test statistic is a Pearson statistic that compares observedand expected event counts from data grouped according to ordered ﬁtted values from the model. The decisionrule for the test is based on comparing the test statistic to a chi-squared distribution with degrees of freedomthat depend on the number of groups used to create the test statistic. The HL test is relatively easy toimplement in statistical software, and since its creation, the HL test has become quite popular, particularlyin ﬁelds such as biostatistics and the health sciences.Despite its popularity, the HL test is known to have some drawbacks. In both experimental and observa-tional studies, it is possible to have data for which binary observations have the same explanatory variablepatterns (EVPs). In this case, responses can be aggregated into binomial counts and trials. When thereare observations with the same EVP present in the data, it is possible to obtain many diﬀerent p-valuesdepending on how the data are sorted (Bertolini et al., 2000). A related disadvantage of HL-type tests isthat the test statistic depends on the way in which groups are formed, as remarked upon by Hosmer et al.1 a r X i v : . [ s t a t . M E ] F e b In what follows, we use the notation of Surjanovic et al. (2020). We let Y ∈ { , } denote a binary responsevariable associated with a d -dimensional covariate vector, X ∈ R d , where the ﬁrst element of X is equal toone. The pairs ( X i , Y i ), i = 1 , . . . , n denote a random sample, with each pair being distributed accordingto the joint distribution of ( X, Y ). The observed values of ( X i , Y i ) are denoted using lowercase letters as( x i , y i ).In a logistic regression model with binary responses, one assumes thatE( Y | X = x ) = π ( β (cid:62) x ) = exp( β (cid:62) x )1 + exp( β (cid:62) x ) , for some β ∈ R d . The likelihood function is L ( β ) = n (cid:89) i =1 π ( β (cid:62) x i ) y i (1 − π ( β (cid:62) x i )) − y i . From this likelihood, a maximum likelihood estimate (MLE), β n , of β is obtained. The HL Test Statistic

To compute the HL test statistic, one partitions the observed data, ( x i , y i ), into G groups. Typically, thegroups are created so that ﬁtted values are similar within each group and the groups are approximately2f equal size. To achieve this, a partition is deﬁned by a collection of G + 1 interval endpoints, −∞ = k < k < · · · < k G − < k G = ∞ . The k g often depend on the data, usually being set equal to the logitsof equally-spaced quantiles of the ﬁtted values, ˆ π i = π ( β (cid:62) n x i ). We deﬁne I ( g ) i = ( k g − < β (cid:62) n x i ≤ k g ), O g = (cid:80) ni =1 y i I ( g ) i , E g = (cid:80) ni =1 ˆ π i I ( g ) i , n g = (cid:80) ni =1 I ( g ) i , and ¯ π g = E g /n g , where ( A ) is the indicator functionon a set A . With this notation, the number of observations in the g th group is represented by n g , and¯ π g denotes the mean of the ﬁtted values in this group. The HL test statistic is a quadratic form that iscommonly written in summation form as (cid:98) C G = G (cid:88) g =1 ( O g − E g ) n g ¯ π g (1 − ¯ π g ) . When

G > d , Hosmer and Lemeshow (1980) ﬁnd that the HL test statistic is asymptotically distributed asa weighted sum of chi-squared random variables under the null hypothesis, after checking certain conditionsof Theorem 5.1 in Moore and Spruill (1975). Precisely, (cid:98) C G d −→ χ G − d + d (cid:88) j =1 λ j χ j , (1)with each χ j being a chi-squared random variable with 1 degree of freedom, and each λ j an eigenvalue ofa certain matrix that depends on both the distribution of X and the vector β , the true value of β underthe null hypothesis. Hosmer and Lemeshow (1980) conclude through simulations that the right side of (1)is well approximated by a χ G − distribution in various settings.The HL test statistic and the corresponding p-value both depend on the chosen number of groups, G .Typically, G = 10 groups are used, so that observations are partitioned into groups that are associated with“deciles of risk”. Throughout this paper we use 10 groups, and therefore compare the HL test statistic to achi-squared distribution with 8 degrees of freedom, but the results hold for more general choices of G . The GHL Test Statistic

The GHL test introduced by Surjanovic et al. (2020) generalizes several GOF tests, allowing them to beapplied to other generalized linear models. Tests that are generalized by the GHL test include the HL test(Hosmer and Lemeshow, 1980), the Tsiatis (Tsiatis, 1980) and generalized Tsiatis tests (Canary et al., 2016),and a version of the “full grouped chi-square” from Hosmer and Hjort (2002) with all weights equal to one.The test statistic is a quadratic form like (cid:98) C G , but with important changes to the central matrix. The theorybehind this test depends on the residual process, R n ( u ), u ∈ R , deﬁned in Stute and Zhu (2002). In the caseof logistic regression, R n ( u ) = 1 √ n n (cid:88) i =1 [ Y i − π ( β (cid:62) n X i )] ( β (cid:62) n X i ≤ u ) , a cumulative sum of residuals that are ordered according to the size of their corresponding ﬁtted values.This process is transformed into a G -dimensional vector, S n , which forms the basis of the HL and GHL teststatistics, with S n = ( R n ( k ) − R n ( k ) , . . . , R n ( k G ) − R n ( k G − )) (cid:62) . In order to approximate the variance of S n , we need to deﬁne several matrices. Let( G ∗ n ) gi = I ( g ) i ,V ∗ / = diag (cid:16) [ π ( β (cid:62) x i )(1 − π ( β (cid:62) x i ))] / (cid:17) , i = 1 , . . . , n , and g = 1 , . . . , G . Also, deﬁne X ∗ to be the n × d matrix with i th row given by x (cid:62) i , and let V ∗ / n be the same as V ∗ / , but evaluated at the estimate β n of β . Finally, deﬁneΣ n = 1 n G ∗ n (cid:0) V ∗ n − V ∗ n X ∗ ( X ∗(cid:62) V ∗ n X ∗ ) − X ∗(cid:62) V ∗ n (cid:1) G ∗(cid:62) n = 1 n G ∗ n V ∗ / n (cid:16) I n − V ∗ / n X ∗ ( X ∗(cid:62) V ∗ n X ∗ ) − X ∗(cid:62) V ∗ / n (cid:17) V ∗ / n G ∗(cid:62) n , (2)where I n is the n × n identity matrix.For logistic regression models, the GHL test statistic is then X = S (cid:62) n Σ + n S n , where Σ + n is the Moore-Penrose pseudoinverse of Σ n . Under certain conditions given by Surjanovic et al.(2020), S (cid:62) n Σ + n S n d −→ χ ν , where ν = rank(Σ), with Σ a matrix deﬁned in their paper. Since the rank of Σ might be unknown, theyuse the rank of Σ n as an estimate. We use the same approach to estimating ν , empirically ﬁnding that theestimated rank of Σ n is often equal to G − X w statistic from Hosmer and Hjort (2002) with all weights set equal to 1, when a particulargrouping method is used—that is, when G ∗ n is the same for all methods. However, use of the GHL test isjustiﬁed for a wide variety of GLMs and grouping procedures with a proper modiﬁcation of Σ n , as describedby Surjanovic et al. (2020). For both the HL and GHL tests, we use the grouping method proposed bySurjanovic et al. (2020), which uses random interval endpoints, k g , so that (cid:80) ni =1 ˆ π i (1 − ˆ π i ) I ( g ) i is roughlyequal between groups. Further details of the implementation are provided in the supplementary material oftheir paper.It is important to note that Σ n is a non-diagonal matrix that standardizes and accounts for correlationsbetween the grouped residuals in the vector S n . This can be seen from (2), which shows that Σ n containsa generalized hat matrix for logistic regression. In contrast, when written as a quadratic form, the centralmatrix of the HL test statistic is diagonal and does not account for the number of parameters in the model, d , when standardizing the grouped residuals. We expect this standardization to be very important whenexact replicates are present, as the binomial responses might be more inﬂuential than sparse, individualbinary responses.It is extremely common to ﬁt logistic regression models to data where multiple Bernoulli trials areobserved at some or all EVPs, even when the underlying explanatory variables are continuous. As with anyﬁtted model, a test of model ﬁt would be appropriate, and the HL test would likely be a candidate in atypical problem. It is therefore important to explore how the HL and GHL tests behave with large modelswhen exact or near-replicates are present in the data. We compare the performance of the HL and GHL tests by performing a simulation study. Of particularinterest is the rejection rate under the null, when the tests are applied to moderately large models that areﬁt to data with clusters or exact replicates in the covariate space.In all settings, the true regression model isE( Y | X = x ) = logit( β + β x + . . . + β d − x d − ) , (3)4ith d in { , , . . . , } . Here, β represents the intercept term. To produce replicates in the covariatespace, m ≤ n unique EVPs are drawn randomly from a ( d − σ = 1 for each simulation realization. At each EVP, n/m replicate Bernoulli trials are then created, with probabilities determined by (3). In our simulation study, weﬁx n = 500 and select m ∈ { , , } so that the number of replicates at each EVP, n/m , is 10, 5, or 1,respectively.We set β = 0 . β = . . . = β d − = 0 . / √ d −

1. This results in ﬁtted values that rarely fall outsidethe interval [0 . , . n = 100, using smaller values of d and m than for n = 500.However, we focus on results for n = 500 because we are then able to increase the number of replicates perEVP, n/m , while still maintaining large enough m so that it is possible to create ten viable groupings. Ineach simulation setting, 10,000 realizations are produced. All simulations are performed using R . Figure 1 presents plots of the sample mean and variance of the HL and GHL test statistics against thenumber of variables in the model, separately for each m . An analogous plot of the estimated type 1 errorrate of the tests against the number of variables is also presented. For the HL test, all three statistics showa clear decreasing pattern with increasing model size when replicates are present, with a sharper decreasewhen the number of replicates per EVP is larger. Since the estimated variance is not always twice the sizeof the mean, we can infer that the chi-squared approximation to the null distribution of the HL test statisticis not adequate in ﬁnite samples for these data structures. Simulation results with a sample size of n = 100are not displayed, but are quite similar.From the same ﬁgure, we see that the GHL test performs well in the settings considered. The estimatedmean and variance of the test statistic stay close to the desired values of G − G −

1) = 18.We note that the GHL test can have an inﬂated type 1 error rate, particularly when it is applied to highlycomplex models. The models considered here are only moderately large, with d ≤ min { n/ , m/ } . If onewishes to use the GHL test to assess the quality of ﬁt of larger models with only a moderate sample size,one should be wary of an inﬂated type 1 error rate that can become considerably large for complex models.A possible explanation for this is that estimating the oﬀ-diagonal elements of the matrix Σ n can potentiallyintroduce a considerable amount of variance into the test statistic in small samples.Recall from (1) that the asymptotic χ G − distribution for the HL test proposed by Hosmer and Lemeshow(1980) is based on a sum of chi-squares, where one has G − d degrees of freedom. We investigated whethermaintaining G = 10 while increasing d contributes to the phenomena we have observed. We set G = 26and performed a similar simulation study. The adverse behavior of the HL statistic still persists despite thismodiﬁcation.We also investigated the eﬀect of near-replicate clustering in the covariate space. We ﬁxed n and m asin Section 3, but added a small amount of random noise with marginal variance σ e to each replicate withinthe m sampled vectors. The amount of clustering was controlled by varying σ e , as shown in Figure 2. Asexpected, increasing σ e reduces the severity of the decreasing mean, variance and type 1 error rate for theHL test statistic. However, the pattern remains evident while σ e /σ remains small.5 Discussion

The original HL test, developed by Hosmer and Lemeshow (1980), is a commonly-used test for logisticregression among researchers in biostatistics and the health sciences. Although its performance is welldocumented (Lemeshow and Hosmer, 1982; Hosmer et al., 1997; Hosmer and Hjort, 2002), we have identiﬁedan issue that does not seem to be well known. For moderately large logistic regression models ﬁt to datawith clusters or exact replicates in the covariate space, the null distribution of the HL test statistic can failto be adequately represented by a chi-squared distribution in ﬁnite samples. Using the original chi-squareddistribution with G − n , makes a form of correction to the HL test statistic by standardizing and by accounting forcorrelations between the grouped residuals that comprise the quadratic form in both the HL and GHLtests. This is evident from (2), which shows that Σ n contains the generalized hat matrix subtracted froman identity matrix. To empirically assess the behaviour of Σ n , we varied σ e in the setup with replicatesand added noise, described at the end of Section 4. For large d and moderate n , both ﬁxed, we found thatthe diagonal elements of Σ n tend to shrink, on average, as σ e decreases. In contrast, the elements of theHL central matrix remain roughly constant. Therefore, the GHL statistic seems to adapt to clustering orreplicates in X , whereas the HL test statistic does not.In logistic regression with exact replicates, grouped binary responses can be viewed as binomial responsesthat can be more inﬂuential. In this scenario, as d increases for a ﬁxed sample size n , the distribution of theregular HL test statistic diverges from a single chi-squared distribution, suggesting that the standardizationoﬀered by the central GHL matrix becomes increasingly important.The real-life implications of the reduced type 1 error rate and power of the regular HL test are that in mod-els with a considerable number of variables—provided that the data contains clusters or exact replicates—theHL test has limited ability to detect model misspeciﬁcations. Failure to detect model misspeciﬁcation canresult in retaining an inadequate model, which is arguably worse than rejecting an adequate model due toan inﬂated type 1 error rate, particularly when logistic regression models are used to estimate probabilitiesfrom which life-and-death decisions might be made.Our advice for choosing between the two GOF tests is displayed as a simple decision tree in Figure 3.The advice should be interpreted for G = 10 groups, the most commonly used number of groups. Withlarge samples, provided that m is suﬃciently large compared to d , it should generally be safe to use theGHL test. Our simulations explored models with d ≤

25, so some caution should be exercised if the GHLtest is to be used with larger models. For small or moderate samples, such as when n = 100 or 500, it isimportant to identify whether there are clusters or exact replicates in the covariate space. One can computethe number of unique EVPs, m , and compare this number to the sample size, n . If n/m ≥

5, say, then thereis a considerable amount of “clustering”. For data without exact replicates, clusters can still be detectedusing one of many existing clustering algorithms, and the average distances between and within clusters canbe compared. Informal plots of the x i projected onto a two- or three-dimensional space can also be used asan aid in this process.If there is no evidence of clustering or replicates, the HL test should not be disturbed by this phenomenon.On the other hand, if there is a noticeable amount of clustering, and the regression model is not too large,say d ≤ min { n/ , m/ } , where m also represents the number of estimated clusters, then one can use the6HL test. In the worst-case scenario with a small sample size, clustering, and a large regression model, onecan use both tests as an informal aid in assessing the quality of the ﬁt of the model, recognizing that GHLmay overstate the lack of ﬁt, while HL may understate it. If the two tests agree, then this suggests that thedecision is not inﬂuenced by the properties of the tests. When they disagree, conclusions should be drawnmore tentatively. Acknowledgements

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC),[funding reference number RGPIN-2018-04868]. Cette recherche a ´et´e ﬁnanc´ee par le Conseil de recherchesen sciences naturelles et en g´enie du Canada (CRSNG), [num´ero de r´ef´erence RGPIN-2018-04868].

References

G Bertolini, Roberto D’amico, D Nardi, A Tinazzi, and G Apolone. One model, several results: the paradoxof the Hosmer-Lemeshow goodness-of-ﬁt test for the logistic regression model.

Journal of Epidemiologyand Biostatistics , 5(4):251–253, 2000.Christopher R. Bilder and Thomas M. Loughin.

Analysis of categorical data with R . Chapman and Hall/CRC,2014.Jana D. Canary, Leigh Blizzard, Ronald P. Barry, David W. Hosmer, and Stephen J. Quinn. Summarygoodness-of-ﬁt statistics for binary generalized linear models with noncanonical link functions.

BiometricalJournal , 58(3):674–690, 2016.David W. Hosmer and Nils L. Hjort. Goodness-of-ﬁt processes for logistic regression: simulation results.

Statistics in Medicine , 21(18):2723–2738, 2002.David W. Hosmer and Stanley Lemeshow. Goodness of ﬁt tests for the multiple logistic regression model.

Communications in Statistics - Theory and Methods , 9(10):1043–1069, 1980.David W. Hosmer, Trina Hosmer, Saskia Le Cessie, and Stanley Lemeshow. A comparison of goodness-of-ﬁttests for the logistic regression model.

Statistics in Medicine , 16(9):965–980, 1997.Stanley Lemeshow and David W. Hosmer. A review of goodness of ﬁt statistics for use in the developmentof logistic regression models.

American Journal of Epidemiology , 115(1):92–106, 1982.Stanley Lemeshow, Daniel Teres, Jill Spitz Avrunin, and Harris Pastides. Predicting the outcome of intensivecare unit patients.

Journal of the American Statistical Association , 83(402):348–356, 1988.David S. Moore and Marcus C. Spruill. Uniﬁed large-sample theory of general chi-squared statistics for testsof ﬁt.

The Annals of Statistics , pages 599–616, 1975.Winfried Stute and Li-Xing Zhu. Model checks for generalized linear models.

Scandinavian Journal ofStatistics , 29(3):535–545, 2002.Nikola Surjanovic, Richard Lockhart, and Thomas M. Loughin. A generalized Hosmer-Lemeshow goodness-of-ﬁt test for a family of generalized linear models. arXiv preprint arXiv:2007.11049 , 2020.Anastasios A. Tsiatis. A note on a goodness-of-ﬁt test for the logistic regression model.

Biometrika , 67(1):250–251, 1980. 7igure 1: Null simulation results. Solid red lines are approximate 95% CIs. Intervals are omitted for the type1 error rate plot, but can be approximated by adding and subtracting 0.005 from the estimated rejectionrate. 8igure 2: Example of clustering in the covariate space with two predictor variables, X and X . Top left tobottom right: σ e = 0, 0 . .

01, and 0 .

1, with σ = 1.Is the model small or moderate in size?( d ≤ min { n/ , m/ } )Are there replicates orclusters of observations?Use HL Try both tests,but proceed with caution Very large n ?Are there replicates orclusters of observations?Use HL Use GHL or both testsUse GHLFigure 3: Decision tree oﬀering guidance on how to choose between the two GOF tests when G = 10 and d (cid:46)(cid:46)