Improving the Hosmer-Lemeshow Goodness-of-Fit Test in Large Models with Replicated Trials
IImproving the Hosmer-Lemeshow Goodness-of-Fit Test in LargeModels with Replicated Trials
Nikola Surjanovic and Thomas M. Loughin
Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, British Columbia V5A1S6, Canada
Summary
The Hosmer-Lemeshow (HL) test is a commonly used global goodness-of-fit (GOF) test that assesses the quality ofthe overall fit of a logistic regression model. In this paper, we give results from simulations showing that the type1 error rate (and hence power) of the HL test decreases as model complexity grows, provided that the sample sizeremains fixed and binary replicates are present in the data. We demonstrate that the generalized version of the HLtest by Surjanovic et al. (2020) can offer some protection against this power loss. We conclude with a brief discussionexplaining the behaviour of the HL test, along with some guidance on how to choose between the two tests.
Key Words: chi-squared test, generalized linear model, goodness-of-fit test, Hosmer-Lemeshow test, logistic regres-sion
Logistic regression models have gained a considerable amount of attention as a tool for estimating the prob-ability of success of a binary response variable, conditioning on several explanatory variables. Researchers inhealth and medicine have used these models in a wide range of applications. One of many examples includesestimating the probability of hospital mortality for patients in intensive care units as a function of variouscovariates (Lemeshow et al., 1988).Regardless of the application, it is always desirable to construct a model that fits the observed data well.One of several ways of assessing the quality of the fit of a model is with goodness-of-fit (GOF) tests (Bilderand Loughin, 2014). In general, GOF tests examine a null hypothesis that the structure of the fitted modelis correct. They may additionally identify specific alternative models or deviations to test against, but thisis not required. Global (omnibus) GOF tests are useful tools that allow one to assess the validity of a modelwithout restricting the alternative hypothesis to a specific type of deviation.An example of a well-known global GOF test for logistic regression is the Hosmer-Lemeshow (HL) test,introduced by Hosmer and Lemeshow (1980). The test statistic is a Pearson statistic that compares observedand expected event counts from data grouped according to ordered fitted values from the model. The decisionrule for the test is based on comparing the test statistic to a chi-squared distribution with degrees of freedomthat depend on the number of groups used to create the test statistic. The HL test is relatively easy toimplement in statistical software, and since its creation, the HL test has become quite popular, particularlyin fields such as biostatistics and the health sciences.Despite its popularity, the HL test is known to have some drawbacks. In both experimental and observa-tional studies, it is possible to have data for which binary observations have the same explanatory variablepatterns (EVPs). In this case, responses can be aggregated into binomial counts and trials. When thereare observations with the same EVP present in the data, it is possible to obtain many different p-valuesdepending on how the data are sorted (Bertolini et al., 2000). A related disadvantage of HL-type tests isthat the test statistic depends on the way in which groups are formed, as remarked upon by Hosmer et al.1 a r X i v : . [ s t a t . M E ] F e b In what follows, we use the notation of Surjanovic et al. (2020). We let Y ∈ { , } denote a binary responsevariable associated with a d -dimensional covariate vector, X ∈ R d , where the first element of X is equal toone. The pairs ( X i , Y i ), i = 1 , . . . , n denote a random sample, with each pair being distributed accordingto the joint distribution of ( X, Y ). The observed values of ( X i , Y i ) are denoted using lowercase letters as( x i , y i ).In a logistic regression model with binary responses, one assumes thatE( Y | X = x ) = π ( β (cid:62) x ) = exp( β (cid:62) x )1 + exp( β (cid:62) x ) , for some β ∈ R d . The likelihood function is L ( β ) = n (cid:89) i =1 π ( β (cid:62) x i ) y i (1 − π ( β (cid:62) x i )) − y i . From this likelihood, a maximum likelihood estimate (MLE), β n , of β is obtained. The HL Test Statistic
To compute the HL test statistic, one partitions the observed data, ( x i , y i ), into G groups. Typically, thegroups are created so that fitted values are similar within each group and the groups are approximately2f equal size. To achieve this, a partition is defined by a collection of G + 1 interval endpoints, −∞ = k < k < · · · < k G − < k G = ∞ . The k g often depend on the data, usually being set equal to the logitsof equally-spaced quantiles of the fitted values, ˆ π i = π ( β (cid:62) n x i ). We define I ( g ) i = ( k g − < β (cid:62) n x i ≤ k g ), O g = (cid:80) ni =1 y i I ( g ) i , E g = (cid:80) ni =1 ˆ π i I ( g ) i , n g = (cid:80) ni =1 I ( g ) i , and ¯ π g = E g /n g , where ( A ) is the indicator functionon a set A . With this notation, the number of observations in the g th group is represented by n g , and¯ π g denotes the mean of the fitted values in this group. The HL test statistic is a quadratic form that iscommonly written in summation form as (cid:98) C G = G (cid:88) g =1 ( O g − E g ) n g ¯ π g (1 − ¯ π g ) . When
G > d , Hosmer and Lemeshow (1980) find that the HL test statistic is asymptotically distributed asa weighted sum of chi-squared random variables under the null hypothesis, after checking certain conditionsof Theorem 5.1 in Moore and Spruill (1975). Precisely, (cid:98) C G d −→ χ G − d + d (cid:88) j =1 λ j χ j , (1)with each χ j being a chi-squared random variable with 1 degree of freedom, and each λ j an eigenvalue ofa certain matrix that depends on both the distribution of X and the vector β , the true value of β underthe null hypothesis. Hosmer and Lemeshow (1980) conclude through simulations that the right side of (1)is well approximated by a χ G − distribution in various settings.The HL test statistic and the corresponding p-value both depend on the chosen number of groups, G .Typically, G = 10 groups are used, so that observations are partitioned into groups that are associated with“deciles of risk”. Throughout this paper we use 10 groups, and therefore compare the HL test statistic to achi-squared distribution with 8 degrees of freedom, but the results hold for more general choices of G . The GHL Test Statistic
The GHL test introduced by Surjanovic et al. (2020) generalizes several GOF tests, allowing them to beapplied to other generalized linear models. Tests that are generalized by the GHL test include the HL test(Hosmer and Lemeshow, 1980), the Tsiatis (Tsiatis, 1980) and generalized Tsiatis tests (Canary et al., 2016),and a version of the “full grouped chi-square” from Hosmer and Hjort (2002) with all weights equal to one.The test statistic is a quadratic form like (cid:98) C G , but with important changes to the central matrix. The theorybehind this test depends on the residual process, R n ( u ), u ∈ R , defined in Stute and Zhu (2002). In the caseof logistic regression, R n ( u ) = 1 √ n n (cid:88) i =1 [ Y i − π ( β (cid:62) n X i )] ( β (cid:62) n X i ≤ u ) , a cumulative sum of residuals that are ordered according to the size of their corresponding fitted values.This process is transformed into a G -dimensional vector, S n , which forms the basis of the HL and GHL teststatistics, with S n = ( R n ( k ) − R n ( k ) , . . . , R n ( k G ) − R n ( k G − )) (cid:62) . In order to approximate the variance of S n , we need to define several matrices. Let( G ∗ n ) gi = I ( g ) i ,V ∗ / = diag (cid:16) [ π ( β (cid:62) x i )(1 − π ( β (cid:62) x i ))] / (cid:17) , i = 1 , . . . , n , and g = 1 , . . . , G . Also, define X ∗ to be the n × d matrix with i th row given by x (cid:62) i , and let V ∗ / n be the same as V ∗ / , but evaluated at the estimate β n of β . Finally, defineΣ n = 1 n G ∗ n (cid:0) V ∗ n − V ∗ n X ∗ ( X ∗(cid:62) V ∗ n X ∗ ) − X ∗(cid:62) V ∗ n (cid:1) G ∗(cid:62) n = 1 n G ∗ n V ∗ / n (cid:16) I n − V ∗ / n X ∗ ( X ∗(cid:62) V ∗ n X ∗ ) − X ∗(cid:62) V ∗ / n (cid:17) V ∗ / n G ∗(cid:62) n , (2)where I n is the n × n identity matrix.For logistic regression models, the GHL test statistic is then X = S (cid:62) n Σ + n S n , where Σ + n is the Moore-Penrose pseudoinverse of Σ n . Under certain conditions given by Surjanovic et al.(2020), S (cid:62) n Σ + n S n d −→ χ ν , where ν = rank(Σ), with Σ a matrix defined in their paper. Since the rank of Σ might be unknown, theyuse the rank of Σ n as an estimate. We use the same approach to estimating ν , empirically finding that theestimated rank of Σ n is often equal to G − X w statistic from Hosmer and Hjort (2002) with all weights set equal to 1, when a particulargrouping method is used—that is, when G ∗ n is the same for all methods. However, use of the GHL test isjustified for a wide variety of GLMs and grouping procedures with a proper modification of Σ n , as describedby Surjanovic et al. (2020). For both the HL and GHL tests, we use the grouping method proposed bySurjanovic et al. (2020), which uses random interval endpoints, k g , so that (cid:80) ni =1 ˆ π i (1 − ˆ π i ) I ( g ) i is roughlyequal between groups. Further details of the implementation are provided in the supplementary material oftheir paper.It is important to note that Σ n is a non-diagonal matrix that standardizes and accounts for correlationsbetween the grouped residuals in the vector S n . This can be seen from (2), which shows that Σ n containsa generalized hat matrix for logistic regression. In contrast, when written as a quadratic form, the centralmatrix of the HL test statistic is diagonal and does not account for the number of parameters in the model, d , when standardizing the grouped residuals. We expect this standardization to be very important whenexact replicates are present, as the binomial responses might be more influential than sparse, individualbinary responses.It is extremely common to fit logistic regression models to data where multiple Bernoulli trials areobserved at some or all EVPs, even when the underlying explanatory variables are continuous. As with anyfitted model, a test of model fit would be appropriate, and the HL test would likely be a candidate in atypical problem. It is therefore important to explore how the HL and GHL tests behave with large modelswhen exact or near-replicates are present in the data. We compare the performance of the HL and GHL tests by performing a simulation study. Of particularinterest is the rejection rate under the null, when the tests are applied to moderately large models that arefit to data with clusters or exact replicates in the covariate space.In all settings, the true regression model isE( Y | X = x ) = logit( β + β x + . . . + β d − x d − ) , (3)4ith d in { , , . . . , } . Here, β represents the intercept term. To produce replicates in the covariatespace, m ≤ n unique EVPs are drawn randomly from a ( d − σ = 1 for each simulation realization. At each EVP, n/m replicate Bernoulli trials are then created, with probabilities determined by (3). In our simulation study, wefix n = 500 and select m ∈ { , , } so that the number of replicates at each EVP, n/m , is 10, 5, or 1,respectively.We set β = 0 . β = . . . = β d − = 0 . / √ d −
1. This results in fitted values that rarely fall outsidethe interval [0 . , . n = 100, using smaller values of d and m than for n = 500.However, we focus on results for n = 500 because we are then able to increase the number of replicates perEVP, n/m , while still maintaining large enough m so that it is possible to create ten viable groupings. Ineach simulation setting, 10,000 realizations are produced. All simulations are performed using R . Figure 1 presents plots of the sample mean and variance of the HL and GHL test statistics against thenumber of variables in the model, separately for each m . An analogous plot of the estimated type 1 errorrate of the tests against the number of variables is also presented. For the HL test, all three statistics showa clear decreasing pattern with increasing model size when replicates are present, with a sharper decreasewhen the number of replicates per EVP is larger. Since the estimated variance is not always twice the sizeof the mean, we can infer that the chi-squared approximation to the null distribution of the HL test statisticis not adequate in finite samples for these data structures. Simulation results with a sample size of n = 100are not displayed, but are quite similar.From the same figure, we see that the GHL test performs well in the settings considered. The estimatedmean and variance of the test statistic stay close to the desired values of G − G −
1) = 18.We note that the GHL test can have an inflated type 1 error rate, particularly when it is applied to highlycomplex models. The models considered here are only moderately large, with d ≤ min { n/ , m/ } . If onewishes to use the GHL test to assess the quality of fit of larger models with only a moderate sample size,one should be wary of an inflated type 1 error rate that can become considerably large for complex models.A possible explanation for this is that estimating the off-diagonal elements of the matrix Σ n can potentiallyintroduce a considerable amount of variance into the test statistic in small samples.Recall from (1) that the asymptotic χ G − distribution for the HL test proposed by Hosmer and Lemeshow(1980) is based on a sum of chi-squares, where one has G − d degrees of freedom. We investigated whethermaintaining G = 10 while increasing d contributes to the phenomena we have observed. We set G = 26and performed a similar simulation study. The adverse behavior of the HL statistic still persists despite thismodification.We also investigated the effect of near-replicate clustering in the covariate space. We fixed n and m asin Section 3, but added a small amount of random noise with marginal variance σ e to each replicate withinthe m sampled vectors. The amount of clustering was controlled by varying σ e , as shown in Figure 2. Asexpected, increasing σ e reduces the severity of the decreasing mean, variance and type 1 error rate for theHL test statistic. However, the pattern remains evident while σ e /σ remains small.5 Discussion
The original HL test, developed by Hosmer and Lemeshow (1980), is a commonly-used test for logisticregression among researchers in biostatistics and the health sciences. Although its performance is welldocumented (Lemeshow and Hosmer, 1982; Hosmer et al., 1997; Hosmer and Hjort, 2002), we have identifiedan issue that does not seem to be well known. For moderately large logistic regression models fit to datawith clusters or exact replicates in the covariate space, the null distribution of the HL test statistic can failto be adequately represented by a chi-squared distribution in finite samples. Using the original chi-squareddistribution with G − n , makes a form of correction to the HL test statistic by standardizing and by accounting forcorrelations between the grouped residuals that comprise the quadratic form in both the HL and GHLtests. This is evident from (2), which shows that Σ n contains the generalized hat matrix subtracted froman identity matrix. To empirically assess the behaviour of Σ n , we varied σ e in the setup with replicatesand added noise, described at the end of Section 4. For large d and moderate n , both fixed, we found thatthe diagonal elements of Σ n tend to shrink, on average, as σ e decreases. In contrast, the elements of theHL central matrix remain roughly constant. Therefore, the GHL statistic seems to adapt to clustering orreplicates in X , whereas the HL test statistic does not.In logistic regression with exact replicates, grouped binary responses can be viewed as binomial responsesthat can be more influential. In this scenario, as d increases for a fixed sample size n , the distribution of theregular HL test statistic diverges from a single chi-squared distribution, suggesting that the standardizationoffered by the central GHL matrix becomes increasingly important.The real-life implications of the reduced type 1 error rate and power of the regular HL test are that in mod-els with a considerable number of variables—provided that the data contains clusters or exact replicates—theHL test has limited ability to detect model misspecifications. Failure to detect model misspecification canresult in retaining an inadequate model, which is arguably worse than rejecting an adequate model due toan inflated type 1 error rate, particularly when logistic regression models are used to estimate probabilitiesfrom which life-and-death decisions might be made.Our advice for choosing between the two GOF tests is displayed as a simple decision tree in Figure 3.The advice should be interpreted for G = 10 groups, the most commonly used number of groups. Withlarge samples, provided that m is sufficiently large compared to d , it should generally be safe to use theGHL test. Our simulations explored models with d ≤
25, so some caution should be exercised if the GHLtest is to be used with larger models. For small or moderate samples, such as when n = 100 or 500, it isimportant to identify whether there are clusters or exact replicates in the covariate space. One can computethe number of unique EVPs, m , and compare this number to the sample size, n . If n/m ≥
5, say, then thereis a considerable amount of “clustering”. For data without exact replicates, clusters can still be detectedusing one of many existing clustering algorithms, and the average distances between and within clusters canbe compared. Informal plots of the x i projected onto a two- or three-dimensional space can also be used asan aid in this process.If there is no evidence of clustering or replicates, the HL test should not be disturbed by this phenomenon.On the other hand, if there is a noticeable amount of clustering, and the regression model is not too large,say d ≤ min { n/ , m/ } , where m also represents the number of estimated clusters, then one can use the6HL test. In the worst-case scenario with a small sample size, clustering, and a large regression model, onecan use both tests as an informal aid in assessing the quality of the fit of the model, recognizing that GHLmay overstate the lack of fit, while HL may understate it. If the two tests agree, then this suggests that thedecision is not influenced by the properties of the tests. When they disagree, conclusions should be drawnmore tentatively. Acknowledgements
We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC),[funding reference number RGPIN-2018-04868]. Cette recherche a ´et´e financ´ee par le Conseil de recherchesen sciences naturelles et en g´enie du Canada (CRSNG), [num´ero de r´ef´erence RGPIN-2018-04868].
References
G Bertolini, Roberto D’amico, D Nardi, A Tinazzi, and G Apolone. One model, several results: the paradoxof the Hosmer-Lemeshow goodness-of-fit test for the logistic regression model.
Journal of Epidemiologyand Biostatistics , 5(4):251–253, 2000.Christopher R. Bilder and Thomas M. Loughin.
Analysis of categorical data with R . Chapman and Hall/CRC,2014.Jana D. Canary, Leigh Blizzard, Ronald P. Barry, David W. Hosmer, and Stephen J. Quinn. Summarygoodness-of-fit statistics for binary generalized linear models with noncanonical link functions.
BiometricalJournal , 58(3):674–690, 2016.David W. Hosmer and Nils L. Hjort. Goodness-of-fit processes for logistic regression: simulation results.
Statistics in Medicine , 21(18):2723–2738, 2002.David W. Hosmer and Stanley Lemeshow. Goodness of fit tests for the multiple logistic regression model.
Communications in Statistics - Theory and Methods , 9(10):1043–1069, 1980.David W. Hosmer, Trina Hosmer, Saskia Le Cessie, and Stanley Lemeshow. A comparison of goodness-of-fittests for the logistic regression model.
Statistics in Medicine , 16(9):965–980, 1997.Stanley Lemeshow and David W. Hosmer. A review of goodness of fit statistics for use in the developmentof logistic regression models.
American Journal of Epidemiology , 115(1):92–106, 1982.Stanley Lemeshow, Daniel Teres, Jill Spitz Avrunin, and Harris Pastides. Predicting the outcome of intensivecare unit patients.
Journal of the American Statistical Association , 83(402):348–356, 1988.David S. Moore and Marcus C. Spruill. Unified large-sample theory of general chi-squared statistics for testsof fit.
The Annals of Statistics , pages 599–616, 1975.Winfried Stute and Li-Xing Zhu. Model checks for generalized linear models.
Scandinavian Journal ofStatistics , 29(3):535–545, 2002.Nikola Surjanovic, Richard Lockhart, and Thomas M. Loughin. A generalized Hosmer-Lemeshow goodness-of-fit test for a family of generalized linear models. arXiv preprint arXiv:2007.11049 , 2020.Anastasios A. Tsiatis. A note on a goodness-of-fit test for the logistic regression model.
Biometrika , 67(1):250–251, 1980. 7igure 1: Null simulation results. Solid red lines are approximate 95% CIs. Intervals are omitted for the type1 error rate plot, but can be approximated by adding and subtracting 0.005 from the estimated rejectionrate. 8igure 2: Example of clustering in the covariate space with two predictor variables, X and X . Top left tobottom right: σ e = 0, 0 . .
01, and 0 .
1, with σ = 1.Is the model small or moderate in size?( d ≤ min { n/ , m/ } )Are there replicates orclusters of observations?Use HL Try both tests,but proceed with caution Very large n ?Are there replicates orclusters of observations?Use HL Use GHL or both testsUse GHLFigure 3: Decision tree offering guidance on how to choose between the two GOF tests when G = 10 and d (cid:46)(cid:46)