[PDF] Firth's logistic regression with rare events: accurate effect estimates AND predictions?

Abstract

Firth-type logistic regression has become a standard approach for the analysis of binary outcomes with small samples. Whereas it reduces the bias in maximum likelihood estimates of coefficients, bias towards 1/2 is introduced in the predicted probabilities. The stronger the imbalance of the outcome, the more severe is the bias in the predicted probabilities. We propose two simple modifications of Firth-type logistic regression resulting in unbiased predicted probabilities. The first corrects the predicted probabilities by a post-hoc adjustment of the intercept. The other is based on an alternative formulation of Firth-types estimation as an iterative data augmentation procedure. Our suggested modification consists in introducing an indicator variable which distinguishes between original and pseudo observations in the augmented data. In a comprehensive simulation study these approaches are compared to other attempts to improve predictions based on Firth-type penalization and to other published penalization strategies intended for routine use. For instance, we consider a recently suggested compromise between maximum likelihood and Firth-type logistic regression. Simulation results are scrutinized both with regard to prediction and regression coefficients. Finally, the methods considered are illustrated and compared for a study on arterial closure devices in minimally invasive cardiac surgery.

Full PDF

aa r X i v : . [ s t a t . M E ] J a n Firth’s logistic regression with rare events:accurate effect estimates AND predictions?

Rainer Puhr, Georg Heinze, Mariana Nold, Lara Lusa and Angelika GeroldingerMay 12, 2016

Firth-type logistic regression has become a standard approach for the analysis of binary outcomeswith small samples. Whereas it reduces the bias in maximum likelihood estimates of coefﬁcients,bias towards / is introduced in the predicted probabilities. The stronger the imbalance of the out-come, the more severe is the bias in the predicted probabilities. We propose two simple modiﬁcationsof Firth-type logistic regression resulting in unbiased predicted probabilities. The ﬁrst corrects thepredicted probabilities by a post-hoc adjustment of the intercept. The other is based on an alterna-tive formulation of Firth-types estimation as an iterative data augmentation procedure. Our suggestedmodiﬁcation consists in introducing an indicator variable which distinguishes between original andpseudo observations in the augmented data. In a comprehensive simulation study these approaches arecompared to other attempts to improve predictions based on Firth-type penalization and to other pub-lished penalization strategies intended for routine use. For instance, we consider a recently suggestedcompromise between maximum likelihood and Firth-type logistic regression. Simulation results arescrutinized both with regard to prediction and regression coefﬁcients. Finally, the methods consid-ered are illustrated and compared for a study on arterial closure devices in minimally invasive cardiacsurgery. Keywords: bias reduction; data augmentation; Jeffreys prior; penalized likelihood; sparse data

In logistic regression, Firth-type penalisation [1] has gained increasing popularity as a method to reduce the small-sample bias of maximum likelihood (ML) coefﬁcients. Penalising the likelihood function utilizing Jeffreys invariantprior does not only remove the ﬁrst-order term in the asymptotic bias expansion of ML estimates but also allowsto compute reliable, ﬁnite estimates of coefﬁcients in the case of separation, where ML estimation fails, see [2].Though, reducing the bias in the estimates of coefﬁcients comes at the cost of introducing bias in the predictedprobabilities as will be illustrated later: whereas ML gives an average predicted probability equal to the observedproportion of events, Firth-type penalization biases the average predicted probability towards / . This bias ofpredictions may be non-negligible if events are very rare or very common. The tempting approach to replace MLestimates by Firth-type penalized estimates whenever the number of observations is critically low, maybe indicated1y the occurrence of separation, might then even be detrimental if one is not only interested in effect estimatesbut also in predicted probabilities. Thus, the present paper has two main objectives. First, we want to clarify howrelevant the bias in predicted probabilities based on Firth-type penalization is in practice. We investigate the originof the bias by simple theoretical considerations and empirically quantify it for realistic situations using simulations.Second, we suggest two simple modiﬁcations of Firth-type penalization to overcome the bias of predictions andcompare them with alternative methods which were proposed for situations of rare events.The bias in predicted probabilities based on Firth-type penalization already becomes apparent in the simpleexample of logistic regression with a single binary predictor. Assume that we want to investigate the associationbetween a binary outcome y and some binary risk factor x and observe the following counts:x0 1y 0 95 41 5 1For × -tables, predicted probabilities obtained by ML estimation are equal to the proportion of events in the twogroups. Thus, we obtain ML predicted probabilities of and for x = 0 and x = 1 , respectively, correspond-ing to an overall average predicted probability of . . However, Firth-type penalization results in predictedprobabilities of . and , respectively, and an average predicted probability of . , i.e. in . over-estimation. This results from the implicit augmentation of cell counts by applying Firth-type penalization to thissimple case. Here, Firth-type penalization is equivalent to ML estimation after adding a constant of . to eachcell, cf. [2]. (Note that a × -table constitutes the simplest case of a saturated model.) The four cell count modi-ﬁcations can be interpreted as four pseudo-observations, each with weight . , which are added to the original dataset. Since the pseudo-data have an event rate of . , Firth-type penalization leads to overestimation of predictedprobabilities in case of rare events.The present paper proposes two simple, generally applicable modiﬁcations of Firth-type multivariable logisticregression in order to obtain unbiased average predicted probabilities. First, we consider a simple post-hoc ad-justment of the intercept. This Firth-type logistic regression with intercept-correction (FLIC) does not alter thebias-corrected effect estimates. By excluding the intercept from the penalization, we do not have to trade accuracyof effect estimates for better predictions.The other approach achieves unbiased predicted probabilities by adjusting for an artiﬁcial covariate discriminatingbetween original and pseudo-observations in the iterative weighting procedure mentioned above. In this way, this

Firth-type logistic regression with added covariate (FLAC) recalibrates the average predicted probability to theproportion of events in the original data.In Section 3 we perform a comprehensive simulation study comparing the performance of FLIC and FLAC to theperformance of other published methods aiming at accurate predictions. In particular, we would like to clarify,whether the proposed modiﬁcations improve the accuracy of the predicted probabilities conditional on explanatoryvariables. Furthermore, in the case of FLAC, we investigate whether the unbiasedness of the predicted probabilityis paid for by an inﬂation of bias in effect estimates. As comparator methods we consider a compromise betweenML and Firth-type logistic regression recently proposed by Elgmati et al., cf. [3], penalization by log- F (1 , [4]or Cauchy priors [5], the “approximate Bayesian” and the “approximate unbiased” method suggested by King andZeng, see [6], and ridge regression. These methods are introduced in Section 2. In Section 4, the performance of2hese methods is illustrated in a study comparing the use of arterial closure devices to conventional surgical accessin minimally invasive cardiac surgery. The logistic regression model P ( y i = 1 | x i ) = (1 + exp( − x i β )) − with i = 1 , . . . N associates a binary outcome y i ∈ { , } to a vector of covariate values x i = (1 , x i , . . . , x ip ) using a ( p + 1) -dimensional vector of regressionparameters β = ( β , β , . . . , β p ) ′ . The maximum likelihood (ML) estimate ˆ β ML is given by the parameter vectormaximizing the log-likelihood function l ( β ) and is usually derived by solving the score equations ∂l/∂β r = 0 with r = 0 , . . . , p . For ML estimates, the proportion of observed events is equal to the average predicted probability.This can be seen easily from the explicit form of the ML score functions ∂l/∂β r = P i ( y i − π i ) x ir , where π i = (1 + exp( − x i β )) − denotes the predicted probability for the i -th observation.Firth has shown that penalizing the likelihood function by the Jeffreys invariant prior removes the ﬁrst-order termin the asymptotic bias of ML coefﬁcient estimates, cf. [1] and [2]. Jeffreys invariant prior is given by | I ( β ) | / = | X ′ WX | / with I ( β ) the Fisher information matrix, X the design matrix and W the diagonal matrix diag( π i (1 − π i )) . Estimates of coefﬁcients ˆ β FL for Firth-type penalized logistic regression (FL) can be found by solving thecorresponding modiﬁed score equations N X i =1 ( y i − π i + h i ( 12 − π i )) x ir = 0 , r = 0 , . . . p, (1)where h i is the i -th diagonal element of the hat matrix W X ( X ′ WX ) − X ′ W , see [2]. Equation (1) for r = 0 reveals that in general the average predicted probability in Firth-type penalized logistic regression is not equalto the observed proportion of events. Since the determinant of the Fisher information matrix is maximized for π i = 1 / , it is concluded that Firth-type penalization tends to push the predictions towards / compared to MLestimation. Thus, in the situation of rare events, Firth type penalization is prone to overestimate predictions. Wecan gain more insight into the behaviour of bias of FL predictions by interpreting the modiﬁed score equations (1)as score equations for ML estimates for an augmented data set. This data set can be created by complementingeach original observation i with two pseudo-observations weighted by h i / with same covariate values and withresponse values y = 0 and y = 1 , respectively. Thus, FL estimates could be obtained by iteratively applying MLestimation to the augmented data. Given that the trace of the hat matrix is always equal to p + 1 with p the numberof explanatory variables, we see that the augmented data contain ( p + 1) / more events than the original one.Consequently, the predicted probability by FL, averaged over the observations in the augmented data, is equal to ( k + ( p + 1) / / ( N + p + 1) with k the number of events. This gives a very rough approximation of the predictedprobability averaged over the original observations, being the closer to the true value the more homogeneous thediagonal elements of the hat matrix are. Unsurprisingly, the relative bias in the average predicted probability islarger for smaller numbers of events k and for larger numbers of parameters p – exactly in the same situationswhere the application of Firth-type logistic regression is indicated to reduce small-sample bias.In the following, we suggest two simple modiﬁcations of Firth-type logistic regression which provide averagepredicted probabilities equal to the observed proportion of events, while preserving the ability to deal with separa-3ion:First, we consider altering only the intercept of the Firth-type estimates such that the predicted probabilitiesbecome unbiased while keeping all other coefﬁcients constant. In practice, estimates for this Firth-type logisticregression with intercept-correction (FLIC) can be derived as follows:1. Determine the Firth-type estimates ˆ β FL .2. Calculate the linear predictors ˆ η i = ˆ β FL , x i + · · · + ˆ β FL ,p x ip , omitting the intercept.3. Determine the ML estimate ˆ γ of the intercept in the logistic model P ( y i = 1) = (1 + exp( − γ − ˆ η i )) − ,containing only a single predictor ˆ η i with regression coefﬁcient equal to one.4. The FLIC estimate ˆ β FLIC is then given by the Firth-type estimate ˆ β FL with the intercept replaced by ˆ γ , so ˆ β FLIC = (ˆ γ , ˆ β FL , , . . . , ˆ β FL ,p ) .By deﬁnition, for FLIC the average predicted probability is equal to the proportion of observed events. For theexample of the × -table in the introduction, i.e. subjects with events for x = 0 and subjects with oneevent for x = 1 , FLIC yields predicted probabilities of . and . , respectively. Similarly as for Firth-typepenalized logistic regression, we suggest to use proﬁle (penalized) likelihood conﬁdence intervals for the coefﬁ-cients estimated by FLIC except for the intercept, see [2]. Approximate Wald-type conﬁdence intervals for theintercept can be derived from the covariance matrix of the model used to estimate the intercept, which contains thelinear predictors as offset. This approximation was evaluated in the simulation study, see Section 3.Second, we introduce Firth-type logistic regression with added covariate (FLAC) . The basic idea is to discrim-inate between original and pseudo-observations in the alternative formulation of Firth-type estimation as iterativedata augmentation procedure, which was described above. For instance, in the case of × -tables, where FLamounts to ML estimation of an augmented table with each cell count increased by . , FLAC estimates are ob-tained by a stratiﬁed analysis of the original × -table and the pseudo data, given by a × -table with each cellcount equal to . . In the general case, FLAC estimates β FLAC can be obtained as follows:1. Apply Firth-type logistic regression and calculate the diagonal elements h i of the hat matrix.2. Construct an augmented data set by stacking (i) the original observations weighted by , (ii) the originalobservations weighted by h i / and (iii) the original observations with opposite response values and weightedby h i / .3. Deﬁne an indicator variable g on this augmented data set, being equal to for (i) and equal to for (ii) and(iii).4. The FLAC estimates ˆ β FLAC are then obtained by ML estimation on the augmented data set adding g ascovariate.If one does not include the indicator variable g as additional covariate in the last step, ML estimation with aug-mented data will result in ordinary Firth-type estimates. For the example of the × -table in the introduction, i.e. subjects with events for x = 0 and subjects with one event for x = 1 , FLAC yields predicted probabilities4f . and . , respectively. For data augmentation in × -tables the idea of discriminating between orig-inal and pseudo observations was already explored in [7]. The average predicted probability by FLAC is equal tothe proportion of observed events. Conﬁdence intervals for coefﬁcients estimated by FLAC can be deduced fromthe ML estimation on the augmented data in the last step. Again, we prefer proﬁle likelihood over Wald conﬁdenceintervals.One of the main aims of the simulation study in Section 3 is to investigate whether these adjustments of theaverage predicted probability are also reﬂected in improved predicted probabilities at the subject level. Besides MLand classical Firth-type estimation, we compared the performance of FLIC and FLAC to the performance of thefollowing methods:- A “weakened” Firth-type penalization (WF) is proposed by Elgmati et al., where the likelihood is penalizedby | I ( β ) | τ with weight τ between (corresponding to ML estimation) and / (corresponding to Firth-typepenalization), cf. [3]. This gives a compromise between accuracy of predictions, where Firth’s method isoutperformed by ML estimation, and the handling of stability and separation issues, which is a strength ofFirth-type penalization. We chose the weight τ equal to . as recommended by Elgmati et al. Similarly asordinary Firth-type penalization, WF estimates could be obtained by iteratively applying ML estimation toan augmented data set, in this case with an event rate of ( k + ( p + 1) / / ( N + p + 1) .- Penalizing by log- F (1 , priors (LF) amounts to multiplying the likelihood by Q e β j / / (1 + e β j ) , where theproduct ranges over j ∈ { , . . . , p } . This type of penalization is regarded a better choice of a “default prior”by Greenland and Mansournia [4] compared to methods such as Firth-type penalization or Cauchy priors.We follow their suggestion to omit the intercept from the penalization, resulting in an average predictedprobability equal to the proportion of observed events. Quantitative explanatory variables are advised to bescaled in units that are contextually meaningful, cf. [4]. Unlike Jeffreys prior, the log- F (1 , prior does notdepend on the correlation between explanatory variables.- Penalization by Cauchy priors (CP) in logistic regression is suggested as “default choice for routine applieduse” by Gelman et al., see [5]. Unlike Greenland and Mansournia, they give an explicit recommendation fordata preprocessing, which is also implemented in the corresponding R function bayesglm in the package arm . All explanatory variables are shifted to have a mean of . Binary variables are coded to have a range of and all others are scaled to have a standard deviation of . . Then, all explanatory variables are penalizedby Cauchy priors with center and scale . . The intercept is assigned a weaker Cauchy prior, with center and scale , which implicates that the average predicted probability is in general not equal, but very closeto the observed proportion of events.- The approximate Bayesian method (AB) suggested by King and Zeng, cf. [6], takes as starting point small-sample bias corrected logistic regression coefﬁcients, such as coefﬁcients estimated by FL ˆ β FL . Predictedprobabilities by AB ˆ π AB ,i are then obtained by averaging the predicted probabilities by FL over the posteriordistribution of the coefﬁcients, ˆ π AB ,i = Z (1 + exp( − x i β ∗ )) − f ( β ∗ ) dβ ∗ , (2)where f ( β ∗ ) denotes the posterior density of the parameter ˆ β FL , approximated by N ( ˆ β FL , − ( ∂ l∂β ( ˆ β FL )) − ) .5nstead of deriving ˆ π AB ,i by numerical integration, we make use of the approximation ˆ π AB ,i = ˆ π FL ,i + C i ,where the correction factor C i is given by (0 . − ˆ π FL ,i ) h i , cf. [6]. In general, equality of average predictedprobability and observed event rate does not hold for this method, the bias of the average predicted probabilityby AB is exactly twice as large as by Firth-type penalization. This follows from the fact that the sum over C i is equal to P i ˆ π FL ,i − y i according to the modiﬁed score equation for the intercept in FL estimation. Kingand Zeng are aware that their estimator introduces bias into the predicted probabilities but claim that this iscompensated by a reduction of the root mean squared error (RMSE).- The approximate unbiased estimator (AU) also introduced by King and Zeng [6] can be understood as acounterpart to the approximate Bayesian method: using the notation introduced above, predicted probabilitiesby the approximate unbiased method are now deﬁned as ˆ π AU ,i = ˆ π FL ,i − C i . Similarly as we have seen thatthe bias in the average predicted probability by AB is twice as large as by FL, one can easily verify that theaverage predicted probability by AU is equal to the observed proportion of events, i.e. unbiased. Nevertheless,King and Zeng consider the AB preferable to the AU “in the vast majority of applications”. One should beaware that the deﬁnition of the AU does not ensure that predicted probabilities fall inside the range of to .- In ridge regression (RR) , the log-likelihood is penalized by the square of the Euclidean norm of the re-gression parameters, β + . . . + β p , multiplied by a tuning parameter λ , see [8]. Since the intercept isomitted from the penalty, the average predicted probability is equal to the proportion of observed events.Following Verweij and Van Houwelingen [9], in our simulation study the tuning parameter λ was chosenby minimizing the penalized version of the Akaike’s Information Criterion AIC = − l ( ˆ β ) + 2 df e , where df e = trace (( ∂ l∂β ( ˆ β )( ∂ l ∗ ∂β ( ˆ β )) − )) with l ∗ the penalized log-likelihood denotes the effective degrees offreedom.Wald-type conﬁdence intervals were deduced from the penalized variance-covariance matrix withﬁxed tuning parameter. RR was always performed on scaled explanatory variables with standard deviationequal to one, but results are reported on the original scale.The following types of conﬁdence intervals were chosen for the different methods: Wald-type conﬁdence intervalswere used for ML, CP and RR and proﬁle likelihood conﬁdence intervals for WF, FL, FLAC and LF. Conﬁdenceintervals for FLIC were constructed as explained above. The empirical performance of the methods introduced in Section 2 was evaluated in a comprehensive simulationstudy. Integral parts were the comparison of bias and RMSE of predicted probabilities and estimates of coefﬁcients.Furthermore, we assessed discrimination, quantiﬁed by the c-statistic, and the calibration slope of the models[10]. Conﬁdence intervals were evaluated with regard to power, length and coverage. Results on predictions arepresented for all methods introduced above, ML, WF, FL, FLIC, FLAC, LF, CP, AU, AB and RR, whereas results oncoefﬁcient estimates only concern ML, WF, FL, FLIC, FLAC, LF, CP and RR. Whenever we are solely interestedin coefﬁcient estimates except for the intercept, results from FL and FLIC agree and are presented jointly.

Binary outcomes y i were generated from the logistic model P ( y i | x i , . . . , x i ) = (1 + exp( − β − β x i − . . . − β x i )) − , i = 1 . . . N, with ten explanatory variables x i , . . . , x i . The joint distribution of the explanatory6ariables follows Binder, Sauerbrei und Royston [11] and includes continuous as well as categorical variables.Since the focus of our paper is on predictions, we have reduced the number of explanatory variables from seventeen[11] to ten and considered only linear effects.First, we generated ten standard normal random variables z ij ∼ N (0 , , j = 1 , . . . , , i = 1 , . . . N , withcorrelation structure as listed in Table 5 in the Appendix. By applying the transformations described in Table 5 tothe variables z ij , four continuous, four binary and two ordinal variables x ij were derived. The continuous variables x i , x i , x i and x i were truncated at the third quartile plus ﬁve times the interquartile distance in each simulateddataset.Coefﬁcients β , β , β and β of the binary variables were set to . , coefﬁcients β and β of ordinal vari-ables to . . This corresponds to an odds ratio of for binary variables and to an odds ratio of √ for consecutivecategories of ordinal variables. Coefﬁcients β , β , β and β of the continuous variables were chosen such thatthe difference between the ﬁrst and the sixth sextile of the empirical distribution function corresponds to an oddsratio of . Finally, the intercept β was chosen such that a certain proportion of events was obtained. We consideredall combinations of sample sizes N ∈ { , , } and population event rates π ∈ { , , , } suchthat the expected number of events was always greater than . In addition to the covariate effects described above,we considered effects all equal to zero and effects half the size of the coefﬁcients stated above, referred to as aneffect size a of , . and . Finally, for each of these scenarios we also took into account effects of mixed signs,multiplying the coefﬁcients β j , j = 6 , . . . , by − . For each of these scenarios, datasets were simulatedand analysed by the methods described in Section 2. The sample sizes have been chosen such that in approximately , and of datasets with event rate of and large, all positive effects the Firth-type coefﬁcientswere signiﬁcantly different from zero, respectively.For random variable generation, the function rmvnorm from package mvtnorm [12] was used in the case ofmultivariate normal distributions and the function rbinom from package base in the case of binomial distribu-tions ( R 3.0.3 [13]). ML and FL were estimated using the R -package logistf [14]. This package was alsoused in the implementation of the algorithms for FLIC and FLAC as described in Section 2. For WF a self-modiﬁedversion of logistf was used. RR was estimated by the R -package rms [15], CP by the function bayesglm inthe package arm [16] and LF was implemented as maximum likelihood analysis of the appropriately modiﬁed dataset, see [4], page 3138. For clarity, tables of results in this section are restricted to some selected scenarios with coefﬁcients of mixed sign,i.e. sgn ( β j ) = − , j = 6 , . . . , . Results for scenarios with only positive coefﬁcients β were similar and areavailable from the authors upon request. In the text we refer to all scenarios if not stated otherwise. Separationwas encountered at most in out of simulated datasets per scenario. Although some methods can handleseparation in data sets, we excluded these cases from analyses to retain comparability.We begin by analysing the discrepancy between the estimated event rate, i.e. the average predicted probability,and the observed event rate for the methods WF, FL, CP and AB, see Table 1. All other methods (ML, FLIC, FLAC,LF, AU and RR) give average predicted probabilities equal to the observed event rate by construction. Among allconsidered scenarios, the scenario with a sample size of , an expected event rate of . and with all explanatoryvariables being unrelated to the outcome (effect size of zero) was associated with the largest relative bias of the7vent rate for WF (4%) , FL (19 . and for AB (38 . . The bias of the estimated event rate for FL was aboutﬁve times larger than for WF, and exactly half the size of the bias for AB. For CP, the difference between estimatedand observed event rate can be considered negligible (relative bias smaller than . for all scenarios). For allmethods, the RMSE was only slightly larger than the bias, i.e. the variance of the estimated event rates was small.These numbers show that in cases of rare events the systematic error in the estimated proportion of events can notbe ignored.Table 1: Bias and RMSE ( × of the estimated event rate ˆ π i relative to the observed event rate y i , for scenarioswith small effect size ( a = 0 . and with coefﬁcients of mixed sign. The methods not considered in thetable (ML, FLIC, FLAC, LF, AU and RR) have zero bias and RMSE by construction. N Method Relative bias Relative RMSE π π

Table 2 shows the bias and RMSE of the individual predicted probabilities. ML, FLIC, FLAC, LF, AU and RRgive unbiased predicted probabilities – the fact that the respective numbers are not exactly equal to zero is dueto sampling variability in the simulation. RR was associated with considerably lower RMSE than other methodsin all but one scenarios, with an RMSE up to . lower than the second best performing method which waseither FLAC or, in some situations with large effect size, FLIC. Both FLAC and FLIC did not only correct the biasin predicted probabilities but also reduced the variance and in particular the RMSE (by up to . and . ,respectively) in comparison to FL. CP resulted in smaller RMSE than LF, which still performed better than WF inall scenarios. The approximate Bayesian method AB performed worst with respect to bias and RMSE in almost allsettings. For 17 simulation scenarios with smaller number of events, the approximate unbiased method AU yieldedpredicted probabilities outside of [0 , , with a minimum value of − . and a maximum value of . . Withincreasing sample size, event rate and effect size, differences between methods diminished.All methods except for RR yielded mean calibration slopes smaller than throughout all scenarios, indicatingunderestimation of small event probabilities and overestimation of larger ones (cf. Table 2). The method achievingthe calibration slope closest to the optimal value of was either FLAC or RR depending on the scenario. Both FLICand FLAC outperformed ordinary Firth-type estimation with respect to calibration. With increasing sample sizeand expected event rate both, differences between methods and the distance to the optimal value of , decreased.Figure 1 investigates the bias and RMSE of predictions in relation to the size of the true linear predictor exemplarilyfor one simulation scenario. In line with the results on the calibration slope in Table 2, we ﬁnd that RR stronglyoverestimates small event probabilities and underestimates large ones. A similar, but less pronounced pattern isshown by FLAC predictions. Predictions by AU were close to unbiased over the whole range of predictions,8ut were associated with considerable variance, in particular for larger predictions. For better discriminability ofmethods, Figure 1 shows the bias and RMSE scaled by the standard error of proportions corresponding to therespective true linear predictors. Supplementary ﬁgure 2 is the unscaled version of Figure 1. −6 −4 −2 0 − − − True linear predictor S c a l ed b i a s o f p r ed . p r ob . −6 −4 −2 0 True linear predictor S c a l ed R M SE o f p r ed . p r ob . MLWFFLFLICFLACLFCPAUABRR

Figure 1: Bias and RMSE of predicted probabilities (scaled by the standard error of proportions p p (1 − p ) /N with p the probability corresponding to the true linear predictor) by true linear predictor, exemplarily forthe scenario N = 1400 , ¯ y = 0 . , large effect size ( a = 1 ) and coefﬁcients of mixed signs. For thecalculation of bias and RMSE, the predicted probabilities were splitted into groups using adequatequantiles of the true linear predictor. Cubic smoothing splines were then ﬁtted to the derived bias andRMSE values in the groups. Upward directed ticks on the x-axis mark the deciles of the true linearpredictor. See Supplementary ﬁgure 2 for an unscaled version of this plot.The discriminative power of the models (in terms of c-indices) was evaluated with regard to a new, independentlygenerated outcome drawn from the logistic model with the same covariate values as in the training data. Highestdiscrimination was achieved by either RR or AB depending on the scenario but there was little variation acrossmethods, cf. Supplementary table 2. By construction, FL and FLIC give the same c-indices since adjusting theintercept does not change the order of predicted probabilities. The “optimal value” in the Supplementary table 2was obtained by calculating the c-index based on event probabilities from the true model.Since the relation between predicted probabilities and linear predictors is non-linear of non-vanishing curvaturefor linear predictors smaller than , a reduced bias of predictions often comes at the cost of an increased bias oflinear predictors and vice versa. This effect is less pronounced if the variability of the estimator is small. Forinstance, RR, having the smallest RMSE of linear predictors among the investigated methods, performed well withregard to both, linear predictors and predicted probabilities, cf. Supplementary table 3. Bias and RMSE of linearpredictors were of largest absolute size for ML across all scenarios. FL was least biased in almost all scenariosbut with considerably larger RMSE than RR. Again, with increasing number of expected events and effect sizedifferences between methods decreased.Table 3 shows the absolute bias and RMSE of the standardized coefﬁcients, averaged over all explanatory vari-ables, omitting the intercept. For simulation scenarios with non-zero covariate effects, FL outperformed the othermethods in out of scenarios with respect to absolute bias. Even in unfavourable scenarios, its average stan-dardized bias did not exceed . Concerning the RMSE, FL ranked at least fourth place, almost always clearlyoutmatched by RR and throughout all scenarios slightly worse than CP and, interestingly, FLAC, but never worsethan LF. ML and WF were associated with the largest and second largest RMSE throughout all simulation sce-9able 2: Mean bias and RMSE ( × of predicted probabilities ˆ π i , mean and standard deviation ( × ofcalibration slopes, for selected simulation scenarios with coefﬁcients of mixed signs. (See Supplementarytable 1 for further scenarios). Predictions Calibration slopeN π Method Bias ( × ) RMSE ( × ) Mean ( × ) SD ( × ) a a a a × ) of standardized coefﬁcients, averaged over all explanatory variables(omitting the intercept), for some simulation scenarios with coefﬁcients of mixed signs. (See Supplemen-tary table 4 for further scenarios.) N π Method Bias ( × ) RMSE ( × ) a a Finally, the approximate Wald-type standard error for the intercept in FLIC estimation suggested in Section2 was evaluated by comparing the corresponding conﬁdence intervals to jackknife and bootstrap ( repetitions)conﬁdence intervals. Due to the computational burden, this comparison was restricted to the simulation scenarioswith sample size N = 500 and N = 1400 . In out of simulation scenarios with non-zero covariate effectsthe approximate approach yielded the conﬁdence bounds that most often excluded zero, but differences betweenmethods were marginal, cf. Supplementary table 5. The approximate Wald-type conﬁdence interval was alsothe shortest. Especially in extreme situations, the coverage exceeded the nominal signiﬁcance level for all threemethods.For all methods, results on conﬁdence intervals averaged over all coefﬁcients omitting the intercept, can11e found in the Supplementary table 6. Coverage was reasonable for all methods except for RR, ranging between . and across methods and scenarios. RR conﬁdence intervals were overly conservative in simulationscenarios without any covariate effects with coverage levels as high as . and were too optimistic for scenarioswith moderate or large effects. They were clearly shorter with less power to exclude compared to the othermethods. In out of simulation scenarios CP and FLAC were among the four methods with smallest power,often combined with a slight conservatism. Though, in general, there were little differences in the behaviour of theconﬁdence intervals among all methods except for RR. In a retrospective study at the Department of Cardiothoracic Surgery of the University Hospital of the Friedrich-Schiller University Jena, the use of arterial closure devices (ACDs) in minimally invasive cardiac surgery wascompared to conventional surgical access with regard to the occurrence of cannulation-site complications. Of the patients eligible for analysis,

16 (3 . encountered complications. About one ﬁfth of surgeries ( cases)were performed with conventional surgical access to the groin vessels. The complication rate was . ( cases)for the conventional surgical access and . ( cases) for the ACDs group. For the purpose of illustration, theanalysis in the present paper was restricted to four adjustment variables selected by medical criteria, the logisticEuroSCORE , which estimates the risk of mortality in percent, the presence of previous cardiac operations (yes/no),the body mass index ( BMI ), for which ﬁve missing values were replaced by the mean BMI, and the presence of diabetes (yes/no). The aim of the study was twofold, ﬁrst, to estimate the adjusted effect of the surgical accessprocedure on the occurrence of complications and, second, to quantify the risk for complications. Multivariablelogistic regression models with the ﬁve explanatory variables type of surgical access , logistic EuroSCORE , previouscardiac operations , BMI and diabetes were estimated by ML, WF, FL, FLAC, LF, CP and RR. In addition, predictedprobabilities were obtained by FLIC, AB and AU. All methods gave signiﬁcant, large effect estimates for type ofsurgical access ranging between an odds ratio of . for RR and of . for ML, accompanied by wide conﬁdenceintervals but with lower bounds always greater than , see Table 4. Given the small events-per-variable ratio of . and the small-sample bias-reducing property of FL, the coefﬁcient estimates by FL might be preferable to theML estimates in this situation although the difference is not substantial. Coefﬁcient estimates by WF, which can beregarded as a compromise between ML and FL, fell between the ones from ML and FL. None of the four adjustmentvariables showed a signiﬁcant effect with the methods ML, WF, FL, FLAC and LF, but some of them with CP andRR. This is somewhat in contrast to our simulation results, where conﬁdence intervals from RR were rather morelikely to include than conﬁdence intervals from other methods, independent of the effect size. We explain thisdiscrepancy by one of the assumptions of our simulation study, assuming that in each scenario all explanatoryvariables had similar effect sizes. Counter to intuition, RR may induce bias away from zero in the effect estimatesof irrelevant variables which are correlated to strong predictors.Figure 2 a) shows the distribution of the predicted probabilities by method, separately for patients with and with-out complications. As expected, both FLIC and FLAC pushed the predicted probabilities towards zero, resulting invalues clearly smaller than the FL counterparts. On the contrary, AB even increased the FL predictions in magnitudeand supplied the largest individual predicted probabilities both, for the event ( ) and non-event group ( . ).RR had the smallest range of predicted probabilities. The observed proportion of events was overestimated mostseverely with AB (by . ), with FL by , with WF by . and with CP by only . , see Figure 2 b). The12able 4: Odds ratios with 95% conﬁdence intervals. ML WF FL/FLIC FLAC LF CP RR

Type of surgical access

Logistic EuroSCORE

Previous cardiac operations

BMI

Diabetes estimated event probabilities for patients with average covariate values (median logistic EuroSCORE of . , with-out previous cardiac operation, median BMI of . , suffering from diabetes) and with either ACD or conventionalsurgical access shown in Figure 2 b) exhibit a similar pattern as the boxplots in 2 a). Again, predicted probabilitiesby RR had the smallest variability. Predicted probabilities by WF fell between predicted probabilities by ML andFL, whereas for the conventional surgical access group FLIC and FLAC resulted in predicted probabilities smallerthan the ones from both, ML and FL. Discrimination in terms of cross-validated c-indices (see Figure 2) was bestfor AB with a c-index of . and poorest for AU with . .All considered methods gave rise to similar conclusions on the role of ACD use, being associated with a signiﬁ-cantly lower complication rate than conventional surgical access. For ACD patients with average covariate values,the different methods predicted complication risks between . and , while for comparable conventional ac-cess patients predictions ranged between . and . . Our simulation study shows that both our suggested methods, Firth-type logistic regression with intercept-correctionand Firth-type logistic regression with added covariate, efﬁciently improve on Firth-type predictions. The completeremoval of the bias in the average predicted probability, which amounted up to . in our simulations, wasaccompanied by more accurate individual predicted probabilities, as revealed by an RMSE of FLIC and FLACpredicted probabilities ranking at most fourth in out of simulation scenarios, only constantly outperformedby RR and never worse than FL. Fortunately, this improvement in predictions has not to be paid for by a lowerperformance of effect estimation: while for FLIC the effect estimates are identical to those of FL, FLAC introducessome bias but this is compensated by a decreased RMSE. Based on our simulation results, we slightly prefer FLACover FLIC because of the lower RMSE of predictions. Conﬁdence intervals for coefﬁcient estimates for both, FLICand FLAC can be easily derived and performed reasonably well in all considered scenarios. Though, in the case ofFLIC, the covariance between the intercept and other coefﬁcient estimates can not be based on model output and ifrequired, would have to be estimated by resampling methods. If conﬁdence intervals for predicted probabilities areneeded, one could avoid the determination of a full covariance matrix by centering the explanatory variables at thevalues for which prediction is requested and re-estimating the model. The conﬁdence interval of the intercept canthen be interpreted as a conﬁdence interval for the linear predictor. Finally, the two methods can be readily imple-mented using only some data manipulation steps, ML estimation and an implementation of FL, which is available13 . . . . . . . . a) p r ed i c t ed p r obab ili t i e s ML WF FL FLIC LF CP AU AB RRFLAC c−index: 64.68 65.02 66.05 65.67 64.99 63.99 64.36 61.36 68.23 61.85(x100) non−eventsevents . . . . . b) ML WF FL FLIC LF CP AU AB RRFLACACDconv.average

Figure 2: a) Boxplots of predicted probabilities by method and occurrence of event. C-indices were estimated usingleave-one-out cross-validation. b) Rectangles and triangles give the predicted probabilities for patientswith average covariate values (median logistic EuroSCORE of . , without previous cardiac operation,median BMI of . , suffering from diabetes) and with either ACD or conventional surgical access,respectively. Circles mark the average predicted probabilities. In both plots, the horizontal dashed linemarks the observed proportion of events (3.6%).in software packages such as SAS, R, Stata and Statistica ([17], [18], [19]).The “weakened” Firth-type penalization, a compromise between ML and FL estimation, indeed performed betterthan FL with regard to bias and RMSE of predictions but was outperformed by most of the other methods, by FLIC,FLAC LF, CP and RR. Moreover, reducing the bias in predicted probabilities compared to FL comes at the costof enlarging the RMSE of effect estimates, as shown by our simulation results in line with intuition. Of course,WF depends essentially on the choice of the weight parameter τ which was set to . as suggested by Elgmati etal., cf. [3]. Future investigations should clarify whether tuning the weight parameter by optimizing for instance thecross-validated modiﬁed AIC mod , as performed in ridge regression in our study, can make the WF a more attractiveoption.The approximate Bayesian method could not make up for the introduction of bias in predicted probabilities,which is exactly twice as large as in FL estimation, but also performed rather poorly with respect to RMSE. Ourresults even suggest its inferiority with respect to the approximate unbiased method, giving average predictedprobability equal to the proportion of events with often smaller RMSE than AB. Thus, we could not conﬁrmKing and Zeng’s recommendation to prefer AB over AU “in the vast majority of applications”. Though, thediscrepancy between their and our simulation results advise caution: it seems that none of the two methods issuperior to the other in most situations, but that the behaviour strongly depends on the setting. This was alsoemphasized by a spot-check simulation with balanced outcome, where AB showed lower RMSE than AU (resultsnot shown). One disadvantage of the AU is, that predicted probabilities can fall outside the plausible range of to . However, the question of deciding between AB and AU might not be a relevant one, since both methods wereclearly outperformed by FLIC, FLAC, CP, LF and RR in all simulation scenarios with rare events. It should also betaken into account, that with AB and AU predictions, the analytical relation between linear predictors and predicted14robabilities is lost.The three methods based on weakly-informative priors, penalization by log- F (1 , priors, by Cauchy priorsand ridge regression, do not only differ in the choice of the prior distributions but also in the strategy of datapreprocessing. While Greenland and Mansournia, [4], advocate the use of reasonable, study-independent multiplesof SI-units for the LF, Gelman et al., [5], suggest to scale continuous variables to have a standard deviation of . for CP. In the case of RR, we followed the widespread practice to standardize variables to unit variance and tochoose the tuning parameter by the penalized version of the AIC, see for instance [9]. Of course, it is a legitimatequestion to ask whether other combinations of strategies, for instance penalization by log- F (1 , priors afterstringent scaling of variables or even tuning of prior distribution parameters, might yield better performances, butthis would go beyond the scope of this study. Instead, we focused on readily available methods, aimed at routineuse in statistical applications.Whenever one is willing to accept introduction of bias towards zero for the sake of a small RMSE, RR turnedout to be the method of choice, outperforming all other methods with respect to RMSE of coefﬁcients in andof predictions in out of simulation scenarios. However, conﬁdence intervals do not reach their nominalcoverage levels because of the combination of bias and reduced variance. The naive approach of deducing Waldconﬁdence intervals from the penalized covariance matrix, which was applied in our simulation study, can not berecommended. In order to avoid these issues in the construction of conﬁdence intervals, CP or FLAC, providingsubstantially less biased effect estimates than RR with a reasonable RMSE, might be preferable. LF showed asimilar performance pattern as CP but was slightly outperformed with regard to the RMSE of coefﬁcients as wellas predicted probabilities by CP throughout all simulation scenarios.When comparing CP to FLAC, the main difference is the application of independent, univariate priors by CPwhile a multivariate prior is employed by FLAC. In all methods which use independent priors (including alsoLF and RR), linear transformations of explanatory variables, such as scaling, or interaction coding, or to achieveorthogonality, will affect predictions from the model. Therefore, bayesglm , [16], includes an automatic pre-processing of the explanatory variables following [5]. This preprocessing leads to more stringent penalization ofinteraction effects than of main effects, which can be desirable in exploratory data analyses. By contrast, in FLAC,any linear transformations of explanatory variables (including different coding) will not affect predictions. Never-theless, FLAC will apply more shrinkage in larger models, e.g., when interaction effects are included. Thus, withlimited data it will penalize complex models more than simple ones.Another aspect in this comparison is the interpretation of adjusted regression coefﬁcients. CP, as implementedin bayesglm , uses the same prior for the effect of an explanatory variable X j , no matter whether in the model itis adjusted for another variable X k or not. However, the interpretation of the corresponding regression coefﬁcient β j changes fundamentally by adjustment for β k , in particular if X j and X k are correlated. Why should then thesame prior be imposed, even if only weakly informative, irrespectively of whether β j is adjusted for β k or not? InCP, LF or RR (with ﬁxed λ ) the prior deﬁnition is not adapted to such changes in interpretation. In FLAC this isimplicitly captured by the Jeffreys prior which involves covariances between the coefﬁcients.Summarizing, being interested in accurate effect estimates and predictions in the presence of rare events werecommend to use- RR if inference is not required and optimization of the RMSE of coefﬁcients and predicted probabilities isgiven the priority, 15 FLAC or CP as offering low variability and bias in effect estimates as well as predictions, if valid conﬁdenceintervals for effects are needed. We have a slight preference for FLAC because of the invariance propertiesoutlined above. Table 5: Structure of explanatory variables in the simulation study, following Binder, Sauerbrei und Royston [11].Square brackets [ . . . ] indicate that the non-integer part of the argument is removed. denotes the indicatorfunction, taking value if its argument is true and otherwise. Underlying Correlation Explanatory Type Correlation ofvariable of underlying variables variable explanatory variables z i z i (0 . , z i (0 . x i = [10 z i + 55] continuous x i ( − . , x i (0 . z i z i (0 . x i = { z i < . } binary x i ( − . z i z i ( − . , z i ( − . x i = { z i ≥− . } + { z i ≥ . } ordinal x i ( − . , x i ( − . z i z i ( − . , z i (0 . , z i (0 . x i = [max(0 ,

100 exp( z i ) − continuous x i ( − . , x i (0 . , x i (0 . ,z i (0 . , z i (0 . x i (0 . , x i ( − . z i z i ( − . , z i (0 . , z i (0 . , x i = [max(0 ,

80 exp( z i ) − continuous x i ( − . , x i (0 . , x i (0 . ,z i (0 . x i ( − . z i z i ( − . , z i (0 . x i = { z i < − . } binary x i (0 . , x i ( − . z i z i (0 . , z i (0 . , z i ( − . x i = { z i ≥ . } + { z i ≥ . } ordinal x i (0 . , x i (0 . , x i (0 . z i z i (0 . , z i (0 . , z i (0 . x i = [10 z i + 55] continuous x i (0 . , x i (0 . , x i ( − . ,z i (0 . x i ( − . z i z i (0 . , z i (0 . , z i (0 . x i = { z i < } binary x i ( − . , x i ( − . , x i ( − . z i − x i = { z i < } binary − Acknowledgements:

We thank Michael Schemper and Mohammad Ali Mansournia for helpful comments. Thedata on complications in cardiac surgery were kindly provided by Paulo A. Amorim, Alexandros Moschovas, GloriaF¨arber, Mahmoud Diab, Tobias B ¨unger and Torsten Doenst from the Department of Cardiotheracic Survergy at theUniversity Hospital Jena. The research leading to these results has received funding from the Austrian ScienceFund (FWF) within project I 2276 and from the Slovenian Research Agency (ARRS) within project N1-0035.

References [1] Firth D. Bias Reduction of Maximum Likelihood Estimates.

Biometrika (1):27–38, doi:10.2307/2336755.[2] Heinze G, Schemper M. A Solution to the Problem of Separation in Logistic Regression. Statistics in Medicine (16):2409–2419, doi:10.1002/sim.1047.[3] Elgmati E, Fiaccone R, Henderson R, Matthews J. Penalised Logistic Regression and Dynamic Predic-tion for Discrete-Time Recurrent Event Data. Lifetime Data Analysis (4):542–560, doi:10.1007/s10985-015-9321-4.[4] Greenland S, Mansournia M. Penalization, Bias Reduction, and Default Priors in Logistic and Related Cate-gorical and Survival regressions. Statistics in Medicine (23):3133–3143, doi:10.1002/sim.6537.165] Gelman A, Jakulin A, Pittau M, Su YS. A Weakly Informative Default Prior Distribution for Logistic andOther Regression Models. The Annals of Applied Statistics (4):1360–1383, doi:10.1214/08-AOAS191.[6] King G, Zeng L. Logistic Regression in Rare Events Data. Political Analysis (2):137–163.[7] Greenland S. Simpson’s Paradox From Adding Constants in Contingency Tables as an Example of BayesianNoncollapsibility. The American Statistician (4):340–344, doi:10.1198/tast.2010.10006.[8] Le Cessie S, Van Houwelingen J. Ridge Estimators in Logistic Regression. Applied Statistics (1):191–201, doi:10.2307/2347628.[9] Verweij P, Van Houwelingen J. Penalized Likelihood in Cox Regression. Statistics in Medicine (23-24):2427–2436, doi:10.1002/sim.4780132307.[10] Steyerberg E, Vickers A, Cook N, Gerds T, Gonen M, Obuchowski N, Pencina M, Kattan M. Assessing thePerformance of Prediction Models: a Framework for Traditional and Novel Measures. Epidemiology (1):128–138, doi:10.1097/EDE.0b013e3181c30fb2.[11] Binder H, Sauerbrei W, Royston P. Multivariable Model-Building with Continuous Covariates: 1. Perfor-mance Measures and Simulation Design. Technical Report FDM-Preprint 105 , University of Freiburg, Ger-many 2011.[12] Genz A, Brezt F.

Computation of Multivariate Normal and t Probabilities . Springer: Heidelberg, 2009, doi:10.1007/978-3-642-01689-9.[13] R Core Team.

R: A Language and Environment for Statistical Computing . R Foundation for Statistical Com-puting, Vienna, Austria 2014. URL .[14] Heinze G, Ploner M, Dunkler D, Southworth H. logistf: Firth’s bias reduced logistic regression http://tinyurl.com/fllogistf , R package version 1.22.[15] Harrell F. rms: Regression Modeling Strategies http://CRAN.R-project.org/package=rms ,R package version 4.4-1.[16] Gelman A, Su YS. arm: Data Analysis Using Regression and Multilevel/Hierarchical Models http://CRAN.R-project.org/package=arm , R package version 1.8-6.[17] Heinze G, Ploner M. A SAS Macro, S-PLUS Library and R Package to Perform Logistic Regressionwithout Convergence Problems.

Technical Report 2 , Medical University of Vienna, Austria 2004. URL http://tinyurl.com/fllogistfTR .[18] Coveney J.

FIRTHLOGIT: Stata Module to Calculate Bias Reduction in Logistic Regression http://EconPapers.repec.org/RePEc:boc:bocode:s456948 .[19] Fijorek K, Sokolowski A. Separation-Resistant and Bias-Reduced Logistic Regression: STATISTICA Macro.

Journal of Statistical Software, Code Snippets47