Point and interval estimation of exposure effects and interaction between the exposures based on logistic model for observational studies
1 Point and interval estimation of exposure effects and interaction between the exposures based on logistic model for observational studies
Xiaoqin Wang , Weimin Ye and Li Yin Department of Electronics, Mathematics and Natural Sciences, University of GΓ€vle, SE-801 76, GΓ€vle, Sweden Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Box 281, SE-171 77, Stockholm, Sweden * Corresponding author: Email: [email protected]
Summary
In observational studies with dichotomous outcome of a population, researchers need to present the effects of exposures and interaction between the exposures jointly in order to learn the relationship between the exposure effects and the interaction. In this article we study point and interval estimation of exposure effects and the interaction based on logistic model, where the exposure effects are measured by risk differences while the interaction is measured by difference between risk differences. Using approximate normal distribution of the maximum-likelihood (ML) estimate of the model parameters, we obtain approximate non-normal distribution of the ML estimate of the exposure effects and the interaction. Using the obtained distribution, we obtain point estimate and confidence region of (exposure effect, interaction) as well as point estimate and confidence interval of the interaction when the ML estimate of an exposure effect falls into specified range. Our maximum-likelihood-based approach provides a simple but reliable method of interval estimation of exposure effects and the interaction. Keywords: exposure effect; interaction between exposures; point estimate; interval estimate; logistic model Introduction
Suppose one conducts a randomized trial to investigate the effect of two exposures z and z on an outcome y of certain population, where π§ , π§ and π¦ are all dichotomous, namely, π§ = 0, 1 , π§ = 0, 1 and π¦ = 0, 1 . In the randomized trial, covariates are essentially unassociated with the exposures ( z , z ) and thus are not confounders. Oftentimes, one measures the effect of π§ when π§ = 0 by the risk difference TE1 = pr(π¦ = 1 | π§ = 1, π§ = 0) β pr(π¦ = 1 | π§ = 0, π§ = 0) and the interaction between z and z by the difference between risk differences [1--5] INT = {pr(π¦ = 1 |π§ = 1, π§ = 1) β pr(π¦ = 1 |π§ = 0, π§ = 1)} β TE1. Likewise, one measures the effect of π§ when π§ = 0 by the risk difference TE2 = pr(π¦ = 1 | π§ = 0, π§ = 1) β pr(π¦ = 1 | π§ = 0, π§ = 0) and obtains another expression of the above interaction as INT= {pr(π¦ = 1 |π§ = 1, π§ = 1) β pr(π¦ = 1 |π§ = 1, π§ = 0)} β TE2. INT is also called the biological interaction because its expression in terms of risk differences has a close relationship with some well-known classification of biological mechanisms [3, 4]. When presenting INT, one also presents TE1 and TE2 in order to learn the significance of INT relative to TE1 and TE2. For instance,
INT = 0.05 has rather different significance relative to TE1 = 0.01 versus TE1 = 0.5. Now suppose one conducts an observational study in which one also wishes to study TE1, TE2 and INT. In the observational study, there exist confounders, which are associated with the outcome y and the exposures ( z , z ). The most common model to adjust for confounding covariates in estimating the effect of ( z , z ) is logistic model. Most often, logistic model is used to obtain point estimate and confidence interval of odds ratio as measure of the exposure effects as well as point estimate and confidence interval of ratio of odds ratios as measure of the interaction. In recent years, logistic model is also used to obtain point estimate and confidence interval of TE1 or TE2 as measures of exposure effects [6--9], and in rare cases, of INT as measures of interaction [10]. Noticeably, TE1, TE2 and INT are more interpretable than odds ratio and ratio of odds ratios. In their studies, confidence interval of TE1, TE2 or INT is obtained by the normal approximation method and the parametric / non-parametric bootstrap method; the normal approximation method is based on approximate variance of the ML estimate of TE1, TE2 or INT; the bootstrap method generates bootstrap samples and then uses the samples to obtain the bootstrap distribution of TE1, TE2 or INT and then the bootstrap confidence interval. However, little is seen in the literature that reports TE1, TE2 and INT jointly, e.g. confidence region of (TE1 or TE2, INT), which takes into account the correlation between the estimates of (TE1 or TE2, INT). In this article, we use logistic model to obtain point and interval estimate of (TE1 or TE2, INT). Instead of deriving the covariance matrix of the two-dimensional ML estimate of (TE1 or TE2, INT) or using the bootstrap method, we generate approximate distribution of the ML estimate of (TE1 or TE2, INT) and then use the distribution to obtain the interval estimate of (TE1 or TE2, INT). We present our method by studying the effect of (hospital type, cancer stage) on cancer survival.
2. Effect of (hospital type, cancer stage) on cancer survival in Sweden 2.1 Medical background and the data In cancer treatment, an important question is which type of hospitals, small or large, is superior to treatment of cancers of early or advanced stage, where the size of a hospital is determined by the number of cancer patients treated there. In an observational study, researchers studied the effect of (hospital type, cancer stage) on one-year survival of cardia cancer patients [11]. Cardia cancer is highly malignant with bad prognosis and its one-year survival is a good measure of the performance of hospital type and the impact of cancer stage in treating such cancers. The data was collected between 1988 and 1995 on 150 cardia cancer patients treated in hospitals located in central and northern Sweden. The hospitals were categorized into two types: large type ( z = 1) when treating more than 10 patients during the period 1988-1995, and small type ( z = 0) when treating less than or equal to 10 patients. Cancer stages were categorized into two categories: advanced stage ( z =
1) when the eight-level stage index took values larger than or equal to five, and early stage ( z =
0) when the index took values smaller than five. Then the exposures in this study were π = ( z , z ) = (hospital type, cancer stage). The outcome of a patient was successful ( y =
1) versus unsuccessful ( y =
0) survival for one year after diagnosis. In addition to the exposure π , possible confounders were documented: age ( x ), gender ( x ), and geographic area ( x ). Age was continuous, but gender and geographic area were categorical. Let x = x = 0 female of the gender. Geographic area was categorized into urban ( x = versus rural ( x = π = ( x , x , x ) be the set of the documented covariates. The descriptive statistics of these covariates and exposures are given in Table 1. The complete data of the study is given as supplementary material. Let π¦(π§ , π§ ) be the potential outcome of each patient in the population under exposure ( z , z ), where π§ = 1, 0 indicates large versus small hospital types while π§ = 1, 0 indicates cancers of advanced versus early stages, see [12--16] for the framework of causal inference. We use pr{π¦(π§ , π§ ) = 1} to denote the risk of π¦(π§ , π§ ) = 1 of patient under exposure (π§ , π§ ) . Then the effect of the hospital type π§ at the early cancer stage π§ = 0 is measured by the risk difference TE1 = pr{π¦(1, 0) = 1} β pr{π¦(0, 0) = 1}.
The effect of the cancer stage π§ at the small hospital type π§ = 0 is measured by the risk difference TE2 = pr{π¦(0, 1) = 1} β pr{π¦(0, 0) = 1}.
The interaction between π§ and π§ is measured by the difference between risk differences INT = [pr{π¦(1, 1) = 1} β pr{π¦(0, 1) = 1}] β [pr{π¦(1, 0) = 1} β pr{π¦(0, 0) = 1}] = [pr{π¦(1, 1) = 1} β pr{π¦(0, 1) = 1}] β TE1, or equivalently,
INT = [pr{π¦(1, 1) = 1} β pr{π¦(1, 0)} = 1] β [pr{π¦(0, 1) = 1} β pr{π¦(0, 0) = 1}] = [pr{π¦(1, 1) = 1} β pr{π¦(1, 0)} = 1] β TE2.
In some medical applications, one aims at the effect of π§ in stratum π§ = 1, 0 , namely, TE(π§ ) = pr{π¦(π§ = 1) = 1| π§ } β pr{π¦(π§ = 0) = 1| π§ }, where π¦(π§ ) is the potential outcome of each patient in stratum z under exposure π§ , and then considers the modification of the effect of z by z , namely, EM = TE(π§ = 1) β TE(π§ = 0). However, EM compares the effects of z on different strata π§ = 1, 0 and does not have causal interpretation. Thus one cannot use EM to address the question of the illustrative example described in the previous subsection. We have similar situation for the modification of the effect of z by z Because we can only observe potential outcome of a patient under one of the four exposures (π§ , π§ ) = (0, 0), (0, 1) , (1, 0) , (1, 1) , we need certain assumption to allow for estimation of TE1, TE2 and INT [12--16]. In the medical context of this study, it is reasonable to assume that there is no other confounder than the documented covariates π = ( x , x , x ). We denote the risk of π¦ = 1 in stratum ( π§ , π§ , π) by pr (π¦ = 1 | π§ , π§ , π) . Then the assumption states pr{π¦(π§ , π§ ) = 1| π} = pr (π¦ = 1 | π§ , π§ , π) which implies that the unobservable potential outcome π¦(π§ , π§ ) can be assessed through the observable outcome y . Then we have pr{π¦(π§ , π§ ) = 1} = οΏ½ pr{π¦(π§ , π§ ) = 1| π} pr (π) = π οΏ½ pr (π¦ = 1|π§ , π§ , π) pr (π). π Inserting this into the above formulas for TE1, TE2 and INT, we obtain
TE1 = οΏ½ pr (π¦ = 1|π§ = 1, π§ = 0, π) pr (π) π β οΏ½ pr (π¦ = 1|π§ = 0, π§ = 0, π) pr (π), π (1π) TE2 = οΏ½ pr (π¦ = 1|π§ = 0, π§ = 1, π) pr (π) π β οΏ½ pr (π¦ = 1|π§ = 0, π§ = 0, π) pr (π) π , (1π) INT = οΏ½οΏ½ pr (π¦ = 1|π§ = 1, π§ = 1, π) pr (π) π β οΏ½ pr (π¦ = 1|π§ = 0, π§ = 1, π) pr (π) π οΏ½ β TE1 = οΏ½οΏ½ pr ( π¦ = 1 | π§ = 1, π§ = 1, π) pr (π) π β οΏ½ pr ( π¦ = 1 | π§ = 1, π§ = 0, π) pr (π) π οΏ½ β TE2. (1π) In particular, in randomized trial, we have pr (π) = pr (π|π§ ,π§ ) , which implies that οΏ½ pr (π¦ = 1|π§ , π§ , π) pr (π) π = οΏ½ pr (π¦ = 1|π§ , π§ , π) pr( π|π§ , π§ ) = pr (π¦ = 1|π§ , π§ ) . π Inserting this into (1a), (1b) and (1c), we see that TE1, TE2 and INT given by (1a), (1b) and (1c) are the same as those for randomized trials given in the introduction of this article.
The risk pr (π¦ = 1 | π§ , π§ , π) of π¦ = 1 in stratum ( π§ , π§ , π ) is modeled by logistic model. By the likelihood ratio-based significance testing of the model parameters, we obtain Log οΏ½ pr (π¦ = 1 | π§ , π§ , π)1 β pr (π¦ = 1 | π§ , π§ , π)οΏ½ = Ξ± + π½ π§ + π½ π§ + π½ (π§ β π§ ) + π π₯ + π π₯ + π π₯ + π (π§ β π₯ ). (2) In this model, we include, in addition to one term for the exposure product π§ β π§ , another term for the hospital type-age product π§ β π₯ , because of a somewhat small p-value, 0.20 for the significance test of π β 0 . Let Ο = ( Ξ±, Ξ² , Ξ² , Ξ² , ΞΈ , ΞΈ , ΞΈ , ΞΈ ) be the set of all model parameters. The ML estimate ποΏ½ = (πΌοΏ½, π½Μ , π½Μ , π½Μ , ποΏ½ , ποΏ½ , ποΏ½ , ποΏ½ ) and its approximate covariance matrix Ξ£οΏ½ (i.e. the inverse of the observed information) are given in Table 2. From model (2), we obtain pr (π¦ = 1 | π§ , π§ , π) = exp{Ξ± + π½ π§ + π½ π§ + π½ (π§ β π§ ) + π π₯ + π π₯ + π π₯ + π (π§ β π₯ )}1 + exp{Ξ± + π½ π§ + π½ π§ + π½ (π§ β π§ ) + π π₯ + π π₯ + π π₯ + π (π§ β π₯ )} (3) Inserting (3) into (1a)-(1c) and replacing the probability pr(π) by the proportion pror(π) of the covariates π in the sample, we obtain TE1, TE2 and INT as functions of π TE1(π) = οΏ½ exp{Ξ± + π½ + π π₯ + π π₯ + π π₯ + π π₯ }1 + exp{Ξ± + π½ + π π₯ + π π₯ + π π₯ + π π₯ } π prop(π) β οΏ½ exp{Ξ± + π π₯ + π π₯ + π π₯ }1 + exp{Ξ± + π π₯ + π π₯ + π π₯ } π prop(π), (4π) TE2(π) = οΏ½ exp{Ξ± + π½ + π π₯ + π π₯ + π π₯ }1 + exp{Ξ± + π½ + π π₯ + π π₯ + π π₯ } π prop(π) β οΏ½ exp{Ξ± + π π₯ + π π₯ + π π₯ }1 + exp{Ξ± + π π₯ + π π₯ + π π₯ } π prop(π), (4π) INT(π) = οΏ½οΏ½ exp{Ξ± + π½ + π½ + π½ + π π₯ + π π₯ + π π₯ + π π₯ }1 + exp{Ξ± + π½ + π½ + π½ + π π₯ + π π₯ + π π₯ + π π₯ } π prop(π)β οΏ½ exp{Ξ± + π½ + π π₯ + π π₯ + π π₯ }1 + exp{Ξ± + π½ + π π₯ + π π₯ + π π₯ } π prop(π)οΏ½ β TE1(π) = οΏ½οΏ½ exp{Ξ± + π½ + π½ + π½ + π π₯ + π π₯ + π π₯ + π π₯ }1 + exp{Ξ± + π½ + π½ + π½ + π π₯ + π π₯ + π π₯ + π π₯ } π prop(π)β οΏ½ exp{Ξ± + π½ + π π₯ + π π₯ + π π₯ + π π₯ }1 + exp{Ξ± + π½ + π π₯ + π π₯ + π π₯ + π π₯ } π prop(π)οΏ½ β TE2(π). (4π) Replacing π in TE1( π ), TE2( π ) and INT( π ) by ποΏ½ , we obtain the ML estimates TE1 οΏ½= TE1(ποΏ½) , TE2οΏ½ = TE2(ποΏ½) and
INTοΏ½ =INT(ποΏ½) . Although ποΏ½ is biased for finite sample, TE1οΏ½ , TE2οΏ½ and
INTοΏ½ are unbiased [17]. Clearly, ποΏ½ is consistent and so are TE1οΏ½ , TE2οΏ½ and
INTοΏ½ . With ποΏ½ given in Table 2, these ML estimates are equal to TE1οΏ½ = 0.12 , TE2οΏ½ = β0.48 and
INTοΏ½ = 0.01 . We are going to obtain confidence region of (TE1 or TE2, INT) by generating approximate distribution of (TE1οΏ½ or TE2οΏ½ ,
INTοΏ½ ) . Let p be a random variable which follows the normal distribution N (ποΏ½, Ξ£οΏ½) , namely, π ~ N (ποΏ½, Ξ£οΏ½) , where ποΏ½ and Ξ£οΏ½ in N (ποΏ½, Ξ£οΏ½) are given in Table 2. This normal distribution is good approximation to the distribution of the ML estimate ποΏ½ , because parameters of a logistic model have good asymptotic normality [18]. Replacing π in {TE1( π ), TE2( π ), INT( π )} by p and then using the normal distribution of p , we generate distribution of {TE1( p ), TE2( p ), INT( p )}, which approximates the distribution of (TE1οΏ½ , TE2οΏ½ ,
INTοΏ½ ) . With this approximate distribution, we obtain approximate distribution of (TE1οΏ½ or TE2οΏ½ ,
INTοΏ½ ) and then the confidence region of (TE1 or TE2, INT). Because {TE1( π ), TE2( π ), INT( π )} is a smooth monotone bounded function of π according to (4a)-(4c), the performance of p in its approximating ποΏ½ determines the performance of {TE1( p ), TE2( p ), INT( p )} in its approximating (TE1οΏ½ , TE2οΏ½ ,
INTοΏ½ ) . Exact and approximate distributions of the ML estimate of exposure effect have been simulated in the setting of a large exposure effect, a large exposure-covariate interaction and a strong confounding, and it was found that the exact and approximate distributions have rather good agreement even in their tails [17]. This method of obtaining confidence region of (TE1 or TE2, INT) is analogous to the common method of obtaining confidence interval of an odds ratio. For instance, suppose that π½ of model (2) is a parameter of interest and we wish to obtain 95 % confidence interval of the odds ratio exp( π½ ). The π½ has good asymptotic normality: the distribution of π½Μ is approximately normal with a variance estimate varοΏ½ (π½Μ ) . Then one generates an approximate distribution of exp( π½Μ ) by using the approximate normal distribution of π½Μ and thus obtains the 95 % confidence interval of the odds ratio by exp{π½Μ Β±1.96varοΏ½ (π½Μ )} , where the number 1.96 is the 97.5 th percentile of the standard normal distribution. This confidence interval reflects the asymmetry of the distribution of exp( π½Μ ). This method has also been used to obtain confidence intervals of other measures of exposure effect such as risk ratio [19]. In the next section, we shall describe the procedure of obtaining the approximate distribution of (TE1οΏ½ , TE2οΏ½ ,
INTοΏ½ ) and various point and interval estimates of TE1, TE2 and INT.
3. Point and interval estimates of TE1, TE2 and INT 3.1 Approximate distribution of ( TE1 οΏ½ , TE2 οΏ½, INT οΏ½ ) and the derived distributions First we draw p from π~N(ποΏ½, Ξ£οΏ½) . Second, we replace π by p in formulas (4a) - (4c) to get {TE1( π ), TE2( π ), INT( π )}, which approximates (TE1οΏ½ , TE2οΏ½ ,
INTοΏ½ ) . We iterate the procedure, 1000 times in this article, to get 1000 sets of approximate values of (TE1οΏ½ , TE2οΏ½ ,
INTοΏ½ ) . All these 1000 sets form an approximate distribution of (TE1οΏ½ , TE2οΏ½ ,
INTοΏ½ ) . All pairs of (TE1οΏ½ , INTοΏ½ ) in those 1000 sets form an approximate distribution of (TE1οΏ½ ,
INTοΏ½ ) . All
TE1οΏ½ in those 1000 pairs form an approximate distribution of
TE1οΏ½ . All
INTοΏ½ in those 1000 pairs form an approximate distribution of
INTοΏ½ . The three distributions are presented in Figure 1. Similarly, we obtain approximate distributions for (TE2οΏ½ ,
INTοΏ½ ) and
TE2οΏ½ , which are presented in Figure 2. Restricting
TE1οΏ½ to a given range, all corresponding
INTοΏ½ form an approximate conditional distribution of
INTοΏ½ when
TE1οΏ½ falls into the given range. The ranges of
TE1οΏ½ we consider are the terciles of TE1οΏ½ , i.e. three consecutive ranges of
TE1οΏ½ , such that the probability of
TE1οΏ½ occurring in each interval is 1 / 3. The three conditional distributions are presented in Figure 3. Similarly, we obtain three approximate conditional distributions of
INTοΏ½ when
TE2οΏ½ falls into the terciles, as presented in Figure 4.
Point estimates of TE1, TE2 and INT have been obtained in Section 2.4 and are equal to
TE1οΏ½ = 0.12 , TE2οΏ½ = β0.48 and
INTοΏ½ = 0.01 . In most practical studies with interval estimates, it is sufficient to have confidence regions of (TE1, INT) and (TE2, INT) rather than a confidence volume of (TE1, TE2, INT). We are going to use the distribution of (TE1οΏ½ , INTοΏ½ ) obtained in Section 3.1 to obtain (1 β πΌ) confidence region of (TE1, INT). We can arbitrarily choose a confidence curve that partitions the whole region of (TE1, INT) into a confidence region in which (TE1οΏ½ , INTοΏ½ ) occurs with the probability (1 β πΌ) and its complementary region in which (TE1οΏ½ , INTοΏ½ ) occurs with the probability πΌ . Here we take the confidence region by imposing that the confidence curve is an ellipse and further requiring that the confidence region enclosed by the ellipse has smallest area at the (1 β πΌ) confidence level. First, we calculate {mean(TE1οΏ½ ), mean( INTοΏ½ )} and the variances var(TE1οΏ½ ) and var(
INTοΏ½ ) and the covariance cov (TE1οΏ½ , INTοΏ½ ) . Second, we use the normal distribution N οΏ½οΏ½mean(TE1οΏ½ ) mean(
INTοΏ½ ) οΏ½ , οΏ½ var(TE1οΏ½ ) cov(TE1οΏ½ , INTοΏ½ )cov(TE1οΏ½ , INTοΏ½ ) var(
INTοΏ½ ) οΏ½οΏ½ to obtain the (1 β πΌ) confidence curve for (TE1, INT) by the ellipse formula οΏ½ TE1 β mean(TE1 οΏ½ ) INT β mean(
INT οΏ½ ) οΏ½ οΏ½ var(TE1 οΏ½ ) cov(TE1 οΏ½ , INT οΏ½ )cov(TE1 οΏ½ , INT οΏ½ ) var( INT οΏ½ ) οΏ½ οΏ½TE1 β mean(TE1 οΏ½ ) INT β mean(
INT οΏ½ ) οΏ½ = π (1 β πΌ) where π (1 β πΌ) is the 100 (1 β πΌ) P th percentile of the central chi-square distribution with two degrees of freedom. The 95 % confidence region of (TE1, INT) is shown in Figure 1. In the same way, we obtain the 95 % confidence region of (TE2, INT), which is presented in Figure 2.
We are going to use the distribution of
INTοΏ½ to obtain (1 β πΌ) confidence interval of INT. The confidence interval is not unique because the upper and lower confidence limits can be arbitrarily chosen such that the corresponding upper and lower confidence levels, denoted by πΌ π’ and πΌ π respectively, satisfy πΌ π’ + πΌ π = πΌ . Here we take the confidence interval by imposing the condition πΌ π’ = πΌ π = 0.025 . Then the 95 % confidence interval of INT is ( β INTοΏ½ = 0.01 obtained in Section 2.4. We are going to use the conditional distribution of
INTοΏ½ when
TE1οΏ½ falls into a terciles of
TE1οΏ½ to obtain point estimate and (1 β πΌ) confidence interval of INT over the tercile. The point estimate of INT is the mean of the conditional distribution. Over the lower tercile of
TE1οΏ½ , the point estimate (95 % confidence interval) of INT is 0.18 ( β TE1οΏ½ , the point estimate (95 % confidence interval) of INT is 0.03 (0.20, 0.21). Over the upper tercile of
TE1οΏ½ , the point estimate (95 % confidence interval) of INT is β β TE2οΏ½ , the point estimate (95 % confidence interval) of INT are 0.18 ( β β β β
4. Interpretation for various interval estimates of TE1, TE2 and INT
From Figures 1b, 1c and 2b, we cannot observe the correlation of
INTοΏ½ with
TE1οΏ½ or TE2οΏ½ . The confidence interval of INT in Table 3a indicates possible values for INT, i.e. ( β INTοΏ½ is highly correlated with
TE1οΏ½ . The confidence region of (TE1, INT) indicates possible values for (TE1, INT) at the 95 % confidence level. From Figure 2a, we see similar situation for (TE2, INT). From Figures 3a-3c, we see that INT has different values at different TE1. The confidence interval of INT over a tercile of
TE1οΏ½ in Table 3b indicates possible values of INT at the 95 % confidence level when
TE1οΏ½ falls in the tercile. Such confidence intervals allow us to perform stratified analysis of INT over strata of
TE1οΏ½ . If the effect of large versus small hospital types for early stage cancer is
TE1οΏ½ β€ 0.105 , then we have an increase of the effect for advanced stage cancer, i.e. INTοΏ½ (95 % confidence interval) = 0.18 ( β TE1οΏ½ > 0.246 . Similarly, from Figures 4a-4c and Table 3c, we see that INT has difference value at different TE2. If the effect of advanced versus early cancer stage for small hospital type is
TE2οΏ½ β€β0.508 , then we have an increase of the effect for large hospital type, i.e.
INTοΏ½ (95 % confidence interval) = 0.18 ( β TE2οΏ½ > β0.374 . Although the TE2-INT relation is similar to the TE1-INT relation as revealed in the analysis above, they cannot be obtained from each other.
5. Discussion and conclusions
To learn the relationship between exposure effects and interaction between the exposures, one needs to report both the exposure effects and the interaction. Because the ML estimates of the exposure effects and the interaction are highly correlated, one needs to report them jointly. In this article, we have obtained point estimate and confidence region of (TE1, INT), those of (TE2, INT), and point estimate and confidence interval of INT when the ML estimate of TE1 or TE2 falls into specified range. Because of its statistical advantages, logistic model is ubiquitous in observational studies for exposure effects on dichotomous outcomes of populations. One major advantage of a logistic model is that the model parameters have good asymptotic normality: normal distribution is good approximate distribution for the ML estimate of the model parameters, see, e.g. [18]. One major disadvantage with logistic model is the use of odds ratio as measure of the exposure effects; see the rich literature for non-collapsibility of odds ratio [20, 21, 12, 22-24]. In this article, we have kept the advantage of logistic model while avoiding the disadvantage by using risk difference as measure of the exposure effects and difference between risk differences as measure of the interaction. We have used approximate normal distribution of the ML estimate of the model parameters to obtain approximate non-normal distribution of the ML estimate of TE1, TE2 and INT and then their interval estimates including confidence regions. This maximum-likelihood-based approach provides a simple but reliable method of interval estimation of TE1, TE2 and INT, which can be easily implemented by using any software that generates normal distribution. Two methods are available in the literature to calculate confidence interval of TE1 (or TE2 or INT in rare cases) based on logistic model [6-10], i.e. the normal approximation method and the bootstrap method, but they have not been used to calculate confidence region of (TE1, INT). In the normal approximation method, one derives approximate variance of the ML estimate of TE1 by using the delta method and then uses the variance to obtain normal approximation confidence interval of TE1. To obtain confidence region of (TE1, INT), however, one needs to derive approximate covariance matrix of the ML estimate of (TE1, INT), which is tedious. In the bootstrap method, one generates bootstrap samples by parametric or non-parametric bootstrap method and then uses the bootstrap samples to obtain bootstrap confidence interval of TE1. However, it is highly difficult to correct finite-sample bias arising from the bootstrap sampling particularly with dichotomous outcome [25--27]. It is even more difficult to correct the bias for the bootstrap confidence region of (TE1, INT) [25--27]. In comparison, our method of obtaining interval estimates including confidence regions of TE1, TE2 and INT is based on approximate distribution of their maximum estimates and does not involve resampling of the data. Therefore our method does not belong to the category of resampling methods such as the bootstrap method.
Acknowledgement
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
References Blot WJ and Day NE. Synergism and interaction: are they equivalent?
American Journal of Epidemiology : 99--100. 2.
Rothman KJ, Greenland S and Walker AM. Concepts of interaction.
American Journal of Epidemiology : 467--470. 3.
Rothman KJ, Greenland S and Lash TL.
Modern Epidemiology (3rd edition).
Lippincott Williams & Wilkins, Philadelphia; 2008 4.
Greenland S. Interactions in Epidemiology: Relevance, Identification and Estimation.
Epidemiology : 14β17. 5. VanderWeele TJ. On the Distinction between Interaction and Effect Modification.
American Journal of Epidemiology : 863--871. 6. McNutt LA, Wu C, Xue X and Hafner JP. Estimating the relative risk in cohort studies and clinical trials of common outcomes.
American Journal of Epidemiology : 940--943. 7.
Newcombe RG. A deficiency of the odds ratio as a measure of effect size.
Statistics in Medicine : 4235--4240. Greenland S. Model-based Estimation of Relative Risk and Other Epidemiologic measures in Studies of Common Outcomes and Case-Control Studies.
American Journal of Epidemiology : 301--305. 9.
Austin PC. Absolute risk reductions, relative risks, relative risk reductions, and numbers needed to treat can be obtained from a logistic regression model.
Journal of Clinical Epidemiology : 2--6. 10. Nie L, Chu H, Li F, and Cole SR. Relative excess risk due to interaction: resampling-based confidence intervals.
Epidemiology : 552--556. 11. Hansson LE, Ekstrom AM, Bergstrom R and Nyren O. Surgery for stomach cancer in a defined Swedish population: current practices and operative results. Swedish Gastric Cancer Study Group.
The European journal of surgery , 787--975. 12.
Greenland S, Robins JM and Pearl J. Confounding and collapsibility in causal inference.
Statistical Science : 29β46. 13. Greenland S and Robins JM. Identifiability, exchangeability, epidemiological confounding.
International Journal of Epidemiology : 413β419. 14. Rosenbaum PR and Rubin DB. The central role of the propensity score in observational studies for causal effects.
Biometrika : 41--55. 15. Rosenbaum PR.
Observational studies . Springer, New York; 1995. 16.
Rubin DB, Wang X, Yin L and Zell E. Estimating the Effect of Treating Hospital Type on Cancer Survival in Sweden Using Principal Stratification. In
The HANDBOOK OF APPLIED BAYESIAN ANALYSIS , eds. T. OβHagan and M. West, Oxford University Press, Oxford; 2009. Wang X, Jin Y and Yin L. Point and Interval Estimations of Marginal Risk Difference by Logistic Model.
Communications in Statistics: theory and methods
Lindsey JK.
Parametric Statistical Inference . Clarendon Press, Oxford; 1996. 19.
Wang X, Jin Y and Yin L. Measuring and estimating treatment effect on dichotomous outcome of a population.
Statistical methods in medical research
Gail MH, Wieand S and Piantadosi S. Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates.
Biometrika : 431--444. 21. Guo J and Geng Z. Collapsibility of logistic regression coefficients.
Journal of Royal Statistical Society B : 263--267. Lee Y and Nelder JA. Conditional and marginal models: another view.
Statistical Science : 219--228. 23. Austin PC. The performance of different propensity score method for estimating marginal odds ratio.
Statistics in Medicine : 3078β3094. 24. Austin PC, Grootendorst P, Normand SLT and Anderson GM. Conditioning on the propensity score can result in biased estimation of common measures of treatment effect: A Monte Carlo study.
Statistics in Medicine : 754β768. 25. Carpenter J and Bithell J. Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians.
Statistics in Medicine : 1141--1164. 26. Davison AC and Hinkley DV.
Bootstrap Methods and their Application . Cambridge University Press, Cambridge; 1997. Greenland S. Interval estimation by simulation as an alternative to and extension of confidence intervals.
International Journal of Epidemiology : 1389β1397. able 1: Descriptive statistics of the study population: one-year survivals / patient totals on levels of age, gender and geographic area for (Hospital type, Cancer stage) (Hospital type, Cancer stage) (Large, Advanced) (Small, Advanced) (Large, Early) (Small, Early)
Overall age <= median (67) 12/41 2/13 11/13 2/2 >median 11/34 2/23 12/18 3/6 gender female 4/22 1/8 3/5 0/1 male 19/53 3/28 20/26 5/7
Geographic area rural 11/32 4/26 13/15 3/6 urban 12/43 0/10 10/16 2/2 able 2
ML estimates and its approximate covariance matrix for parameters of the model (2)
Parameters Ξ± Ξ² Ξ² Ξ² ΞΈ ΞΈ ΞΈ ΞΈ Estimates 3.92 β β β β Ξ± β β β β β Ξ² β β β Ξ² β β β β Ξ² β β β β ΞΈ β β ΞΈ β β β β ΞΈ β β β β ΞΈ β β β β β able 3 Point estimates (50 and 95 % confidence intervals) of INT, obtained respectively from (3a) marginal distribution of
INTοΏ½ , (3b) conditional distribution of
INTοΏ½ when
TE1οΏ½ falls into the tercile, and (3c) conditional distribution of
INTοΏ½ when
TE2οΏ½ falls into the tercile, (3a) Point estimate (50 and 95 % confidence intervals) of INT β β (3b) Point estimate (50 and 95 % confidence intervals) of INT over tercile of TE1οΏ½
Tercile of
TE1οΏ½ ( ββ ,0.105] 0.18 ( β (0.105, 0.246] 0.03 ( β β (0.246, +β ) β β β (3c) Point estimate (50 and 95 % confidence intervals) of INT over tercile of TE2οΏ½
Trercile of
TE2οΏ½ ( ββ , β β ( β β β β ( β +β ) β β β β Figure 1 (1a) The scatterplot for approximate distribution of the ML estimate of (TE1, INT) and the 95% confidence region of (TE1, INT); the point estimate of (TE1, INT) is equal to (0.12, 0.01) . (1b) Approximate distribution for the ML estimate of INT. (1c) Approximate distribution for the ML estimate of TE1. Figure 2 (2a) The scatterplot for approximate distribution of the ML estimate of (TE2, INT) and the 95% confidence region of (TE2, INT); the point estimate of (TE2, INT) is equal to (β0.48, 0.01) . (2b) Approximate distribution for the ML estimate of TE2. Figure 3