[PDF] Point and interval estimation of exposure effects and interaction between the exposures based on logistic model for observational studies

Abstract

In observational studies with dichotomous outcome of a population, researchers need to present the effects of exposures and interaction between the exposures jointly in order to learn the relationship between the exposure effects and the interaction. In this article we study point and interval estimation of exposure effects and the interaction based on logistic model, where the exposure effects are measured by risk differences while the interaction is measured by difference between risk differences. Using approximate normal distribution of the maximum-likelihood (ML) estimate of the model parameters, we obtain approximate non-normal distribution of the ML estimate of the exposure effects and the interaction. Using the obtained distribution, we obtain point estimate and confidence region of (exposure effect, interaction) as well as point estimate and confidence interval of the interaction when the ML estimate of an exposure effect falls into specified range. Our maximum-likelihood-based approach provides a simple but reliable method of interval estimation of exposure effects and the interaction.

Full PDF

1 Point and interval estimation of exposure effects and interaction between the exposures based on logistic model for observational studies

Xiaoqin Wang , Weimin Ye and Li Yin Department of Electronics, Mathematics and Natural Sciences, University of Gävle, SE-801 76, Gävle, Sweden Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Box 281, SE-171 77, Stockholm, Sweden * Corresponding author: Email: [email protected]

Summary

Suppose one conducts a randomized trial to investigate the effect of two exposures z and z on an outcome y of certain population, where 𝑧 , 𝑧 and 𝑦 are all dichotomous, namely, 𝑧 = 0, 1 , 𝑧 = 0, 1 and 𝑦 = 0, 1 . In the randomized trial, covariates are essentially unassociated with the exposures ( z , z ) and thus are not confounders. Oftentimes, one measures the effect of 𝑧 when 𝑧 = 0 by the risk difference TE1 = pr(𝑦 = 1 | 𝑧 = 1, 𝑧 = 0) − pr(𝑦 = 1 | 𝑧 = 0, 𝑧 = 0) and the interaction between z and z by the difference between risk differences [1--5] INT = {pr(𝑦 = 1 |𝑧 = 1, 𝑧 = 1) − pr(𝑦 = 1 |𝑧 = 0, 𝑧 = 1)} − TE1. Likewise, one measures the effect of 𝑧 when 𝑧 = 0 by the risk difference TE2 = pr(𝑦 = 1 | 𝑧 = 0, 𝑧 = 1) − pr(𝑦 = 1 | 𝑧 = 0, 𝑧 = 0) and obtains another expression of the above interaction as INT= {pr(𝑦 = 1 |𝑧 = 1, 𝑧 = 1) − pr(𝑦 = 1 |𝑧 = 1, 𝑧 = 0)} − TE2. INT is also called the biological interaction because its expression in terms of risk differences has a close relationship with some well-known classification of biological mechanisms [3, 4]. When presenting INT, one also presents TE1 and TE2 in order to learn the significance of INT relative to TE1 and TE2. For instance,

INT = 0.05 has rather different significance relative to TE1 = 0.01 versus TE1 = 0.5. Now suppose one conducts an observational study in which one also wishes to study TE1, TE2 and INT. In the observational study, there exist confounders, which are associated with the outcome y and the exposures ( z , z ). The most common model to adjust for confounding covariates in estimating the effect of ( z , z ) is logistic model. Most often, logistic model is used to obtain point estimate and confidence interval of odds ratio as measure of the exposure effects as well as point estimate and confidence interval of ratio of odds ratios as measure of the interaction. In recent years, logistic model is also used to obtain point estimate and confidence interval of TE1 or TE2 as measures of exposure effects [6--9], and in rare cases, of INT as measures of interaction [10]. Noticeably, TE1, TE2 and INT are more interpretable than odds ratio and ratio of odds ratios. In their studies, confidence interval of TE1, TE2 or INT is obtained by the normal approximation method and the parametric / non-parametric bootstrap method; the normal approximation method is based on approximate variance of the ML estimate of TE1, TE2 or INT; the bootstrap method generates bootstrap samples and then uses the samples to obtain the bootstrap distribution of TE1, TE2 or INT and then the bootstrap confidence interval. However, little is seen in the literature that reports TE1, TE2 and INT jointly, e.g. confidence region of (TE1 or TE2, INT), which takes into account the correlation between the estimates of (TE1 or TE2, INT). In this article, we use logistic model to obtain point and interval estimate of (TE1 or TE2, INT). Instead of deriving the covariance matrix of the two-dimensional ML estimate of (TE1 or TE2, INT) or using the bootstrap method, we generate approximate distribution of the ML estimate of (TE1 or TE2, INT) and then use the distribution to obtain the interval estimate of (TE1 or TE2, INT). We present our method by studying the effect of (hospital type, cancer stage) on cancer survival.

2. Effect of (hospital type, cancer stage) on cancer survival in Sweden 2.1 Medical background and the data In cancer treatment, an important question is which type of hospitals, small or large, is superior to treatment of cancers of early or advanced stage, where the size of a hospital is determined by the number of cancer patients treated there. In an observational study, researchers studied the effect of (hospital type, cancer stage) on one-year survival of cardia cancer patients [11]. Cardia cancer is highly malignant with bad prognosis and its one-year survival is a good measure of the performance of hospital type and the impact of cancer stage in treating such cancers. The data was collected between 1988 and 1995 on 150 cardia cancer patients treated in hospitals located in central and northern Sweden. The hospitals were categorized into two types: large type ( z = 1) when treating more than 10 patients during the period 1988-1995, and small type ( z = 0) when treating less than or equal to 10 patients. Cancer stages were categorized into two categories: advanced stage ( z =

1) when the eight-level stage index took values larger than or equal to five, and early stage ( z =

0) when the index took values smaller than five. Then the exposures in this study were 𝒛 = ( z , z ) = (hospital type, cancer stage). The outcome of a patient was successful ( y =

1) versus unsuccessful ( y =

0) survival for one year after diagnosis. In addition to the exposure 𝒛 , possible confounders were documented: age ( x ), gender ( x ), and geographic area ( x ). Age was continuous, but gender and geographic area were categorical. Let x = x = 0 female of the gender. Geographic area was categorized into urban ( x = versus rural ( x = 𝒙 = ( x , x , x ) be the set of the documented covariates. The descriptive statistics of these covariates and exposures are given in Table 1. The complete data of the study is given as supplementary material. Let 𝑦(𝑧 , 𝑧 ) be the potential outcome of each patient in the population under exposure ( z , z ), where 𝑧 = 1, 0 indicates large versus small hospital types while 𝑧 = 1, 0 indicates cancers of advanced versus early stages, see [12--16] for the framework of causal inference. We use pr{𝑦(𝑧 , 𝑧 ) = 1} to denote the risk of 𝑦(𝑧 , 𝑧 ) = 1 of patient under exposure (𝑧 , 𝑧 ) . Then the effect of the hospital type 𝑧 at the early cancer stage 𝑧 = 0 is measured by the risk difference TE1 = pr{𝑦(1, 0) = 1} − pr{𝑦(0, 0) = 1}.

The effect of the cancer stage 𝑧 at the small hospital type 𝑧 = 0 is measured by the risk difference TE2 = pr{𝑦(0, 1) = 1} − pr{𝑦(0, 0) = 1}.

The interaction between 𝑧 and 𝑧 is measured by the difference between risk differences INT = [pr{𝑦(1, 1) = 1} − pr{𝑦(0, 1) = 1}] − [pr{𝑦(1, 0) = 1} − pr{𝑦(0, 0) = 1}] = [pr{𝑦(1, 1) = 1} − pr{𝑦(0, 1) = 1}] − TE1, or equivalently,

INT = [pr{𝑦(1, 1) = 1} − pr{𝑦(1, 0)} = 1] − [pr{𝑦(0, 1) = 1} − pr{𝑦(0, 0) = 1}] = [pr{𝑦(1, 1) = 1} − pr{𝑦(1, 0)} = 1] − TE2.

In some medical applications, one aims at the effect of 𝑧 in stratum 𝑧 = 1, 0 , namely, TE(𝑧 ) = pr{𝑦(𝑧 = 1) = 1| 𝑧 } − pr{𝑦(𝑧 = 0) = 1| 𝑧 }, where 𝑦(𝑧 ) is the potential outcome of each patient in stratum z under exposure 𝑧 , and then considers the modification of the effect of z by z , namely, EM = TE(𝑧 = 1) − TE(𝑧 = 0). However, EM compares the effects of z on different strata 𝑧 = 1, 0 and does not have causal interpretation. Thus one cannot use EM to address the question of the illustrative example described in the previous subsection. We have similar situation for the modification of the effect of z by z Because we can only observe potential outcome of a patient under one of the four exposures (𝑧 , 𝑧 ) = (0, 0), (0, 1) , (1, 0) , (1, 1) , we need certain assumption to allow for estimation of TE1, TE2 and INT [12--16]. In the medical context of this study, it is reasonable to assume that there is no other confounder than the documented covariates 𝒙 = ( x , x , x ). We denote the risk of 𝑦 = 1 in stratum ( 𝑧 , 𝑧 , 𝒙) by pr (𝑦 = 1 | 𝑧 , 𝑧 , 𝒙) . Then the assumption states pr{𝑦(𝑧 , 𝑧 ) = 1| 𝒙} = pr (𝑦 = 1 | 𝑧 , 𝑧 , 𝒙) which implies that the unobservable potential outcome 𝑦(𝑧 , 𝑧 ) can be assessed through the observable outcome y . Then we have pr{𝑦(𝑧 , 𝑧 ) = 1} = � pr{𝑦(𝑧 , 𝑧 ) = 1| 𝒙} pr (𝒙) = 𝒙 � pr (𝑦 = 1|𝑧 , 𝑧 , 𝒙) pr (𝒙). 𝒙 Inserting this into the above formulas for TE1, TE2 and INT, we obtain

TE1 = � pr (𝑦 = 1|𝑧 = 1, 𝑧 = 0, 𝒙) pr (𝒙) 𝒙 − � pr (𝑦 = 1|𝑧 = 0, 𝑧 = 0, 𝒙) pr (𝒙), 𝒙 (1𝑎) TE2 = � pr (𝑦 = 1|𝑧 = 0, 𝑧 = 1, 𝒙) pr (𝒙) 𝒙 − � pr (𝑦 = 1|𝑧 = 0, 𝑧 = 0, 𝒙) pr (𝒙) 𝒙 , (1𝑏) INT = �� pr (𝑦 = 1|𝑧 = 1, 𝑧 = 1, 𝒙) pr (𝒙) 𝒙 − � pr (𝑦 = 1|𝑧 = 0, 𝑧 = 1, 𝒙) pr (𝒙) 𝒙 � − TE1 = �� pr ( 𝑦 = 1 | 𝑧 = 1, 𝑧 = 1, 𝒙) pr (𝒙) 𝒙 − � pr ( 𝑦 = 1 | 𝑧 = 1, 𝑧 = 0, 𝒙) pr (𝒙) 𝒙 � − TE2. (1𝑐) In particular, in randomized trial, we have pr (𝒙) = pr (𝒙|𝑧 ,𝑧 ) , which implies that � pr (𝑦 = 1|𝑧 , 𝑧 , 𝒙) pr (𝒙) 𝒙 = � pr (𝑦 = 1|𝑧 , 𝑧 , 𝒙) pr( 𝒙|𝑧 , 𝑧 ) = pr (𝑦 = 1|𝑧 , 𝑧 ) . 𝒙 Inserting this into (1a), (1b) and (1c), we see that TE1, TE2 and INT given by (1a), (1b) and (1c) are the same as those for randomized trials given in the introduction of this article.

The risk pr (𝑦 = 1 | 𝑧 , 𝑧 , 𝒙) of 𝑦 = 1 in stratum ( 𝑧 , 𝑧 , 𝒙 ) is modeled by logistic model. By the likelihood ratio-based significance testing of the model parameters, we obtain Log � pr (𝑦 = 1 | 𝑧 , 𝑧 , 𝒙)1 − pr (𝑦 = 1 | 𝑧 , 𝑧 , 𝒙)� = α + 𝛽 𝑧 + 𝛽 𝑧 + 𝛽 (𝑧 ∗ 𝑧 ) + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 (𝑧 ∗ 𝑥 ). (2) In this model, we include, in addition to one term for the exposure product 𝑧 ∗ 𝑧 , another term for the hospital type-age product 𝑧 ∗ 𝑥 , because of a somewhat small p-value, 0.20 for the significance test of 𝜃 ≠ 0 . Let π = ( α, β , β , β , θ , θ , θ , θ ) be the set of all model parameters. The ML estimate 𝜋� = (𝛼�, 𝛽̂ , 𝛽̂ , 𝛽̂ , 𝜃� , 𝜃� , 𝜃� , 𝜃� ) and its approximate covariance matrix Σ� (i.e. the inverse of the observed information) are given in Table 2. From model (2), we obtain pr (𝑦 = 1 | 𝑧 , 𝑧 , 𝒙) = exp{α + 𝛽 𝑧 + 𝛽 𝑧 + 𝛽 (𝑧 ∗ 𝑧 ) + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 (𝑧 ∗ 𝑥 )}1 + exp{α + 𝛽 𝑧 + 𝛽 𝑧 + 𝛽 (𝑧 ∗ 𝑧 ) + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 (𝑧 ∗ 𝑥 )} (3) Inserting (3) into (1a)-(1c) and replacing the probability pr(𝒙) by the proportion pror(𝒙) of the covariates 𝒙 in the sample, we obtain TE1, TE2 and INT as functions of 𝜋 TE1(𝜋) = � exp{α + 𝛽 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 }1 + exp{α + 𝛽 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 } 𝒙 prop(𝒙) − � exp{α + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 }1 + exp{α + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 } 𝒙 prop(𝒙), (4𝑎) TE2(𝜋) = � exp{α + 𝛽 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 }1 + exp{α + 𝛽 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 } 𝒙 prop(𝒙) − � exp{α + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 }1 + exp{α + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 } 𝒙 prop(𝒙), (4𝑏) INT(𝜋) = �� exp{α + 𝛽 + 𝛽 + 𝛽 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 }1 + exp{α + 𝛽 + 𝛽 + 𝛽 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 } 𝒙 prop(𝒙)− � exp{α + 𝛽 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 }1 + exp{α + 𝛽 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 } 𝒙 prop(𝒙)� − TE1(𝜋) = �� exp{α + 𝛽 + 𝛽 + 𝛽 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 }1 + exp{α + 𝛽 + 𝛽 + 𝛽 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 } 𝒙 prop(𝒙)− � exp{α + 𝛽 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 }1 + exp{α + 𝛽 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 + 𝜃 𝑥 } 𝒙 prop(𝒙)� − TE2(𝜋). (4𝑐) Replacing 𝜋 in TE1( 𝜋 ), TE2( 𝜋 ) and INT( 𝜋 ) by 𝜋� , we obtain the ML estimates TE1 �= TE1(𝜋�) , TE2� = TE2(𝜋�) and

INT� =INT(𝜋�) . Although 𝜋� is biased for finite sample, TE1� , TE2� and

INT� are unbiased [17]. Clearly, 𝜋� is consistent and so are TE1� , TE2� and

INT� . With 𝜋� given in Table 2, these ML estimates are equal to TE1� = 0.12 , TE2� = −0.48 and

INT� = 0.01 . We are going to obtain confidence region of (TE1 or TE2, INT) by generating approximate distribution of (TE1� or TE2� ,

INT� ) . Let p be a random variable which follows the normal distribution N (𝜋�, Σ�) , namely, 𝑝 ~ N (𝜋�, Σ�) , where 𝜋� and Σ� in N (𝜋�, Σ�) are given in Table 2. This normal distribution is good approximation to the distribution of the ML estimate 𝜋� , because parameters of a logistic model have good asymptotic normality [18]. Replacing 𝜋 in {TE1( 𝜋 ), TE2( 𝜋 ), INT( 𝜋 )} by p and then using the normal distribution of p , we generate distribution of {TE1( p ), TE2( p ), INT( p )}, which approximates the distribution of (TE1� , TE2� ,

INT� ) . With this approximate distribution, we obtain approximate distribution of (TE1� or TE2� ,

INT� ) and then the confidence region of (TE1 or TE2, INT). Because {TE1( 𝜋 ), TE2( 𝜋 ), INT( 𝜋 )} is a smooth monotone bounded function of 𝜋 according to (4a)-(4c), the performance of p in its approximating 𝜋� determines the performance of {TE1( p ), TE2( p ), INT( p )} in its approximating (TE1� , TE2� ,

INT� ) . Exact and approximate distributions of the ML estimate of exposure effect have been simulated in the setting of a large exposure effect, a large exposure-covariate interaction and a strong confounding, and it was found that the exact and approximate distributions have rather good agreement even in their tails [17]. This method of obtaining confidence region of (TE1 or TE2, INT) is analogous to the common method of obtaining confidence interval of an odds ratio. For instance, suppose that 𝛽 of model (2) is a parameter of interest and we wish to obtain 95 % confidence interval of the odds ratio exp( 𝛽 ). The 𝛽 has good asymptotic normality: the distribution of 𝛽̂ is approximately normal with a variance estimate var� (𝛽̂ ) . Then one generates an approximate distribution of exp( 𝛽̂ ) by using the approximate normal distribution of 𝛽̂ and thus obtains the 95 % confidence interval of the odds ratio by exp{𝛽̂ ±1.96var� (𝛽̂ )} , where the number 1.96 is the 97.5 th percentile of the standard normal distribution. This confidence interval reflects the asymmetry of the distribution of exp( 𝛽̂ ). This method has also been used to obtain confidence intervals of other measures of exposure effect such as risk ratio [19]. In the next section, we shall describe the procedure of obtaining the approximate distribution of (TE1� , TE2� ,

INT� ) and various point and interval estimates of TE1, TE2 and INT.

3. Point and interval estimates of TE1, TE2 and INT 3.1 Approximate distribution of ( TE1 � , TE2 �, INT � ) and the derived distributions First we draw p from 𝑝~N(𝜋�, Σ�) . Second, we replace 𝜋 by p in formulas (4a) - (4c) to get {TE1( 𝑝 ), TE2( 𝑝 ), INT( 𝑝 )}, which approximates (TE1� , TE2� ,

INT� ) . We iterate the procedure, 1000 times in this article, to get 1000 sets of approximate values of (TE1� , TE2� ,

INT� ) . All these 1000 sets form an approximate distribution of (TE1� , TE2� ,

INT� ) . All pairs of (TE1� , INT� ) in those 1000 sets form an approximate distribution of (TE1� ,

INT� ) . All

TE1� in those 1000 pairs form an approximate distribution of

TE1� . All

INT� in those 1000 pairs form an approximate distribution of

INT� . The three distributions are presented in Figure 1. Similarly, we obtain approximate distributions for (TE2� ,

INT� ) and

TE2� , which are presented in Figure 2. Restricting

TE1� to a given range, all corresponding

INT� form an approximate conditional distribution of

INT� when

TE1� falls into the given range. The ranges of

TE1� we consider are the terciles of TE1� , i.e. three consecutive ranges of

TE1� , such that the probability of

TE1� occurring in each interval is 1 / 3. The three conditional distributions are presented in Figure 3. Similarly, we obtain three approximate conditional distributions of

INT� when

TE2� falls into the terciles, as presented in Figure 4.

Point estimates of TE1, TE2 and INT have been obtained in Section 2.4 and are equal to

TE1� = 0.12 , TE2� = −0.48 and

INT� = 0.01 . In most practical studies with interval estimates, it is sufficient to have confidence regions of (TE1, INT) and (TE2, INT) rather than a confidence volume of (TE1, TE2, INT). We are going to use the distribution of (TE1� , INT� ) obtained in Section 3.1 to obtain (1 − 𝛼) confidence region of (TE1, INT). We can arbitrarily choose a confidence curve that partitions the whole region of (TE1, INT) into a confidence region in which (TE1� , INT� ) occurs with the probability (1 − 𝛼) and its complementary region in which (TE1� , INT� ) occurs with the probability 𝛼 . Here we take the confidence region by imposing that the confidence curve is an ellipse and further requiring that the confidence region enclosed by the ellipse has smallest area at the (1 − 𝛼) confidence level. First, we calculate {mean(TE1� ), mean( INT� )} and the variances var(TE1� ) and var(

INT� ) and the covariance cov (TE1� , INT� ) . Second, we use the normal distribution N ��mean(TE1� ) mean(

INT� ) � , � var(TE1� ) cov(TE1� , INT� )cov(TE1� , INT� ) var(

INT� ) �� to obtain the (1 − 𝛼) confidence curve for (TE1, INT) by the ellipse formula � TE1 − mean(TE1 � ) INT − mean(

INT � ) � � var(TE1 � ) cov(TE1 � , INT � )cov(TE1 � , INT � ) var( INT � ) � �TE1 − mean(TE1 � ) INT − mean(

INT � ) � = 𝜒 (1 − 𝛼) where 𝜒 (1 − 𝛼) is the 100 (1 − 𝛼) P th percentile of the central chi-square distribution with two degrees of freedom. The 95 % confidence region of (TE1, INT) is shown in Figure 1. In the same way, we obtain the 95 % confidence region of (TE2, INT), which is presented in Figure 2.

We are going to use the distribution of

INT� to obtain (1 − 𝛼) confidence interval of INT. The confidence interval is not unique because the upper and lower confidence limits can be arbitrarily chosen such that the corresponding upper and lower confidence levels, denoted by 𝛼 𝑢 and 𝛼 𝑙 respectively, satisfy 𝛼 𝑢 + 𝛼 𝑙 = 𝛼 . Here we take the confidence interval by imposing the condition 𝛼 𝑢 = 𝛼 𝑙 = 0.025 . Then the 95 % confidence interval of INT is ( − INT� = 0.01 obtained in Section 2.4. We are going to use the conditional distribution of

INT� when

TE1� falls into a terciles of

TE1� to obtain point estimate and (1 − 𝛼) confidence interval of INT over the tercile. The point estimate of INT is the mean of the conditional distribution. Over the lower tercile of

TE1� , the point estimate (95 % confidence interval) of INT is 0.18 ( − TE1� , the point estimate (95 % confidence interval) of INT is 0.03 (0.20, 0.21). Over the upper tercile of

TE1� , the point estimate (95 % confidence interval) of INT is − − TE2� , the point estimate (95 % confidence interval) of INT are 0.18 ( − − − −

4. Interpretation for various interval estimates of TE1, TE2 and INT

From Figures 1b, 1c and 2b, we cannot observe the correlation of

INT� with

TE1� or TE2� . The confidence interval of INT in Table 3a indicates possible values for INT, i.e. ( − INT� is highly correlated with

TE1� . The confidence region of (TE1, INT) indicates possible values for (TE1, INT) at the 95 % confidence level. From Figure 2a, we see similar situation for (TE2, INT). From Figures 3a-3c, we see that INT has different values at different TE1. The confidence interval of INT over a tercile of

TE1� in Table 3b indicates possible values of INT at the 95 % confidence level when

TE1� falls in the tercile. Such confidence intervals allow us to perform stratified analysis of INT over strata of

TE1� . If the effect of large versus small hospital types for early stage cancer is

TE1� ≤ 0.105 , then we have an increase of the effect for advanced stage cancer, i.e. INT� (95 % confidence interval) = 0.18 ( − TE1� > 0.246 . Similarly, from Figures 4a-4c and Table 3c, we see that INT has difference value at different TE2. If the effect of advanced versus early cancer stage for small hospital type is

TE2� ≤−0.508 , then we have an increase of the effect for large hospital type, i.e.

INT� (95 % confidence interval) = 0.18 ( − TE2� > −0.374 . Although the TE2-INT relation is similar to the TE1-INT relation as revealed in the analysis above, they cannot be obtained from each other.

5. Discussion and conclusions

To learn the relationship between exposure effects and interaction between the exposures, one needs to report both the exposure effects and the interaction. Because the ML estimates of the exposure effects and the interaction are highly correlated, one needs to report them jointly. In this article, we have obtained point estimate and confidence region of (TE1, INT), those of (TE2, INT), and point estimate and confidence interval of INT when the ML estimate of TE1 or TE2 falls into specified range. Because of its statistical advantages, logistic model is ubiquitous in observational studies for exposure effects on dichotomous outcomes of populations. One major advantage of a logistic model is that the model parameters have good asymptotic normality: normal distribution is good approximate distribution for the ML estimate of the model parameters, see, e.g. [18]. One major disadvantage with logistic model is the use of odds ratio as measure of the exposure effects; see the rich literature for non-collapsibility of odds ratio [20, 21, 12, 22-24]. In this article, we have kept the advantage of logistic model while avoiding the disadvantage by using risk difference as measure of the exposure effects and difference between risk differences as measure of the interaction. We have used approximate normal distribution of the ML estimate of the model parameters to obtain approximate non-normal distribution of the ML estimate of TE1, TE2 and INT and then their interval estimates including confidence regions. This maximum-likelihood-based approach provides a simple but reliable method of interval estimation of TE1, TE2 and INT, which can be easily implemented by using any software that generates normal distribution. Two methods are available in the literature to calculate confidence interval of TE1 (or TE2 or INT in rare cases) based on logistic model [6-10], i.e. the normal approximation method and the bootstrap method, but they have not been used to calculate confidence region of (TE1, INT). In the normal approximation method, one derives approximate variance of the ML estimate of TE1 by using the delta method and then uses the variance to obtain normal approximation confidence interval of TE1. To obtain confidence region of (TE1, INT), however, one needs to derive approximate covariance matrix of the ML estimate of (TE1, INT), which is tedious. In the bootstrap method, one generates bootstrap samples by parametric or non-parametric bootstrap method and then uses the bootstrap samples to obtain bootstrap confidence interval of TE1. However, it is highly difficult to correct finite-sample bias arising from the bootstrap sampling particularly with dichotomous outcome [25--27]. It is even more difficult to correct the bias for the bootstrap confidence region of (TE1, INT) [25--27]. In comparison, our method of obtaining interval estimates including confidence regions of TE1, TE2 and INT is based on approximate distribution of their maximum estimates and does not involve resampling of the data. Therefore our method does not belong to the category of resampling methods such as the bootstrap method.

Acknowledgement

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

References Blot WJ and Day NE. Synergism and interaction: are they equivalent?

American Journal of Epidemiology : 99--100. 2.

Rothman KJ, Greenland S and Walker AM. Concepts of interaction.

American Journal of Epidemiology : 467--470. 3.

Rothman KJ, Greenland S and Lash TL.

Modern Epidemiology (3rd edition).

Lippincott Williams & Wilkins, Philadelphia; 2008 4.

Greenland S. Interactions in Epidemiology: Relevance, Identification and Estimation.

Epidemiology : 14–17. 5. VanderWeele TJ. On the Distinction between Interaction and Effect Modification.

American Journal of Epidemiology : 863--871. 6. McNutt LA, Wu C, Xue X and Hafner JP. Estimating the relative risk in cohort studies and clinical trials of common outcomes.

American Journal of Epidemiology : 940--943. 7.

Newcombe RG. A deficiency of the odds ratio as a measure of effect size.

Statistics in Medicine : 4235--4240. Greenland S. Model-based Estimation of Relative Risk and Other Epidemiologic measures in Studies of Common Outcomes and Case-Control Studies.

American Journal of Epidemiology : 301--305. 9.

Austin PC. Absolute risk reductions, relative risks, relative risk reductions, and numbers needed to treat can be obtained from a logistic regression model.

Journal of Clinical Epidemiology : 2--6. 10. Nie L, Chu H, Li F, and Cole SR. Relative excess risk due to interaction: resampling-based confidence intervals.

Epidemiology : 552--556. 11. Hansson LE, Ekstrom AM, Bergstrom R and Nyren O. Surgery for stomach cancer in a defined Swedish population: current practices and operative results. Swedish Gastric Cancer Study Group.

The European journal of surgery , 787--975. 12.

Greenland S, Robins JM and Pearl J. Confounding and collapsibility in causal inference.

Statistical Science : 29–46. 13. Greenland S and Robins JM. Identifiability, exchangeability, epidemiological confounding.

International Journal of Epidemiology : 413–419. 14. Rosenbaum PR and Rubin DB. The central role of the propensity score in observational studies for causal effects.

Biometrika : 41--55. 15. Rosenbaum PR.

Observational studies . Springer, New York; 1995. 16.

Rubin DB, Wang X, Yin L and Zell E. Estimating the Effect of Treating Hospital Type on Cancer Survival in Sweden Using Principal Stratification. In

The HANDBOOK OF APPLIED BAYESIAN ANALYSIS , eds. T. O’Hagan and M. West, Oxford University Press, Oxford; 2009. Wang X, Jin Y and Yin L. Point and Interval Estimations of Marginal Risk Difference by Logistic Model.

Communications in Statistics: theory and methods

Lindsey JK.

Parametric Statistical Inference . Clarendon Press, Oxford; 1996. 19.

Wang X, Jin Y and Yin L. Measuring and estimating treatment effect on dichotomous outcome of a population.

Statistical methods in medical research

Gail MH, Wieand S and Piantadosi S. Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates.

Biometrika : 431--444. 21. Guo J and Geng Z. Collapsibility of logistic regression coefficients.

Journal of Royal Statistical Society B : 263--267. Lee Y and Nelder JA. Conditional and marginal models: another view.

Statistical Science : 219--228. 23. Austin PC. The performance of different propensity score method for estimating marginal odds ratio.

Statistics in Medicine : 3078–3094. 24. Austin PC, Grootendorst P, Normand SLT and Anderson GM. Conditioning on the propensity score can result in biased estimation of common measures of treatment effect: A Monte Carlo study.

Statistics in Medicine : 754–768. 25. Carpenter J and Bithell J. Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians.

Statistics in Medicine : 1141--1164. 26. Davison AC and Hinkley DV.

Bootstrap Methods and their Application . Cambridge University Press, Cambridge; 1997. Greenland S. Interval estimation by simulation as an alternative to and extension of confidence intervals.

International Journal of Epidemiology : 1389–1397. able 1: Descriptive statistics of the study population: one-year survivals / patient totals on levels of age, gender and geographic area for (Hospital type, Cancer stage) (Hospital type, Cancer stage) (Large, Advanced) (Small, Advanced) (Large, Early) (Small, Early)

Overall age <= median (67) 12/41 2/13 11/13 2/2 >median 11/34 2/23 12/18 3/6 gender female 4/22 1/8 3/5 0/1 male 19/53 3/28 20/26 5/7

Geographic area rural 11/32 4/26 13/15 3/6 urban 12/43 0/10 10/16 2/2 able 2

ML estimates and its approximate covariance matrix for parameters of the model (2)

Parameters α β β β θ θ θ θ Estimates 3.92 − − − − α − − − − − β − − − β − − − − β − − − − θ − − θ − − − − θ − − − − θ − − − − − able 3 Point estimates (50 and 95 % confidence intervals) of INT, obtained respectively from (3a) marginal distribution of

INT� , (3b) conditional distribution of

INT� when

TE1� falls into the tercile, and (3c) conditional distribution of

INT� when

TE2� falls into the tercile, (3a) Point estimate (50 and 95 % confidence intervals) of INT − − (3b) Point estimate (50 and 95 % confidence intervals) of INT over tercile of TE1�

Tercile of

TE1� ( −∞ ,0.105] 0.18 ( − (0.105, 0.246] 0.03 ( − − (0.246, +∞ ) − − − (3c) Point estimate (50 and 95 % confidence intervals) of INT over tercile of TE2�

Trercile of

TE2� ( −∞ , − − ( − − − − ( − +∞ ) − − − − Figure 1 (1a) The scatterplot for approximate distribution of the ML estimate of (TE1, INT) and the 95% confidence region of (TE1, INT); the point estimate of (TE1, INT) is equal to (0.12, 0.01) . (1b) Approximate distribution for the ML estimate of INT. (1c) Approximate distribution for the ML estimate of TE1. Figure 2 (2a) The scatterplot for approximate distribution of the ML estimate of (TE2, INT) and the 95% confidence region of (TE2, INT); the point estimate of (TE2, INT) is equal to (−0.48, 0.01) . (2b) Approximate distribution for the ML estimate of TE2. Figure 3