[PDF] On Mendelian Randomization Mixed-Scale Treatment Effect Robust Identification (MR MiSTERI) and Estimation for Causal Inference

Abstract

Standard Mendelian randomization analysis can produce biased results if the genetic variant defining the instrumental variable (IV) is confounded and/or has a horizontal pleiotropic effect on the outcome of interest not mediated by the treatment. We provide novel identification conditions for the causal effect of a treatment in presence of unmeasured confounding, by leveraging an invalid IV for which both the IV independence and exclusion restriction assumptions may be violated. The proposed Mendelian Randomization Mixed-Scale Treatment Effect Robust Identification (MR MiSTERI) approach relies on (i) an assumption that the treatment effect does not vary with the invalid IV on the additive scale; and (ii) that the selection bias due to confounding does not vary with the invalid IV on the odds ratio scale; and (iii) that the residual variance for the outcome is heteroscedastic and thus varies with the invalid IV. We formally establish that their conjunction can identify a causal effect even with an invalid IV subject to pleiotropy. MiSTERI is shown to be particularly advantageous in presence of pervasive heterogeneity of pleiotropic effects on additive scale, a setting in which two recently proposed robust estimation methods MR GxE and MR GENIUS can be severely biased. In order to incorporate multiple, possibly correlated and weak IVs, a common challenge in MR studies, we develop a MAny Weak Invalid Instruments (MR MaWII MiSTERI) approach for strengthened identification and improved accuracy MaWII MiSTERI is shown to be robust to horizontal pleiotropy, violation of IV independence assumption and weak IV bias. Both simulation studies and real data analysis results demonstrate the robustness of the proposed MR MiSTERI methods.

Full PDF

OOn Mendelian Randomization Mixed-Scale Treatment Eﬀect RobustIdentiﬁcation (MR MiSTERI) and Estimation for Causal Inference

Zhonghua Liu , Ting Ye , Baoluo Sun , Mary Schooling , and Eric Tchetgen Tchetgen Department of Statistics and Actuarial Science, University of Hong Kong, Hong Kong

Department of Statistics, The Wharton School of Business, University of Pennsylvania,Philadelphia, PA, USA Department of Statistics and Applied Probability, National University of Singapore,Singapore CUNY Graduate School of Public Health and Health Policy, New York, NY, USA School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong,Hong Kong * [email protected] 2, 2020 AbstractBackground : Standard Mendelian randomization analysis can produce biased results if the geneticvariant deﬁning the instrumental variable (IV) is confounded and/or has a horizontal pleiotropiceﬀect on the outcome of interest not mediated by the treatment.

Development : We provide novel identiﬁcation conditions for the causal eﬀect of a treatment inpresence of unmeasured confounding, by leveraging an invalid IV for which both the IV indepen-dence and exclusion restriction assumptions may be violated. The proposed Mendelian Random-ization Mixed-Scale Treatment Eﬀect Robust Identiﬁcation (MR MiSTERI) approach relies on (i)an assumption that the treatment eﬀect does not vary with the invalid IV on the additive scale; and(ii) that the selection bias due to confounding does not vary with the invalid IV on the odds ratioscale; and (iii) that the residual variance for the outcome is heteroscedastic and thus varies withthe invalid IV. Although assumptions (i) and (ii) have, respectively appeared in the IV literature,assumption (iii) has not; we formally establish that their conjunction can identify a causal eﬀecteven with an invalid IV subject to pleiotropy. MiSTERI is shown to be particularly advantageousin presence of pervasive heterogeneity of pleiotropic eﬀects on additive scale, a setting in whichtwo recently proposed robust estimation methods MR GxE and MR GENIUS can be severely bi-ased. For estimation, we propose a simple and consistent three-stage estimator that can be used aspreliminary estimator to a carefully constructed one-step-update estimator, which is guaranteed tobe more eﬃcient under the assumed model. In order to incorporate multiple, possibly correlatedand weak IVs, a common challenge in MR studies, we develop a MAny Weak Invalid Instruments(MR MaWII MiSTERI) approach for strengthened identiﬁcation and improved accuracy. We havedeveloped an R package MR-MiSTERI for public use of all proposed methods.

Application : We illustrate MR MiSTERI in an application using UK Biobank data to evaluatethe causal relationship between body mass index and glucose, thus obtaining inferences that arerobust to unmeasured confounding, leveraging many weak and potentially invalid candidate geneticIVs.

Conclusion : MaWII MiSTERI is shown to be robust to horizontal pleiotropy, violation of IVindependence assumption and weak IV bias. Both simulation studies and real data analysis resultsdemonstrate the robustness of the proposed MR MiSTERI methods. a r X i v : . [ s t a t . M E ] O c t eywords: Treatment eﬀect on the treated, Mendelian randomization, horizontal pleiotropy, in-valid instrument, unmeasured confounding, UK Biobank, weak instrument. Introduction

Mendelian randomization (MR) is an instrumental variable (IV) approach that uses genetic variants,for example, single-nucleotide polymorphisms (SNPs), to infer the causal eﬀect of a modiﬁable risktreatment on a health outcome (Smith and Ebrahim 2003). MR has recently gained popularityin epidemiological studies because, under certain conditions, it can provide unbiased estimatesof causal eﬀects even in the presence of unmeasured exposure-outcome confounding. For example,ﬁndings from a recent MR analysis assessing the causal relationship between low-density lipoproteincholesterol and coronary heart disease (Ference et al. 2017) in an observational study are consistentwith the results of earlier randomized clinical trials (Scandinavian Simvastatin Survival StudyGroup 1994).For a SNP to be a valid IV, it must satisfy the following three core assumptions (Didelez andSheehan 2007; Lawlor et al. 2008): (A1)

IV relevance: the SNP must be associated (not necessarily causally) with the exposure; (A2)

IV independence: the SNP must be independent of any unmeasured confounder of theexposure-outcome relationship; (A3)

Exclusion restriction: the SNP cannot have a direct eﬀect on the outcome variable notmediated by the treatment , that is, no horizontal pleiotropic eﬀect can be present.The causal diagram in Figure 1(a) graphically represents the three core assumptions of a valid IV. Itis well-known that even when assumptions (A1)-(A3) hold for a given IV, the average causal eﬀect ofthe treatment on the outcome cannot be point identiﬁed without an additional condition, the natureof which dictates the interpretation of the identiﬁed causal eﬀect. Speciﬁcally, Angrist et al. (1996)proved that under A1-A3 and a monotonicity assumption about the IV-treatment relationship, theso-called complier average treatment eﬀect is nonparametrically identiﬁed. More recently, Wangand Tchetgen Tchetgen (2018) established identiﬁcation of population average causal eﬀect under(A1)-(A3) and an additional assumption of no unmeasured common eﬀect modiﬁer of the associationbetween the IV and the endogenous variable, and the treatment causal eﬀect on the outcome. Aspecial case of this assumption is that the association between the IV and the treatment variableis constant on the additive scale across values of the unmeasured confounder (Tchetgen Tchetgenet al. (2020)). In a separate strand of work, Robins (1994) identiﬁed the eﬀect of treatment onthe treated under (A1)-(A3) and a so-called no current-treatment value interaction assumption(A.4a) that the eﬀect of treatment on the treated is constant on the additive scale across valuesof the IV. In contrast, Liu et al. (2020) established identiﬁcation of the ETT under (A1)-(A3),and an assumption (A.4b) that the selection bias function deﬁned as the odds ratio associationbetween the potential outcome under no treatment and the treatment variable, is constant acrossvalues of the IV. Note that under the IV DAG Figure 1(a), assumption (A1) is empirically testablewhile (A2) and (A3) cannot be refuted empirically without a diﬀerent assumption being imposed(Glymour et al. 2012). Possible violation or near violation of assumption (A1) known as the weakIV problem poses an important challenge in MR studies as the associations between a single SNPIV and complex traits can be fairly weak (Staiger and Stock 1997; Stock et al. 2002). But massivegenotyped datasets have provided many such weak IVs. This has motivated a rich body of workaddressing the weak IV problem under a many weak instruments framework,from a generalizedmethod of moments perspective given individual level data (Chao and Swanson 2005; Newey andWindmeijer 2009; Davies et al. 2015), and also from a summary-data perspective (Zhao et al.2019a,b; Ye et al. 2019). Violation of assumption (A2) can occur due to population stratiﬁcation,or when selected SNP IV is in linkage disequilibrium (LD) with another genetic variant which hasa direct eﬀect on the outcome (Didelez and Sheehan 2007). Violations of (A3) can occur whenthe chosen SNP IV has a non-null direct eﬀect on the outcome not mediated by the exposure,commonly referred to as horizontal pleiotropy and is found to be widespread (Solovieﬀ et al. 2013;Verbanck et al. 2018). A standard MR analysis (i.e. based on standard IV methods such as 2SLS)with an invalid IV that violates any of those three core assumptions might yield biased causal eﬀectestimates.Methods to address possible violations of (A2) or (A3) given a single candidate IV are limited.Two methods have recently emerged as potentially robust against such violation under certain A YU (a)

Z A YU (b)Figure 1: Directed acyclic graph (DAG) with an instrument ( Z ), an outcome ( Y ), a treatment( A )and unmeasured confounders ( U ). The left panel shows a valid Mendelian randomization study andthe right panel shows violations of IV independence and exclusion restriction assumptions. conditions. The ﬁrst, known as MR-GxE, assumes that one has observed an environmental factor,which interacts with the invalid IV in its additive eﬀects on the treatmentof interest, and that suchinteraction is both independent of any unmeasured confounder of the exposure-outcome relation-ship, and does not operate on the outcome in view (Spiller et al. 2019, 2020). In other words,MR-GxE essentially assumes that while the candidate SNP may not be a valid IV, its additiveinteraction with an observed covariate constitutes a valid IV which satisﬁes (A1-A3). In contrast,MR GENIUS relies on an assumption that the residual variance of the ﬁrst stage regression of thetreatmenton the candidate IV is heteroscedastic with respect to the candidate IV, i.e. the varianceof the treatment depends on the IV, an assumption that may be viewed as strengthening of the IVrelevance assumption (Tchetgen Tchetgen et al. 2020). Interestingly, as noted by Tchetgen Tch-etgen et al. (2020), existence of a GxE interaction that satisﬁes conditions (A1-A3) required byMR GxE had such an E variable (independent of G ) been observed, necessarily implies that theheteroscedasticity condition required by MR GENIUS holds even when the relevant E variable isnot directly observed. Furthermore, as it is logically possible for heteroscedasticity of the varianceof the treatment to operate even in absence of any GxE interaction, MR GENIUS can be valid incertain settings where MR GxE is not.In this paper, we develop an alternative robust MR approach to estimating a causal eﬀect ofa treatment subject to unmeasured confounding, leveraging a potentially invalid IV which failsto fulﬁl either IV independence or exclusion restriction assumptions. The proposed MendelianRandomization Mixed-Scale Treatment Eﬀect Robust Identiﬁcation (MR MiSTERI) approach relieson (i) an assumption that the treatment eﬀect does not vary with the invalid IV on the additivescale; and (ii) that the selection bias due to confounding does not vary with the invalid IV on theodds ratio scale; and (iii) that the residual variance for the outcome is heteroscedastic and thusvaries with the invalid IV. Although assumptions (i) and (ii) have, respectively appeared in theIV literature, assumption (iii) has not; we formally establish that their conjunction can identify acausal eﬀect even with an invalid IV subject to pleiotropy. MiSTERI is shown to be particularlyadvantageous in presence of pervasive heterogeneity of pleiotropic eﬀects on the additive scale, asetting in which both MR GxE and MR GENIUS can be severely biased whenever heteroscedasticﬁrst stage residuals can fully be attributed to latent heterogeneity in SNP-treatment association(cite GDS group recent paper comparing the methods). For estimation, we propose a simple andconsistent three-stage estimator that can be used as preliminary estimator to a carefully constructedone-step-update estimator, which is guaranteed to be more eﬃcient under the assumed model. Inorder to incorporate multiple, possibly correlated and weak IVs, a common challenge in MR studies,we develop a MAny Weak Invalid Instruments (MaWII MR MiSTERI) approach for strengthenedidentiﬁcation and improved accuracy. Simulation study results show that our proposed MR methodgive consistent estimates of the causal parameter and the selection bias parameter with nominalconﬁdence interval coverage with an invalid IV, and the accuracy is further improved with multipleweak invalid IVs. For illustration, we apply our method to the UK Biobank data set to estimatethe causal eﬀect of body mass index (BMI) on average glucose level. Methods

Suppose that we observe data O i = ( Z i , A i , Y i ) of size n drawn independently from a commonpopulation, where Z i , A i , Y i denote a SNP IV, a modiﬁable treatment and a continuous outcomeof interest for the i th subject (1 ≤ i ≤ n ), respectively. In order to simplify the presentation, wedrop the sample index i and do not consider observed covariates at this stage, although all theconclusions continue to hold within strata deﬁned by observed covariates. Let z, a, y denote thepossible values that Z, A, Y could take, respectively. Let Y az denote the potential outcome, hadpossibly contrary to fact, A and Z been set to a and z respectively, and let Y a denote the potentialoutcome had A been set to a . We are interested in estimating the ETT deﬁned as β ( a ) = E ( Y a − Y | A = a )To facilitate the exposition, consider the simple setting where both the treatment and the SNPIV are binary, then the ETT simpliﬁes to β = E ( Y − Y | A = 1). By the consistency assumption,we know that E ( Y | A = 1) = E ( Y | A = 1). However, the expectation of the potential outcome Y among the exposed subpopulation E ( Y | A = 1) cannot empirically be observed due to possibleunmeasured confounding for the exposure-outcome relationship. The following diﬀerence capturesthis confounding bias on the additive scale Bias = E ( Y | A = 1) − E ( Y | A = 0), which isexactly zero only when exposed and unexposed groups are exchangeable on average (i.e. under noconfounding) (Hernan and Robins 2020), and is otherwise not null. With this representation andthe consistency assumption, we have E ( Y | A = 1) − E ( Y | A = 0) = β + Bias.

This simple equation implies that one can only estimate the sum of the causal eﬀect β and theconfounding bias but cannot tease them apart using the data ( A, Y ) only. With the availability ofa binary candidate SNP IV Z that is possibly invalid, we can further stratify the population by Z to obtain under consistency: E ( Y | A = 1 , Z = z ) − E ( Y | A = 0 , Z = z ) = β ( z ) + Bias ( z ) , where z is equal to either 0 or 1, and β ( z ) = E ( Y a − Y | A = a, Z = z ), Bias ( z ) = E ( Y | A =1 , Z = z ) − E ( Y | A = 0 , Z = z ) denote the causal eﬀect and bias in the stratiﬁed population withthe IV taking value z . Note that there are only two equations but four unknown parameters: β ( z ), Bias ( z ) , z = 0 ,

1. Therefore, the causal eﬀect cannot be identiﬁed without imposing assumptionsto reduce the total number of parameters to two.Our ﬁrst assumption extends Robins (1994) no current-treatment value interaction assumption(A.4a), that the causal eﬀect does not vary across the levels of the SNP IV so that β ( z ) is a constantas function of z . Formally stated: (B1) Homogeneous ETT assumption: E ( Y a =1 ,z − Y a =0 ,z | A = 1 , Z = z ) = β .It is important to note that this assumption does not imply the exclusion restriction assumption(A3); it is perfectly compatible with presence of a direct eﬀect of Z on Y (the direct arrow from Z to Y is present in Figure 1(b)), i.e. E ( Y a =0 ,z =1 − Y a =0 ,z =0 | A = 1 , Z = z ) (cid:54) = 0, which weaccommodate.In order to state our second core assumption, consider the following data generating mecha-nism for the treatment, which encodes presence of unmeasured confounding by making dependencebetween A and Y explicit under a logistic model:logit { P r ( A = 1 | Y = y , Z = z ) } = γ + γy + γ z z + γ y z y z. This model can of course not be estimated directly from the observed data without any ad-ditional assumption because it would require the potential outcome Y be observed both on theuntreated (guaranteed by consistency) and the treated. Nevertheless, this model implies that the onditional (on Z ) association between A and Y on the odds ratio scale is OR ( Y = y , A = 1 | Z = z ) = exp( γy + γ y z y z ).Together, γ and γ y z encode the selection bias due to unmeasured confounding on the log-oddsratio scale. If both γ and γ y z are zeros, then A and Y are conditionally (on Z ) independent, orequivalently, there is no residual confounding bias upon conditioning on Z . Our second identifyingassumption formally encodes assumption A4.b of Liu et al. (2020) of homogeneous odds ratioselection bias γ y z = 0: (B2) Homogeneous selection bias: OR ( Y = y , A = a | Z = z ) = exp( γay ).Assumption (B2) states that the selection bias on the odds ratio scale is homogeneous across levelsof the IV. Thus, this assumption allows for the presence of unmeasured confounding which, uponsetting γ y z = 0 is assumed to be on average balanced with respect to the SNP IV (on the oddsratio scale). Following Liu et al. (2020), we have that under (B2) E ( Y | A = a, Z = z ) = E { Y exp( γaY ) | A = 0 , Z = z } E { exp( γaY ) | A = 0 , Z = z } Deﬁne ε = Y − E ( Y | A, Z ), and suppose that ε | A, Z ∼ N (0 , σ ( Z )), then after some algebra wehave, E ( Y | A = 1 , Z = z ) = E ( Y | A = 0 , Z = z ) + γσ ( Z ) . The selection bias on the odds ratio scale does not vary with the levels of the IV, however in orderto achieve identiﬁcation, the bias term γσ ( Z ) must depend on the IV. This observation motivatesour third assumption, (B3) Heteroscedasticity, that is σ ( Z ) cannot be a constant.Assumption (B3) is empirically testable using existing statistical methods for detecting non-constantvariance in regression models. For example, the Levene test has been used in genetic associationstudies and found several SNPs associated with the variance of several phenotypes. The well-knownFTO variant (rs7202116) was found to be associated with the phenotypic variability of body massindex (BMI) (P=2.4E-10; N=131233) by using a two-stage procedure (Yang et al. 2012). Theyﬁrst regress BMI on the SNP genotype, then regress the squared residuals on the SNP genotypeagain to test for association of the SNP with the variance of BMI. Previous large-scale geneticassociation studies show that our assumption (B3) is scientiﬁcally plausible. Furthermore, one cangenerally expect the assumption to hold in presence of pervasive heterogeneity of pleiotropic eﬀectson additive scale for the outcome, a setting in which both MR GxE and MR GENIUS can beseverely biased. Under assumptions (B1) -(B3), we have E ( Y | A = 1 , Z = z ) − E ( Y | A = 0 , Z = z ) = β + γσ ( z ) . Note that σ ( z ) is the variance of Y given z, and thus can be estimated with no bias. For thebinary IV Z , σ ( Z = 0) and σ ( Z = 1) are the variance of Y within each IV stratum and can easilybe estimated using the sample variance in each group. We will describe estimating procedures inthe next section for more general settings where Z is not necessarily binary. Importantly, in thepresent case, we now have two equations and two unknown parameters β, γ .Denote D ( Z = z ) = E ( Y | A = 1 , Z = z ) − E ( Y | A = 0 , Z = z ). Therefore we have thefollowing main identiﬁcation result. Theorem 1.

Under assumptions (B1)-(B3), the selection bias parameter and the causal eﬀect ofinterest are uniquely identiﬁed as followed: γ = D ( Z = 1) − D ( Z = 0) σ ( Z = 1) − σ ( Z = 0) β = E { D ( Z ) − D ( Z = 1) − D ( Z = 0) σ ( Z = 1) − σ ( Z = 0) σ ( Z ) } = D ( z ) − D ( Z = 1) − D ( Z = 0) σ ( Z = 1) − σ ( Z = 0) σ ( z ); z = 0 , . A and binary IV Z . Thegreen line represents E ( Y | A = 0 , Z = z ) and its diﬀerence between Z = 1 and Z = 0 is θ z . The reddashed line (parallel to the green line) represents the hypothetical setting when σ ( Z = z ) is constant,and the red solid line represents the setting when σ ( Z = z ) is not a constant function in Z . Notethat b and b are treated as known. In an observed sample, one can simply use the sample versions of unknown quantities to obtainestimates for β and γ . Standard errors can be deduced by a standard application of the multivariatedelta method, or by using resampling techniques such as the jacknife or the bootstrap. Remark 1.

Note that as stated in the theorem, the causal parameter β can be estimated basedon data among either Z = 0 or Z = 1 group. In practical settings, in order to improve eﬃciency,one may take either unweighted average of the two estimates as given in the theorem, or theirinverse variance weighted average. In fact, one may ﬁt a saturated linear model for the variance σ ( Z = z ) = b + bz ; We give a graphical illustration of our identiﬁcation condition as shownin Figure 2. By using the heteroscedasticity assumption (B3), we can identify the selection biasparameter γ and then the causal eﬀect parameter β . Without the assumption (B3) as shown bythe red dashed line, we can only obtain β + γb and thus cannot tease apart the causal eﬀect andthe selection bias. The identiﬁcation strategy in the previous section extends to the more general setting of continuoustreatment and a more general discrete IV, for example, the SNP IV taking values 0, 1, 2 correspond-ing to number of minor alleles. To do so requires we introduce the notion of generalized conditionalodds ratio function as a measure the conditional association between a continuous treatment and acontinuous outcome. To ground idea, consider the following model for the treatment free potentialoutcome: Y = E ( Y | A, Z ) + σ ( Z ) (cid:15) where (cid:15) is independent standard normal; then one can show that the generalized conditional oddsratio function associated with Y and A given Z with ( Y = 0 , A = 0) taken as reference valuesChen (2007) is given by OR ( y , a | z ) = f ( y | a, z ) f ( y = 0 | a = 0 , z ) f ( y | a = 0 , z ) f ( y = 0 | a, z ) = exp { ( E ( Y | A = a, Z ) − E ( Y | A = 0 , Z )) y /σ ( Z ) } which by Assumption (B2) implies that E ( Y | A, Z ) − E ( Y | A = 0 , Z ) = γσ ( Z ) A . We thereforehave µ ( a, z ) = E ( Y | A = a, Z = z ) = βa + γσ ( Z = z ) a + E ( Y | A = 0 , Z = z ) . or practical interpretation, we assume A has been centered so that A = 0 actually representsaverage treatment value in the study population. Assume the following model E ( Y | A = 0 , Z = z ) = θ + θ z z and a log linear model for σ ( Z ) such that µ ( a, z ) = βA + γAσ ( Z ) + θ + θ z Zlog ( σ ( Z )) = η + η z Z. The conditional distribution of Y given A and Z is Y | A = a, Z = z ∼ N (cid:0) µ ( a, z ) , σ ( z ) (cid:1) Then we can simply use standard maximum likelihood estimation (MLE) to obtain consistent andfully eﬃcient estimates for all parameters under the assumed model. However, the likelihood maynot be log-concave and direct maximization of the likelihood function with respect to all parametersmight not always converge. Instead, we propose a novel three-stage estimation procedure whichcan provide a consistent but ineﬃcient preliminary estimator of unknown parameters. We thenpropose a carefully constructed one-step update estimator to obtain a consistent and fully eﬃcientestimator. Our three-stage estimation method works as follows.

Stage 1

Fit the following linear regression using standard (weighted) least-squares Y = θ + θ a A + θ z Z + θ az AZ + ε. We obtain parameter estimates ˆ θ , ˆ θ a , ˆ θ z , ˆ θ az , ˆ E ( Y | A = 0 , Z ) = ˆ θ + ˆ θ z Z , with correspondingestimated residuals ˆ ε = Y − ˆ θ − ˆ θ a A − ˆ θ z Z − ˆ θ az AZ . Stage 2

Regress the squared residuals ˆ ε on (1 , Z ) in a log-linear model and obtain parameterestimates ˆ η , ˆ η z and ˆ σ ( Z ) = exp(ˆ η + ˆ η z Z ). Stage 3

Regress Y − ˆ E ( Y | A = 0 , Z ) on A and A ˆ σ ( Z ) without an intercept term and obtain thetwo corresponding regression coeﬃcients estimates ˆ β and ˆ γ respectively.Alternatively, we may also ﬁt E ( Y | A, Z ) = βA + γA ˆ σ ( Z ) + θ + θ z Z in Stage 3.The proposed three-stage estimation procedure is computationally convenient, appears to alwaysconverge and provides a consistent estimator for all parameters under the assumed model. Next wepropose the following one-step update to obtain a fully eﬃcient estimator. Consider the followingnormal model Y − { βA + γA exp( η + η z Z ) + θ + θ z Z } (cid:112) exp( η + η z Z ) ∼ N (0 , . (1)Denote Θ = ( β, γ, η , η z , θ , θ z ) and denote the log-likelihood function as l (Θ; Y, A, Z ) for an indi-vidual. With a sample of size n , the log-likelihood function is l n (Θ) = n (cid:88) i =1 l i (Θ; Y i , A i , Z i ) . The score function S n (Θ) = ∂l n (Θ) /∂ Θ is deﬁned as the ﬁrst order derivative of the log-likelihoodfunction with respect to Θ, and use i n (Θ) = ∂ l n (Θ) /∂ Θ ∂ Θ T to denote the second order derivativeof the log-likelihood function. Suppose we have obtained a consistent estimator ˜Θ (0) of Θ using thethree-stage estimating procedure, then one can show that a consistent and fully eﬃcient estimatoris given by the one-step update estimator ˜Θ˜Θ = ˜Θ − { i n ( ˜Θ (0) ) } − S n ( ˜Θ (0) )) . (2) Remark 2 . In biomedical studies, continuous outcomes of interest are often approximately nor-mally distributed, and the normal linear regression is also routinely used by applied researchers.When the normal assumption is violated, a typical remedy is to apply a transformation to theoutcome prior to modeling it, such as the log-transformation or the more general Box-Cox trans-formation to achieve approximate normality. Alternatively, below we propose to model the errordistribution as a more ﬂexible Gaussian mixture model which is more robust than the Gaussianmodel. Alternatively, as described in the Supplemental materials, one may consider a semipara-metric location-scale model which allows the distribution of standardized residuals to remain un-restricted, a potential more robust approach although computationally more demanding. .3 Estimation and Inference under Many Weak Invalid IVs Weak identiﬁcation bias is a salient issue that needs special attention when using genetic data tostrengthen causal inference, as most genetic variants are only weakly associated with the traits.When many genetic variants are available, we recommend using the conditional maximum like-lihood estimator (CMLE) based on (1), where Z is replaced with a multi-dimensional vector. LetˆΘ be the solution to the corresponding score functions, i.e., S n ( ˆΘ) = 0; let I (Θ) = − E (cid:26) ∂ ∂ Θ ∂ Θ T l (Θ , Y , A , Z ) (cid:27) be the Fisher information matrix based on the conditional likelihood function for one observation.Let k be the total number of parameters in Θ, which is equal to 4 + 2 p when Z is p dimensional.When λ min { nI (Θ) } /k → ∞ with λ min { nI (Θ) } being the minimum eigenvalue of nI (Θ), we showin the supplementary materials (Theorem 3) that ˆΘ is consistent and asymptotically normal as thesample size goes to inﬁnity, and satisﬁes √ n ( ˆΘ − Θ) d −→ N (cid:0) , { I (Θ) } − (cid:1) . (3)In other words, the CMLE is robust to weak identiﬁcation bias and the usual inference procedurecan be directly applied.The key condition λ min { nI (Θ) } /k → ∞ warrants more discussion. Note that the quantity λ min { nI (Θ) } /k measures the ratio between the amount of information that a sample of size n carries about the unknown parameter and the number of parameters. In classic strong identiﬁcationscenarios where the minimum eigenvalue of I (Θ) is assumed to be lower bounded by a constant,then the condition simpliﬁes to that the number of parameter is small compared with the samplesize, i.e., n/k → ∞ . However, when identiﬁcation is weak, the minimum eigenvalue of I (Θ) can besmall. In practice, the condition λ min { nI (Θ) } /k → ∞ can be evaluated by ˆ κ = − λ min { i n ( ˆΘ) } /k ,which is the ratio between the minimum eigenvalue of the negative Hessian matrix and the numberof parameters. We remark that the condition stated in the supplementary materials (Theorem 3)is more general and applies to general likelihood-based methods, which, for example, implies thecondition for consistency for the proﬁle likelihood estimator (MR.raps) in Zhao et al. (2019b).For valid statistical inference, a consistent estimator for the variance covariance matrix for ˆΘ issimply the negative Hessian matrix − i n ( ˆΘ), whose corresponding diagonal element estimates thevariance of the treatment eﬀect estimator ˆ β . Other variants of the CMLE can also be used, forexample, the one-step iteration estimator ˜Θ in (2) is asymptotically equivalent with the CMLE ˆΘas long as the initial estimator ˜Θ (0) is √ n -consistent (Shao 2003). Consider the general location-scale model Y = E ( Y | A, Z ) + σ ( Z ) (cid:15), where (cid:15) ⊥ A, Z . Under assumptions (B1)-(B3), the conditional mean function is given by E ( Y | A = a, Z = z ) = βa + E ( Y | A = 0 , Z = z ) + ∂∂ ( γa ) ln (cid:90) exp [ γaσ ( Z = z ) (cid:15) ] dF ( (cid:15) ) , where F ( · ) denotes the cumulative distribution function of (cid:15) . The error distribution may be rea-sonably approximated by a Gaussian mixture model with enough components (Goodfellow et al.2016). Speciﬁcally, let f ( (cid:15) ) = (cid:80) Kk =1 π k φ k ( (cid:15) ) where φ k ( · ) is the normal density with mean µ k andvariance δ k satisfying the constraints K (cid:88) k =1 π k = 1 , E ( (cid:15) ) = K (cid:88) k =1 π k µ k = 0 , Var ( (cid:15) ) = K (cid:88) k =1 π k (cid:0) δ k + µ k (cid:1) = 1 . The conditional mean function with Gaussian mixture error is given by βa + E ( Y | A = 0 , Z = z ) + σ ( Z = z ) (cid:80) Kk =1 π k ω k (cid:40)(cid:34) K (cid:88) k =1 π k ω k µ k (cid:35)(cid:41) + γa K (cid:88) k =1 π k ω k δ k σ ( Z = z ) (cid:80) Kk =1 π k ω k , γ and the ETT β using theone-step estimator approach (without covariates) based on 1000 Monte Carlo experiments with a rangeof sample sizes n. ˆ γ is the averaged point estimates of γ , and bias is calculated in percentage format.Likewise for ˆ β . SE is the averaged standard error. SD stands for the sample standard deviation of the1000 point estimates for γ and β . Coverage is calculated as the proportion of 95% conﬁdence intervalsthat contains the true parameter value among those 1000 experiments. We vary the value of η z in σ ( Z ) = exp ( η + η z Z )) to assess the impact of decreasing IV strength on the estimation results. n η z ( β, γ ) ˆ β Bias SE SD Cover ˆ γ Bias SE SD Cover1e4 0.2 (0.8,0.2) 0.806 0.74 % 0.092 0.094 93.8% 0.196 -2.12% 0.074 0.075 94.6%1e4 0.15 (0.8,0.2) 0.788 -1.44% 0.124 0.126 95.6% 0.210 4.77 % 0.103 0.104 95.6%1e4 0.1 (0.8,0.2) 0.778 -2.76% 0.197 0.210 96.2% 0.219 9.32% 0.168 0.181 96.0%3e4 0.1 (0.8,0.2) 0.789 -1.35% 0.104 0.106 96.0% 0.209 4.58 % 0.089 0.090 95.8%3e4 0.05 (0.8,0.2) 0.788 -1.46% 0.226 0.222 97.0% 0.210 5.17% 0.199 0.195 97.4%1e5 0.05 (0.8,0.2) 0.797 -0.36% 0.113 0.120 95.2% 0.203 1.29% 0.099 0.105 95.2% where ω k = exp (cid:16) γaσ ( Z = z ) µ k + δ k [ γaσ ( Z = z )] / (cid:17) . The conditional distribution is { Y − E ( Y | A = a, Z = z ) } /σ ( Z = z ) ∼ (cid:80) Kk =1 π k N ( µ k , δ k ). In practice, estimation of ( β, γ ) may proceedunder a user-speciﬁed integer value K < ∞ as well as parametric models for E ( Y | A = 0 , Z = z )and σ ( Z = z ) via an alternating optimization algorithm which we describe in the SupplementaryMaterials. In this section, we conduct simulation studies to evaluate the ﬁnite-sample performance of our one-step-update estimating method. Sample sizes of n = 10 , , , , ,

000 are considered; stillconsiderably smaller than the UK Biobank study considered below for illustration with sample sizeof a couple of hundred thousands independent samples. If our method can produce good estimatesin smaller sample settings, we expect it will have even better performance in larger sample settings.The IV Z is generated from a Binomial distribution with probability equal to 0.3 (minor allelefrequency). The treatment A is simply generated from the standard normal distribution N (0 , Y is also generated from a normal distribution with mean and variance given by E ( Y | A, Z ) = 0 . A + 0 . Aσ ( Z ) + 1 + 0 . Zσ ( Z ) = exp(0 . η z Z )where η z is set to be 0.2, 0.15, 0.1, 0.05. We generated in total 1000 such data sets and then applyour proposed CMLE to each of them. Results are summarized in Table 1. We found that whensample size is 10,000, the bias and standard error of the causal eﬀect and selection bias parameterestimates become larger when the IV strength η z decreases from 0.2 to 0.1. The causal eﬀect is lesssensitive to the IV strength compared to the selection bias parameter γ . As we increase samplesize, the bias gets much smaller and becomes negligible when the sample size is 100,000. Conﬁdenceintervals achieved the nominal 95% coverage.The second simulation study is designed to evaluate the ﬁnite sample performance of the pro-posed three-stage estimator and the CMLE in presence of many weak invalid IVs. The sample sizeis set to be 100,000, each IV Z j , j = 1 , . . . , p , is still generated to take values 0,1,2 with minor allelefrequency equal to 0.3. The treatment A is generated from standard normal distribution. The utcome Y is generated from a normal distribution with mean and variance given by E ( Y | A, Z ) = 0 . A + 0 . Aσ ( Z ) − . . p (cid:88) j =1 Z j ,σ ( Z ) = exp  . . p (cid:88) j =1 Z j  . Simulation results are presented in Table 2, where the standard error (SE) for the three-stageestimator is the bootstrap estimator approximated using 100 Monte Carlo simulations, the SE forthe CMLE is obtained from the inverse Hessian matrix discussed in Section 2.3. From the results inTable 2, the three-stage estimator has negligible bias when p = 20, but shows evidently larger biaswhen p grows to 50, which is also reﬂected by the fact that the coverage probabilities are smallerthan 95%. In contrast, the CMLE shows negligible bias in both scenarios. The standard errors forthe CMLE are close to the standard deviations, and the CMLE estimators show nominal coveragein both simulation scenarios.These observations agree with our theoretical assessment that the CMLE is eﬃcient and isrobust to many weak invalid IVs. In particular, it is consistent and asymptotically normal as longas the condition ˆ κ is reasonable large. We similarly generate the treatment A from the standard normal distribution, the IV Z which takesvalues in { , , } with minor allele frequency equal to 0.3. However we generate Y = E ( Y | A, Z ) + σ ( Z ) (cid:15) where the error (cid:15) is generated from a Gaussian mixture distribution with two components, ε ∼ N ( µ , δ ) , ε ∼ N ( µ , δ ) , π ∼ Bernoulli( p = 0 . , π = 1 − π , (cid:15) = π ε + π ε , ( µ , µ ) = ( − . , .

4) and ( δ , δ ) = (0 . , . E ( Y | A, Z ) = 0 . A + (1 + 0 . Z ) + σ ( Z ) (cid:40) (cid:80) k =1 π k ω k (cid:2) µ k + δ k γAσ ( Z ) (cid:3)(cid:80) k =1 π k ω k (cid:41) ,σ ( Z ) = exp(0 . η z Z ) . We vary the value of η z in the set { . , . , . } to assess the impact of IV strength. The resultsusing the proposed Gaussian mixture method with K = 2 are summarized in Table 3. There isnoticeable ﬁnite-sample bias and under-coverage when η z = 0 .

1, especially for estimation of theselection bias parameter γ , but the performance improves with larger η z or sample size. We performan additional set of simulations to assess the impact of model misspeciﬁcation. Speciﬁcally, theerror is generated from a mixture of two uniform distributions ε ∼ U ( a , b ) , ε ∼ U ( a , b ) , π ∼ Bernoulli( p = 0 . , π = 1 − π , (cid:15) = π ε + π ε , where ( a , b ) = (0 . − . , . .

87) and ( a , b ) = (0 . − . , . . E ( Y | A, Z ) = 0 . A + (1 + 0 . Z )+ σ ( Z ) (cid:34) (cid:80) k =1 π k { ( b k t −

1) exp( b k t ) − ( a k t −

1) exp( a k t ) } / { t ( b k − a k ) } (cid:80) k =1 π k { exp( b k t ) − exp( a k t ) } / { t ( b k − a k ) } (cid:35) ,σ ( Z ) = exp(0 . η z Z );where t ( A, Z ; γ ) ≡ γAσ ( Z ). The Gaussian mixture procedure with K = 2 failed to converge in theweak IV setting where η z = 0 .

1; we only report results for η z = 0 .

25 or 0.5 in Table 4. The absolutebias of the Gaussian mixture estimator becomes noticeably larger for the same values of ( n, η z ),especially in estimation of γ . The average standard error at η z = 0 .

25 is signiﬁcantly larger thanthe Monte Carlo standard error, resulting in somewhat conservative conﬁdence intervals. Although γ = 0 . β = 0 . n = 100 ,

000 and varying p . The third column is the averaged ˆ κ = − λ min { i n ( ˆΘ) } /k , which is theconsistency and asymptotic normality condition for the CMLE and is preferably to be large. ˆ γ is theaveraged point estimates of γ , and bias is calculated in percentage format. Likewise for ˆ β . SE is theaveraged standard error. SD stands for the sample standard deviation of the 1000 point estimates for γ and β . Coverage is calculated as the proportion of 95% conﬁdence intervals that contains the trueparameter value among those 1000 experiments. n p ˆ κ Method ˆ β Bias SE SD Cover ˆ γ Bias SE SD Cover1e5 20 15.55 3-stage 0.806 0.76% 0.034 0.034 94.1% 0.197 -1.49% 0.017 0.017 94.2%CMLE 0.799 -0.14% 0.033 0.034 94.9% 0.201 0.28% 0.017 0.017 95.2%1e5 50 4.51 3-stage 0.818 2.21% 0.038 0.037 91.9% 0.197 -1.62% 0.008 0.008 93.2%CMLE 0.801 0.11% 0.036 0.036 95.3% 0.200 0.02% 0.007 0.007 94.6%

Table 3: Simulation results for estimating the confounding bias parameter γ = 0 . β = 0 . n η z ˆ β Bias SE SD Cover ˆ γ Bias SE SD Cover0.10 0.771 − .

64% 0.373 0.195 75.6% 0.256 27.78% 0.314 0.211 66.7%1e4 0.25 0.798 − .

29% 0.056 0.059 93.9% 0.205 2.39% 0.044 0.049 91.4%0.50 0.799 − .

12% 0.039 0.039 95.9% 0.201 0.69% 0.026 0.026 96.0%0.10 0.780 − .

54% 0.106 0.084 77.0% 0.230 14.95% 0.090 0.101 69.2%3e4 0.25 0.799 − .

09% 0.032 0.033 95.1% 0.202 0.80% 0.025 0.027 92.9%0.50 0.799 − .

07% 0.023 0.022 95.4% 0.201 0.36% 0.015 0.015 95.8%0.10 0.793 − .

86% 0.031 0.041 78.0% 0.210 4.96% 0.026 0.052 69.5%1e5 0.25 0.800 0.03% 0.018 0.018 94.7% 0.200 0.24% 0.014 0.015 93.5%0.50 0.800 0.03% 0.012 0.012 95.3% 0.200 0.14% 0.008 0.008 95.6% simulations for the mixture model were only performed in the single IV case in which case theapproach appeared to become unreliable in the weak single IV setting, we expect the CMLE undermixture Gaussian error will exhibit similar robustness in many weak IV regimes as the simpleGaussian model explored in the previous simulation study.

UK Biobank is a large-scale ongoing prospective cohort study that recruited around 500,000 partici-pants aged 40-69 in 2006-2010. Participants provided biological samples, completed questionnaires,underwent assessments, and had nurse led interviews. Follow up is chieﬂy through cohort-widelinkages to National Health Service data, including electronic, coded death certiﬁcate, hospital,and primary care data (Sudlow et al. 2015). To control for population stratiﬁcation, we restrictedour analysis to participants with self-reported and genetically validated white British ancestry. Forquality control, we also excluded participants with (1) excess relatedness (more than 10 putativethird-degree relatives) or (2) mismatched information on sex between genotyping and self-report,or (3) sex-chromosomes not XX or XY, or (4) poor-quality genotyping based on heterozygosity andmissing rates > γ = 0 . β = 0 . n η z ˆ β Bias SE SD Cover ˆ γ Bias SE SD Cover1e4 0.25 0.779 − .

65% 0.515 0.125 98.0% 0.235 17.52% 0.479 0.113 97.5%0.50 0.800 − .

02% 0.047 0.052 92.9% 0.184 − .

84% 0.033 0.028 92.7%3e4 0.25 0.793 − .

92% 0.069 0.067 98.6% 0.208 4.08% 0.057 0.046 98.2%0.50 0.800 − .

01% 0.027 0.030 92.6% 0.182 − .

98% 0.019 0.016 85.1%1e5 † − .

12% 0.032 0.036 94.2% 0.197 − .

72% 0.026 0.021 98.6%0.50 0.801 0.10% 0.014 0.016 92.3% 0.181 -9.46% 0.010 0.008 55.2% † : excludes 1 simulation replicate in which the Gaussian mixture procedure did not converge. Biobank data with complete measurements in body mass index (BMI) and glucose levels. Wefurther selected BMI associated SNP potential IVs among the list provided by Sun et al. (2019).We found 7 SNPs (in Table 5) that are predictive of the residual variance of glucose levels inour Stage 2 regression ( p < .

01 ). The average BMI is 27.39 kg/m (SD: 4.75 kg/m ), andthe average glucose level is 5.12 mmol/L (SD: 1.21 mmol/L ). We applied the proposed methodsto these data with the goal of evaluating the causal relationship between BMI as treatment andglucose as outcome; analysis results are summarized in Table 5. For comparison purposes, we alsoinclude results for standard two-stage least square method implemented in the R package ivreg (Foxet al. 2020). The allele score for the 7 SNPs selected is deﬁned as the signed sum of their minoralleles, where the sign is determined by the regression coeﬃcient in our stage 2 regression. We alsoimplemented MR-GENIUS method (Tchetgen Tchetgen et al. 2020) using the same 7 SNPs, thecausal eﬀect of BMI on glucose was estimated as 0.046 (se: 0.01, p -value: 2 . × − ). Cruderegression analysis by simply regressing glucose on BMI gives an estimate of 0 .

041 (se: 4 . × − , p -value: < × − ). Further, we included all 7 SNPs into a conventional regression together withBMI, and obtained a similar estimate 0 .

039 (se: 4 . × − , p -value: < × − ). It is interestingto consider results obtained by selecting each SNP as single candidate IV as summarized in thetable. For instance, take the SNP rs2176040, the corresponding causal eﬀect estimate is ˆ β = 0 . p -value: 0.003), and the selection bias parameter is estimated as ˆ γ : 0 . p -value: 2 . × − ). However, the standard two-stage least squares (TSLS) method gives causaleﬀect estimate − . p -value: 0.428). These conﬂicting inferences may be due to SNPrs2176040 being an invalid IV, in which case TSLS likely yields biased results. We further combine20 SNPs, 13 of them are weakly associated with the residual variance of glucose levels, we obtainusing MLE a causal eﬀect estimate of BMI on glucose ˆ β = 0 .

028 (se:0.0025, p -value:6 . × − ),and a selection bias estimate of ˆ γ = 0 .

009 (se: 0.0017, p -value: 2 . × − ). This ﬁnal analysissuggests a somewhat smaller treatment eﬀect size than all other MR methods, while providingstrong statistical evidence of a positive eﬀect with a relatively small but statistically signiﬁcantamount of confounding bias detected. In this paper, we have considered identiﬁcation and inference about a causal eﬀect in an observa-tional study using a novel MR approach. MR MiSTERI leverages a possible association betweencandidate IV SNPs and the variance of the outcome in order to disentangle confounding bias fromthe causal eﬀect of interest. Importantly, MR MiSTERI can recover unbiased inferences about thecausal eﬀect in view even when none of the candidate IV SNPs is a valid IV. Key assumptionswhich the approach relies on include no additive interaction involving a candidate IV SNP and thetreatment causal eﬀect in the outcome model, and a requirement that the amount of selection bias(measured on the odds ratio scale) remains constant as a function of SNP values. Violation of eitherassumption may result in incorrect inferences about treatment eﬀect. Although not pursued in thispaper, robustness to such violation may be assessed via a sensitivity analysis in which the impact ivreg .SNP ˆ β se ( ˆ β ) ˆ γ se (ˆ γ ) ˆ β ∗ se ( ˆ β ∗ ) ˆ γ ∗ se (ˆ γ ∗ ) ˆ β ∗∗ se ( ˆ β ∗∗ )rs3736485 0.041 0.013 0.0001 0.0094 0.027 0.001 0.0031 0.0002 0.200 0.068rs1558902 0.023 0.006 0.0128 0.0046 0.040 0.001 0.0025 0.0002 0.064 0.009rs1808579 0.041 0.014 0.0001 0.0098 0.026 0.001 0.0042 0.0002 0.107 0.038rs13021737 0.051 0.012 -0.0065 0.0084 0.036 0.001 -0.0043 0.0005 0.030 0.016rs2176040 0.041 0.014 0.0585 0.0099 0.026 0.001 0.0009 0.0003 -1.619 2.041rs10968576 0.047 0.016 -0.0040 0.0111 0.046 0.001 -0.0005 0.0004 0.079 0.025rs10132280 0.030 0.013 0.0081 0.0094 0.062 0.004 -0.0228 0.0005 0.049 0.029Allele score 0.030 0.003 0.0078 0.0021 0.033 0.005 0.0105 0.0002 0.089 0.009 of various departures from the assumption may be explored by varying sensitivity parameters. Al-ternatively, as illustrated in our application, providing inferences using a variety of MR methodseach of which relying on a diﬀerent set of assumptions provides a robustness check of empiricalﬁndings relative to underlying assumptions needed for a causal interpretation of results. A notablerobustness property of the MLE approach to MR MiSTERI is that as we formally establish, it isrobust to many weak IV bias, thus providing certain theoretical guarantees of reliable performancein many common MR settings where one might have available a relatively large number of weakcandidate IVs, many of which may be subject to pleiotropy. While the proposed methods requirecorrectly speciﬁed a Gaussian model for the conditional distribution of Y given A and Z , such anassumption can be relaxed. In fact, in the paper, We also consider a Gaussian mixture model asa framework to assess and possibly correct for possible violation of this assumption. Furthermore,in the Supplementary Materials, we brieﬂy describe a semiparametric three-stage estimation ap-proach under the location-scale model which allows the distribution of standardized residuals toremain unrestricted. It will be of interest to investigate robustness and eﬃciency properties of thismore ﬂexible estimator, which we plan to pursue in future work. It would likewise be of interestto investigate whether the proposed methods can be extended to the important setting of a binaryoutcome, which is of common occurrence in epidemiology and other health and social sciences. Acknowledgements eferences Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identiﬁcation of causal eﬀects usinginstrumental variables.

Journal of the American Statistical Association , 91(434):444–455.Chao, J. C. and Swanson, N. R. (2005). Consistent estimation with a large number of weakinstruments.

Econometrica , 73(5):1673–1692.Chen, H. Y. (2007). A semiparametric odds ratio model for measuring association.

Biometrics ,63(2):413–421.Davies, N. M., von Hinke Kessler Scholder, S., Farbmacher, H., Burgess, S., Windmeijer, F.,and Smith, G. D. (2015). The many weak instruments problem and mendelian randomization.

Statistics in medicine , 34(3):454–468.Didelez, V. and Sheehan, N. (2007). Mendelian randomization as an instrumental variable approachto causal inference.

Statistical Methods in Medical Research , 16(4):309–330.Ference, B., Kastelein, J., Ginsberg, H., Chapman, M., Nicholls, S., Ray, K., Packard, C., Laufs,U., Brook, R., Oliver-Williams, C., Butterworth, A., Danesh, J., Smith, G., Catapano, A., andSabatine, M. (2017). Association of genetic variants related to cetp inhibitors and statins withlipoprotein levels and cardiovascular risk.

JAMA - Journal of the American Medical Association ,318(10):947–956.Fox, J., Kleiber, C., and Zeileis, A. (2020). ivreg: Two-Stage Least-Squares Regression with Diag-nostics . R package version 0.5-0.Glymour, M. M., Tchetgen Tchetgen, E. J., and Robins, J. M. (2012). Credible Mendelian Random-ization Studies: Approaches for Evaluating the Instrumental Variable Assumptions.

AmericanJournal of Epidemiology , 175(4):332–339.Goodfellow, I., Bengio, Y., and Courville, A. (2016).

Deep Learning . MIT Press. .Hernan, M. A. and Robins, J. M. (2020).

Causal Inference: What If . Boca Raton: Chapman &Hall/CRC.Lawlor, D. A., Harbord, R. M., Sterne, J. A. C., Timpson, N., and Davey Smith, G. (2008).Mendelian randomization: Using genes as instruments for making causal inferences in epidemi-ology.

Statistics in Medicine , 27(8):1133–1163.Liu, L., Miao, W., Sun, B., Robins, J., and Tchetgen Tchetgen, E. (2020). Identiﬁcation andinference for marginal average treatment eﬀect on the treated with an instrumental variable.

Statistica Sinica .Newey, W. K. and Windmeijer, F. (2009). Generalized method of moments with many weak momentconditions.

Econometrica , 77(3):687–719.Robins, J. M. (1994). Correcting for non-compliance in randomized trials using structural nestedmean models.

Communications in Statistics - Theory and Methods , 23(8):2379–2412.Scandinavian Simvastatin Survival Study Group (1994). Randomised trial of cholesterol loweringin 4444 patients with coronary heart disease: the scandinavian simvastatin survival study (4s).

The Lancet , 344(8934):1383–1389.Shao, J. (2003).

Mathematical Statistics . Springer-Verlag New York.Smith, G. D. and Ebrahim, S. (2003). ‘mendelian randomization’: can genetic epidemiology con-tribute to understanding environmental determinants of disease?

International Journal of Epi-demiology , 32(1):1–22. olovieﬀ, N., Cotsapas, C., Lee, P. H., Purcell, S. M., and Smoller, J. W. (2013). Pleiotropy incomplex traits: challenges and strategies. Nature Reviews Genetics , 14:483 – 495.Spiller, W., Hartwig, F. P., Sanderson, E., Smith, G. D., and Bowden, J. (2020). Interaction-based mendelian randomization with measured and unmeasured gene-by-covariate interactions. medRxiv .Spiller, W., Slichter, D., Bowden, J., and Davey Smith, G. (2019). Detecting and correcting for biasin mendelian randomization analyses using gene-by-environment interactions.

Int J Epidemiol ,48(3):702–712.Staiger, D. and Stock, J. H. (1997). Instrumental variables regression with weak instruments.

Econometrica , 65(3):557–586.Stock, J. H., Wright, J. H., and Yogo, M. (2002). A survey of weak instruments and weak identiﬁ-cation in generalized method of moments.

Journal of Business & Economic Statistics , 20(4):518–529.Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., Elliott, P.,Green, J., Landray, M., et al. (2015). Uk biobank: an open access resource for identifying thecauses of a wide range of complex diseases of middle and old age.

Plos med , 12(3):e1001779.Sun, D., Zhou, T., Heianza, Y., Li, X., Fan, M., Fonseca, V. A., and Qi, L. (2019). Type 2 diabetesand hypertension: a study on bidirectional causality.

Circulation research , 124(6):930–937.Tchetgen Tchetgen, E. J., Sun, B., and Walter, S. (2020). The genius approach to robust mendelianrandomization inference.

Statistical Science .Verbanck, M., Chen, C.-Y., Neale, B., and Do, R. (2018). Detection of widespread horizontalpleiotropy in causal relationships inferred from mendelian randomization between complex traitsand diseases.

Nature Genetics , 50(5):693–698.Wang, L. and Tchetgen Tchetgen, E. (2018). Bounded, eﬃcient and multiply robust estimation ofaverage treatment eﬀects using instrumental variables.

Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 80(3):531–550.Yang, J., Loos, R. J. F., Powell, J. E., Medland, S. E., Speliotes, E. K., Chasman, D. I., Rose, L. M.,Thorleifsson, G., Steinthorsdottir, V., M¨agi, R., Waite, L., Vernon Smith, A., Yerges-Armstrong,L. M., Monda, K. L., Hadley, D., Mahajan, A., Li, G., Kapur, K., Vitart, V., Huﬀman, J. E.,Wang, S. R., Palmer, C., Esko, T., Fischer, K., Hua Zhao, J., Demirkan, A., Isaacs, A., Feitosa,M. F., Luan, J., Heard-Costa, N. L., White, C., Jackson, A. U., Preuss, M., Ziegler, A., Eriksson,J., Kutalik, Z., Frau, F., Nolte, I. M., Van Vliet-Ostaptchouk, J. V., Hottenga, J.-J., Jacobs,K. B., Verweij, N., Goel, A., Medina-Gomez, C., Estrada, K., Lynn Bragg-Gresham, J., Sanna,S., Sidore, C., Tyrer, J., Teumer, A., Prokopenko, I., Mangino, M., Lindgren, C. M., Assimes,T. L., Shuldiner, A. R., Hui, J., Beilby, J. P., McArdle, W. L., Hall, P., Haritunians, T., Zgaga, L.,Kolcic, I., Polasek, O., Zemunik, T., Oostra, B. A., Juhani Junttila, M., Gr¨onberg, H., Schreiber,S., Peters, A., Hicks, A. A., Stephens, J., Foad, N. S., Laitinen, J., Pouta, A., Kaakinen, M.,Willemsen, G., Vink, J. M., Wild, S. H., Navis, G., Asselbergs, F. W., Homuth, G., John, U.,Iribarren, C., Harris, T., Launer, L., Gudnason, V., O’Connell, J. R., Boerwinkle, E., Cadby, G.,Palmer, L. J., James, A. L., Musk, A. W., Ingelsson, E., Psaty, B. M., Beckmann, J. S., Waeber,G., Vollenweider, P., Hayward, C., Wright, A. F., Rudan, I., Groop, L. C., Metspalu, A., Khaw,K.-T., van Duijn, C. M., Borecki, I. B., Province, M. A., Wareham, N. J., Tardif, J.-C., Huikuri,H. V., Adrienne Cupples, L., Atwood, L. D., Fox, C. S., Boehnke, M., Collins, F. S., Mohlke,K. L., Erdmann, J., Schunkert, H., Hengstenberg, C., Stark, K., Lorentzon, M., Ohlsson, C.,Cusi, D., Staessen, J. A., Van der Klauw, M. M., Pramstaller, P. P., Kathiresan, S., Jolley,J. D., Ripatti, S., Jarvelin, M.-R., de Geus, E. J. C., Boomsma, D. I., Penninx, B., Wilson,J. F., Campbell, H., Chanock, S. J., van der Harst, P., Hamsten, A., Watkins, H., Hofman,A., Witteman, J. C., Zillikens, M. C., Uitterlinden, A., Rivadeneira, F., Carola Zillikens, M., iemeney, L. A., Vermeulen, S. H., Abecasis, G. R., Schlessinger, D., Schipf, S., Stumvoll, M.,T¨onjes, A., Spector, T. D., North, K. E., Lettre, G., McCarthy, M. I., Berndt, S. I., Heath,A. C., Madden, P. A. F., Nyholt, D. R., Montgomery, G. W., Martin, N. G., McKnight, B.,Strachan, D. P., Hill, W. G., Snieder, H., Ridker, P. M., Thorsteinsdottir, U., Stefansson, K.,Frayling, T. M., Hirschhorn, J. N., Goddard, M. E., and Visscher, P. M. (2012). Fto genotypeis associated with phenotypic variability of body mass index. Nature , 490(7419):267–272.Ye, T., Shao, J., and Kang, H. (2019). Debiased inverse-variance weighted estimator in two-samplesummary-data mendelian randomization. arXiv:1911.09802.Zhao, Q., Chen, Y., Wang, J., and Small, D. S. (2019a). Powerful three-sample genome-widedesign and robust statistical inference in summary-data mendelian randomization.

InternationalJournal of Epidemiology .Zhao, Q., Wang, J., Hemani, G., Bowden, J., and Small, D. S. (2019b). Statistical inference intwo-sample summary-data mendelian randomization using robust adjusted proﬁle score.

Annalsof Statistics . upplementary Materials Proof of Theorem 1

Proof.

Under assumptions (B1) - (B3), we have D ( Z = z ) = E ( Y | A = 1 , Z = z ) − E ( Y | A = 0 , Z = z ) = β + γσ ( z ) . (S1)Hence, we have the following two equations D ( Z = 1) = β + γσ ( Z = 1) D ( Z = 0) = β + γσ ( Z = 0) . Solving these two equations simultaneously gives the result in Theorem 1.

Estimation and Inference under Weak Identiﬁcation

Consider the normal model (1). Let I n (Θ) = − E (cid:26) ∂ ∂ Θ ∂ Θ T l n (Θ) (cid:27) be the Fisher information matrix for a sample of size n , λ min { I n (Θ) } be the minimum eigenvalueof I n (Θ), tr ( A ) be the trace of a square matrix A .Condition 2 summaries the regulaty conditions. Condition 2 (Regularity Conditions) . For every o = ( y, a, z ) in the support of O = ( Y, A, Z ) , thelikelihood function denoted by f (Θ; o ) is twice continuously diﬀerentiable in Θ and satisﬁes ∂∂ Θ (cid:90) ψ Θ ( o ) dν = (cid:90) ∂∂ Θ ψ Θ ( o ) dν for ψ Θ ( o ) = f Θ ( o ) and = ∂f Θ ( o ) /∂ Θ , where ν is a suitable measure; there is a positive (cid:15) such thatthe Fisher information matrix I n (Γ) is positive deﬁnite for Γ : (cid:107) Γ − Θ (cid:107) < (cid:15) ; and for any given Θ ,there exists a positive number (cid:15) and a positive integrable function h , such that sup Γ: (cid:107) Γ − Θ (cid:107) <(cid:15) (cid:13)(cid:13)(cid:13)(cid:13) ∂ log f Γ ( o ) ∂ Γ ∂ Γ T (cid:13)(cid:13)(cid:13)(cid:13) ≤ h ( o ) for all o in the support of O , where (cid:107) A (cid:107) = (cid:112) tr ( A T A ) for any matrix A . Theorem 3 establishes the consistency and asymptotic normality of solutions to the score func-tion.

Theorem 3. (a) Under Condition 2 and assume that the observed data O i , i = 1 , . . . , n are in-dependent and identically distributed, if k/ [ λ min { nI (Θ) } ] → , then the sequence of conditionalmaximum likelihood estimators { ˆΘ n } is consistent, i.e., ˆΘ n p −→ Θ . (b) Any consistent sequence ˜Θ n such that S n ( ˜Θ) = 0 is asymptotically normal, and as n → ∞ , √ n ( ˜Θ − Θ) d −→ N (0 , { I (Θ) } − ) . (S2)In fact, the general consistency condition in Theorem 3 can be written as λ min { nI (Θ) } → ∞ and tr (cid:2) I − n (Θ) E { S n (Θ) S n (Θ) T } (cid:3) λ min { nI (Θ) } → k = 1, E { S n (Θ) S n (Θ) T } = V , I n (Θ) = V , and V = V + O ( p ) , V (cid:16) n (cid:107) γ (cid:107) , and thus, condition (S3) ecomes V /V = O { p/ ( n (cid:107) γ (cid:107) ) } →

0, which is the condition for consistency in Zhao et al. (2019b,Theorem 3.1).

Proof.

First, notice that the condition k/ [ λ min { nI (Θ) } ] → λ min { nI (Θ) } → ∞ . Let B n ( c ) = { Γ : (cid:107) [ I n (Θ)] / (Γ − Θ) (cid:107) ≤ c } for c >

0, where (cid:107) A (cid:107) = (cid:112) tr ( A T A ) for any matrix A . Since B n ( c ) shrinks to Θ as n → ∞ , the consistency result in Theorem 3(a) is implied by the fact thatfor any (cid:15) >

0, there exists c > n > P ( l n (Γ) − l n (Θ) < ∈ ∂B n ( c )) ≥ − (cid:15), n ≥ n (S4)and ˆΘ n is unique for n ≥ n , where ∂B n ( c ) is the boundary of B n ( c ). For Γ ∈ ∂B n ( c ), the Taylorexpansion gives l n (Γ) − l n (Θ) = cλ T [ I n (Θ)] − / S n (Θ)+ ( c / λ T [ I n (Θ)] − / i n (Γ ∗ )[ I n (Θ)] − / λ, where λ = [ I n (Θ)] / (Γ − Θ) /c satisfying (cid:107) λ (cid:107) = 1, i n (Γ) = ∂ l n (Γ) /∂ Γ ∂ Γ T , and Γ ∗ lies between Γand Θ. Note that E (cid:107) i n (Γ ∗ ) − i n (Θ) (cid:107) n ≤ E max Γ ∈ B n ( c ) (cid:107) i n (Γ) − i n (Θ) (cid:107) n ≤ E max Γ ∈ B n ( c ) (cid:107) i (Γ) − i (Θ) (cid:107) → i (Γ) = ∂l (Γ) /∂ Γ ∂ Γ T is continuous in a neighborhood of Θ for any ﬁxed observation; (b) B n ( c ) shrinks to { Θ } from λ min ( I n (Θ)) → ∞ ; and (c) for suﬃciently large n , max Γ ∈ B n ( c ) (cid:107) i (Γ) − i (Θ) (cid:107) is boundedby an integrate function under Condition 2. By the strong law of large numbers, (cid:107) n − i n (Θ) + I (Θ) (cid:107) a.s. −−→

0. These results imply that l n (Γ) − l n (Θ) = cλ T [ I n (Θ)] − / S n (Θ) − (1 + o p (1)) c / . (S6)Note that max λ { λ T [ I n (Θ)] − / S n (Θ) } = (cid:107) I n (Θ) − / S n (Θ) (cid:107) with λ = I n (Θ) − / S n (Θ) / (cid:107) I n (Θ) − / S n (Θ) (cid:107) . Therefore, (S4) follows from (S6) and P ( cλ T [ I n (Θ)] − / S n (Θ) − c / < λ s.t. (cid:107) λ (cid:107) = 1)= P (cid:18) max λ λ T [ I n (Θ)] − / S n (Θ) < c/ (cid:19) = P (cid:16) (cid:107) I n (Θ) − / S n (Θ) (cid:107) < c/ (cid:17) =1 − P (cid:16) (cid:107) I n (Θ) − / S n (Θ) (cid:107) ≥ c/ (cid:17) ≥ − E (cid:107) I n (Θ) − / S n (Θ) (cid:107) c / − E tr { S n (Θ) T I n (Θ) − S n (Θ) } c / − tr { I n (Θ) − E [ S n (Θ) S Tn (Θ)] } c / ≥ − (cid:15), rom choosing c such that c /tr { I n (Θ) − E [ S n (Θ) S Tn (Θ)] } is large enough, which is possible because c /tr { I n (Θ) − E [ S n (Θ) S Tn (Θ)] }≥(cid:107) [ I n (Θ)] / (Γ − Θ) (cid:107) /tr { I n (Θ) − E [ S n (Θ) S Tn (Θ)] }≥ λ min [ I n (Θ)] (cid:107) Γ − Θ (cid:107) /tr { I n (Θ) − E [ S n (Θ) S Tn (Θ)] } = (cid:107) Γ − Θ (cid:107) /o (1) . For n ≥ n , ˆΘ n is unique because from Condition 2, the Fisher information I n (Γ) is positive deﬁnitein a neighborhood of Θ.(b) Using the mean value theorem for vector-valued functions, we obtain that − S n (Θ) = (cid:20)(cid:90) i n (cid:0) Θ + t ( ˜Θ − Θ) (cid:1) dt (cid:21) ( ˆΘ − Θ) . Note that using a similar argument as in (S5), when ˆΘ − Θ = o p (1),1 n (cid:13)(cid:13)(cid:13) i n (cid:0) Θ + t ( ˜Θ − Θ) (cid:1) − i n (Θ) (cid:13)(cid:13)(cid:13) p −→ . Since n − i n (Θ) p −→ − I (Θ) and I n (Θ) = nI (Θ), − S n (Θ) = − I n (Θ)( ˜Θ − Θ) + o p ( (cid:107) I n (Θ)( ˜Θ − Θ) (cid:107) ) . This and Slutsky’s theorem imply that √ n ( ˜Θ − Θ) is asymptotically equivalent with √ n [ I n (Θ)] − S n (Θ) = n − / [ I (Θ)] − S n (Θ) d −→ N k (0 , I [ (Θ)] − )by the central limit theorem for i.i.d. samples. Estimation method under Gaussian mixture error

We describe the alternating optimization procedure with user-speciﬁed K and parametric models σ ( Z = z ; η ) and E ( Y | A = 0 , Z = z ; θ ) ≡ µ ( z ; θ ) indexed by ﬁnite-dimensional parameters η and θ respectively.(i) Initialization step: maximize the standard Gaussian location scale model Y − { βA + γAσ ( Z ; η ) + µ ( Z ; θ ) } σ ( Z ; η ) ∼ N (0 , β, θ, η, γ ) by maximum likelihood estimation, to obtain ( ˜ β, ˜ θ, ˜ η, ˜ γ ). Constructthe standardized residuals˜ (cid:15) = Y − { ˜ βA + ˜ γAσ ( Z ; ˜ η ) + µ ( Z ; ˜ θ ) } σ ( Z ; ˜ η ) . (ii) Maximize the Gaussian mixture location scale model˜ (cid:15) ∼ K (cid:88) k =1 π k N (cid:0) µ k , δ k (cid:1) with respect to ( π , ..., π K , µ , ..., µ K , δ , ..., δ K ) using constrained maximum likelihood esti-mation to obtain (˜ π , ..., ˜ π K , ˜ µ , ..., ˜ µ K , ˜ δ , ..., ˜ δ K ). Construct˜ v = A K (cid:88) k =1 ˜ π k ˜ ω k ˜ δ k σ ( Z ; ˜ η ) (cid:80) Kk =1 ˜ π k ˜ ω k , ˜ w = σ ( Z ; ˜ η ) (cid:80) Kk =1 ˜ π k ˜ ω k (cid:40)(cid:34) K (cid:88) k =1 ˜ π k ˜ ω k ˜ µ k (cid:35)(cid:41) , here ˜ ω k = exp { ˜ γAσ ( Z ; ˜ η ) ˜ µ k + ˜ δ k [˜ γAσ ( Z ; ˜ η )] / } . (iii) Minimize n − (cid:80) i { Y i − βA i − µ ( Z i ; θ ) − γ ˜ v i − ˜ w i } using the least squares method withoﬀset ˜ w to obtain new ( ˜ β, ˜ θ, ˜ γ ), followed by regressing the squared residuals to obtain new ˜ η .Construct the standardized residuals˜ (cid:15) = Y − { ˜ βA + µ ( Z ; ˜ θ ) + ˜ γ ˜ v + ˜ w } σ ( Z ; ˜ η ) , based on the new values ( ˜ β, ˜ θ, ˜ γ, ˜ η ).(iv) Iterate between steps (ii) and (iii) until change in log-likelihood derived at step (ii) falls below auser-speciﬁed tolerance level. In this simulation we iterate until | log LL j − log LL j − | / | log LL j − | < .

001 at the j -th iteration.Asymptotic variance is estimated based on standard M-estimation theory by stacking the estimatingfunctions in steps (ii) and (iii), evaluated at the ﬁnal parameter values at convergence. Semiparametric three-stage estimation

Consider the general location-scale model Y = E ( Y | A, Z ) + σ ( Z ) (cid:15) ≡ µ ( A, Z ) + σ ( Z ) (cid:15), where (cid:15) ⊥ A, Z . Under assumptions (B1)-(B3), the conditional mean function is given by µ ( a, z ) = βa + µ (0 , z ) + ∂∂ ( γa ) ln E { exp ( γaσ ( z ) (cid:15) ) | A = 0 , Z = z } = βa + µ (0 , z ) + σ ( z ) E { (cid:15) exp ( γaσ ( z ) (cid:15) ) } E { exp ( γaσ ( z ) (cid:15) ) } . A semiparametric three-stage estimator of ( β, γ ) may be implemented via the following steps:(i) Fit the regression model µ ( a, z ) = m a ( a )+ m z ( z )+ m az ( a, z ), where 0 = m z (0) = m az ( a,

0) = m az (0 , z ) for identiﬁcation, using a nonparametric method such as generalized additive model.For example, if the support of Z is { , , } , then the nonparametric model for the outcome isof the form µ ( a, z ) = m a ( a ) + { γ z + γ z } + { zm ( a ) + z m ( a ) } with m (0) = m (0) = 0;a saturated model may be considered if A also has ﬁnite support. Let (cid:98) µ ( a, z ) denote theresulting estimator of the mean µ ( a, z ).(ii) Using the residuals from (i), ﬁt a nonparametric model for the conditional variance σ ( z ). Ifthe support of Z is { , , } , then we can specify the saturated model σ ( z ) = exp( η + η z + η z ). Let (cid:98) σ ( z ) denote the resulting estimator of σ ( z ).(iii) Deﬁne (cid:98) (cid:15) j = { Y j − (cid:98) µ ( A j , Z j ) } / (cid:98) σ ( Z j ) and let (cid:98) g i ( γ ) = (cid:80) nj =1 (cid:98) (cid:15) j exp ( γA i (cid:98) σ ( Z i ) (cid:98) (cid:15) j ) (cid:80) nj =1 exp ( γA i (cid:98) σ ( Z i ) (cid:98) (cid:15) j ) , i = 1 , ..., n. The proposed semiparametric three-stage estimator of ( β, γ ) is( ˆ β, ˆ γ ) = arg min β,γ (cid:88) i { Y i − βA i − (cid:98) µ (0 , Z i ) − (cid:98) σ ( Z i ) (cid:98) g i ( γ ) } . We note that a potentially more eﬃcient approach is to maximize the log-likelihood for the modelusing a kernel estimator for the density of standardized residuals from step (iii). Speciﬁcally, let (cid:98) δ i ( β, γ ) = { Y j − βA i − (cid:98) µ (0 , Z i ) − (cid:98) σ ( Z i ) (cid:98) g i ( γ ) } / (cid:98) σ ( Z i ) nd deﬁne (cid:96) n ( β, γ ) = n (cid:88) i =1 log (cid:98) f h ( (cid:98) δ i ( β, γ )) , where (cid:98) f h ( (cid:15) ) = nh (cid:80) ni =1 K (cid:0) (cid:15) − (cid:98) (cid:15) i h (cid:1) is a kernel density estimator with bandwidth h >

0. The alterna-tive estimator of ( β, γ ) is given by ( ˆ β, ˆ γ ) = arg max β,γ { (cid:96) n ( β, γ ) } ..