[PDF] A Latent Mixture Model for Heterogeneous Causal Mechanisms in Mendelian Randomization

Abstract

Mendelian Randomization (MR) is a popular method in epidemiology and genetics that uses genetic variation as instrumental variables for causal inference. Existing MR methods usually assume most genetic variants are valid instrumental variables that identify a common causal effect. There is a general lack of awareness that this effect homogeneity assumption can be violated when there are multiple causal pathways involved, even if all the instrumental variables are valid. In this article, we introduce a latent mixture model MR-PATH that groups instruments that yield similar causal effect estimates together. We develop a Monte-Carlo EM algorithm to fit this mixture model, derive approximate confidence intervals for uncertainty quantification, and adopt a modified Bayesian Information Criterion (BIC) for model selection. We verify the efficacy of the Monte-Carlo EM algorithm, confidence intervals, and model selection criterion using numerical simulations. We identify potential mechanistic heterogeneity when applying our method to estimate the effect of high-density lipoprotein cholesterol on coronary heart disease and the effect of adiposity on type II diabetes.

Full PDF

AA Latent Mixture Model for HeterogeneousCausal Mechanisms in Mendelian

Randomization

Daniel IongDepartment of Statistics, University of Michigan, Ann ArborQingyuan ZhaoStatistical Laboratory, University of CambridgeYang ChenDepartment of Statistics, University of Michigan, Ann Arbor

Abstract

Mendelian Randomization (MR) is a popular method in epidemiology and geneticsthat uses genetic variation as instrumental variables for causal inference. Existing MRmethods usually assume most genetic variants are valid instrumental variables thatidentify a common causal eﬀect. There is a general lack of awareness that this eﬀecthomogeneity assumption can be violated when there are multiple causal pathwaysinvolved, even if all the instrumental variables are valid. In this article, we introduce alatent mixture model MR-PATH that groups instruments that yield similar causal eﬀectestimates together. We develop a Monte-Carlo EM algorithm to ﬁt this mixture model,derive approximate conﬁdence intervals for uncertainty quantiﬁcation, and adopt amodiﬁed Bayesian Information Criterion (BIC) for model selection. We verify theeﬃcacy of the Monte-Carlo EM algorithm, conﬁdence intervals, and model selectioncriterion using numerical simulations. We identify potential mechanistic heterogeneitywhen applying our method to estimate the eﬀect of high-density lipoprotein cholesterolon coronary heart disease and the eﬀect of adiposity on type II diabetes.

Keywords:

Causal inference, Instrumental variables, EM algorithm, Monte-Carlo sampling,HDL cholesterol, Diabetes 1 a r X i v : . [ s t a t . A P ] J u l Introduction

Mendelian randomization (MR) is a causal inference method that aims to estimate the causaleﬀect of a modiﬁable risk exposure on disease outcomes. MR is a special case of instrumentalvariable methods that have a long history in statistics and econometrics. The key insight ofMR is that genetic variants, usually in the form of single nucleotide polymorphisms (SNPs),are naturally randomised during conception and may serve as good instrumental variables formany epidemiological risk factors (Smith and Ebrahim, 2004; Didelez and Sheehan, 2007).As a study design, MR has been quickly gaining popularity among epidemiologists because ofits ability to give unbiased causal eﬀect estimates in the presence of unmeasured confoundingand the increasing availability of genome-wide association studies (GWAS) data.A key assumption of MR is that the genetic instrumental variables can only aﬀect theoutcome variables through the risk exposure under investigation. This is often referred toas the “exclusion restriction” or “no direct eﬀect” assumption in the instrumental variableliterature. With genetic variants as instruments, this assumption may be violated due toa genetic phenomenon called “pleiotropy”, meaning a single genotype can aﬀect multipleseemingly unrelated phenotypes. Recent empirical evidence and genetic theory suggest thatpleiotropy is pervasive for common traits (Boyle et al., 2017; Liu et al., 2019). This has leadto a burst of development of new statistical methods aiming to make MR studies robust todiﬀerent patterns of pleiotropy (Bowden et al., 2015; Kang et al., 2016; Zhao et al., 2020;Verbanck et al., 2018; Burgess et al., 2020; Qi and Chatterjee, 2019).To our knowledge, the vast majority of these robust MR methods still rely on the “eﬀecthomogeneity” assumption that the risk exposure has the same causal eﬀect for every indi-vidual. This assumption usually follows from assuming a linear structural equation modelcommonly used in the instrumental variable literature (Anderson and Rubin, 1949; Bow-den et al., 2017). However, this key assumption may be unrealistic when we use MR tostudy complex biological systems involving multiple mechanisms, as demonstrated in thenext example. 2 .1 Motivating example: The eﬀect of HDL cholesterol on coro-nary heart disease

The statistical model we develop in this article is motivated by a real world problem. Overthe last few decades, there has been a heated debate in cardiology on the role of high-density lipoproteins (HDL) in coronary heart disease (CHD) (Rader and Hovingh, 2014;Davey Smith and Phillips, 2020). Numerous observational studies have found a consistentinverse association between HDL cholesterol (HDL-C, amount of cholesterol carried in HDLparticles) and CHD, lending support to a theory that HDL plays a causally protective rolein atherogenesis (formation of fatty deposits in the arteries) through a biological mechanismcalled reverse cholesterol transport (HDL particles remove excessive cholesterol in peripheraltissues). This led to a once widely held belief among healthcare professionals and the generalpublic that HDL particles are the “good cholesterol”, as opposed to low-density lipoproteins(LDL) which are thought to be the “bad cholesterol”.However, the HDL hypothesis has received close scrutiny after several promising clinicaltrials raising HDL cholesterol through the

CETP inhibitors demonstrated no or only modestcardiovascular beneﬁt (Armitage et al., 2019). Moreover, a prominent MR study furtherchallenged the presumption that raising HDL-C will uniformly translate into reductions inrisk of CHD (Voight et al., 2012). Although there are lots of SNPs associated with HDL-C,many of them are also associated with LDL cholesterol and/or triglycerdes. Due to thisreason, Voight et al. (2012) based their main argument on a SNP in the

LIPG gene thatdoes not exhibit signiﬁcant association with LDL cholesterol and triglycerides, even thoughother genetic instruments showed varied associations with risk of CHD. This shows thatpleiotropy, arising from the multiple mechanisms involved in the synthesis and regulation ofblood lipids, poses a major challenge for using MR methods to study HDL.Using some of the latest large-scale GWAS datasets for HDL-C and CHD, we createda dataset to visualize and analyze the heterogeneity among potential genetic instrumentsfor HDL (Figure 1). Each point in this plot shows the reported associations of a SNP withHDL-C and CHD in the GWAS (with standard error bars). Figure 1a shows a straight lineacross the origin whose slope is obtained using a previous method called MR-RAPS that3

SNP association with HDL−C S N P a ss o c i a t i on w i t h CHD (a) MR-RAPS. −0.10−0.050.00 0.00 0.05 0.10 0.15

SNP association with HDL−C S N P a ss o c i a t i on w i t h CHD (b) Proposed mixture model.

Figure 1:

Scatterplot of HDL-CAD data . Left : Line/shaded region represents the causaleﬀect estimate ± one standard error from MR-RAPS. Right : Lines/shaded regions representheterogeneous causal eﬀect estimates ± one standard deviation from our proposed mixturemodel.assumes a homogeneous eﬀect of HDL-C on CHD (Zhao et al., 2020). If this assumptionholds, the slope in Figure 1a can be interpreted as the causal eﬀect of HDL-C on CHD.However, it is clear from Figure 1a that the slope estimated by MR-RAPS provides a poorﬁt to the scatterplot.In this paper, we propose to ﬁt this dataset using an alternative model where the SNPshave individual slopes that are drawn from a mixture distribution. This would be the case ifthere are several biological mechanisms involved in regulating HDL-C; see Section 3. In thisexample, MR-PATH selects two clusters (shown in Figure 1b) which provides a much betterﬁt to the data. See Section 7 for more detail about the data collection for this example andthe results of our model. There have been several attempts to develop MR methods that allow for heterogeneouscausal eﬀects. We are also not the ﬁrst to use mixture models for MR. The contamination ixture method proposed by Burgess et al. (2020) uses a two-component mixture modelto distinguish between valid and invalid instruments. Similarly, the MR-Mix method (Qiand Chatterjee, 2019) uses a four-component mixture model to identify one group of validinstruments and three groups of invalid instruments which have either direct eﬀects on boththe exposure and outcome, direct eﬀects on the outcome but no eﬀect on the exposure,or no eﬀect on both the exposure and outcome. However, the purpose of using mixturemodels in these approaches is not to identify diﬀerent mechanisms but rather to provide arealistic model for the invalid instruments. In particular, they assume that the majority ofthe instruments are valid and indicate an identical causal eﬀect.The only methods we are aware of that does not assume eﬀect homogeneity and attemptsto distinguish causal mechanisms are

GRAPPLE (Wang et al., 2020),

MR-Clust (Foleyet al., 2019) and

BESIDE-MR (Shapland et al., 2020).

GRAPPLE proposes to use the localmaximums of a robustiﬁed proﬁle likelihood function (Zhao et al., 2020) to discover multiplemechanisms. However,

GRAPPLE is only a visualization tool and does not attempt toexplicitly model the diﬀerent mechanisms.

MR-Clust works by constructing a mixture modelbased on SNP-speciﬁc Wald estimators. Similar to our proposed method,

MR-Clust does notmake further assumptions about the number of clusters and the structure of each cluster.However, a major limitation of

MR-Clust is its assumption that the SNP-speciﬁc Waldestimates are normally distributed, which is a poor approximation for weak instruments.

BESIDE-MR is another related method that uses Bayesian model averaging. Although

BESIDE-MR is initially motivated by averaging over the uncertainty in selecting the validinstruments, it can also be extended to allow for multiple clusters of instruments indicatingdiﬀerent causal eﬀects. However,

BESIDE-MR is not a full likelihood approach because ituses the proﬁle likelihood derived in Zhao et al. (2020) to eliminate the nuisance parametersrelated to the SNP-exposure eﬀects.Our paper makes two main contributions to this fast growing literature. First, there is ageneral lack of awareness that MR can be used to discover multiple biological mechanisms,partly due to the wide usage of the broad terminology “eﬀect heterogeneity” to refer to severaldiﬀerent phenomena—invalid instrument due to pleiotropy, eﬀect modiﬁcation/moderationby a covariate, and eﬀect heterogeneity due to diﬀerent causal mechanisms. In this article5e introduce the concept of mechanistic heterogeneity for the last phenomenon and showthat it can occur even if all the instruments are valid.Our second contribution is a transparent mixture model, which we call MR-PATH, tocapture the mechanistic heterogeneity and the adaptation of a Monte Carlo EM algorithmto ﬁt this model. Because our model is based on the SNP-exposure and SNP-outcomeassociations, it does not require the individual instruments to be strong. Since our MonteCarlo EM algorithm maximizes the full likelihood function for the latent mixture model, ithas all the beneﬁts of likelihood-based inference.The rest of the paper is organized as follows. In section 2, we give a brief review ofthe standard assumptions in MR. In section 3, we introduce the concept of mechanisticheterogeneity and in section 4 we propose to model it with MR-PATH. In section 5, wedescribe an Monte Carlo EM algorithm to ﬁt MR-PATH and discuss the relevant statisticalinference and model selection procedures. We then study the performance of our MonteCarlo EM algorithm with two simulation studies in section 6 and apply it to two real datadatasets (including the HDL-CHD example above) in section 7. We conclude the paper withsome ending remarks in section 8.

The goal of MR is to estimate the causal eﬀect of a risk exposure variable (X) on a dis-ease outcome variable (Y) with summary-level data from GWAS. In particular, we may beinterested in the causal eﬀect of HDL cholesterol (X) on the risk of coronary heart disease(Y). Regression analyses between X and Y are typically biased by unobserved confoundingvariables U . MR uses p genetic variants Z , . . . , Z p as instrumental variables to obtain anunbiased causal eﬀect estimate of X on Y. A genetic variant Z is said to be a valid instrument for estimating the causal eﬀect of X on Y if it satisﬁes the following assumptions: Assumption 2.1 (Relevance)

It is associated with the risk exposure, i.e. Corr ( Z, X ) (cid:54) = 0 . Assumption 2.2 (Independence)

It must be independent of any unmeasured confoundersthat are associated with both the exposure and outcome, i.e. Z | = U . YUZ × × Figure 2: A directed acyclic graph (DAG) illustrating the core assumptions for a validinstrument.

Assumption 2.3 (Exclusion restriction)

It aﬀects the outcome only through the risk ex-posure, i.e. Z | = Y | X . Figure 2 provides a graphical representation of these assumptions. The independenceassumption is guaranteed by Mendel’s law of random assortment of genes. The relevanceassumption is justiﬁed if genetic variants are chosen to be genome-wide signiﬁcant. Amongthe three assumptions, the ER assumption is the most problematic due to pleiotropy. As-sessment of the IV assumptions and sensitivity analysis in MR are discussed by Burgesset al. (2017). In the MR literature, it is common to assume the following linear structuralequation model for the exposure and outcome variables (Bowden et al., 2017): X = p (cid:88) i =1 θ X i Z i + η X U + E X , (1) Y = βX + p (cid:88) i =1 α i Z i + η Y U + E Y , (2)where η X and η Y are confounding eﬀects and E X and E Y are random noise terms actingon X and Y respectively. The causal eﬀect between X and Y is given by the parameter β . The true marginal association between X and Z i is given by θ X i . The direct eﬀectof Z i on Y is given by α i . Assumption 2.1 implies θ X i (cid:54) = 0 and assumption 2.2 implies Z , . . . , Z p | = U, E X , E Y . Under this model, the exclusion restriction assumption is violatedif α i (cid:54) = 0. In particular, this can occur if Z i aﬀects Y through a mechanism unrelated to X (see ﬁg. 3a for illustration). Plugging eq. (1) into eq. (2), we obtain Y = p (cid:88) i =1 (cid:16) βθ X i + α i (cid:17) Z i + ( η X + η Y ) U + ( E X + E Y ) = p (cid:88) i =1 θ Y i Z i + E (cid:48) Y , (3)7here θ Y i = βθ X i + α i gives the true marginal association between Y and Z i and E (cid:48) Y is arandom noise term that is independent of Z i . In summary-data MR, we usually observethe estimated SNP-exposure eﬀect ˆ θ X i , with standard error σ X i , and the estimated SNP-outcome eﬀect ˆ θ Y i , with standard error σ Y i , for SNP i = 1 , . . . , p . These estimated eﬀectsare typically computed from two diﬀerent samples using linear or logistic regression. If weassume the genetic variants Z , . . . , Z p satisfy the exclusion restriction assumption, in otherwords α = · · · = α p = 0, then the causal eﬀect parameter β can be estimated consistentlywith the inverse-variance-weighted estimator Burgess et al. (2013). A more robust approachis to perform error-in-variables regression of ˆ θ Y i on ˆ θ X i (Zhao et al., 2020). We adopt thisapproach in MR-PATH. If the exclusion restriction assumption is violated for some SNPs,then the causal eﬀect parameter β cannot be identiﬁed without further assumptions on α i .For example, Zhao et al. (2020) assumes that α i ∼ N (0 , τ ) for most genetic variants so thatthe direct eﬀects are balanced out. Instruments Z Pathway M Eﬀect of M on X Eﬀect of M on Y Wald estimandScenario 1 Z , , . . . , Z ,p M θ θ β βZ , , . . . , Z ,p M θ θ β + α β + α /θ Z , , . . . , Z ,p M θ θ β + α β + α /θ Scenario 2 Z , , . . . , Z ,p M θ θ β β Z , , . . . , Z ,p M θ θ β β Z , , . . . , Z ,p M θ θ β β Table 1: Ratio estimands using diﬀerent instruments in the two scenarios in Figure 3. Di-rected acyclic graphs in Figure 3 are interpreted as linear structural equation models.The assumption that the direct eﬀect α i is iid normally distributed does not take into8 , ... Z ,p Z , ... Z ,p Z , ... Z ,p M M M Xθ θ θ Yβα α U (a) Scenario 1: Multiple pathways of horizontalpleiotropy. Z , ... Z ,p M X θ Z , ... Z ,p M X θ Z , ... Z ,p M X θ X = X + X + X Yβ β β U (b) Scenario 2: Multiple mechanisms for theexposure X . Figure 3: Two scenarios of mechanistic heterogeneity.account the possibility that genetic variation often aﬀects phenotypic traits through separatebiological pathways. In this section we show that such behaviour may lead to a clusteringphenomenon where SNPs belonging to the same pathway would indicate similar causal eﬀectsin an MR analysis. This is what we call “mechanistic heterogeneity” in MR.

Consider Figure 3 which contains two scenarios of mechanistic heterogeneity that motivatethe latent mixture model. In both scenarios, genetic variants are grouped into three biologicalpathways, M , M , and M , that aﬀect the exposure X and outcome Y diﬀerently. In theﬁrst scenario (Figure 3a), all three pathways aﬀect X in the same way but have diﬀerentdirect eﬀects on Y . In particular, the ﬁrst pathway M does not have any direct eﬀect on Y not mediated by X , so the instruments Z , , . . . , Z ,p associated with it are all valid IVs.In the second scenario in Figure 3b, the three pathways aﬀect diﬀerent components of theexposure X which also have diﬀerent causal eﬀects on the outcome Y .9f we interpret the diagrams in Figure 3 as linear structural equations (like Figure 2 for (1)and (2)), we can derive the so-called Wald estimand (ratio of θ Y and θ X ) for each instrument(Table 1). In both scenarios in Figure 3, genetic instruments on the same pathway havethe same Wald estimand θ Y /θ X , while instruments across diﬀerent pathways generally havediﬀerent estimands.Therefore, we may reach completely diﬀerent conclusions when using instruments ondiﬀerent pathways in the MR analysis. In reality, mechanistic heterogeneity can be morecomplicated than the basic scenarios in Figure 3. For example, another pathway can aﬀectsome components of X and also have direct eﬀect on Y . It is also worthwhile to pointout that mechanistic heterogeneity can arise even when all the IVs are perfectly valid; anexample is Scenario 2 in Figure 3. The above introduction of mechanistic heterogeneity is entirely based on linear structuralequation models. Next we show that the same clustering phenomenon can also happenwhen there is nonlinearity. Consider the causal diagram in Figure 3a without the M → Y and M → Y edges, so all the instruments Z , , . . . , Z ,p are valid (satisfy Assumptions 2.1to 2.3). Suppose the variables satisfy the nonparametric structural equation model (NPSEM)with independent errors (Pearl, 2009) according to Figure 3a, so the counterfactuals of M k , k = 1 , , X , and Y can be deﬁned using the NPSEM. For example, we use Y ( X = 1) todenote the counterfactual outcome under the intervention X = 1. To simplify our illustrationbelow, we assume Z , , . . . , Z ,p , M , M , M , and X are all binary variables.It is well known that if the instrument Z k,j is valid and the counterfactuals of the exposure X satisﬁes the monotonicity assumption X ( Z k,j = 1) ≥ X ( Z k,j = 0), the Wald estimand forinstrument Z k,j is equal to the so-called local average treatment eﬀect, E [ Y ( X = 1) − Y ( X =0) | X ( Z k,j = 1) > X ( Z k,j = 0)] (Angrist et al., 1996). Notice that this interpretation of theIV analysis does not require linearity of the structural equation model. Suppose we furtherassume the eﬀect of Z on M is monotone, M k ( Z k,j = 1) ≥ M k ( Z k,j = 0) , and the eﬀect of M on X is monotone, X ( M k = 1) ≥ X ( M k = 0). Using the properties of counterfactuals10nd the fact in Figure 3a that Z k,j aﬀects X entirely through M k , we get X ( Z k,j = z ) = X ( Z k,j = z, M k = M k ( Z k,j = z )) = X ( M k = M k ( Z k,j = z )) . Thus, using the assumption that X and M k are binary, E (cid:2) Y ( X = 1) − Y ( X = 0) | X ( Z k,j = 1) > X ( Z k,j = 0) (cid:3) = E (cid:2) Y ( X = 1) − Y ( X = 0) | X (cid:0) M k = M k ( Z k,j = 1) (cid:1) > X (cid:0) M k = M k ( Z k,j = 0) (cid:1)(cid:3) = E (cid:2) Y ( X = 1) − Y ( X = 0) | X ( M k = 1) > X ( M k = 0) , M k ( Z k,j = 1) > M k ( Z k,j = 0) (cid:3) = E (cid:2) Y ( X = 1) − Y ( X = 0) | X ( M k = 1) > X ( M k = 0) (cid:3) . (4)The last equality above uses { M k ( Z k,j = 0) , M k ( Z k,j = 1) } | = { X ( M k = 0) , X ( M k = 1) , Y ( X = 0) , Y ( X = 1) } . This counterfactual independence follows from expressing the counterfactuals using theNPSEM and the assumption that the diﬀerent structural equations have independent er-rors.The signiﬁcance of (4) is that, if the counterfactuals of M and X satisfy the monotonicityassumption, the local average treatment eﬀect corresponding to Z k,j only depends on themechanism index k . This shows that the clustering of the Wald esimand in Section 3.1 notonly occurs in linear structural equation models but also in certain nonlinear models. Theseexamples demonstrate the importance of identifying mechanistic heterogeneity to correctlyinterpret MR studies. Motivated by the observations in the previous section, we propose a latent mixture modelto discover mechanistic heterogeneity using summary GWAS data. In essence, this modelassumes that each genetic variant has a speciﬁc causal eﬀect and the genetic variants onthe same biological pathway have similar variant-speciﬁc causal eﬀects and form clusters.The mean of each cluster corresponds to the Wald estimand of that pathway (last column11n table 1). These assumptions, along with standard assumptions for summary-data MRliterature, are introduced below.

Assumption 4.1 (Error-in-variables regression)

The observed instrument-exposure andinstrument-outcome associations are distributed as  ˆ θ X i ˆ θ Y i  indep. ∼ N (cid:16)  θ X i β i θ X i  ,  σ X i σ Y i  (cid:17) , i = 1 , . . . , p, (5) where σ X i , σ Y i are (ﬁxed) measurement errors. In this assumption, the variant-speciﬁc causal eﬀects are given by β i . The normalityassumption is justiﬁed because ˆ θ X i and ˆ θ Y i are typically linear (or logistic) regression coef-ﬁcients which are computed in GWAS with a large sample size. Independence between ˆ θ X i and ˆ θ Y i for each SNP is justiﬁed if they are from GWAS conducted with non-overlappingsamples. Independence of the estimated eﬀects across diﬀerent SNPs is a reasonable as-sumption if we select SNPs that are uncorrelated by using linkage disequilibrium clumping.Although independence between SNPs does not imply the estimated eﬀects are uncorrelated,the correlation between the estimated eﬀects are typically negligible (Zhao et al., 2020). Assumption 4.2 (Mixture model for mechanistic heterogeneity) Z i ∼ Categorical ( π , . . . , π K ) , (6) β i | Z i = k ∼ N ( µ k , σ k ) , k = 1 , . . . , K. (7)We assume a Gaussian mixture model for the variant-speciﬁc (latent) causal eﬀects β i toprobabilistically cluster genetic variants with similar causal eﬀects. An indicator variable forcluster membership of SNP i is given by Z i . We can compute the posterior distribution of Z i and β i to summarize our knowledge of these variant-speciﬁc latent variables based on data(see section 5.4). The K clusters represent diﬀerent causal mechanisms where the clustermeans µ k , for k = 1 , . . . , K , identify the average causal eﬀects for each mechanism. Thecluster proportions π k give us the proportion of genetic variants associated with each mecha-nism. The cluster variances σ k quantify uncertainty within each mechanism. Therefore, theparameters of interest in MR-PATH are given by ϕ = { π k , µ k , σ k : k = 1 , . . . , K } . The num-ber of clusters K is unknown but can be chosen based on heuristics and domain knowledgeor estimated from data—see section 5.5 for a model selection criterion for choosing K .12 Statistical inference for MR-PATH

In order to ﬁt MR-PATH to gain insight into mechanistic heterogeneity, we proceed bydiscussing three inference procedures. First, we describe our implementation of the EMalgorithm for obtaining a maximum likelihood estimate of ϕ and point out the challenges andour solutions for the expectation step. Second, we discuss approximate conﬁdence intervalsfor ϕ obtained by computing and inverting the observed information matrix. Lastly, wego into how K can be selected from data using a modiﬁed Bayesian Information criterion(BIC). Let D = { (ˆ θ X i , σ X i , ˆ θ Y i , σ Y i ) : i = 1 , . . . , p } denote the observed data and let Θ = { ( θ X i , β i , Z i ) : i = 1 , . . . , p } denote the set of latent variables. The Expectation-Maximization (EM) algorithm is an iterative procedure commonly usedto perform maximum likelihood estimation in latent variable models. To deﬁne the EMalgorithm for MR-PATH, we derive two important model quantities: the complete-data loglikelihood and the conditional posterior of the latent variables Θ , given parameters ϕ . Thelatter is used to compute the Q-function in the expectation step. Let φ ( · ; µ, σ ) denote thedensity of N ( µ, σ ). We make the additional model assumption that ˆ θ X i ∼ N ( m x , λ x ). Itfollows that the complete-data log likelihood for MR-PATH is given by l ( ϕ ; Θ ) := p (cid:88) i =1 l i ( ϕ ; θ X i , β i , Z i ) , (8)where l i ( ϕ ; θ X i , β i , Z i ) ∝ log φ ( θ X i ; m x , λ x ) + K (cid:88) k =1 Z ik (cid:2) log π k + log φ ( β i ; µ k , σ k ) (cid:3) (9)and Z ik = 1, if Z i = k , and 0 otherwise. The conditional posterior of the latent variablesgiven ϕ can be decomposed as P ( Θ | D , ϕ ) := p (cid:89) i =1 P ( β i , Z i , θ X i | ˆ θ X i , ˆ θ Y i , ϕ )= p (cid:89) i =1 K (cid:89) k =1 (cid:2) P ( Z i = k | β i , ϕ ) (cid:3) Z ik P ( β i | θ X i , ˆ θ Y i , ϕ ) P ( θ X i | ˆ θ X i , ˆ θ Y i , ϕ ) , (10)13here P ( Z i = k | β i , ϕ ) and P ( β i | θ X i , ˆ θ Y i , ϕ ) are available in closed-form and are given by P ( Z i = k | β i , ϕ ) = π k φ ( β i ; µ k , σ k ) (cid:80) Kj =1 π j φ ( β i ; µ j , σ j ) , (11) P ( β i | θ X i , ˆ θ Y i , ϕ ) = K (cid:88) k =1 π k φ ( β i ; ˜ µ ik , ˜ σ ik ); (12)where ˜ σ ik = (cid:16) σ k + θ X i σ Y i (cid:17) − , ˜ µ ik = ˜ σ ik (cid:16) ˆ θ Y i θ X i σ Y i + µ k σ k (cid:17) . Unfortunately, instead of having an analytical solution, the probability density P ( θ X i | ˆ θ X i , ˆ θ Y i , ϕ )is known only up to a multiplicative constant given by P ( θ X i | ˆ θ X i , ˆ θ Y i , ϕ ) ∝ P ( θ X i | ˆ θ X i , ϕ ) P (ˆ θ Y i | θ X i , ϕ )= φ ( θ X i ; m X i , λ X i ) K (cid:88) k =1 π k φ (ˆ θ Y i ; θ X i µ k , θ X i σ k + σ Y i ) , (13)where λ X i = (cid:16) σ X i + 1 λ x (cid:17) − , m X i = λ X i (cid:16) ˆ θ X i σ X i + m x λ x (cid:17) . The EM algorithm starts with initial values for the parameter estimates ϕ (0) . Eachiteration t = 1 , , . . . of the EM algorithm consists of an expectation step (E-step) anda maximization (M-step). The E-step involves computing the Q-function, deﬁned as theexpectation of the complete-data log likelihood with respect to the conditional posteriorof the latent variables given the previous iterations parameter estimates. Speciﬁcally, theE-step can be represented by Q ( ϕ, ϕ ( t − ) = E (cid:2) l ( ϕ ; Θ ) | D , ϕ ( t − (cid:3) , (14)where the expectation is taken with respect to P ( Θ | D , ϕ ). The M-step consists of com-puting an update of the parameter estimates for the current iteration as the value thatmaximizes the Q-function. In other words, the M-step involves ϕ ( t ) = arg max ϕ Q ( ϕ, ϕ ( t − ) . (15)14he EM algorithm guarantees the likelihood is non-decreasing across iterations—ascentproperty—and is theoretically guaranteed, under mild regularity conditions, to converge toa local optimum (Wu, 1983). In practice, a commonly used heuristic for determining the EMalgorithm has converged is when the increase in the Q-function from the previous iterationis less than a speciﬁed threshold. Unfortunately, since P ( Θ | D , ϕ ) is only known up to amultiplicative constant, the Q-function for MR-PATH cannot be computed analytically. Inthe next section, we discuss our implementation of a variant of the EM algorithm wherethe Q-function in the E-step is approximated using Monte-Carlo methods—the Monte-CarloEM (MC-EM) algorithm. The MC-EM algorithm allows us to perform maximum likelihood estimation in latent vari-able models where the Q-function cannot be computed analytically but can be approximatedusing Monte-Carlo methods. In section 5.2.1, we describe an importance sampling (IS)scheme for approximating the Q-function in MR-PATH. The main drawback of the MC-EMalgorithm is that Monte-Carlo error accrued from approximating the Q-function may causethe algorithm to not converge (Neath, 2013). More precise approximations of the Q-functionare needed as the M-step updates approach a local optimum. Furthermore, the MC-EM algo-rithm does not satisfy the ascent property of the vanilla EM algorithm. A simple solution tothese issues is to set the number of Monte-Carlo samples used to approximate the Q-function(MC sample size) to be very large or increase the MC sample size by a deterministic largeamount at each iteration. However, this may not be computationally feasible, thus creatinga trade-oﬀ between statistical consistency and computational eﬃciency. In section 5.2.2, wediscuss an automated data-driven procedure introduced by Caﬀo et al. (2005) that guaran-tees the ascent property is satisﬁed with high probability by assessing Monte-Carlo error atthe end of each iteration and increasing the MC sample size accordingly.15 .2.1 Monte-Carlo E-step using Importance Sampling

The decomposition of P ( Θ | D , ϕ ) in eq. (10) suggests that we can obtain importance samplesfrom P ( Θ | D , ϕ ) by ﬁrst drawing samples of { θ X i } from an importance (proposal) distribu-tion. Given importance samples of { θ X i } , we can obtain importance samples of { β i } and { Z i } by directly sampling from (cid:81) pi =1 P ( β i | θ X i , ˆ θ X i , ϕ ) and (cid:81) pi =1 P ( Z i | β i , ϕ ), which are available inclosed form in eq. (11) and eq. (12). Then, we can estimate the Q-function with a sum of thecomplete-data log-likelihood evaluated at the samples weighted by the importance weights.Choosing an importance distribution that yields an eﬃcient sampling procedure is a non-trivial task that depends on the form of the target distribution (see Tokdar and Kass (2010)for a review). In our case, the target distribution is (cid:81) pi =1 P ( θ X i | ˆ θ X i , ˆ θ Y i , ϕ ) so a sensiblechoice for an importance (proposal) distribution would be (cid:81) pi =1 P ( θ X i | ˆ θ X i , ϕ ), the posteriordistribution of { θ X i } given only { ˆ θ X i } instead of the full data { ˆ θ X i , ˆ θ Y i } which has a closedform given in eq. (13). This choice of importance distribution yields importance weights thatare bounded and have a ﬁnite variance.More precisely, let m t denote the number of desired importance samples at iteration t and˜ ϕ ( t − denote the MC-EM update from iteration t −

1. For each i = 1 , . . . , p and j = 1 , . . . , m t ,suppose θ jX i ∼ P ( θ X i | ˆ θ X i , ϕ ( t − ), β ji ∼ P ( β i | θ X i , ˆ θ Y i , ϕ ( t − ), and Z ji ∼ P ( Z i | β i , ϕ ( t − ). Fromeq. (13), we have that the unnormalized importance weights are given by w ji = P (ˆ θ Y i | θ jX i , ˜ ϕ ( t − ) = K (cid:88) k =1 ˜ π ( t − k φ (cid:16) ˆ θ Y i ; θ jX i ˜ µ ( t − k , ( θ jX i ) ˜ σ t − k + σ Y i (cid:17) (16)for j = 1 , . . . , M . It can be shown that 0 ≤ w ji ≤ (cid:0) πσ Y i (cid:1) − / (proof is provided in ap-pendix A.1). Let ¯ w ji = w ji / (cid:80) m t j =1 w ji be the normalized importance weights. The IS estimateof the Q-function at iteration t is given by˜ Q ( ϕ, ˜ ϕ ( t − ; m t ) = p (cid:88) i =1 m t (cid:88) j =1 ¯ w ji l ji ( ϕ ) , (17)where l ji ( ϕ ) = l i ( ϕ ; θ jX i , β ji , Z ji ) , and l i is deﬁned in eq. (9). Consequently, the MC-EM update of the model parameters at16teration t approximated with Monte-Carlo sample size m t is given by˜ ϕ ( t,m t ) = arg max ϕ ˜ Q ( ϕ, ˜ ϕ ( t − ; m t ) . At the end of each MC-EM iteration, the ascent-based MC-EM algorithm (Caﬀo et al.,2005) performs a hypothesis test to determine whether the Q-function has increased fromthe previous iteration. If there is not suﬃcient evidence that suggests the Q-function hasincreased, we reject the current iteration’s M-step and repeat the iteration with a largerMC sample size. In this section, we will describe this procedure concretely for our inferenceproblem.Deﬁne the change in the Q function at iteration t of MC-EM as∆ Q ( ˜ ϕ ( t,m t ) , ˜ ϕ ( t − ) := Q ( ˜ ϕ ( t,m t ) , ˜ ϕ ( t − ) − Q ( ˜ ϕ ( t − , ˜ ϕ ( t − ) . (18)After obtaining ˜ ϕ ( t,m t ) in the M-step, we are interested in performing the following hy-pothesis test at a speciﬁed signiﬁcance level α : H : ∆ Q ( ˜ ϕ ( t,m t ) , ˜ ϕ ( t − ) = 0 , (19) H A : ∆ Q ( ˜ ϕ ( t,m t ) , ˜ ϕ ( t − ) > . If we reject H , then we accept ˜ ϕ ( t,m t ) and move on to the next iteration. If we fail toreject H , then we reject ˜ ϕ ( t,m t ) and repeat the current iteration with a larger Monte-Carlosample size. The change in the Q function in eq. (18) can be approximated using∆ ˜ Q ( ˜ ϕ ( t,m t ) , ˜ ϕ ( t − ) ≈ p (cid:88) i =1 m t (cid:88) j =1 w ji Λ ( t ) ij (20)where Λ ( t ) ij = l ji ( ˜ ϕ ( t,m t ) ) − l ji ( ˜ ϕ ( t − ) . (21)It was shown in Caﬀo et al. (2005) that under H , √ m t ∆ ˜ Q ( ˜ ϕ ( t,m t ) , ˜ ϕ ( t − ) d → N (0 , η ) (22)as m t → ∞ , where σ depends on the sampling procedure used. We can obtain an estimateof η , ˆ η , by computing the variance of the importance sampling estimate in eq. (20) (details17rovided in appendix A.2). Therefore, we can reject H with approximate signiﬁcance level α if ∆ ˜ Q ( ˜ ϕ ( t,m t ) , ˜ ϕ ( t − ) − z α ˆ ηm t > z α is the (1 − α ) th quantile of the standard normal distribution. If we fail to reject H ,we repeat iteration t with a larger m t until we are able to reject H . Similarly, a convenientstopping criterion is obtained by testing H A : ∆ Q ( ˜ ϕ ( t,m t ) , ˜ ϕ ( t − ) < (cid:15) at a speciﬁed signiﬁcance level γ and threshold (cid:15) . We can reject H with approximatesigniﬁcance level γ and determine MC-EM has converged if∆ ˜ Q ( ˜ ϕ ( t,m t ) , ˜ ϕ ( t − ) + z γ ˆ ηm t < (cid:15). To quantify uncertainty of the parameter estimates obtained using the MC-EM algorithm,we adapt the method in (Louis, 1982) for computing the observed information matrix in theEM framework. Standard errors, and therefore approximate conﬁdence intervals, can thenbe obtained by inverting the observed information matrix. By a result presented in (Louis,1982), the observed information matrix at a point ϕ ∗ can be computed by I ( ϕ ∗ ) = (cid:110) E (cid:104) − ∂ l ( ϕ ) ∂ϕ∂ϕ T (cid:12)(cid:12)(cid:12) D , ϕ ∗ (cid:105) − E (cid:104)(cid:16) ∂l ( ϕ ) ∂ϕ ∂l ( ϕ ) ∂ϕ T (cid:17)(cid:12)(cid:12)(cid:12) D , ϕ ∗ (cid:105) + E (cid:104) ∂l ( ϕ ) ∂ϕ (cid:12)(cid:12)(cid:12) D , ϕ ∗ (cid:105) E (cid:104) ∂l ( ϕ ) ∂ϕ T (cid:12)(cid:12)(cid:12) D , ϕ ∗ (cid:105)(cid:111)(cid:12)(cid:12)(cid:12) ϕ = ϕ ∗ (23)where the dependence of D , Θ in the log-likelihood have been suppressed for notationalconvenience. Let ˆ ϕ denote the ﬁnal MC-EM parameter estimate and suppose { θ jX i , β ji , Z ji : j = 1 , . . . , M } are now importance samples from P ( θ X i , β i , Z i | D , ˆ ϕ ) with (normalized)weights { w ji } . The ﬁrst and third expectations in eq. (23) can be approximated by E (cid:104) − ∂ l ( ϕ ) ∂ϕ∂ϕ T (cid:12)(cid:12)(cid:12) D , ϕ (cid:105) ≈ p (cid:88) i =1 M (cid:88) j =1 w ji ∂ l ji ( ϕ ) ∂ϕ∂ϕ T , and E (cid:104) ∂l ( ϕ ) ∂ϕ (cid:12)(cid:12)(cid:12) D , ϕ (cid:105) ≈ p (cid:88) i =1 M (cid:88) j =1 w ji ∂l ji ( ϕ ) ∂ϕ . E (cid:104)(cid:16) ∂l ( ϕ ) ∂ϕ ∂l ( ϕ ) ∂ϕ T (cid:17)(cid:12)(cid:12)(cid:12) D , ϕ ∗ (cid:105) = p (cid:88) i =1 E (cid:104)(cid:16) ∂∂ϕ l i ( ϕ ) (cid:17)(cid:16) ∂∂ϕ l i ( ϕ ) (cid:17) T (cid:105) + 2 (cid:88) i

To gain a better picture of our knowledge of each individual SNP, we can sample from P ( β i , Z i , θ X i | ˆ θ X i , ˆ θ Y i , ˆ ϕ )—the posterior distribution of variant-speciﬁc latent variables givenan MC-EM estimate of ϕ —by using the sampling/importance resampling (SIR) algorithm(Li, 2004). For example, P ( Z i = k | ˆ θ X i , ˆ θ Y i , ˆ ϕ )—the cluster membership probability of the i th SNP—quantiﬁes how certain we are that the i th SNP belongs to a certain cluster. Morespeciﬁcally, suppose we have M samples { β ji , Z ji , θ jX i : j = 1 , . . . , M } from our importancedistribution for the i th SNP. Then we can obtain samples of β i from P ( β i | ˆ θ X i , ˆ θ Y i , ˆ ϕ ) bysampling—with replacement—from { β ji } with probabilities proportional to the importanceweights given in eq. (16). Samples from P ( Z i | ˆ θ X i , ˆ θ Y i , ˆ ϕ ) and P ( θ X i | ˆ θ X i , ˆ θ Y i , ˆ ϕ ) can be ob-tained in a similar fashion. In particular, these samples can be used to construct credibleintervals of β i and compute cluster membership probabilities—see section 7 for examples. To select the number of clusters K , we use a modiﬁed Bayesian Information criterion (BIC)for latent variable models estimated using the EM algorithm adopted from Ibrahim et al.(2008). For MR-PATH, the standard BIC is typically deﬁned asBIC = − P ( D | ˆ ϕ ) + (3 K + 2 p ) log( p )19here ϕ is the MLE of ϕ and 3 K + 2 p is the dimension of our model. We replace themarginal density P ( D | ˆ ϕ ) with the readily available IS estimate of the Q-function at the ﬁnalMC-EM iteration from eq. (17). To verify the eﬃcacy of our statistical inference procedures, we perform two simulation stud-ies. The goal of the ﬁrst simulation study is to demonstrate that (1) the MC-EM algorithmgives parameter estimates that are close to the ground truth and (2) the approximate con-ﬁdence intervals we derive have desirable coverage probabilities. The goal of our secondsimulation study is to evaluate the accuracy of the modiﬁed BIC for selecting the number ofclusters K . In our ﬁrst simulation study, we generate simulated data from MR-PATH under variousparameter settings that mimic GWAS summary data used in practice. In each setting, wegenerate measurement errors as σ X i , σ Y i ∼ Inv. Gamma(9 , . λ x = 10 / √ p . We set λ x to be inversely related to the number of genetic variants p because a larger p correspondsto a less stringent criterion for ﬁltering the genetic variants. We vary the number of geneticvariants to be p = 50 , , , K true = 1 , ,

3. In MR,we do not expect the number of clusters to be greater than 3 and the number of ﬁlteredgenetic variants to be large. For each parameter setting, we ran the MC-EM algorithm with K = K true and computed the approximated conﬁdence intervals with 500 simulated data-sets and calculated the root mean squared error (RMSE) and 95% coverage probability. Asthe EM algorithm is sensitive to initial value speciﬁcation, for each repetition, we run theMC-EM algorithm 10 times with diﬀerent random initial values and report the results fromthe run with the largest complete-data log likelihood. Simulation results are presented inFigure 4. Note that as K increases, we choose mixture means to be closer in value, making20he estimation task more challenging.In this simulation study, the RMSE for each scenario does not exceed .09 and the coverageprobability for most scenarios do not deviate much from the desired 95%. Furthermore,we observe a trade-oﬀ between RMSE and coverage probability when ﬁltering out weakerinstruments (variants with ˆ θ X i ≈ p ) results inlower RMSE but coverage probabilities that deviate more from 95%. m s

250 500 750 1000 250 500 750 10000.0050.010 m s

250 500 750 1000 250 500 750 10000.650.750.850.95 (a)

K = 1 . µ = 0 . σ = 0 . p m s

250 500 750 1000 250 500 750 1000 250 500 750 10000.020.040.06 p m s

250 500 750 1000 250 500 750 1000 250 500 750 10000.650.750.850.95 (b)

K = 2 . π = 0 . µ = − . µ = 0 .

5, and σ = σ = 0 . p m s

250 500 750 1000 250 500 750 1000 250 500 750 10000.020.040.06 p m s

250 500 750 1000 250 500 750 1000 250 500 750 10000.650.750.850.95 (c)

K = 3 . π = π = 0 . µ = − . µ = 0, µ = 0 .

5, and σ = σ = σ = 0 . Figure 4:

Results from ﬁrst simulation study . For each parameter setting, we simulated N rep = 500 data-sets. Each row corresponds to a diﬀerent number of cluster K with theground truth in the sub-captions. Each plot corresponds to one of the mixture parameters π = ( π , . . . , π K ), µ = ( µ , . . . , µ K ), σ = ( σ , . . . , σ K ) in MR-PATH. Left : Plots of root meansquared error (RMSE) against the number of variants p for each mixture parameter. Right :Plots of 95% coverage probabilities against p for each mixture parameter. A horizontaldashed line is drawn at .95 (desired coverage prob.).21 .2 Model Selection with modiﬁed BIC In our second simulation study, we simulate data from MR-PATH with p = 50 or 250 and √ pλ x = 1 or 5. In each setting, we set the true number of clusters K true = 1,2, or 3. In eachreplication, we ran the MC-EM algorithm with K = 1 , , K that yields thelowest modiﬁed BIC value. Results for this simulation study are shown in table 2.From table 2, we observe that lower values for √ pλ x result in the largest decreases inthe accuracy of the modiﬁed BIC. For example, when p = 50 and K true = 2, the modiﬁedBIC chose the correct K

92% of the time when √ pλ x = 5 but only 71% of the time when √ pλ x = 1. We also notice that when √ pλ x = 1, the accuracy of the modiﬁed BIC decreaseswith p . A practical consequence of this observation is that the modiﬁed BIC is more likelyto choose the correct K with few strong instruments than with more weak instruments. We now return to the motivating example introduced in Section 1.1. The dataset beingused is created from several large-scale GWAS datasets for plasma lipids (HDL-C, LDL-C,triglycerides) (Teslovich et al., 2010; Willer et al., 2013), coronary heart disease (Nikpayet al., 2015), lipoprotein subfractions (Kettunen et al., 2016), and other cardiovascular dis-eases. We use the three-sample summary-data MR design described in Zhao et al. (2020)to preprocess and homogeneize the datasets. We ﬁrst select 151 independent SNPs (dis-tance ≥

10 mega base pairs, R ≤ .

001 in a reference panel) that are associated with atleast plasma lipid traits (deﬁned as the minimum p -value with HDL-C, LDL-C, and triglyc-erides less than 10 − ). We then obtain the GWAS associations of these SNPS with all theother cardiometabolic traits. For the purpose of this example, we will focus on 31 SNPsthat showed genome-wide signiﬁcant associations ( p -value ≤ × − ) with HDL-C in theselection GWAS.We apply the Monte Carlo EM algorithm developed in Section 5 to the 31 genetic in-struments and their associations with HDL-C in a separate dataset (Kettunen et al., 2016)22roportion p √ pλ x K true K = 1 K = 2 K = 350 1 1 .83 .17 02 0 0.71 0.293 0 .01 0.995 1 .98 0.02 02 0 0.92 0.083 0 .002 .998250 1 1 .76 0.24 02 0 0.67 0.333 0 .005 .9955 1 1 0 02 0 0.99 0.013 0 0 1Table 2: Results from second simulation study . For each setting, we simulated N rep = 500data-sets. For K true = 1, we set µ = 0 . σ = 0 .

1. For K true = 2, we set π = 0 . µ = ( − . , . σ = σ = 0 .

1. For K true = 3, we set π = π = 1 / µ = ( − . , , . σ = σ = σ = 0 .

05. The last column reports the proportion of repetitions for eachsetting where the modiﬁed BIC chose the corresponding K .23nd CHD. The modiﬁed BIC selects K = 2 clusters of SNPs. The larger cluster (ˆ π = 0 . µ = − . σ = 0 .

23) and the smaller cluster (ˆ π = 0 . µ = 0 .

14, ˆ σ = 0 . β j along side the cluster membership probabilities.To validate the mechanistic heterogeneity identiﬁed by MR-PATH, we generate a heatmapof the associations (z-scores) of the SNPs with lipoprotein subfraction traits (Kettunen et al.,2016). Most of the traits are named after their size (XS = extra small, S = small, M =medium, L = large, XL = extra large, XXL = double extra large), their lipoprotein class(HDL, IDL = intermediate-density lipoprotein, LDL, VLDL = very-low-density lipoprotein),and the measurement (C = total cholesterol, CE = cholesterol esters, FC = free cholesterol,L = total lipid, P = particle concentration, PL = phosolipids, TG = triglycerides). Othertraits including the mean diameter of HDL/LDL/VLDL particles (HDL-D/LDL-D/VLDL-D) and the concentration of ApoA1/ApoB (major protein component of HDL/LDL). Toaid visualization, the SNPs are ordered by their cluster membership probabilities and thelipoprotein subfractions are ordered by their density and size.The heatmap in Figure 6 shows that the SNP clusters found by the mixture model exhibitdiﬀerent patterns of association with the lipoprotein subfractions. Several SNPs in the ﬁrstcluster have strong inverse association with LDL-C and other LDL/VLDL subfraction traits.Therefore, the negative eﬀect of HDL-C on CHD sugested by the instruments in the ﬁrstcluster cluster may indeed be due to their pleiotropic eﬀect on LDL-C and ApoB-containinglipoproteins (see scenario 1 in Figure 3a). In contrast, several SNPs in the second cluster(rs1532085, rs588136, rs174546, rs7679) are inversely associated with the concentration ofsmall HDL particles, so they may be related to another mechanism that regulates the size ofHDL particles. Although the instruments in this cluster suggest a positive eﬀect of HDL-Con CHD, this may be explained by heterogeneous eﬀects of cholesterol contained in diﬀerentHDL subfractions (see scenario 2 in Figure 3a). An earlier univariable MR study indeed foundthat the concentration of small and medium HDL particles may have a negative eﬀect onCHD, while the large and extra large HDL particles seem to have no eﬀect (Zhao et al., 2019).24 r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s % P o s t e r i o r C r ed i b l e I n t e r v a l C l u s t e r m e m be r s h i p p r ob . Figure 5:

SNP-speciﬁc posterior quantities in HDL-CAD data . The SNPs are ordered bytheir posterior probability of belonging to cluster 2.

Top :

95% posterior credible intervals .Colored dashed lines are the estimated cluster means and x-marks are the posterior mediansfor each SNP.

Bottom : Posterior cluster membership probabilities bar plot . Vertical axis isposterior probability of belonging to cluster 2.To summarize, the heatmap provides some evidence that the clustering structure identiﬁedby our mixture model indeed corresponds to some distinct underlying mechanisms.

We now turn to a second example to illustrate the utility of MR-PATH. In this example, weare interested in possible mechanistic heterogeneity of the eﬀect of adiposity (as measuredby the body mass index, BMI) on type II diabetes (T2D). Following the same three-samplesummary-data MR design as describe in Section 7.1, we created a dataset of 60 SNPs fromtwo GWAS summary datasets for BMI (Akiyama et al., 2017; Locke et al., 2015) and onefor T2D (Mahajan et al., 2018). We then apply the Monte Carlo EM algorithm developed in25ection 5. The modiﬁed BIC selects K = 2 clusters of SNPs. The larger cluster (ˆ π = 0 . µ = 0 .

77, ˆ σ = 0 .

42) and the smaller cluster (ˆ π = 0 . µ = − .

4, ˆ σ = 1 . β i with the SNP association with peak blood insulin, which is available froman independent GWAS (Wood et al., 2017) (Figure 9). In fact, six out of the seven SNPsclassiﬁed into cluster 2 are strongly associated with peak blood insulin. This shows that thelarge negative eﬀect of this cluster is most likely due to horizontal pleiotropy (Figure 3a)instead of a genuine negative causal eﬀect of adiposity. The results we obatined here arebroadly consistent with other recent genetic studies that have identiﬁed SNPs with oppositeeﬀects on adiposity and type II diabetes and linked the “favorable adiposity” genes to insulinfunction and fat distribution (Ji et al., 2019). In this paper, we have formalized the notion of mechanistic heterogeneity in the context ofMR and showed that SNPs on the same biological pathway identify similar causal eﬀects.Diﬀerent pathways generally correspond to diﬀerent causal eﬀects, even if they are all validinstruments. Motivated by this observation, we introduced MR-PATH, an interpretablemixture model for from summary-level GWAS data that can provide valuable insights onmechanistic heterogenity.MR-PATH has several advantages over similar existing methods for MR. First, it relaxesthe eﬀect homogeneity assumption implicit in most existing MR methods so it is able toidentify multiple causal mechanisms. Second, MR-PATH does not require substantial domainknowledge since we use a data-driven approach to select the number of clusters. Lastly, MR-PATH is based on a full likelihood and is robust to weak instrument bias since we use anerror-in-variables approach to estimate the variant-speciﬁc causal eﬀects. We showed using26umerical simulations that our MC-EM algorithm gives parameter estimates that are closeto the ground truth and the corresponding approximate conﬁdence intervals have coverageprobabilities close to their true values. We also showed that the modiﬁed BIC criterion weused for selecting the number of clusters chose correctly most of the time.We demonstrated the utility of MR-PATH in modeling mechanistic heterogeneity inMR analysis by using it to investigate the causal mechanisms between HDL-C and CHDand between adioposity and type II diabetes. These examples reinforce the importance ofconsidering multiple causal mechanisms in MR analysis. Our ﬁndings are consistent withexisting genetic studies that use external data. For the HDL-C and CHD data set, MR-PATH identiﬁes a cluster with a positive average causal eﬀect which may be associated witha mechanism that regulates size of HDL particles. In our study of the role of adioposity ontype II diabetes, MR-PATH ﬁnds a cluster with a negative average causal eﬀect that is likelyattributed to horizontal pleiotropy.Since MR-PATH is a generative model for multiple causal mechanisms in MR, there aremany potential extensions that can be incorporated in future work. One such extension isto replace our univariate mixture model with a multivariate model to consider multiple riskexposures simultaneously. The multivariate version of MR-PATH can be used to accountfor the pleiotropic eﬀects of other lipoproteins in our HDL and CHD example. Anotherpossible extension is to allow for correlated SNPs by relaxing the independence assumptionin assumption 4.1.

Acknowledgement

We would like to thank Xuelu Wang for helpful comments on the type II diabetes example.27 eferences

Akiyama, M., Okada, Y., Kanai, M., Takahashi, A., Momozawa, Y., Ikeda, M., Iwata,N., Ikegawa, S., Hirata, M., Matsuda, K., et al. (2017). Genome-wide association studyidentiﬁes 112 new loci for body mass index in the japanese population.

Nature Genetics ,49(10):1458.Anderson, T. W. and Rubin, H. (1949). Estimation of the parameters of a single equation ina complete system of stochastic equations.

Annals of Mathematical Statistics , 20(1):46–63.Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identiﬁcation of causal eﬀects usinginstrumental variables.

Journal of the American Statistical Association , 91(434):444–455.Armitage, J., Holmes, M. V., and Preiss, D. (2019). Cholesteryl ester transfer proteininhibition for preventing cardiovascular events: Jacc review topic of the week.

Journal ofthe American College of Cardiology , 73(4):477–487.Bowden, J., Davey Smith, G., and Burgess, S. (2015). Mendelian randomization with invalidinstruments: eﬀect estimation and bias detection through egger regression.

InternationalJournal of Epidemiology , 44(2):512–525.Bowden, J., Del Greco M, F., Minelli, C., Davey Smith, G., Sheehan, N., and Thompson,J. (2017). A framework for the investigation of pleiotropy in two-sample summary datamendelian randomization.

Statistics in Medicine , 36(11):1783–1802.Boyle, E. A., Li, Y. I., and Pritchard, J. K. (2017). An expanded view of complex traits:from polygenic to omnigenic.

Cell , 169(7):1177–1186.Buniello, A., MacArthur, J. A. L., Cerezo, M., Harris, L. W., Hayhurst, J., Malangone,C., McMahon, A., Morales, J., Mountjoy, E., Sollis, E., et al. (2019). The nhgri-ebigwas catalog of published genome-wide association studies, targeted arrays and summarystatistics 2019.

Nucleic acids research , 47(D1):D1005–D1012.Burgess, S., Bowden, J., Fall, T., Ingelsson, E., and Thompson, S. G. (2017). Sensitiv-28ty Analyses for Robust Causal Inference from Mendelian Randomization Analyses withMultiple Genetic Variants.

Epidemiology (Cambridge, Mass.) , 28(1):30–42.Burgess, S., Butterworth, A., and Thompson, S. G. (2013). Mendelian randomizationanalysis with multiple genetic variants using summarized data.

Genetic Epidemiology ,37(7):658–665.Burgess, S., Foley, C. N., Allara, E., Staley, J. R., and Howson, J. M. M. (2020). A robust andeﬃcient method for Mendelian randomization with hundreds of genetic variants.

NatureCommunications , 11(1):376.Caﬀo, B. S., Jank, W., and Jones, G. L. (2005). Ascent-based Monte Carlo expectationmaximization.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,67(2):235–251.Davey Smith, G. and Phillips, A. N. (2020). Correlation without a cause: an epidemiologicalodyssey.

International Journal of Epidemiology , 49(1):4–14.Didelez, V. and Sheehan, N. (2007). Mendelian randomization as an instrumental variableapproach to causal inference.

Statistical Methods in Medical Research , 16(4):309–330.Foley, C. N., Kirk, P. D., and Burgess, S. (2019). MR-Clust: Clustering of genetic variants inMendelian randomization with similar causal estimates. bioRxiv , page 2019.12.18.881326.Ibrahim, J. G., Zhu, H., and Tang, N. (2008). Model Selection Criteria for Missing-DataProblems Using the EM Algorithm.

Journal of the American Statistical Association ,103(484):1648–1658.Ji, Y., Yiorkas, A. M., Frau, F., Mook-Kanamori, D., Staiger, H., Thomas, E. L., Atabaki-Pasdar, N., Campbell, A., Tyrrell, J., Jones, S. E., et al. (2019). Genome-wide andabdominal mri data provide evidence that a genetically determined favorable adiposityphenotype is characterized by lower ectopic liver fat and lower risk of type 2 diabetes,heart disease, and hypertension.

Diabetes , 68(1):207–219.29ang, H., Zhang, A., Cai, T. T., and Small, D. S. (2016). Instrumental variables estimationwith some invalid instruments and its application to mendelian randomization.

Journalof the American Statistical Association , 111(513):132–144.Kettunen, J., Demirkan, A., W¨urtz, P., Draisma, H. H., Haller, T., Rawal, R., Vaarhorst,A., Kangas, A. J., Lyytik¨ainen, L.-P., Pirinen, M., et al. (2016). Genome-wide study forcirculating metabolites identiﬁes 62 loci and reveals novel systemic eﬀects of lpa.

Naturecommunications , 7(1):1–9.Li, K.-H. (2004). The Sampling/Importance Resampling Algorithm. In

Applied BayesianModeling and Causal Inference from IncompleteData Perspectives , Wiley Series in Proba-bility and Statistics, pages 265–276.Liu, X., Li, Y. I., and Pritchard, J. K. (2019). Trans eﬀects on gene expression can driveomnigenic inheritance.

Cell , 177(4):1022–1034.Locke, A. E., Kahali, B., Berndt, S. I., Justice, A. E., Pers, T. H., Day, F. R., Powell, C.,Vedantam, S., Buchkovich, M. L., Yang, J., et al. (2015). Genetic studies of body massindex yield new insights for obesity biology.

Nature , 518(7538):197–206.Louis, T. A. (1982). Finding the Observed Information Matrix when Using the EM Algo-rithm.

Journal of the Royal Statistical Society. Series B (Methodological) , 44(2):226–233.Mahajan, A., Taliun, D., Thurner, M., Robertson, N. R., Torres, J. M., Rayner, N. W.,Payne, A. J., Steinthorsdottir, V., Scott, R. A., Grarup, N., et al. (2018). Fine-mappingtype 2 diabetes loci to single-variant resolution using high-density imputation and islet-speciﬁc epigenome maps.

Nature genetics , 50(11):1505–1513.Neath, R. C. (2013). On Convergence Properties of the Monte Carlo EM Algorithm. In

Advances in Modern Statistical Theory and Applications: A Festschrift in honor of Mor-ris L. Eaton , volume Volume 10 of

Collections , pages 43–62. Institute of MathematicalStatistics, Beachwood, Ohio, USA.Nikpay, M., Goel, A., Won, H.-H., Hall, L. M., Willenborg, C., Kanoni, S., Saleheen, D.,Kyriakou, T., Nelson, C. P., Hopewell, J. C., et al. (2015). A comprehensive 1000 genomes–30ased genome-wide association meta-analysis of coronary artery disease.

Nature Genetics ,47(10):1121.Pearl, J. (2009).

Causality . Cambridge University Press.Qi, G. and Chatterjee, N. (2019). Mendelian randomization analysis using mixture modelsfor robust and eﬃcient estimation of causal eﬀects.

Nature Communications , 10(1):1941.Rader, D. J. and Hovingh, G. K. (2014). Hdl and cardiovascular disease.

The Lancet ,384(9943):618–625.Shapland, C. Y., Zhao, Q., and Bowden, J. (2020). Proﬁle-likelihood Bayesian model averag-ing for two-sample summary data Mendelian randomization in the presence of horizontalpleiotropy. bioRxiv , page 2020.02.11.943712.Smith, G. D. and Ebrahim, S. (2004). Mendelian randomization: prospects, potentials, andlimitations.

International journal of epidemiology , 33(1):30–42.Teslovich, T. M., Musunuru, K., Smith, A. V., Edmondson, A. C., Stylianou, I. M., Koseki,M., Pirruccello, J. P., Ripatti, S., Chasman, D. I., Willer, C. J., et al. (2010). Biological,clinical and population relevance of 95 loci for blood lipids.

Nature , 466(7307):707–713.Tokdar, S. T. and Kass, R. E. (2010). Importance sampling: a review.

WIREs ComputationalStatistics , 2(1):54–60.Verbanck, M., Chen, C.-y., Neale, B., and Do, R. (2018). Detection of widespread horizontalpleiotropy in causal relationships inferred from mendelian randomization between complextraits and diseases.

Nature Genetics , 50(5):693–698.Voight, B. F., Peloso, G. M., Orho-Melander, M., Frikke-Schmidt, R., Barbalic, M., Jensen,M. K., Hindy, G., H´olm, H., Ding, E. L., Johnson, T., Schunkert, H., Samani, N. J.,Clarke, R., Hopewell, J. C., Thompson, J. F., Li, M., Thorleifsson, G., Newton-Cheh, C.,Musunuru, K., Pirruccello, J. P., Saleheen, D., Chen, L., Stewart, A. F. R., Schillert, A.,Thorsteinsdottir, U., Thorgeirsson, G., Anand, S., Engert, J. C., Morgan, T., Spertus,J., Stoll, M., Berger, K., Martinelli, N., Girelli, D., McKeown, P. P., Patterson, C. C.,31pstein, S. E., Devaney, J., Burnett, M.-S., Mooser, V., Ripatti, S., Surakka, I., Nieminen,M. S., Sinisalo, J., Lokki, M.-L., Perola, M., Havulinna, A., de Faire, U., Gigante, B.,Ingelsson, E., Zeller, T., Wild, P., de Bakker, P. I. W., Klungel, O. H., Maitland-van derZee, A.-H., Peters, B. J. M., de Boer, A., Grobbee, D. E., Kamphuisen, P. W., Deneer,V. H. M., Elbers, C. C., Onland-Moret, N. C., Hofker, M. H., Wijmenga, C., Verschuren,W. M. M., Boer, J. M. A., van der Schouw, Y. T., Rasheed, A., Frossard, P., Demissie,S., Willer, C., Do, R., Ordovas, J. M., Abecasis, G. R., Boehnke, M., Mohlke, K. L.,Daly, M. J., Guiducci, C., Burtt, N. P., Surti, A., Gonzalez, E., Purcell, S., Gabriel,S., Marrugat, J., Peden, J., Erdmann, J., Diemert, P., Willenborg, C., K¨onig, I. R.,Fischer, M., Hengstenberg, C., Ziegler, A., Buysschaert, I., Lambrechts, D., Van de Werf,F., Fox, K. A., El Mokhtari, N. E., Rubin, D., Schrezenmeir, J., Schreiber, S., Sch¨afer,A., Danesh, J., Blankenberg, S., Roberts, R., McPherson, R., Watkins, H., Hall, A. S.,Overvad, K., Rimm, E., Boerwinkle, E., Tybjaerg-Hansen, A., Cupples, L. A., Reilly,M. P., Melander, O., Mannucci, P. M., Ardissino, D., Siscovick, D., Elosua, R., Stefansson,K., O’Donnell, C. J., Salomaa, V., Rader, D. J., Peltonen, L., Schwartz, S. M., Altshuler,D., and Kathiresan, S. (2012). Plasma HDL cholesterol and risk of myocardial infarction:a mendelian randomisation study.

Lancet (London, England) , 380(9841):572–80.Wang, J., Zhao, Q., Bowden, J., Hemani, G., Smith, G. D., Small, D. S., and Zhang,N. R. (2020). Causal Inference for Heritable Phenotypic Risk Factors Using HeterogeneousGenetic Instruments. bioRxiv , page 2020.05.06.077982.Willer, C. J., Schmidt, E. M., Sengupta, S., Peloso, G. M., Gustafsson, S., Kanoni, S.,Ganna, A., Chen, J., Buchkovich, M. L., Mora, S., et al. (2013). Discovery and reﬁnementof loci associated with lipid levels.

Nature Genetics , 45(11):1274.Wood, A. R., Jonsson, A., Jackson, A. U., Wang, N., Van Leewen, N., Palmer, N. D.,Kobes, S., Deelen, J., Boquete-Vilarino, L., Paananen, J., et al. (2017). A genome-wide association study of ivgtt-based measures of ﬁrst-phase insulin secretion reﬁnes theunderlying physiology of type 2 diabetes variants.

Diabetes , 66(8):2296–2309.32u, C. F. J. (1983). On the Convergence Properties of the EM Algorithm.

The Annals ofStatistics , 11(1):95–103.Zhao, Q., Wang, J., Hemani, G., Bowden, J., and Small, D. S. (in press 2020). Statisticalinference in two-sample summary-data Mendelian randomization using robust adjustedproﬁle score.

The Annals of Statistics .Zhao, Q., Wang, J., Miao, Z., Zhang, N., Hennessy, S., Small, D. S., and Rader, D. J. (2019).The role of lipoprotein subfractions in coronary artery disease: A Mendelian randomizationstudy. bioRxiv , page 691089. 33

Technical Appendix

A.1 Bounded Importance Sampling weights

Following eq. (13), the importance weights are given by w ji = P ( θ X i | ˆ θ X i , ˆ θ Y i , ϕ ) P ( θ X i | ˆ θ X i , ϕ ) ∝ P ( θ X i | ˆ θ X i , ϕ ) P (ˆ θ Y i | θ X i , ϕ )= K (cid:88) k =1 π k (cid:113) πθ X i σ k + σ Y i exp (cid:110) − θ X i σ k + σ Y i ) ( Y i − θ X i µ k ) (cid:111) ≤ (cid:113) πσ Y i , for i = 1 , . . . , p ; j = 1 , . . . , M . Therefore, the importance weights are bounded and havea ﬁnite variance. A.2 Importance sampling estimate of η An importance sampling estimate of η at iteration t in eq. (22) is given byˆ η = m t p (cid:88) i =1 (cid:110) m t (cid:88) j =1 w ji Λ ( t ) ij (cid:111) (cid:104) (cid:80) m t j =1 ( w ji Λ ( t ) ij ) ( (cid:80) m t j =1 w ji Λ ( t ) ij ) − (cid:80) m t j =1 ( w ji ) Λ ( t ) ij (cid:80) m t j =1 w ji Λ ( t ) ij + m t (cid:88) j =1 ( w ji ) (cid:105) , where Λ ( t ) ij is given in eq. (21). 34 LDL−DXXL−VLDL−TGXXL−VLDL−PLXXL−VLDL−PXXL−VLDL−LXL−VLDL−TGXL−VLDL−PLXL−VLDL−PXL−VLDL−LL−VLDL−TGL−VLDL−PLL−VLDL−PL−VLDL−LL−VLDL−FCL−VLDL−CEL−VLDL−CM−VLDL−TGM−VLDL−PLM−VLDL−PM−VLDL−LM−VLDL−FCM−VLDL−CEM−VLDL−CS−VLDL−TGS−VLDL−PLS−VLDL−PS−VLDL−LS−VLDL−FCXS−VLDL−TGXS−VLDL−PLXS−VLDL−PXS−VLDL−LLDL−DLDL−CAPOBL−LDL−PLL−LDL−PL−LDL−LL−LDL−FCL−LDL−CEL−LDL−CM−LDL−PLM−LDL−PM−LDL−LM−LDL−CEM−LDL−CS−LDL−PS−LDL−LS−LDL−CIDL−TGIDL−PLIDL−PIDL−LIDL−FCIDL−CHDL−DHDL−CAPOA1XL−HDL−TGXL−HDL−PLXL−HDL−PXL−HDL−LXL−HDL−FCXL−HDL−CEXL−HDL−CL−HDL−PLL−HDL−PL−HDL−LL−HDL−FCL−HDL−CEL−HDL−CM−HDL−PLM−HDL−PM−HDL−LM−HDL−FCM−HDL−CEM−HDL−CS−HDL−TGS−HDL−PS−HDL−L r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s −7 −5 −3 0 3 5 7 z−score Figure 6: A heatmap showing the z-scores of the SNPs’ associations with lipoprotein sub-fraction measurements. The SNPs are ordered by their posterior probability of belonging tocluster 2 as in Figure 5. The lipoprotein subfractions are ordered by their density and size.35

SNP association with BMI S N P a ss o c i a t i on w i t h T D Figure 7:

Scatterplot of BMI-T2D data with MC-EM parameter estimates . The coloredlines/shaded regions represent the estimated cluster means ± µ k ± ˆ σ k ). The points are colored according to the cluster with the highest posteriorcluster membership probability P ( Z i = k | ˆ θ X i , ˆ θ Y i ).36 r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s % P o s t e r i o r C r ed i b l e I n t e r v a l C l u s t e r m e m be r s h i p p r ob . Figure 8:

SNP-speciﬁc posterior quantities in BMI-T2D data . SNPs are ordered by theirposterior probability of belonging to cluster 2.

Top :

95% posterior credible intervals . Col-ored dashed lines are the estimated cluster means (same as Figure 7) and x-marks are theposterior medians for each SNP.

Bottom : Posterior cluster membership probabilities barplot . Vertical axis is posterior probability of belonging to cluster 2.37 s10906111rs2237892rs6444082rs7020996rs7903146rs7923837rs9368222

SNP−specific slope S N P a ss o c i a t i on w i t h pea k b l ood i n s u li n a a Figure 9:

Relationship of SNP-speciﬁc slope and the association with peak blood insulin . Thehorizontal axis is the posterior median of the SNP-speciﬁc slope β i with standard error bars.The vertical axis is the SNP association with peak blood insulin with standard error barsreported in an independent GWAS. The SNPs are colored according to the cluster with thehighest posterior cluster membership probability P ( Z i = k | ˆ θ X ii