A Latent Mixture Model for Heterogeneous Causal Mechanisms in Mendelian Randomization
AA Latent Mixture Model for HeterogeneousCausal Mechanisms in Mendelian
Randomization
Daniel IongDepartment of Statistics, University of Michigan, Ann ArborQingyuan ZhaoStatistical Laboratory, University of CambridgeYang ChenDepartment of Statistics, University of Michigan, Ann Arbor
Abstract
Mendelian Randomization (MR) is a popular method in epidemiology and geneticsthat uses genetic variation as instrumental variables for causal inference. Existing MRmethods usually assume most genetic variants are valid instrumental variables thatidentify a common causal effect. There is a general lack of awareness that this effecthomogeneity assumption can be violated when there are multiple causal pathwaysinvolved, even if all the instrumental variables are valid. In this article, we introduce alatent mixture model MR-PATH that groups instruments that yield similar causal effectestimates together. We develop a Monte-Carlo EM algorithm to fit this mixture model,derive approximate confidence intervals for uncertainty quantification, and adopt amodified Bayesian Information Criterion (BIC) for model selection. We verify theefficacy of the Monte-Carlo EM algorithm, confidence intervals, and model selectioncriterion using numerical simulations. We identify potential mechanistic heterogeneitywhen applying our method to estimate the effect of high-density lipoprotein cholesterolon coronary heart disease and the effect of adiposity on type II diabetes.
Keywords:
Causal inference, Instrumental variables, EM algorithm, Monte-Carlo sampling,HDL cholesterol, Diabetes 1 a r X i v : . [ s t a t . A P ] J u l Introduction
Mendelian randomization (MR) is a causal inference method that aims to estimate the causaleffect of a modifiable risk exposure on disease outcomes. MR is a special case of instrumentalvariable methods that have a long history in statistics and econometrics. The key insight ofMR is that genetic variants, usually in the form of single nucleotide polymorphisms (SNPs),are naturally randomised during conception and may serve as good instrumental variables formany epidemiological risk factors (Smith and Ebrahim, 2004; Didelez and Sheehan, 2007).As a study design, MR has been quickly gaining popularity among epidemiologists because ofits ability to give unbiased causal effect estimates in the presence of unmeasured confoundingand the increasing availability of genome-wide association studies (GWAS) data.A key assumption of MR is that the genetic instrumental variables can only affect theoutcome variables through the risk exposure under investigation. This is often referred toas the “exclusion restriction” or “no direct effect” assumption in the instrumental variableliterature. With genetic variants as instruments, this assumption may be violated due toa genetic phenomenon called “pleiotropy”, meaning a single genotype can affect multipleseemingly unrelated phenotypes. Recent empirical evidence and genetic theory suggest thatpleiotropy is pervasive for common traits (Boyle et al., 2017; Liu et al., 2019). This has leadto a burst of development of new statistical methods aiming to make MR studies robust todifferent patterns of pleiotropy (Bowden et al., 2015; Kang et al., 2016; Zhao et al., 2020;Verbanck et al., 2018; Burgess et al., 2020; Qi and Chatterjee, 2019).To our knowledge, the vast majority of these robust MR methods still rely on the “effecthomogeneity” assumption that the risk exposure has the same causal effect for every indi-vidual. This assumption usually follows from assuming a linear structural equation modelcommonly used in the instrumental variable literature (Anderson and Rubin, 1949; Bow-den et al., 2017). However, this key assumption may be unrealistic when we use MR tostudy complex biological systems involving multiple mechanisms, as demonstrated in thenext example. 2 .1 Motivating example: The effect of HDL cholesterol on coro-nary heart disease
The statistical model we develop in this article is motivated by a real world problem. Overthe last few decades, there has been a heated debate in cardiology on the role of high-density lipoproteins (HDL) in coronary heart disease (CHD) (Rader and Hovingh, 2014;Davey Smith and Phillips, 2020). Numerous observational studies have found a consistentinverse association between HDL cholesterol (HDL-C, amount of cholesterol carried in HDLparticles) and CHD, lending support to a theory that HDL plays a causally protective rolein atherogenesis (formation of fatty deposits in the arteries) through a biological mechanismcalled reverse cholesterol transport (HDL particles remove excessive cholesterol in peripheraltissues). This led to a once widely held belief among healthcare professionals and the generalpublic that HDL particles are the “good cholesterol”, as opposed to low-density lipoproteins(LDL) which are thought to be the “bad cholesterol”.However, the HDL hypothesis has received close scrutiny after several promising clinicaltrials raising HDL cholesterol through the
CETP inhibitors demonstrated no or only modestcardiovascular benefit (Armitage et al., 2019). Moreover, a prominent MR study furtherchallenged the presumption that raising HDL-C will uniformly translate into reductions inrisk of CHD (Voight et al., 2012). Although there are lots of SNPs associated with HDL-C,many of them are also associated with LDL cholesterol and/or triglycerdes. Due to thisreason, Voight et al. (2012) based their main argument on a SNP in the
LIPG gene thatdoes not exhibit significant association with LDL cholesterol and triglycerides, even thoughother genetic instruments showed varied associations with risk of CHD. This shows thatpleiotropy, arising from the multiple mechanisms involved in the synthesis and regulation ofblood lipids, poses a major challenge for using MR methods to study HDL.Using some of the latest large-scale GWAS datasets for HDL-C and CHD, we createda dataset to visualize and analyze the heterogeneity among potential genetic instrumentsfor HDL (Figure 1). Each point in this plot shows the reported associations of a SNP withHDL-C and CHD in the GWAS (with standard error bars). Figure 1a shows a straight lineacross the origin whose slope is obtained using a previous method called MR-RAPS that3
SNP association with HDL−C S N P a ss o c i a t i on w i t h CHD (a) MR-RAPS. −0.10−0.050.00 0.00 0.05 0.10 0.15
SNP association with HDL−C S N P a ss o c i a t i on w i t h CHD (b) Proposed mixture model.
Figure 1:
Scatterplot of HDL-CAD data . Left : Line/shaded region represents the causaleffect estimate ± one standard error from MR-RAPS. Right : Lines/shaded regions representheterogeneous causal effect estimates ± one standard deviation from our proposed mixturemodel.assumes a homogeneous effect of HDL-C on CHD (Zhao et al., 2020). If this assumptionholds, the slope in Figure 1a can be interpreted as the causal effect of HDL-C on CHD.However, it is clear from Figure 1a that the slope estimated by MR-RAPS provides a poorfit to the scatterplot.In this paper, we propose to fit this dataset using an alternative model where the SNPshave individual slopes that are drawn from a mixture distribution. This would be the case ifthere are several biological mechanisms involved in regulating HDL-C; see Section 3. In thisexample, MR-PATH selects two clusters (shown in Figure 1b) which provides a much betterfit to the data. See Section 7 for more detail about the data collection for this example andthe results of our model. There have been several attempts to develop MR methods that allow for heterogeneouscausal effects. We are also not the first to use mixture models for MR. The contamination ixture method proposed by Burgess et al. (2020) uses a two-component mixture modelto distinguish between valid and invalid instruments. Similarly, the MR-Mix method (Qiand Chatterjee, 2019) uses a four-component mixture model to identify one group of validinstruments and three groups of invalid instruments which have either direct effects on boththe exposure and outcome, direct effects on the outcome but no effect on the exposure,or no effect on both the exposure and outcome. However, the purpose of using mixturemodels in these approaches is not to identify different mechanisms but rather to provide arealistic model for the invalid instruments. In particular, they assume that the majority ofthe instruments are valid and indicate an identical causal effect.The only methods we are aware of that does not assume effect homogeneity and attemptsto distinguish causal mechanisms are
GRAPPLE (Wang et al., 2020),
MR-Clust (Foleyet al., 2019) and
BESIDE-MR (Shapland et al., 2020).
GRAPPLE proposes to use the localmaximums of a robustified profile likelihood function (Zhao et al., 2020) to discover multiplemechanisms. However,
GRAPPLE is only a visualization tool and does not attempt toexplicitly model the different mechanisms.
MR-Clust works by constructing a mixture modelbased on SNP-specific Wald estimators. Similar to our proposed method,
MR-Clust does notmake further assumptions about the number of clusters and the structure of each cluster.However, a major limitation of
MR-Clust is its assumption that the SNP-specific Waldestimates are normally distributed, which is a poor approximation for weak instruments.
BESIDE-MR is another related method that uses Bayesian model averaging. Although
BESIDE-MR is initially motivated by averaging over the uncertainty in selecting the validinstruments, it can also be extended to allow for multiple clusters of instruments indicatingdifferent causal effects. However,
BESIDE-MR is not a full likelihood approach because ituses the profile likelihood derived in Zhao et al. (2020) to eliminate the nuisance parametersrelated to the SNP-exposure effects.Our paper makes two main contributions to this fast growing literature. First, there is ageneral lack of awareness that MR can be used to discover multiple biological mechanisms,partly due to the wide usage of the broad terminology “effect heterogeneity” to refer to severaldifferent phenomena—invalid instrument due to pleiotropy, effect modification/moderationby a covariate, and effect heterogeneity due to different causal mechanisms. In this article5e introduce the concept of mechanistic heterogeneity for the last phenomenon and showthat it can occur even if all the instruments are valid.Our second contribution is a transparent mixture model, which we call MR-PATH, tocapture the mechanistic heterogeneity and the adaptation of a Monte Carlo EM algorithmto fit this model. Because our model is based on the SNP-exposure and SNP-outcomeassociations, it does not require the individual instruments to be strong. Since our MonteCarlo EM algorithm maximizes the full likelihood function for the latent mixture model, ithas all the benefits of likelihood-based inference.The rest of the paper is organized as follows. In section 2, we give a brief review ofthe standard assumptions in MR. In section 3, we introduce the concept of mechanisticheterogeneity and in section 4 we propose to model it with MR-PATH. In section 5, wedescribe an Monte Carlo EM algorithm to fit MR-PATH and discuss the relevant statisticalinference and model selection procedures. We then study the performance of our MonteCarlo EM algorithm with two simulation studies in section 6 and apply it to two real datadatasets (including the HDL-CHD example above) in section 7. We conclude the paper withsome ending remarks in section 8.
The goal of MR is to estimate the causal effect of a risk exposure variable (X) on a dis-ease outcome variable (Y) with summary-level data from GWAS. In particular, we may beinterested in the causal effect of HDL cholesterol (X) on the risk of coronary heart disease(Y). Regression analyses between X and Y are typically biased by unobserved confoundingvariables U . MR uses p genetic variants Z , . . . , Z p as instrumental variables to obtain anunbiased causal effect estimate of X on Y. A genetic variant Z is said to be a valid instrument for estimating the causal effect of X on Y if it satisfies the following assumptions: Assumption 2.1 (Relevance)
It is associated with the risk exposure, i.e. Corr ( Z, X ) (cid:54) = 0 . Assumption 2.2 (Independence)
It must be independent of any unmeasured confoundersthat are associated with both the exposure and outcome, i.e. Z | = U . YUZ × × Figure 2: A directed acyclic graph (DAG) illustrating the core assumptions for a validinstrument.
Assumption 2.3 (Exclusion restriction)
It affects the outcome only through the risk ex-posure, i.e. Z | = Y | X . Figure 2 provides a graphical representation of these assumptions. The independenceassumption is guaranteed by Mendel’s law of random assortment of genes. The relevanceassumption is justified if genetic variants are chosen to be genome-wide significant. Amongthe three assumptions, the ER assumption is the most problematic due to pleiotropy. As-sessment of the IV assumptions and sensitivity analysis in MR are discussed by Burgesset al. (2017). In the MR literature, it is common to assume the following linear structuralequation model for the exposure and outcome variables (Bowden et al., 2017): X = p (cid:88) i =1 θ X i Z i + η X U + E X , (1) Y = βX + p (cid:88) i =1 α i Z i + η Y U + E Y , (2)where η X and η Y are confounding effects and E X and E Y are random noise terms actingon X and Y respectively. The causal effect between X and Y is given by the parameter β . The true marginal association between X and Z i is given by θ X i . The direct effectof Z i on Y is given by α i . Assumption 2.1 implies θ X i (cid:54) = 0 and assumption 2.2 implies Z , . . . , Z p | = U, E X , E Y . Under this model, the exclusion restriction assumption is violatedif α i (cid:54) = 0. In particular, this can occur if Z i affects Y through a mechanism unrelated to X (see fig. 3a for illustration). Plugging eq. (1) into eq. (2), we obtain Y = p (cid:88) i =1 (cid:16) βθ X i + α i (cid:17) Z i + ( η X + η Y ) U + ( E X + E Y ) = p (cid:88) i =1 θ Y i Z i + E (cid:48) Y , (3)7here θ Y i = βθ X i + α i gives the true marginal association between Y and Z i and E (cid:48) Y is arandom noise term that is independent of Z i . In summary-data MR, we usually observethe estimated SNP-exposure effect ˆ θ X i , with standard error σ X i , and the estimated SNP-outcome effect ˆ θ Y i , with standard error σ Y i , for SNP i = 1 , . . . , p . These estimated effectsare typically computed from two different samples using linear or logistic regression. If weassume the genetic variants Z , . . . , Z p satisfy the exclusion restriction assumption, in otherwords α = · · · = α p = 0, then the causal effect parameter β can be estimated consistentlywith the inverse-variance-weighted estimator Burgess et al. (2013). A more robust approachis to perform error-in-variables regression of ˆ θ Y i on ˆ θ X i (Zhao et al., 2020). We adopt thisapproach in MR-PATH. If the exclusion restriction assumption is violated for some SNPs,then the causal effect parameter β cannot be identified without further assumptions on α i .For example, Zhao et al. (2020) assumes that α i ∼ N (0 , τ ) for most genetic variants so thatthe direct effects are balanced out. Instruments Z Pathway M Effect of M on X Effect of M on Y Wald estimandScenario 1 Z , , . . . , Z ,p M θ θ β βZ , , . . . , Z ,p M θ θ β + α β + α /θ Z , , . . . , Z ,p M θ θ β + α β + α /θ Scenario 2 Z , , . . . , Z ,p M θ θ β β Z , , . . . , Z ,p M θ θ β β Z , , . . . , Z ,p M θ θ β β Table 1: Ratio estimands using different instruments in the two scenarios in Figure 3. Di-rected acyclic graphs in Figure 3 are interpreted as linear structural equation models.The assumption that the direct effect α i is iid normally distributed does not take into8 , ... Z ,p Z , ... Z ,p Z , ... Z ,p M M M Xθ θ θ Yβα α U (a) Scenario 1: Multiple pathways of horizontalpleiotropy. Z , ... Z ,p M X θ Z , ... Z ,p M X θ Z , ... Z ,p M X θ X = X + X + X Yβ β β U (b) Scenario 2: Multiple mechanisms for theexposure X . Figure 3: Two scenarios of mechanistic heterogeneity.account the possibility that genetic variation often affects phenotypic traits through separatebiological pathways. In this section we show that such behaviour may lead to a clusteringphenomenon where SNPs belonging to the same pathway would indicate similar causal effectsin an MR analysis. This is what we call “mechanistic heterogeneity” in MR.
Consider Figure 3 which contains two scenarios of mechanistic heterogeneity that motivatethe latent mixture model. In both scenarios, genetic variants are grouped into three biologicalpathways, M , M , and M , that affect the exposure X and outcome Y differently. In thefirst scenario (Figure 3a), all three pathways affect X in the same way but have differentdirect effects on Y . In particular, the first pathway M does not have any direct effect on Y not mediated by X , so the instruments Z , , . . . , Z ,p associated with it are all valid IVs.In the second scenario in Figure 3b, the three pathways affect different components of theexposure X which also have different causal effects on the outcome Y .9f we interpret the diagrams in Figure 3 as linear structural equations (like Figure 2 for (1)and (2)), we can derive the so-called Wald estimand (ratio of θ Y and θ X ) for each instrument(Table 1). In both scenarios in Figure 3, genetic instruments on the same pathway havethe same Wald estimand θ Y /θ X , while instruments across different pathways generally havedifferent estimands.Therefore, we may reach completely different conclusions when using instruments ondifferent pathways in the MR analysis. In reality, mechanistic heterogeneity can be morecomplicated than the basic scenarios in Figure 3. For example, another pathway can affectsome components of X and also have direct effect on Y . It is also worthwhile to pointout that mechanistic heterogeneity can arise even when all the IVs are perfectly valid; anexample is Scenario 2 in Figure 3. The above introduction of mechanistic heterogeneity is entirely based on linear structuralequation models. Next we show that the same clustering phenomenon can also happenwhen there is nonlinearity. Consider the causal diagram in Figure 3a without the M → Y and M → Y edges, so all the instruments Z , , . . . , Z ,p are valid (satisfy Assumptions 2.1to 2.3). Suppose the variables satisfy the nonparametric structural equation model (NPSEM)with independent errors (Pearl, 2009) according to Figure 3a, so the counterfactuals of M k , k = 1 , , X , and Y can be defined using the NPSEM. For example, we use Y ( X = 1) todenote the counterfactual outcome under the intervention X = 1. To simplify our illustrationbelow, we assume Z , , . . . , Z ,p , M , M , M , and X are all binary variables.It is well known that if the instrument Z k,j is valid and the counterfactuals of the exposure X satisfies the monotonicity assumption X ( Z k,j = 1) ≥ X ( Z k,j = 0), the Wald estimand forinstrument Z k,j is equal to the so-called local average treatment effect, E [ Y ( X = 1) − Y ( X =0) | X ( Z k,j = 1) > X ( Z k,j = 0)] (Angrist et al., 1996). Notice that this interpretation of theIV analysis does not require linearity of the structural equation model. Suppose we furtherassume the effect of Z on M is monotone, M k ( Z k,j = 1) ≥ M k ( Z k,j = 0) , and the effect of M on X is monotone, X ( M k = 1) ≥ X ( M k = 0). Using the properties of counterfactuals10nd the fact in Figure 3a that Z k,j affects X entirely through M k , we get X ( Z k,j = z ) = X ( Z k,j = z, M k = M k ( Z k,j = z )) = X ( M k = M k ( Z k,j = z )) . Thus, using the assumption that X and M k are binary, E (cid:2) Y ( X = 1) − Y ( X = 0) | X ( Z k,j = 1) > X ( Z k,j = 0) (cid:3) = E (cid:2) Y ( X = 1) − Y ( X = 0) | X (cid:0) M k = M k ( Z k,j = 1) (cid:1) > X (cid:0) M k = M k ( Z k,j = 0) (cid:1)(cid:3) = E (cid:2) Y ( X = 1) − Y ( X = 0) | X ( M k = 1) > X ( M k = 0) , M k ( Z k,j = 1) > M k ( Z k,j = 0) (cid:3) = E (cid:2) Y ( X = 1) − Y ( X = 0) | X ( M k = 1) > X ( M k = 0) (cid:3) . (4)The last equality above uses { M k ( Z k,j = 0) , M k ( Z k,j = 1) } | = { X ( M k = 0) , X ( M k = 1) , Y ( X = 0) , Y ( X = 1) } . This counterfactual independence follows from expressing the counterfactuals using theNPSEM and the assumption that the different structural equations have independent er-rors.The significance of (4) is that, if the counterfactuals of M and X satisfy the monotonicityassumption, the local average treatment effect corresponding to Z k,j only depends on themechanism index k . This shows that the clustering of the Wald esimand in Section 3.1 notonly occurs in linear structural equation models but also in certain nonlinear models. Theseexamples demonstrate the importance of identifying mechanistic heterogeneity to correctlyinterpret MR studies. Motivated by the observations in the previous section, we propose a latent mixture modelto discover mechanistic heterogeneity using summary GWAS data. In essence, this modelassumes that each genetic variant has a specific causal effect and the genetic variants onthe same biological pathway have similar variant-specific causal effects and form clusters.The mean of each cluster corresponds to the Wald estimand of that pathway (last column11n table 1). These assumptions, along with standard assumptions for summary-data MRliterature, are introduced below.
Assumption 4.1 (Error-in-variables regression)
The observed instrument-exposure andinstrument-outcome associations are distributed as ˆ θ X i ˆ θ Y i indep. ∼ N (cid:16) θ X i β i θ X i , σ X i σ Y i (cid:17) , i = 1 , . . . , p, (5) where σ X i , σ Y i are (fixed) measurement errors. In this assumption, the variant-specific causal effects are given by β i . The normalityassumption is justified because ˆ θ X i and ˆ θ Y i are typically linear (or logistic) regression coef-ficients which are computed in GWAS with a large sample size. Independence between ˆ θ X i and ˆ θ Y i for each SNP is justified if they are from GWAS conducted with non-overlappingsamples. Independence of the estimated effects across different SNPs is a reasonable as-sumption if we select SNPs that are uncorrelated by using linkage disequilibrium clumping.Although independence between SNPs does not imply the estimated effects are uncorrelated,the correlation between the estimated effects are typically negligible (Zhao et al., 2020). Assumption 4.2 (Mixture model for mechanistic heterogeneity) Z i ∼ Categorical ( π , . . . , π K ) , (6) β i | Z i = k ∼ N ( µ k , σ k ) , k = 1 , . . . , K. (7)We assume a Gaussian mixture model for the variant-specific (latent) causal effects β i toprobabilistically cluster genetic variants with similar causal effects. An indicator variable forcluster membership of SNP i is given by Z i . We can compute the posterior distribution of Z i and β i to summarize our knowledge of these variant-specific latent variables based on data(see section 5.4). The K clusters represent different causal mechanisms where the clustermeans µ k , for k = 1 , . . . , K , identify the average causal effects for each mechanism. Thecluster proportions π k give us the proportion of genetic variants associated with each mecha-nism. The cluster variances σ k quantify uncertainty within each mechanism. Therefore, theparameters of interest in MR-PATH are given by ϕ = { π k , µ k , σ k : k = 1 , . . . , K } . The num-ber of clusters K is unknown but can be chosen based on heuristics and domain knowledgeor estimated from data—see section 5.5 for a model selection criterion for choosing K .12 Statistical inference for MR-PATH
In order to fit MR-PATH to gain insight into mechanistic heterogeneity, we proceed bydiscussing three inference procedures. First, we describe our implementation of the EMalgorithm for obtaining a maximum likelihood estimate of ϕ and point out the challenges andour solutions for the expectation step. Second, we discuss approximate confidence intervalsfor ϕ obtained by computing and inverting the observed information matrix. Lastly, wego into how K can be selected from data using a modified Bayesian Information criterion(BIC). Let D = { (ˆ θ X i , σ X i , ˆ θ Y i , σ Y i ) : i = 1 , . . . , p } denote the observed data and let Θ = { ( θ X i , β i , Z i ) : i = 1 , . . . , p } denote the set of latent variables. The Expectation-Maximization (EM) algorithm is an iterative procedure commonly usedto perform maximum likelihood estimation in latent variable models. To define the EMalgorithm for MR-PATH, we derive two important model quantities: the complete-data loglikelihood and the conditional posterior of the latent variables Θ , given parameters ϕ . Thelatter is used to compute the Q-function in the expectation step. Let φ ( · ; µ, σ ) denote thedensity of N ( µ, σ ). We make the additional model assumption that ˆ θ X i ∼ N ( m x , λ x ). Itfollows that the complete-data log likelihood for MR-PATH is given by l ( ϕ ; Θ ) := p (cid:88) i =1 l i ( ϕ ; θ X i , β i , Z i ) , (8)where l i ( ϕ ; θ X i , β i , Z i ) ∝ log φ ( θ X i ; m x , λ x ) + K (cid:88) k =1 Z ik (cid:2) log π k + log φ ( β i ; µ k , σ k ) (cid:3) (9)and Z ik = 1, if Z i = k , and 0 otherwise. The conditional posterior of the latent variablesgiven ϕ can be decomposed as P ( Θ | D , ϕ ) := p (cid:89) i =1 P ( β i , Z i , θ X i | ˆ θ X i , ˆ θ Y i , ϕ )= p (cid:89) i =1 K (cid:89) k =1 (cid:2) P ( Z i = k | β i , ϕ ) (cid:3) Z ik P ( β i | θ X i , ˆ θ Y i , ϕ ) P ( θ X i | ˆ θ X i , ˆ θ Y i , ϕ ) , (10)13here P ( Z i = k | β i , ϕ ) and P ( β i | θ X i , ˆ θ Y i , ϕ ) are available in closed-form and are given by P ( Z i = k | β i , ϕ ) = π k φ ( β i ; µ k , σ k ) (cid:80) Kj =1 π j φ ( β i ; µ j , σ j ) , (11) P ( β i | θ X i , ˆ θ Y i , ϕ ) = K (cid:88) k =1 π k φ ( β i ; ˜ µ ik , ˜ σ ik ); (12)where ˜ σ ik = (cid:16) σ k + θ X i σ Y i (cid:17) − , ˜ µ ik = ˜ σ ik (cid:16) ˆ θ Y i θ X i σ Y i + µ k σ k (cid:17) . Unfortunately, instead of having an analytical solution, the probability density P ( θ X i | ˆ θ X i , ˆ θ Y i , ϕ )is known only up to a multiplicative constant given by P ( θ X i | ˆ θ X i , ˆ θ Y i , ϕ ) ∝ P ( θ X i | ˆ θ X i , ϕ ) P (ˆ θ Y i | θ X i , ϕ )= φ ( θ X i ; m X i , λ X i ) K (cid:88) k =1 π k φ (ˆ θ Y i ; θ X i µ k , θ X i σ k + σ Y i ) , (13)where λ X i = (cid:16) σ X i + 1 λ x (cid:17) − , m X i = λ X i (cid:16) ˆ θ X i σ X i + m x λ x (cid:17) . The EM algorithm starts with initial values for the parameter estimates ϕ (0) . Eachiteration t = 1 , , . . . of the EM algorithm consists of an expectation step (E-step) anda maximization (M-step). The E-step involves computing the Q-function, defined as theexpectation of the complete-data log likelihood with respect to the conditional posteriorof the latent variables given the previous iterations parameter estimates. Specifically, theE-step can be represented by Q ( ϕ, ϕ ( t − ) = E (cid:2) l ( ϕ ; Θ ) | D , ϕ ( t − (cid:3) , (14)where the expectation is taken with respect to P ( Θ | D , ϕ ). The M-step consists of com-puting an update of the parameter estimates for the current iteration as the value thatmaximizes the Q-function. In other words, the M-step involves ϕ ( t ) = arg max ϕ Q ( ϕ, ϕ ( t − ) . (15)14he EM algorithm guarantees the likelihood is non-decreasing across iterations—ascentproperty—and is theoretically guaranteed, under mild regularity conditions, to converge toa local optimum (Wu, 1983). In practice, a commonly used heuristic for determining the EMalgorithm has converged is when the increase in the Q-function from the previous iterationis less than a specified threshold. Unfortunately, since P ( Θ | D , ϕ ) is only known up to amultiplicative constant, the Q-function for MR-PATH cannot be computed analytically. Inthe next section, we discuss our implementation of a variant of the EM algorithm wherethe Q-function in the E-step is approximated using Monte-Carlo methods—the Monte-CarloEM (MC-EM) algorithm. The MC-EM algorithm allows us to perform maximum likelihood estimation in latent vari-able models where the Q-function cannot be computed analytically but can be approximatedusing Monte-Carlo methods. In section 5.2.1, we describe an importance sampling (IS)scheme for approximating the Q-function in MR-PATH. The main drawback of the MC-EMalgorithm is that Monte-Carlo error accrued from approximating the Q-function may causethe algorithm to not converge (Neath, 2013). More precise approximations of the Q-functionare needed as the M-step updates approach a local optimum. Furthermore, the MC-EM algo-rithm does not satisfy the ascent property of the vanilla EM algorithm. A simple solution tothese issues is to set the number of Monte-Carlo samples used to approximate the Q-function(MC sample size) to be very large or increase the MC sample size by a deterministic largeamount at each iteration. However, this may not be computationally feasible, thus creatinga trade-off between statistical consistency and computational efficiency. In section 5.2.2, wediscuss an automated data-driven procedure introduced by Caffo et al. (2005) that guaran-tees the ascent property is satisfied with high probability by assessing Monte-Carlo error atthe end of each iteration and increasing the MC sample size accordingly.15 .2.1 Monte-Carlo E-step using Importance Sampling
The decomposition of P ( Θ | D , ϕ ) in eq. (10) suggests that we can obtain importance samplesfrom P ( Θ | D , ϕ ) by first drawing samples of { θ X i } from an importance (proposal) distribu-tion. Given importance samples of { θ X i } , we can obtain importance samples of { β i } and { Z i } by directly sampling from (cid:81) pi =1 P ( β i | θ X i , ˆ θ X i , ϕ ) and (cid:81) pi =1 P ( Z i | β i , ϕ ), which are available inclosed form in eq. (11) and eq. (12). Then, we can estimate the Q-function with a sum of thecomplete-data log-likelihood evaluated at the samples weighted by the importance weights.Choosing an importance distribution that yields an efficient sampling procedure is a non-trivial task that depends on the form of the target distribution (see Tokdar and Kass (2010)for a review). In our case, the target distribution is (cid:81) pi =1 P ( θ X i | ˆ θ X i , ˆ θ Y i , ϕ ) so a sensiblechoice for an importance (proposal) distribution would be (cid:81) pi =1 P ( θ X i | ˆ θ X i , ϕ ), the posteriordistribution of { θ X i } given only { ˆ θ X i } instead of the full data { ˆ θ X i , ˆ θ Y i } which has a closedform given in eq. (13). This choice of importance distribution yields importance weights thatare bounded and have a finite variance.More precisely, let m t denote the number of desired importance samples at iteration t and˜ ϕ ( t − denote the MC-EM update from iteration t −
1. For each i = 1 , . . . , p and j = 1 , . . . , m t ,suppose θ jX i ∼ P ( θ X i | ˆ θ X i , ϕ ( t − ), β ji ∼ P ( β i | θ X i , ˆ θ Y i , ϕ ( t − ), and Z ji ∼ P ( Z i | β i , ϕ ( t − ). Fromeq. (13), we have that the unnormalized importance weights are given by w ji = P (ˆ θ Y i | θ jX i , ˜ ϕ ( t − ) = K (cid:88) k =1 ˜ π ( t − k φ (cid:16) ˆ θ Y i ; θ jX i ˜ µ ( t − k , ( θ jX i ) ˜ σ t − k + σ Y i (cid:17) (16)for j = 1 , . . . , M . It can be shown that 0 ≤ w ji ≤ (cid:0) πσ Y i (cid:1) − / (proof is provided in ap-pendix A.1). Let ¯ w ji = w ji / (cid:80) m t j =1 w ji be the normalized importance weights. The IS estimateof the Q-function at iteration t is given by˜ Q ( ϕ, ˜ ϕ ( t − ; m t ) = p (cid:88) i =1 m t (cid:88) j =1 ¯ w ji l ji ( ϕ ) , (17)where l ji ( ϕ ) = l i ( ϕ ; θ jX i , β ji , Z ji ) , and l i is defined in eq. (9). Consequently, the MC-EM update of the model parameters at16teration t approximated with Monte-Carlo sample size m t is given by˜ ϕ ( t,m t ) = arg max ϕ ˜ Q ( ϕ, ˜ ϕ ( t − ; m t ) . At the end of each MC-EM iteration, the ascent-based MC-EM algorithm (Caffo et al.,2005) performs a hypothesis test to determine whether the Q-function has increased fromthe previous iteration. If there is not sufficient evidence that suggests the Q-function hasincreased, we reject the current iteration’s M-step and repeat the iteration with a largerMC sample size. In this section, we will describe this procedure concretely for our inferenceproblem.Define the change in the Q function at iteration t of MC-EM as∆ Q ( ˜ ϕ ( t,m t ) , ˜ ϕ ( t − ) := Q ( ˜ ϕ ( t,m t ) , ˜ ϕ ( t − ) − Q ( ˜ ϕ ( t − , ˜ ϕ ( t − ) . (18)After obtaining ˜ ϕ ( t,m t ) in the M-step, we are interested in performing the following hy-pothesis test at a specified significance level α : H : ∆ Q ( ˜ ϕ ( t,m t ) , ˜ ϕ ( t − ) = 0 , (19) H A : ∆ Q ( ˜ ϕ ( t,m t ) , ˜ ϕ ( t − ) > . If we reject H , then we accept ˜ ϕ ( t,m t ) and move on to the next iteration. If we fail toreject H , then we reject ˜ ϕ ( t,m t ) and repeat the current iteration with a larger Monte-Carlosample size. The change in the Q function in eq. (18) can be approximated using∆ ˜ Q ( ˜ ϕ ( t,m t ) , ˜ ϕ ( t − ) ≈ p (cid:88) i =1 m t (cid:88) j =1 w ji Λ ( t ) ij (20)where Λ ( t ) ij = l ji ( ˜ ϕ ( t,m t ) ) − l ji ( ˜ ϕ ( t − ) . (21)It was shown in Caffo et al. (2005) that under H , √ m t ∆ ˜ Q ( ˜ ϕ ( t,m t ) , ˜ ϕ ( t − ) d → N (0 , η ) (22)as m t → ∞ , where σ depends on the sampling procedure used. We can obtain an estimateof η , ˆ η , by computing the variance of the importance sampling estimate in eq. (20) (details17rovided in appendix A.2). Therefore, we can reject H with approximate significance level α if ∆ ˜ Q ( ˜ ϕ ( t,m t ) , ˜ ϕ ( t − ) − z α ˆ ηm t > z α is the (1 − α ) th quantile of the standard normal distribution. If we fail to reject H ,we repeat iteration t with a larger m t until we are able to reject H . Similarly, a convenientstopping criterion is obtained by testing H A : ∆ Q ( ˜ ϕ ( t,m t ) , ˜ ϕ ( t − ) < (cid:15) at a specified significance level γ and threshold (cid:15) . We can reject H with approximatesignificance level γ and determine MC-EM has converged if∆ ˜ Q ( ˜ ϕ ( t,m t ) , ˜ ϕ ( t − ) + z γ ˆ ηm t < (cid:15). To quantify uncertainty of the parameter estimates obtained using the MC-EM algorithm,we adapt the method in (Louis, 1982) for computing the observed information matrix in theEM framework. Standard errors, and therefore approximate confidence intervals, can thenbe obtained by inverting the observed information matrix. By a result presented in (Louis,1982), the observed information matrix at a point ϕ ∗ can be computed by I ( ϕ ∗ ) = (cid:110) E (cid:104) − ∂ l ( ϕ ) ∂ϕ∂ϕ T (cid:12)(cid:12)(cid:12) D , ϕ ∗ (cid:105) − E (cid:104)(cid:16) ∂l ( ϕ ) ∂ϕ ∂l ( ϕ ) ∂ϕ T (cid:17)(cid:12)(cid:12)(cid:12) D , ϕ ∗ (cid:105) + E (cid:104) ∂l ( ϕ ) ∂ϕ (cid:12)(cid:12)(cid:12) D , ϕ ∗ (cid:105) E (cid:104) ∂l ( ϕ ) ∂ϕ T (cid:12)(cid:12)(cid:12) D , ϕ ∗ (cid:105)(cid:111)(cid:12)(cid:12)(cid:12) ϕ = ϕ ∗ (23)where the dependence of D , Θ in the log-likelihood have been suppressed for notationalconvenience. Let ˆ ϕ denote the final MC-EM parameter estimate and suppose { θ jX i , β ji , Z ji : j = 1 , . . . , M } are now importance samples from P ( θ X i , β i , Z i | D , ˆ ϕ ) with (normalized)weights { w ji } . The first and third expectations in eq. (23) can be approximated by E (cid:104) − ∂ l ( ϕ ) ∂ϕ∂ϕ T (cid:12)(cid:12)(cid:12) D , ϕ (cid:105) ≈ p (cid:88) i =1 M (cid:88) j =1 w ji ∂ l ji ( ϕ ) ∂ϕ∂ϕ T , and E (cid:104) ∂l ( ϕ ) ∂ϕ (cid:12)(cid:12)(cid:12) D , ϕ (cid:105) ≈ p (cid:88) i =1 M (cid:88) j =1 w ji ∂l ji ( ϕ ) ∂ϕ . E (cid:104)(cid:16) ∂l ( ϕ ) ∂ϕ ∂l ( ϕ ) ∂ϕ T (cid:17)(cid:12)(cid:12)(cid:12) D , ϕ ∗ (cid:105) = p (cid:88) i =1 E (cid:104)(cid:16) ∂∂ϕ l i ( ϕ ) (cid:17)(cid:16) ∂∂ϕ l i ( ϕ ) (cid:17) T (cid:105) + 2 (cid:88) i To gain a better picture of our knowledge of each individual SNP, we can sample from P ( β i , Z i , θ X i | ˆ θ X i , ˆ θ Y i , ˆ ϕ )—the posterior distribution of variant-specific latent variables givenan MC-EM estimate of ϕ —by using the sampling/importance resampling (SIR) algorithm(Li, 2004). For example, P ( Z i = k | ˆ θ X i , ˆ θ Y i , ˆ ϕ )—the cluster membership probability of the i th SNP—quantifies how certain we are that the i th SNP belongs to a certain cluster. Morespecifically, suppose we have M samples { β ji , Z ji , θ jX i : j = 1 , . . . , M } from our importancedistribution for the i th SNP. Then we can obtain samples of β i from P ( β i | ˆ θ X i , ˆ θ Y i , ˆ ϕ ) bysampling—with replacement—from { β ji } with probabilities proportional to the importanceweights given in eq. (16). Samples from P ( Z i | ˆ θ X i , ˆ θ Y i , ˆ ϕ ) and P ( θ X i | ˆ θ X i , ˆ θ Y i , ˆ ϕ ) can be ob-tained in a similar fashion. In particular, these samples can be used to construct credibleintervals of β i and compute cluster membership probabilities—see section 7 for examples. To select the number of clusters K , we use a modified Bayesian Information criterion (BIC)for latent variable models estimated using the EM algorithm adopted from Ibrahim et al.(2008). For MR-PATH, the standard BIC is typically defined asBIC = − P ( D | ˆ ϕ ) + (3 K + 2 p ) log( p )19here ϕ is the MLE of ϕ and 3 K + 2 p is the dimension of our model. We replace themarginal density P ( D | ˆ ϕ ) with the readily available IS estimate of the Q-function at the finalMC-EM iteration from eq. (17). To verify the efficacy of our statistical inference procedures, we perform two simulation stud-ies. The goal of the first simulation study is to demonstrate that (1) the MC-EM algorithmgives parameter estimates that are close to the ground truth and (2) the approximate con-fidence intervals we derive have desirable coverage probabilities. The goal of our secondsimulation study is to evaluate the accuracy of the modified BIC for selecting the number ofclusters K . In our first simulation study, we generate simulated data from MR-PATH under variousparameter settings that mimic GWAS summary data used in practice. In each setting, wegenerate measurement errors as σ X i , σ Y i ∼ Inv. Gamma(9 , . λ x = 10 / √ p . We set λ x to be inversely related to the number of genetic variants p because a larger p correspondsto a less stringent criterion for filtering the genetic variants. We vary the number of geneticvariants to be p = 50 , , , K true = 1 , , 3. In MR,we do not expect the number of clusters to be greater than 3 and the number of filteredgenetic variants to be large. For each parameter setting, we ran the MC-EM algorithm with K = K true and computed the approximated confidence intervals with 500 simulated data-sets and calculated the root mean squared error (RMSE) and 95% coverage probability. Asthe EM algorithm is sensitive to initial value specification, for each repetition, we run theMC-EM algorithm 10 times with different random initial values and report the results fromthe run with the largest complete-data log likelihood. Simulation results are presented inFigure 4. Note that as K increases, we choose mixture means to be closer in value, making20he estimation task more challenging.In this simulation study, the RMSE for each scenario does not exceed .09 and the coverageprobability for most scenarios do not deviate much from the desired 95%. Furthermore,we observe a trade-off between RMSE and coverage probability when filtering out weakerinstruments (variants with ˆ θ X i ≈ p ) results inlower RMSE but coverage probabilities that deviate more from 95%. m s 250 500 750 1000 250 500 750 10000.0050.010 m s 250 500 750 1000 250 500 750 10000.650.750.850.95 (a) K = 1 . µ = 0 . σ = 0 . p m s 250 500 750 1000 250 500 750 1000 250 500 750 10000.020.040.06 p m s 250 500 750 1000 250 500 750 1000 250 500 750 10000.650.750.850.95 (b) K = 2 . π = 0 . µ = − . µ = 0 . 5, and σ = σ = 0 . p m s 250 500 750 1000 250 500 750 1000 250 500 750 10000.020.040.06 p m s 250 500 750 1000 250 500 750 1000 250 500 750 10000.650.750.850.95 (c) K = 3 . π = π = 0 . µ = − . µ = 0, µ = 0 . 5, and σ = σ = σ = 0 . Figure 4: Results from first simulation study . For each parameter setting, we simulated N rep = 500 data-sets. Each row corresponds to a different number of cluster K with theground truth in the sub-captions. Each plot corresponds to one of the mixture parameters π = ( π , . . . , π K ), µ = ( µ , . . . , µ K ), σ = ( σ , . . . , σ K ) in MR-PATH. Left : Plots of root meansquared error (RMSE) against the number of variants p for each mixture parameter. Right :Plots of 95% coverage probabilities against p for each mixture parameter. A horizontaldashed line is drawn at .95 (desired coverage prob.).21 .2 Model Selection with modified BIC In our second simulation study, we simulate data from MR-PATH with p = 50 or 250 and √ pλ x = 1 or 5. In each setting, we set the true number of clusters K true = 1,2, or 3. In eachreplication, we ran the MC-EM algorithm with K = 1 , , K that yields thelowest modified BIC value. Results for this simulation study are shown in table 2.From table 2, we observe that lower values for √ pλ x result in the largest decreases inthe accuracy of the modified BIC. For example, when p = 50 and K true = 2, the modifiedBIC chose the correct K 92% of the time when √ pλ x = 5 but only 71% of the time when √ pλ x = 1. We also notice that when √ pλ x = 1, the accuracy of the modified BIC decreaseswith p . A practical consequence of this observation is that the modified BIC is more likelyto choose the correct K with few strong instruments than with more weak instruments. We now return to the motivating example introduced in Section 1.1. The dataset beingused is created from several large-scale GWAS datasets for plasma lipids (HDL-C, LDL-C,triglycerides) (Teslovich et al., 2010; Willer et al., 2013), coronary heart disease (Nikpayet al., 2015), lipoprotein subfractions (Kettunen et al., 2016), and other cardiovascular dis-eases. We use the three-sample summary-data MR design described in Zhao et al. (2020)to preprocess and homogeneize the datasets. We first select 151 independent SNPs (dis-tance ≥ 10 mega base pairs, R ≤ . 001 in a reference panel) that are associated with atleast plasma lipid traits (defined as the minimum p -value with HDL-C, LDL-C, and triglyc-erides less than 10 − ). We then obtain the GWAS associations of these SNPS with all theother cardiometabolic traits. For the purpose of this example, we will focus on 31 SNPsthat showed genome-wide significant associations ( p -value ≤ × − ) with HDL-C in theselection GWAS.We apply the Monte Carlo EM algorithm developed in Section 5 to the 31 genetic in-struments and their associations with HDL-C in a separate dataset (Kettunen et al., 2016)22roportion p √ pλ x K true K = 1 K = 2 K = 350 1 1 .83 .17 02 0 0.71 0.293 0 .01 0.995 1 .98 0.02 02 0 0.92 0.083 0 .002 .998250 1 1 .76 0.24 02 0 0.67 0.333 0 .005 .9955 1 1 0 02 0 0.99 0.013 0 0 1Table 2: Results from second simulation study . For each setting, we simulated N rep = 500data-sets. For K true = 1, we set µ = 0 . σ = 0 . 1. For K true = 2, we set π = 0 . µ = ( − . , . σ = σ = 0 . 1. For K true = 3, we set π = π = 1 / µ = ( − . , , . σ = σ = σ = 0 . 05. The last column reports the proportion of repetitions for eachsetting where the modified BIC chose the corresponding K .23nd CHD. The modified BIC selects K = 2 clusters of SNPs. The larger cluster (ˆ π = 0 . µ = − . σ = 0 . 23) and the smaller cluster (ˆ π = 0 . µ = 0 . 14, ˆ σ = 0 . β j along side the cluster membership probabilities.To validate the mechanistic heterogeneity identified by MR-PATH, we generate a heatmapof the associations (z-scores) of the SNPs with lipoprotein subfraction traits (Kettunen et al.,2016). Most of the traits are named after their size (XS = extra small, S = small, M =medium, L = large, XL = extra large, XXL = double extra large), their lipoprotein class(HDL, IDL = intermediate-density lipoprotein, LDL, VLDL = very-low-density lipoprotein),and the measurement (C = total cholesterol, CE = cholesterol esters, FC = free cholesterol,L = total lipid, P = particle concentration, PL = phosolipids, TG = triglycerides). Othertraits including the mean diameter of HDL/LDL/VLDL particles (HDL-D/LDL-D/VLDL-D) and the concentration of ApoA1/ApoB (major protein component of HDL/LDL). Toaid visualization, the SNPs are ordered by their cluster membership probabilities and thelipoprotein subfractions are ordered by their density and size.The heatmap in Figure 6 shows that the SNP clusters found by the mixture model exhibitdifferent patterns of association with the lipoprotein subfractions. Several SNPs in the firstcluster have strong inverse association with LDL-C and other LDL/VLDL subfraction traits.Therefore, the negative effect of HDL-C on CHD sugested by the instruments in the firstcluster cluster may indeed be due to their pleiotropic effect on LDL-C and ApoB-containinglipoproteins (see scenario 1 in Figure 3a). In contrast, several SNPs in the second cluster(rs1532085, rs588136, rs174546, rs7679) are inversely associated with the concentration ofsmall HDL particles, so they may be related to another mechanism that regulates the size ofHDL particles. Although the instruments in this cluster suggest a positive effect of HDL-Con CHD, this may be explained by heterogeneous effects of cholesterol contained in differentHDL subfractions (see scenario 2 in Figure 3a). An earlier univariable MR study indeed foundthat the concentration of small and medium HDL particles may have a negative effect onCHD, while the large and extra large HDL particles seem to have no effect (Zhao et al., 2019).24 r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s % P o s t e r i o r C r ed i b l e I n t e r v a l C l u s t e r m e m be r s h i p p r ob . Figure 5: SNP-specific posterior quantities in HDL-CAD data . The SNPs are ordered bytheir posterior probability of belonging to cluster 2. Top : 95% posterior credible intervals .Colored dashed lines are the estimated cluster means and x-marks are the posterior mediansfor each SNP. Bottom : Posterior cluster membership probabilities bar plot . Vertical axis isposterior probability of belonging to cluster 2.To summarize, the heatmap provides some evidence that the clustering structure identifiedby our mixture model indeed corresponds to some distinct underlying mechanisms. We now turn to a second example to illustrate the utility of MR-PATH. In this example, weare interested in possible mechanistic heterogeneity of the effect of adiposity (as measuredby the body mass index, BMI) on type II diabetes (T2D). Following the same three-samplesummary-data MR design as describe in Section 7.1, we created a dataset of 60 SNPs fromtwo GWAS summary datasets for BMI (Akiyama et al., 2017; Locke et al., 2015) and onefor T2D (Mahajan et al., 2018). We then apply the Monte Carlo EM algorithm developed in25ection 5. The modified BIC selects K = 2 clusters of SNPs. The larger cluster (ˆ π = 0 . µ = 0 . 77, ˆ σ = 0 . 42) and the smaller cluster (ˆ π = 0 . µ = − . 4, ˆ σ = 1 . β i with the SNP association with peak blood insulin, which is available froman independent GWAS (Wood et al., 2017) (Figure 9). In fact, six out of the seven SNPsclassified into cluster 2 are strongly associated with peak blood insulin. This shows that thelarge negative effect of this cluster is most likely due to horizontal pleiotropy (Figure 3a)instead of a genuine negative causal effect of adiposity. The results we obatined here arebroadly consistent with other recent genetic studies that have identified SNPs with oppositeeffects on adiposity and type II diabetes and linked the “favorable adiposity” genes to insulinfunction and fat distribution (Ji et al., 2019). In this paper, we have formalized the notion of mechanistic heterogeneity in the context ofMR and showed that SNPs on the same biological pathway identify similar causal effects.Different pathways generally correspond to different causal effects, even if they are all validinstruments. Motivated by this observation, we introduced MR-PATH, an interpretablemixture model for from summary-level GWAS data that can provide valuable insights onmechanistic heterogenity.MR-PATH has several advantages over similar existing methods for MR. First, it relaxesthe effect homogeneity assumption implicit in most existing MR methods so it is able toidentify multiple causal mechanisms. Second, MR-PATH does not require substantial domainknowledge since we use a data-driven approach to select the number of clusters. Lastly, MR-PATH is based on a full likelihood and is robust to weak instrument bias since we use anerror-in-variables approach to estimate the variant-specific causal effects. We showed using26umerical simulations that our MC-EM algorithm gives parameter estimates that are closeto the ground truth and the corresponding approximate confidence intervals have coverageprobabilities close to their true values. We also showed that the modified BIC criterion weused for selecting the number of clusters chose correctly most of the time.We demonstrated the utility of MR-PATH in modeling mechanistic heterogeneity inMR analysis by using it to investigate the causal mechanisms between HDL-C and CHDand between adioposity and type II diabetes. These examples reinforce the importance ofconsidering multiple causal mechanisms in MR analysis. Our findings are consistent withexisting genetic studies that use external data. For the HDL-C and CHD data set, MR-PATH identifies a cluster with a positive average causal effect which may be associated witha mechanism that regulates size of HDL particles. In our study of the role of adioposity ontype II diabetes, MR-PATH finds a cluster with a negative average causal effect that is likelyattributed to horizontal pleiotropy.Since MR-PATH is a generative model for multiple causal mechanisms in MR, there aremany potential extensions that can be incorporated in future work. One such extension isto replace our univariate mixture model with a multivariate model to consider multiple riskexposures simultaneously. The multivariate version of MR-PATH can be used to accountfor the pleiotropic effects of other lipoproteins in our HDL and CHD example. Anotherpossible extension is to allow for correlated SNPs by relaxing the independence assumptionin assumption 4.1. Acknowledgement We would like to thank Xuelu Wang for helpful comments on the type II diabetes example.27 eferences Akiyama, M., Okada, Y., Kanai, M., Takahashi, A., Momozawa, Y., Ikeda, M., Iwata,N., Ikegawa, S., Hirata, M., Matsuda, K., et al. (2017). Genome-wide association studyidentifies 112 new loci for body mass index in the japanese population. Nature Genetics ,49(10):1458.Anderson, T. W. and Rubin, H. (1949). Estimation of the parameters of a single equation ina complete system of stochastic equations. Annals of Mathematical Statistics , 20(1):46–63.Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects usinginstrumental variables. Journal of the American Statistical Association , 91(434):444–455.Armitage, J., Holmes, M. V., and Preiss, D. (2019). Cholesteryl ester transfer proteininhibition for preventing cardiovascular events: Jacc review topic of the week. Journal ofthe American College of Cardiology , 73(4):477–487.Bowden, J., Davey Smith, G., and Burgess, S. (2015). Mendelian randomization with invalidinstruments: effect estimation and bias detection through egger regression. InternationalJournal of Epidemiology , 44(2):512–525.Bowden, J., Del Greco M, F., Minelli, C., Davey Smith, G., Sheehan, N., and Thompson,J. (2017). A framework for the investigation of pleiotropy in two-sample summary datamendelian randomization. Statistics in Medicine , 36(11):1783–1802.Boyle, E. A., Li, Y. I., and Pritchard, J. K. (2017). An expanded view of complex traits:from polygenic to omnigenic. Cell , 169(7):1177–1186.Buniello, A., MacArthur, J. A. L., Cerezo, M., Harris, L. W., Hayhurst, J., Malangone,C., McMahon, A., Morales, J., Mountjoy, E., Sollis, E., et al. (2019). The nhgri-ebigwas catalog of published genome-wide association studies, targeted arrays and summarystatistics 2019. Nucleic acids research , 47(D1):D1005–D1012.Burgess, S., Bowden, J., Fall, T., Ingelsson, E., and Thompson, S. G. (2017). Sensitiv-28ty Analyses for Robust Causal Inference from Mendelian Randomization Analyses withMultiple Genetic Variants. Epidemiology (Cambridge, Mass.) , 28(1):30–42.Burgess, S., Butterworth, A., and Thompson, S. G. (2013). Mendelian randomizationanalysis with multiple genetic variants using summarized data. Genetic Epidemiology ,37(7):658–665.Burgess, S., Foley, C. N., Allara, E., Staley, J. R., and Howson, J. M. M. (2020). A robust andefficient method for Mendelian randomization with hundreds of genetic variants. NatureCommunications , 11(1):376.Caffo, B. S., Jank, W., and Jones, G. L. (2005). Ascent-based Monte Carlo expectationmaximization. Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,67(2):235–251.Davey Smith, G. and Phillips, A. N. (2020). Correlation without a cause: an epidemiologicalodyssey. International Journal of Epidemiology , 49(1):4–14.Didelez, V. and Sheehan, N. (2007). Mendelian randomization as an instrumental variableapproach to causal inference. Statistical Methods in Medical Research , 16(4):309–330.Foley, C. N., Kirk, P. D., and Burgess, S. (2019). MR-Clust: Clustering of genetic variants inMendelian randomization with similar causal estimates. bioRxiv , page 2019.12.18.881326.Ibrahim, J. G., Zhu, H., and Tang, N. (2008). Model Selection Criteria for Missing-DataProblems Using the EM Algorithm. Journal of the American Statistical Association ,103(484):1648–1658.Ji, Y., Yiorkas, A. M., Frau, F., Mook-Kanamori, D., Staiger, H., Thomas, E. L., Atabaki-Pasdar, N., Campbell, A., Tyrrell, J., Jones, S. E., et al. (2019). Genome-wide andabdominal mri data provide evidence that a genetically determined favorable adiposityphenotype is characterized by lower ectopic liver fat and lower risk of type 2 diabetes,heart disease, and hypertension. Diabetes , 68(1):207–219.29ang, H., Zhang, A., Cai, T. T., and Small, D. S. (2016). Instrumental variables estimationwith some invalid instruments and its application to mendelian randomization. Journalof the American Statistical Association , 111(513):132–144.Kettunen, J., Demirkan, A., W¨urtz, P., Draisma, H. H., Haller, T., Rawal, R., Vaarhorst,A., Kangas, A. J., Lyytik¨ainen, L.-P., Pirinen, M., et al. (2016). Genome-wide study forcirculating metabolites identifies 62 loci and reveals novel systemic effects of lpa. Naturecommunications , 7(1):1–9.Li, K.-H. (2004). The Sampling/Importance Resampling Algorithm. In Applied BayesianModeling and Causal Inference from IncompleteData Perspectives , Wiley Series in Proba-bility and Statistics, pages 265–276.Liu, X., Li, Y. I., and Pritchard, J. K. (2019). Trans effects on gene expression can driveomnigenic inheritance. Cell , 177(4):1022–1034.Locke, A. E., Kahali, B., Berndt, S. I., Justice, A. E., Pers, T. H., Day, F. R., Powell, C.,Vedantam, S., Buchkovich, M. L., Yang, J., et al. (2015). Genetic studies of body massindex yield new insights for obesity biology. Nature , 518(7538):197–206.Louis, T. A. (1982). Finding the Observed Information Matrix when Using the EM Algo-rithm. Journal of the Royal Statistical Society. Series B (Methodological) , 44(2):226–233.Mahajan, A., Taliun, D., Thurner, M., Robertson, N. R., Torres, J. M., Rayner, N. W.,Payne, A. J., Steinthorsdottir, V., Scott, R. A., Grarup, N., et al. (2018). Fine-mappingtype 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. Nature genetics , 50(11):1505–1513.Neath, R. C. (2013). On Convergence Properties of the Monte Carlo EM Algorithm. In Advances in Modern Statistical Theory and Applications: A Festschrift in honor of Mor-ris L. Eaton , volume Volume 10 of Collections , pages 43–62. Institute of MathematicalStatistics, Beachwood, Ohio, USA.Nikpay, M., Goel, A., Won, H.-H., Hall, L. M., Willenborg, C., Kanoni, S., Saleheen, D.,Kyriakou, T., Nelson, C. P., Hopewell, J. C., et al. (2015). A comprehensive 1000 genomes–30ased genome-wide association meta-analysis of coronary artery disease. Nature Genetics ,47(10):1121.Pearl, J. (2009). Causality . Cambridge University Press.Qi, G. and Chatterjee, N. (2019). Mendelian randomization analysis using mixture modelsfor robust and efficient estimation of causal effects. Nature Communications , 10(1):1941.Rader, D. J. and Hovingh, G. K. (2014). Hdl and cardiovascular disease. The Lancet ,384(9943):618–625.Shapland, C. Y., Zhao, Q., and Bowden, J. (2020). Profile-likelihood Bayesian model averag-ing for two-sample summary data Mendelian randomization in the presence of horizontalpleiotropy. bioRxiv , page 2020.02.11.943712.Smith, G. D. and Ebrahim, S. (2004). Mendelian randomization: prospects, potentials, andlimitations. International journal of epidemiology , 33(1):30–42.Teslovich, T. M., Musunuru, K., Smith, A. V., Edmondson, A. C., Stylianou, I. M., Koseki,M., Pirruccello, J. P., Ripatti, S., Chasman, D. I., Willer, C. J., et al. (2010). Biological,clinical and population relevance of 95 loci for blood lipids. Nature , 466(7307):707–713.Tokdar, S. T. and Kass, R. E. (2010). Importance sampling: a review. WIREs ComputationalStatistics , 2(1):54–60.Verbanck, M., Chen, C.-y., Neale, B., and Do, R. (2018). Detection of widespread horizontalpleiotropy in causal relationships inferred from mendelian randomization between complextraits and diseases. Nature Genetics , 50(5):693–698.Voight, B. F., Peloso, G. M., Orho-Melander, M., Frikke-Schmidt, R., Barbalic, M., Jensen,M. K., Hindy, G., H´olm, H., Ding, E. L., Johnson, T., Schunkert, H., Samani, N. J.,Clarke, R., Hopewell, J. C., Thompson, J. F., Li, M., Thorleifsson, G., Newton-Cheh, C.,Musunuru, K., Pirruccello, J. P., Saleheen, D., Chen, L., Stewart, A. F. R., Schillert, A.,Thorsteinsdottir, U., Thorgeirsson, G., Anand, S., Engert, J. C., Morgan, T., Spertus,J., Stoll, M., Berger, K., Martinelli, N., Girelli, D., McKeown, P. P., Patterson, C. C.,31pstein, S. E., Devaney, J., Burnett, M.-S., Mooser, V., Ripatti, S., Surakka, I., Nieminen,M. S., Sinisalo, J., Lokki, M.-L., Perola, M., Havulinna, A., de Faire, U., Gigante, B.,Ingelsson, E., Zeller, T., Wild, P., de Bakker, P. I. W., Klungel, O. H., Maitland-van derZee, A.-H., Peters, B. J. M., de Boer, A., Grobbee, D. E., Kamphuisen, P. W., Deneer,V. H. M., Elbers, C. C., Onland-Moret, N. C., Hofker, M. H., Wijmenga, C., Verschuren,W. M. M., Boer, J. M. A., van der Schouw, Y. T., Rasheed, A., Frossard, P., Demissie,S., Willer, C., Do, R., Ordovas, J. M., Abecasis, G. R., Boehnke, M., Mohlke, K. L.,Daly, M. J., Guiducci, C., Burtt, N. P., Surti, A., Gonzalez, E., Purcell, S., Gabriel,S., Marrugat, J., Peden, J., Erdmann, J., Diemert, P., Willenborg, C., K¨onig, I. R.,Fischer, M., Hengstenberg, C., Ziegler, A., Buysschaert, I., Lambrechts, D., Van de Werf,F., Fox, K. A., El Mokhtari, N. E., Rubin, D., Schrezenmeir, J., Schreiber, S., Sch¨afer,A., Danesh, J., Blankenberg, S., Roberts, R., McPherson, R., Watkins, H., Hall, A. S.,Overvad, K., Rimm, E., Boerwinkle, E., Tybjaerg-Hansen, A., Cupples, L. A., Reilly,M. P., Melander, O., Mannucci, P. M., Ardissino, D., Siscovick, D., Elosua, R., Stefansson,K., O’Donnell, C. J., Salomaa, V., Rader, D. J., Peltonen, L., Schwartz, S. M., Altshuler,D., and Kathiresan, S. (2012). Plasma HDL cholesterol and risk of myocardial infarction:a mendelian randomisation study. Lancet (London, England) , 380(9841):572–80.Wang, J., Zhao, Q., Bowden, J., Hemani, G., Smith, G. D., Small, D. S., and Zhang,N. R. (2020). Causal Inference for Heritable Phenotypic Risk Factors Using HeterogeneousGenetic Instruments. bioRxiv , page 2020.05.06.077982.Willer, C. J., Schmidt, E. M., Sengupta, S., Peloso, G. M., Gustafsson, S., Kanoni, S.,Ganna, A., Chen, J., Buchkovich, M. L., Mora, S., et al. (2013). Discovery and refinementof loci associated with lipid levels. Nature Genetics , 45(11):1274.Wood, A. R., Jonsson, A., Jackson, A. U., Wang, N., Van Leewen, N., Palmer, N. D.,Kobes, S., Deelen, J., Boquete-Vilarino, L., Paananen, J., et al. (2017). A genome-wide association study of ivgtt-based measures of first-phase insulin secretion refines theunderlying physiology of type 2 diabetes variants. Diabetes , 66(8):2296–2309.32u, C. F. J. (1983). On the Convergence Properties of the EM Algorithm. The Annals ofStatistics , 11(1):95–103.Zhao, Q., Wang, J., Hemani, G., Bowden, J., and Small, D. S. (in press 2020). Statisticalinference in two-sample summary-data Mendelian randomization using robust adjustedprofile score. The Annals of Statistics .Zhao, Q., Wang, J., Miao, Z., Zhang, N., Hennessy, S., Small, D. S., and Rader, D. J. (2019).The role of lipoprotein subfractions in coronary artery disease: A Mendelian randomizationstudy. bioRxiv , page 691089. 33 Technical Appendix A.1 Bounded Importance Sampling weights Following eq. (13), the importance weights are given by w ji = P ( θ X i | ˆ θ X i , ˆ θ Y i , ϕ ) P ( θ X i | ˆ θ X i , ϕ ) ∝ P ( θ X i | ˆ θ X i , ϕ ) P (ˆ θ Y i | θ X i , ϕ )= K (cid:88) k =1 π k (cid:113) πθ X i σ k + σ Y i exp (cid:110) − θ X i σ k + σ Y i ) ( Y i − θ X i µ k ) (cid:111) ≤ (cid:113) πσ Y i , for i = 1 , . . . , p ; j = 1 , . . . , M . Therefore, the importance weights are bounded and havea finite variance. A.2 Importance sampling estimate of η An importance sampling estimate of η at iteration t in eq. (22) is given byˆ η = m t p (cid:88) i =1 (cid:110) m t (cid:88) j =1 w ji Λ ( t ) ij (cid:111) (cid:104) (cid:80) m t j =1 ( w ji Λ ( t ) ij ) ( (cid:80) m t j =1 w ji Λ ( t ) ij ) − (cid:80) m t j =1 ( w ji ) Λ ( t ) ij (cid:80) m t j =1 w ji Λ ( t ) ij + m t (cid:88) j =1 ( w ji ) (cid:105) , where Λ ( t ) ij is given in eq. (21). 34 LDL−DXXL−VLDL−TGXXL−VLDL−PLXXL−VLDL−PXXL−VLDL−LXL−VLDL−TGXL−VLDL−PLXL−VLDL−PXL−VLDL−LL−VLDL−TGL−VLDL−PLL−VLDL−PL−VLDL−LL−VLDL−FCL−VLDL−CEL−VLDL−CM−VLDL−TGM−VLDL−PLM−VLDL−PM−VLDL−LM−VLDL−FCM−VLDL−CEM−VLDL−CS−VLDL−TGS−VLDL−PLS−VLDL−PS−VLDL−LS−VLDL−FCXS−VLDL−TGXS−VLDL−PLXS−VLDL−PXS−VLDL−LLDL−DLDL−CAPOBL−LDL−PLL−LDL−PL−LDL−LL−LDL−FCL−LDL−CEL−LDL−CM−LDL−PLM−LDL−PM−LDL−LM−LDL−CEM−LDL−CS−LDL−PS−LDL−LS−LDL−CIDL−TGIDL−PLIDL−PIDL−LIDL−FCIDL−CHDL−DHDL−CAPOA1XL−HDL−TGXL−HDL−PLXL−HDL−PXL−HDL−LXL−HDL−FCXL−HDL−CEXL−HDL−CL−HDL−PLL−HDL−PL−HDL−LL−HDL−FCL−HDL−CEL−HDL−CM−HDL−PLM−HDL−PM−HDL−LM−HDL−FCM−HDL−CEM−HDL−CS−HDL−TGS−HDL−PS−HDL−L r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s −7 −5 −3 0 3 5 7 z−score Figure 6: A heatmap showing the z-scores of the SNPs’ associations with lipoprotein sub-fraction measurements. The SNPs are ordered by their posterior probability of belonging tocluster 2 as in Figure 5. The lipoprotein subfractions are ordered by their density and size.35 SNP association with BMI S N P a ss o c i a t i on w i t h T D Figure 7: Scatterplot of BMI-T2D data with MC-EM parameter estimates . The coloredlines/shaded regions represent the estimated cluster means ± µ k ± ˆ σ k ). The points are colored according to the cluster with the highest posteriorcluster membership probability P ( Z i = k | ˆ θ X i , ˆ θ Y i ).36 r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s r s % P o s t e r i o r C r ed i b l e I n t e r v a l C l u s t e r m e m be r s h i p p r ob . Figure 8: SNP-specific posterior quantities in BMI-T2D data . SNPs are ordered by theirposterior probability of belonging to cluster 2. Top : 95% posterior credible intervals . Col-ored dashed lines are the estimated cluster means (same as Figure 7) and x-marks are theposterior medians for each SNP. Bottom : Posterior cluster membership probabilities barplot . Vertical axis is posterior probability of belonging to cluster 2.37 s10906111rs2237892rs6444082rs7020996rs7903146rs7923837rs9368222 SNP−specific slope S N P a ss o c i a t i on w i t h pea k b l ood i n s u li n a a Figure 9: Relationship of SNP-specific slope and the association with peak blood insulin . Thehorizontal axis is the posterior median of the SNP-specific slope β i with standard error bars.The vertical axis is the SNP association with peak blood insulin with standard error barsreported in an independent GWAS. The SNPs are colored according to the cluster with thehighest posterior cluster membership probability P ( Z i = k | ˆ θ X ii Relationship of SNP-specific slope and the association with peak blood insulin . Thehorizontal axis is the posterior median of the SNP-specific slope β i with standard error bars.The vertical axis is the SNP association with peak blood insulin with standard error barsreported in an independent GWAS. The SNPs are colored according to the cluster with thehighest posterior cluster membership probability P ( Z i = k | ˆ θ X ii , ˆ θ Y ii