Statistical Methods for cis-Mendelian Randomization
SStatistical Methods for cis-Mendelian Randomization
Apostolos Gkatzionis , ∗ , Stephen Burgess , , Paul J Newcombe MRC Biostatistics Unit, University of Cambridge, UK. MRC Integrative Epidemiology Unit, University of Bristol, UK. Department of Public Health and Primary Care, School of Clinical Medicine,University of Cambridge, UK.
Abstract
Mendelian randomization is the use of genetic variants to assess the existence of a causal re-lationship between a risk factor and an outcome of interest. In this paper we focus on Mendelianrandomization analyses with many correlated variants from a single gene region, and particu-larly on cis-Mendelian randomization studies which uses protein expression as a risk factor. Suchstudies must rely on a small, curated set of variants from the studied region; using all variantsin the region requires inverting an ill-conditioned genetic correlation matrix and results in nu-merically unstable causal effect estimates. We review methods for variable selection and causaleffect estimation in cis-Mendelian randomization, ranging from stepwise pruning and conditionalanalysis to principal components analysis, factor analysis and Bayesian variable selection. In asimulation study, we show that the various methods have a comparable performance in analyseswith large sample sizes and strong genetic instruments. However, when weak instrument bias issuspected, factor analysis and Bayesian variable selection produce more reliable inference thansimple pruning approaches, which are often used in practice. We conclude by examining twocase studies, assessing the effects of LDL-cholesterol and serum testosterone on coronary heartdisease risk using variants in the
HMGCR and
SHBG gene regions respectively.
Introduction
Mendelian randomization (MR) is the use of genetic data in order to assess the existence of acausal relationship between a modifiable risk factor and an outcome of interest (Davey Smith andEbrahim, 2003; Burgess and Thompson, 2015). It is an application of instrumental variablesanalysis in the field of genetic epidemiology, where genetic variants are used as instruments.The approach has received much attention in recent years and has been used to identify a largenumber of causal associations in the literature (Boef et al., 2015). For example, MR studies havestrengthened the evidence for a causal link between lipoprotein(a) and coronary heart disease(CHD) (Kamstrup et al., 2009), but have weakened the evidence for an effect of C-reactiveprotein on CHD risk (CCGC (CRP CHD Genetics Collaboration), 2011).A Mendelian randomization analysis requires a set of genetic variants that satisfy the in-strumental variable assumptions: genetic variants should be (i) associated with the risk factor ∗ Corresponding author. Email: [email protected] a r X i v : . [ q - b i o . Q M ] J a n f interest, (ii) independent of confounders of the risk factor-outcome association, and (iii) theyshould only influence the outcome through their effect on the risk factor (the no-pleiotropyassumption). Under these three assumptions, MR offers a framework for assessing whether therisk factor is causally related to the outcome; under additional modelling assumptions, one canalso estimate the magnitude of the risk factor-outcome causal effect. XG YU (cid:88) x x (1)(2) (3)
Figure 1: A causal diagram representation of the three assumptions of Mendelian randomization.Here, X represents the risk factor, Y the outcome, G the genetic instrument and U denotes con-founders of the X − Y relationship. Traditionally, Mendelian randomization studies are implemented using independent geneticvariants from across the whole genome. This is especially the case when studying complexpolygenic traits such as body mass index, cholesterol or blood pressure. Such genome-wideanalyses rely on published results from large-scale GWAS studies to identify large numbersof genetic regions associated with the risk factor studied. Genetic variants in each region arepruned for independence and only the variant with the smallest p-value, or a small number ofweakly correlated variants with small p-values are selected. Variants from different regions arethen combined to create a genome-wide set of instruments for MR, thus increasing the powerof the analysis to detect a causal relationship.Conversely to the more traditional type of genome-wide MR described above, recent yearshave seen an increased number of MR studies based on a single gene region, and particularlycis-Mendelian randomization studies (Schmidt et al., 2019; Gill et al., 2020). Cis-MR uses aprotein as a risk factor, and investigates the gene region encoding the studied protein in order toidentify variants that regulate protein expression. Mendelian randomization is then conductedusing these variants as instruments, in order to assess the effects of protein expression on oneor more disease outcomes. Cis-MR studies have grown in popularity because they can be usedfor drug target discovery and validation (Gill et al., 2020). Since proteins are the drug targetsof most medicines, an MR investigation of the effects of protein expression on a disease canprovide insights on whether a drug targeting the protein may reduce the risk of suffering fromthe disease.Cis-MR studies typically rely on a single gene region, and therefore have to select geneticinstruments among a potentially large number of correlated variants. Through the existing cis-MR literature, a variety of techniques have been employed to accomplish this task. However, todate no review or comparison of the different methods has been published, and applied cis-MRanalyses typically use some form of LD-pruning for variable selection. The aim of this paperis to provide an overview of statistical methods for cis-MR studies with summary-level data.After reviewing commonly used approaches such as LD-pruning and conditional analysis, wediscuss the use of principal components analysis (PCA), factor analysis and stochastic-searchvariable selection. The use of PCA was proposed by Burgess et al. (2017) for MR with correlated nstruments and can be directly extended to cis-MR. The use of factor analysis was recentlyproposed by Patel et al. (2020), building on similar methods for instrumental variables analysiswith individual-level data (Bai and Ng, 2002). Stochastic search can be implemented usingthe JAM algorithm, originally proposed by Newcombe et al. (2016) for fine-mapping denselygenotyped regions and recently adapted for MR by Gkatzionis et al. (2020). We discuss how toimplement each method in the context of cis-MR and compare their performance in simulationstudies.We also provide two real-data applications, investigating associations of variants in the SHBG and
HMGCR regions with coronary heart disease risk. The
HMGCR gene encodesthe protein HMG-coenzyme A reductase, which is known to play an important role in thecholesterol-biosynthesis pathway. Its inhibition with statins is known to reduce LDL-cholesterollevels, therefore statin treatment is often prescribed to individuals in high CHD risk. The
SHBG region encodes the protein sex-hormone binding globulin, and is known to contain geneticassociations with testosterone and other sex hormones; there is some uncertainty on whethertestosterone levels are associated with CHD risk, although a recent MR study has suggested nocausal relationship (Schooling et al., 2018).
An Overview of Cis-Mendelian Randomization
Drug development is a challenging and costly process with uncertain results, and the failure ratefor drugs in clinical development can be as high as 90% (Hay et al., 2014). Since proteins aretypically the targets of common drugs, cis-Mendelian randomization uses expression data fordruggable proteins as risk factors, in order to aid in drug target discovery and validation. Thewealth of available genetic data, combined with the ability to routinely implement MR analysesusing existing software (Walker et al., 2019; Yavorska and Burgess, 2017) makes cis-MR a usefulapproach to complement and inform drug development. This can be evidenced by the increasedsuccess rate of drug development programs with genomic support (Nelson et al., 2015).Cis-MR studies can be used in order to assess the suitability of proteins as potential drugtargets, predict and inform the results of clinical trials, or investigate repurposing opportunitiesand potential side-effects of existing drugs. For example, a Mendelian randomization analysisof LDL cholesterol and cardiovascular disease risk based on the
ACLY region (Ference et al.,2019) was recently used alongside a clinical trial (Ray et al., 2019) to assess the suitability ofthe
ACLY region as a potential drug target for lowering LDL-cholesterol. Similar analyses havebeen conducted using the
CETP region to instrument lipid fractions and blood pressure (Sofatet al., 2010), using the
IL6R region to assess the effect of interleukin 6 reduction on coronaryheart disease risk (The Interleukin-6 Receptor Mendelian Randomisation Analysis (IL6R MR)Consortium, 2012), and using the
HMGCR (Swerdlow et al., 2015) and
PCSK9 regions (Schmidtet al., 2017) to explore potential effects of LDL-cholesterol lowering treatments on body massindex and type 2 diabetes risk.Ideally, cis-MR should be conducted using protein expression data for the target proteinrisk factor (Gill et al., 2020). However, data on protein expression are not always available,and researchers often use either gene expression data for the associated gene region or geneticassociations with a downstream biomarker instead. For example, Swerdlow et al. (2015) usedLDL-cholesterol as a downstream risk factor in order to study the effects of HMGCR inhibitionon the risk of type 2 diabetes.Genetic variants for cis-MR are selected from a region containing the protein-encoding gene.The gene region is defined to contain the target gene, as well as variants within a few hundred housand base pairs on either side of the gene, in order to include possible cis-eQTL variants.Within the limits of a gene region, it is often possible to identify hundreds of correlated variantsmarginally associated with protein expression. Researchers must then select which of thesevariants to incorporate in a subsequent MR analysis; including all available variants is usuallyavoided as it can introduce numerical approximation errors in estimating the MR causal effect(Burgess et al., 2017). Some authors have suggested prioritizing variants based on functionalannotations; however, Schmidt et al. (2019) investigated this strategy in a simulation study andfound little to no difference in causal inferences conducted using functional variants comparedto non-coding variants. Therefore, the selection of genetic variants is usually performed basedon the strength of their associations with the risk factor, while accounting for LD correlationto reduce numerical approximation errors. In the next section, we review statistical approachesfor implementing this variable selection procedure.Finally, similar to standard Mendelian randomization analyses, genetic variants used forcis-MR must satisfy the instrumental variables assumptions. Violations of the instrumentalvariable assumptions are often a concern because they have the potential to devalidate tests ofcausal association between the risk factor and the outcome and to bias causal effect estimates.However, violations of the no-pleiotropy assumption are less likely to occur when studyingprotein expression, because proteins are causally upstream of metabolites and biomarkers, whichconstitute the most common risk factors used in traditional MR analyses (Swerdlow et al., 2016;Schmidt et al., 2019). Statistical Methods for cis-Mendelian Randomization
Notation
We now present various approaches for selecting genetic variants and computing causal effectestimates in cis-MR studies. Let X denote the risk factor of interest, Y denote the outcome, G , . . . , G P denote a set of P putative genetic instruments from a single gene region, and let U represent confounders of the risk factor-outcome association. Also, let θ denote the causaleffect of the risk factor on the outcome. Our task is to obtain a subset of the genetic variantsto use as instruments, an estimate ˆ θ of the causal effect parameter and an associated standarderror se(ˆ θ ).We work under the framework of two-sample summary-data MR and assume the availabilityof two sets of summary statistics, one for SNP-risk factor and one for SNP-outcome associations,obtained from different datasets. We use ˆ β Xj , ˆ β Y j to denote the marginal associations betweengenetic variant G j and the two traits, and ˆ σ Xj , ˆ σ Y j for the corresponding standard errors. Inaddition, p Xj denotes the p-value of association between variant G j and the risk factor. Finally,we assume that we have access to a reference dataset, such as 1000 genomes or the UK Biobank,from which to estimate correlations between genetic variants. Single-Variant MR
Some studies in the cis-MR literature have used only the variant with the smallest univariatep-value in a region as an instrument (e.g. Sofat et al., 2010; Swerdlow et al., 2015). With asingle genetic variant, the MR causal effect can be estimated using a ratio formula. This avoidsthe need to account for genetic correlations and to perform complex numerical tasks such asinverting the genetic covariance matrix. On the other hand, if the region contains multiple ariants with independent associations with the risk factor, a single-variant MR analysis willnot be able to model all the genetic effects in the region. This typically translates into a lossof power for the cis-MR analysis. Therefore, most recent cis-MR papers have utilised variableselection or dimension reduction approaches in order to include several variants per region. LD-Pruning
Perhaps the most common approach in the literature for selecting genetic variants for inclusioninto a cis-MR study is LD-pruning (see e.g. Dudbridge and Newcombe, 2015; Schmidt et al.,2019). Given a set of candidate genetic variants G , . . . , G P , the stepwise pruning procedureiterates between: • Selecting the variant with the smallest p-value and including it in the analysis. • Identifying all variants whose correlation with the previously selected variant exceeds acorrelation threshold ρ and removing them from further consideration.This is repeated until all the candidate variants have either been selected for inclusion or dis-carded. Alternatively, the process may be stopped when there are no remaining variants withp-values lower than a significance threshold τ , in order to guard against the inclusion of weakinstruments into the analysis. The threshold τ is often taken to be the GWAS significancethreshold τ = 5 × − , although there is some evidence that using more relaxed thresholdsmay be beneficial (Dudbridge, 2013).Once LD-pruning is implemented and a suitable set of genetic instruments is selected, theMR causal effect can be estimated using an inverse variance weighted (IVW) estimate. If thecorrelation threshold ρ is small enough, the selected variants can be considered approximatelyindependent and the standard IVW estimate,ˆ θ ivw = (cid:80) Pj =1 ˆ β Xj ˆ β Y j ˆ σ − Y j (cid:80) Pj =1 ˆ β Xj ˆ σ − Y j se(ˆ θ ivw ) = 1 (cid:80) Pj =1 ˆ β Xj ˆ σ − Y j can be used. This approach is typically used in genome-wide MR analyses. However, forsingle-region MR, the use of a small correlation threshold may result in the exclusion of causalvariants, especially when studying gene regions with multiple independent associations with therisk factor. On the other hand, if a larger threshold is used, the standard IVW estimate willunderestimate the causal standard error, since it does not adjust for correlations. In this case itis better to use a modified version of of the IVW estimator that takes genetic correlations intoaccount (Burgess et al., 2016): ˆ θ cor = (cid:16) ˆ β TX Ω − ˆ β X (cid:17) − (cid:16) ˆ β TX Ω − ˆ β Y (cid:17) (1)se(ˆ θ cor ) = (cid:16) ˆ β TX Ω − ˆ β X (cid:17) − (2)where Ω is (an estimate of) the genetic covariance matrix: Ω i,j = Cov( G i , G j ). The estimator(1) can be motivated from the literature on meta-analysis. It is sometimes referred to asthe generalized least squares method (Schmidt et al., 2019) because it can be derived usinggeneralized linear regression to model the genetic associations with the outcome in terms of thegenetic associations with the risk factor (Burgess et al., 2016, 2017). he correlated-instruments IVW estimator requires the genetic covariance matrix Ω to becomputed and inverted, a process which can become numerically unstable if highly correlatedvariants are used (Burgess et al., 2017). This is the main reason why cis-MR analyses cannotuse all available variants within a region. Numerical instabilities are less likely to occur whenthe IVW estimator is used in combination with pruning, since the pairwise correlations ofvariants selected by LD-pruning are all lower than the specified correlation threshold ρ , althoughproblems can still occur if the genetic covariance matrix is near-singular.A potential criticism for the LD-pruning approach is that there is no consensus in theliterature on how to select the correlation threshold. Implementations of the method usingdifferent correlation thresholds can give rise to different results, especially when large thresholdsare used (Burgess et al., 2017). To address this issue, it has been proposed that stepwise pruningis implemented with a variety of pruning thresholds, as a form of sensitivity analysis (Schmidtet al., 2019). In addition, stepwise methods for variable selection have been criticised for beingunstable and getting stuck in local maxima of the model space (Hocking, 1976; Tibshirani,1996; Miller, 2002) and more flexible variable selection approaches are often preferred in someapplications of biostatistics (Newcombe et al., 2016; Benner et al., 2016; Asimit et al., 2019),although it is not clear whether the added flexibility provides a substantial benefit in Mendelianrandomization studies. Conditional Analysis
Conditional analysis is a process similar to LD-pruning, except variant selection is based ongenetic associations with the risk factor conditional on any previously selected variants insteadof univariate marginal p-values. Perhaps the most established algorithm for conditional analysisis the Conditional and Joint (CoJo) algorithm proposed in Yang et al. (2012). The algorithmstarts by selecting the genetic variant with the smallest univariate p-value, and then in eachiteration it iterates between the following steps: • Calculate the p-values of all candidate genetic variants conditional on any variants thathave already been selected in previous steps. As with stepwise pruning, the p-values areonly computed for genetic variants whose correlations with previously selected variants donot exceed a pre-specified correlation threshold ρ . • Select the variant with the smallest conditional p-value, provided that its p-value is smallerthan a significance threshold τ . • Regress the risk factor on all previously selected variants and compute the multivariatep-values for each selected variant. • If any of the p-values in the multivariate model is larger than the significance threshold τ ,drop the variant with the largest p-value from the model.This variable selection algorithm is run until it can no longer add or remove any variants.Upon finishing, it provides a set of genetic variants which can subsequently be used to con-duct Mendelian randomization. A causal effect estimate based on the selected variants is thenobtained using correlated-instruments IVW (1)-(2).Yang et al. (2012) provide details on how to compute the conditional and joint-model p-values using marginal summary statistics. The fact that the algorithm only requires summary-level data, along with its implementation as part of the GCTA software, have contributed toconditional analysis becoming a popular option for selecting variants from single genetic regions. n practice, conditional analysis often has a similar performance to LD-pruning (Dudbridgeand Newcombe, 2015). Both approaches rely on a correlation threshold parameter ρ and canproduce inconsistent results for different values of the correlation parameter, especially for largevalues (e.g. ρ ≥ . Principal Component Analysis
Principal components analysis (PCA) is widely used in GWAS studies to adjust for populationstratification. In the context of correlated-instruments summary-data MR, the use of princi-pal components was proposed by Burgess et al. (2017). The method aims to identify linearcombinations of genetic variants that are orthogonal to each other and explain as much of thevariance in the genetic dataset as possible. These linear combinations are then used as geneticinstruments to assess the causal relationship between the risk factor and outcome, using anIVW estimate.In Burgess et al. (2017), principal components analysis is applied to a transformed versionΨ of the covariance matrix, with elementsΨ ij = ˆ β Xi ˆ β Xj ˆ σ − Y i ˆ σ − Y j ρ ij = ˆ β Xi ˆ β Xj ˆ σ − Y i ˆ σ − Y j Ω ij where ρ ij = Cor( G i , G j ). The matrix Ψ is preferred over the untransformed matrix of geneticcorrelations because its entries depend on the precisions ˆ σ − Y j of univariate estimates and thus, iftwo genetic variants are almost perfectly correlated, the variant with the highest precision willbe prioritized.Principal components analysis uses an eigendecomposition for the matrix Ψ and identifiesthe diagonal matrix of eigenvalues Λ = ( λ ii ) and the corresponding matrix of eigenvectors W so that Ψ = W T Λ W . By transforming the summary statistics¯ β X = W T ˆ β X , ¯ β Y = W T ˆ β Y and the genetic covariance matrix ¯Ω = W T Ω W one can rewrite the IVW estimate (1) in terms of these transformed vectors:ˆ θ cor = (cid:0) ¯ β TX ¯Ω − ¯ β X (cid:1) − (cid:0) ¯ β TX ¯Ω − ¯ β Y (cid:1) (3)Evaluating (3) is subject to the same numerical difficulties as (1). However, one can constructa numerically stable approximation for ˆ θ cor by using only the first k principal components. If˜ W is the sub-matrix of W containing only the first k columns and ˜ β X = ˜ W T ˆ β X , ˜ β Y = ˜ W T ˆ β Y ,˜Ω = ˜ W T Ω ˜ W , then the causal effect estimate based on the first k principal components isˆ θ cor ≈ (cid:16) ˜ β TX ˜Ω − ˜ β X (cid:17) − (cid:16) ˜ β TX ˜Ω − ˜ β Y (cid:17) The number k of principal components to use is a tuning parameter for the method. In practice,it is often specified so that the selected principal components explain a specific proportion ofvariation in the genetic dataset. The criterion used in Burgess et al. (2017) is to select principalcomponents that explain either 99% or 99 .
9% of the variation in the data. he JAM Algorithm Joint analysis of marginal summary statistics (JAM) is a Bayesian stochastic-search variableselection algorithm that was introduced by Newcombe et al. (2016) for the purpose of statisticalfine-mapping from summary GWAS results. The algorithm aims to identify genetic variantsrobustly and independently associated with a trait of interest, among a large set of correlatedcandidate variants. JAM only requires the availability of GWAS summary-level data, andgenetic correlations which can be estimated from an external reference dataset. Using thesesummary data, JAM identifies sets of genetic variants that are most plausibly associated withthe trait of interest.The algorithm uses linear regression to model trait X in terms of genetic variants G =( G , . . . , G P ): X ∼ N (cid:0) Gb, σ X (cid:1) (4)where b denotes the vector of multivariate associations between each genetic variant and thetrait (in contrast to β X , β Y , which represent marginal associations). In order to implement thisregression using summary statistics, JAM employs a transformation z = G T X , for which (4)implies that z ∼ N (cid:0) G T Gb, G T Gσ X (cid:1) (5)Assuming Hardy-Weinberg equilibrium and that the trait measurements and genetic data havebeen centered, each element z j of the vector z can be approximated using the marginal summarystatistics ˆ β Xj and effect allele frequencies f j , z j = 2 f j (1 − f j ) N ˆ β Xj where N is the sample size from which the G-X associations are inferred. In practice, anadditional Cholesky decomposition is implemented to avoid modelling under the correlated errorstructure of (5); we refer to Newcombe et al. (2016) for the technical details. In addition, notethat G T G is ( N − (cid:96) ( z, G ref | b, σ X ).A similar likelihood is used to conduct the joint analysis step in Yang et al. (2012). JAMemploys Bayesian analysis instead, and uses conjugate normal-inverse-gamma g-priors p ( b, σ X )to facilitate Bayesian inference for the genetic associations with the trait.If the trait studied is binary, z can be derived by mapping univariate log-odds ratios tothe univariate effects that would have been estimated if the binary trait was modelled by linearregression, as has been done in other linear-based summary data algorithms (Vilhjalmsson et al.,2015).The likelihood (5) relies on a fixed set of genetic variants. For variable selection, we assumethat the likelihood has been conditioned on a specific set of variants γ and use a Beta-Binomialprior p ( γ ) for the model coefficient in order to obtain the posterior p ( γ, b, σ X | z, G ref ) ∝ (cid:96) ( z, G ref | b, σ X , γ ) p ( b, σ X | γ ) p ( γ ) (6)This posterior is difficult to evaluate analytically, but JAM implements stochastic-search modelselection via a reversible-jump Markov Chain Monte Carlo (RJMCMC) algorithm to samplefrom (6) and identify the most suitable sets of genetic variants. The stochastic search procedurestarts from a set γ (0) containing only the variant with the smallest p-value. Then in eachiteration, given a current set of genetic variants γ ( k ) , JAM randomly proposes a new set γ ( k +1) by either Adding a new variant to the current set γ ( k ) . • Removing a variant from γ ( k ) . • Swapping a variant in γ ( k ) for a variant not in γ ( k ) .The algorithm then decides whether to accept the new set by evaluating how well it fits thetrait, according to the posterior (6). The process is repeated for a large number of iterations,and the various sets of variants are assigned posterior probabilities according to how often theywere visited. The stochastic search procedure allows JAM to evaluate a wide range of differentmodels and efficiently explore the parameter space of genetic configurations.For Mendelian randomization, the above procedure can be used to identify variants stronglyand independently associated with the risk factor of interest (Gkatzionis et al., 2020). An IVWcausal effect estimate can then be obtained for each set of variants visited by JAM, and thesemodel-specific estimates can be combined into an aggregate causal effect estimate by modelaveraging.Since it jointly models all available genetic variants and can account for genetic correlations,JAM is naturally suited for cis-MR. The algorithm discards variants which are associated withthe risk factor only through their correlation with other present variants, and its posteriormodels include variants independently associated with the risk factor. JAM is not completelyfree of numerical issues when implemented on near-collinear variants, but these issues can beovercome by employing pruning as a first step before calling the algorithm. An advantage overthe LD-pruning approach is that JAM is more robust to the choice of pruning thresholds, sinceit implements a second layer of variable selection after pruning is finished. The pruning step canbe implemented using a high correlation threshold (e.g. ρ = 0 .
9) and no significance threshold.As an alternative, JAM can avoid the need for pruning by using a ridge term for the geneticcovariance matrix to make its inversion more stable. The choice of a pruning threshold is thenreplaced by the tuning of the ridge term parameter. Here we focus on implementations usingpruning as a first step and do not explore the use of a ridge term.Outside of the context of cis-MR, Gkatzionis et al. (2020) extended JAM to handle violationsof the no-pleiotropy assumption when working with multiple independent genetic variants, byaugmenting its variable selection with a heterogeneity loss function to penalize and downweightpleiotropic variants. Furthermore, Jiang et al. (2020) proposed a hierarchical version of thealgorithm that is useful for multivariable MR, as well as transcriptome-wide association studies.
Factor-based Methods
Recently, Patel et al. (2020) proposed the use of factor analysis for MR with correlated weakinstruments. The authors model the genetic variants in terms of a factor model, G = Λ f + (cid:15) f (7)where f is a vector of k < P latent factors, Λ is a matrix of factor loadings and (cid:15) f is a vectorof idiosyncratic errors. The matrix Λ can be estimated using the first k eigenvectors in theeigendecomposition of the (rescaled) genetic covariance matrix G T G . This covariance matrixand its eigendecomposition can be approximated if one has access to a reference dataset. Thenumber k of latent factors can be selected either empirically, as is common in factor analysis,or in a more disciplined way, for example using the factor penalization method of Bai and Ng(2002). he factors are then used as genetic instruments in a standard MR model: X = b TX f + (cid:15) X Y = b TY f + (cid:15) Y β Y = θβ X where (cid:15) X , (cid:15) Y are mean-zero error terms. An estimate of the causal effect parameter θ can beobtained by limited-information maximum likelihood (LIML), minimizing the criterionˆ g ( θ ) = 1 P ˆΛ T ˆ D Y ˆ β Y − θ P ˆΛ T ˆ D X ˆ β X where ˆΛ is the estimated matrix of factor loadings and ˆ D X , ˆ D Y are the diagonal matrices whosediagonal terms are the sample variances of each genetic instrument; these can be approximatedfrom the reference dataset. Minimization of (a weighted version of) ˆ g ( θ ) yields the factor LIML(F-LIML) estimate.Patel et al. (2020) describe conditions that guarantee the consistency and asymptotic nor-mality of the F-LIML estimator. Unfortunately, these conditions are often violated under a weakinstruments framework and F-LIML may suffer from weak instrument bias. In this scenario,Patel et al. (2020) propose a range of hypothesis tests, using the factors as genetic instruments,in order to assess the existence of a causal effect. These tests are derived from the instrumentalvariables literature and adapted to work with summary-level data. They include the Anderson-Rubin (Anderson and Rubin, 1949), Lagrange multiplier (Kleibergen, 2002) and conditionallikelihood ratio (Moreira, 2003) tests. The various tests are more robust to weak instrumentbias than the F-LIML estimator, and have higher power than the principal components methodunder a weak instruments setting. Simulation Study
Simulation Design
To compare the various cis-MR approaches, we implemented a simulation study. We generateddata under the following simulation model: X = P (cid:88) j =1 b Xj G j + α X U + (cid:15) X (8) Y = θX + α Y U + (cid:15) Y (9) U, (cid:15) X , (cid:15) Y ∼ N (0 , σ ) independently of each other. (10)Our simulation design was informed by the real-data applications in the next section. Weobtained genetic variants from two gene regions: the SHBG region, which encodes the proteinsex-hormone binding globulin and is known to be associated with sex hormone traits, and the
HMGCR region, which encodes the protein HMG coenzyme-A reductase, plays an importantrole in the metabolic pathway that produces cholesterol and is the drug target of statins. Werefer to the real-data applications for more information about these regions.
SHBG and
HMGCR represent two distinctly different genetic correlation structures; correlation heat-maps for thetwo regions are provided in Figures 2 and 3 below. By basing simulations on two different egions we guarded against spurious observations on comparative performance of the methodsthat may arise due to subtle properties of the correlation structure in any one region.We defined the two gene regions as spanning 100Kb pairs on either side of the correspondinggene and used all variants in these regions for which individual-level data were available in theUK Biobank and the minor allele frequencies were higher than 1%. This resulted in P = 717genetic variants for the SHBG region and P = 590 variants for the HMGCR region. Weextracted individual-level genetic observations for the selected variants from the UK Biobank,using a sample of N ref = 367643 non-related individuals of European origin. Aiming to replicatea two-sample Mendelian randomization analysis, we then bootstrapped the UK Biobank dataand obtained two genetic matrices G , G of sizes N , N respectively. Using the genetic matrix G , we simulated risk factor measurements according to (8) and used them to obtain G − X summary statistics. Using the matrix G and (8)-(9), we simulated outcome measurements andused them to generate G − Y summary statistics. Finally, the reference correlation matrix wascomputed using the entire UK Biobank dataset.To generate risk factor measurements, we used information from our real-data applicationsand the literature. For the SHBG region, we assumed that six independent genetic effectswere present in the region. This was inspired by Coviello et al. (2012) who investigated theassociations of
SHBG variants with concentration of the SHBG protein, and suggested thatas many as nine independent signals may be present in the region. Three of the nine causalvariants identified by Coviello et al. (2012) were not included in our dataset, as they were locatedmore than 100Kb pairs away from the
SHBG gene, therefore we used the remaining six variantsin our simulations. For these six variants, the effects of the risk-increasing allele on the riskfactor were generated according to b Xj ∼ | N (0 , . | + 0 .
1. This created a pattern of univariateSNP-exposure associations similar to that we observed in our real-data application.For the
HMGCR region, we also assumed the existence of six variants independently as-sociated with the risk factor. The six causal signals were placed at genetic variants used byFerence et al. (2019) to construct a genetic score for LDL-cholesterol based on the
HMGCR region. Effects were generated according to b Xj ∼ | N (0 , . | + 0 .
03 for the risk-increasingallele; again, this closely resembled the univariate summary statistics obtained in our real-dataapplication.We considered three different values for the causal effect parameter: θ = 0 , . , .
1. Weset the G − X sample size to be N = 10000, which is fairly small for modern MR standardsbut may be reasonable for a cis-MR study using protein expression data for the risk factor. Insupplementary material, we report simulation results using a larger sample size of N = 100000.The sample size for SNP-outcome associations was set equal to N = 180000, similar to that ofthe CARDIoGRAMplusC4D dataset (CARDIoGRAMplusC4D Consortium, 2015) used in ourreal-data applications. Finally, we assumed that all the genetic variants are valid instrumentsand did not generate pleiotropic effects.One of our goals in the simulations was to assess the effect of weak instruments bias onthe performance of the various methods. This form of bias can be of particular concern incis-MR studies, as focusing on a single region means that usually there will be far fewer geneticinstruments to use. A common diagnostic for weak instrument bias is to compute the F statisticfor the regression of the risk factor on all the genetic variants; a rule-of-thumb is that F ≤ SHBG region (Jin et al., 2012) and 2% for the
HMGCR region (Krauss et al., SHBG regionand 35 for the
HMGCR region. In this scenario, the effect of weak instrument bias on theperformance of cis-MR methods should be small.In our second scenario, we reduced the proportion of variation in the risk factor that isexplained by the genetic variants to 10% of its value in the previous scenario. For the
SHBG region, we assumed that the genetic variants only explain v G = 0 .
3% of variation in the riskfactor, while for the
HMGCR region, the genetic variants explained 0 .
2% of variation. Thisresulted in a “weak instruments” scenario in which the “oracle” MR analysis had an average Fstatistic of 6 . SHBG region and 4 . HMGCR region.For each scenario, the simulation was replicated 1000 times per region and per value of thecausal effect.
Methods
In our simulations we implemented the following cis-MR methods: • Single-instrument MR using only the minimum p-value variant in the region. • LD-pruning. • Principal components analysis. • The JAM algorithm. • The F-LIML algorithm. • The factor-based conditional likelihood ratio (CLR) test.The performance of the conditional method is often quite similar to that of stepwise pruning(Dudbridge and Newcombe, 2015), hence the method was not implemented. A brief comparisonbetween conditional analysis, stepwise pruning and PCA can be found in Burgess et al. (2017).Many of the above methods depend on additional tuning parameters. To assess the sensitiv-ity of each method to the specification of its tuning parameter(s), we used a range of differentvalues. For stepwise pruning, we set a correlation threshold of ρ = 0 . , . , . , . , .
9. For thesignificance threshold, we use the standard GWAS threshold ( τ = 5 × − ) in simulations with“strong instruments”; with “weak instruments” there were simulations in which no variantspassed the GWAS significance threshold, so we used a lower value ( τ = 10 − ) instead. Forthe PCA method, we used principal components that explained k = 0 .
99 or k = 0 .
999 of thevariation in the genetic data. For JAM, we implemented a pre-pruning step with a correlationthreshold of ρ = 0 . , . , . , .
95 and no significance threshold - note that JAM does not needa significance threshold for the pre-pruning step because its variable selection discards variantsweakly associated with the risk factor anyway. The algorithm was run for 1 million iterationsin each instance. Finally for the F-LIML method, we allowed the algorithm to determine thenumber of latent factors to use.The simulations were implemented in the statistical software R . Stepwise pruning and correlated-instruments IVW were implemented manually. To implement the PCA approach, we used codeprovided in the appendix of Burgess et al. (2017). The JAM algorithm was implemented usingthe R package R2BGLiMS . The F-LIML algorithm and the summary-data CLR test were im-plemented using code provided to us by the authors of the relevant publication (Patel et al.,2020). Stepwise pruning was the fastest algorithm to implement and the JAM algorithm wasthe slowest, although the differences were small on an absolute scale and none of the methodsrequired more than a few seconds to run for each dataset. he various MR methods in our simulations were subject to two sources of bias. The firstsource is weak instrument bias, which is known to act towards the direction of the observationalassociation in one-sample MR analyses and towards the null in two-sample MR (Burgess et al.,2011, 2016). As discussed earlier, weak instrument bias should be a concern in our secondsimulation scenario, where the F statistic is below 10. Note however that even in the firstscenario, weak instrument bias may still affect methods using many nuisance variants to obtaina causal effect estimate, as using such variants reduces the value of the F statistic.Second, the performance of various methods in our simulations can be affected by numericalissues, in particular related to inverting the genetic covariance matrix G T G . Inaccurate estima-tion of ( G T G ) − may lead to spurious increases or decreases in genetic correlations. This formof bias can be detected by computing the condition number of the genetic covariance matrix,but its direction is not straightforward to assess. It is more likely to affect methods using highlycorrelated variants, such as stepwise pruning with high correlation thresholds. Results
Simulation results are reported in Tables 1-2. For each method and each value of the causaleffect parameter, we report median causal effect estimates and estimated standard errors. Forsimulations with θ = 0 we also report the empirical Type I error rates, defined as the proportionof simulations in which the method rejected the null causal hypothesis. For simulations with θ (cid:54) = 0 we also report the empirical coverage of confidence intervals and the empirical power toreject the null causal hypothesis. All methods had very high power in the “strong instruments”scenario with θ = 0 .
1, and very low power in the “weak instruments” scenario with θ = 0 . ρ values, exhibiting bias towards the null. Thebias was more pronounced for θ = 0 . θ = 0 . θ = 0 suggests that any issues in their performanceare due to weak instrument bias, which acts towards the null and hence would only affectsimulations with θ (cid:54) = 0. Accordingly, any biases observed for θ (cid:54) = 0 were towards the null.For LD-pruning, large values of ρ make the inclusion of weak instruments more likely andthe numerical computation and inversion of the correlation matrix more challenging. For smallvalues of ρ the method is more robust to the inclusion of weak instruments. On the otherhand, when the genetic region studied contains multiple causal signals, using a small correlationthreshold may discard some of the causal variants from the analysis. In our simulations, thistranslated into a fairly small increase in causal standard errors and a decrease in the method’spower to detect a causal effect. We also note that in the simulations of Table 1, LD-pruningwas implemented after discarding genetic variants whose p-values did not reach genome-widesignificance. This is a “best-case scenario” for the method in terms of avoiding weak instrumentbias; in practice, pruning is often implemented using less stringent thresholds (Dudbridge, 2013),and the effects of weak instrument bias can be more severe.As expected, an MR analysis using only the variant with the smallest p-value in the regionwas unbiased, but had larger standard errors and lower power than other methods. This was = 0 θ = 0 . θ = 0 . θ se (ˆ θ ) Type I ˆ θ se (ˆ θ ) Cov Power ˆ θ se (ˆ θ ) Cov SHBG
RegionTop SNP —– 0.000 0.019 0.051 0.050 0.019 0.956 0.731 0.096 0.020 0.927Pruning ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . k = 0 .
99 -0.001 0.015 0.063 0.049 0.015 0.954 0.914 0.099 0.016 0.938 k = 0 .
999 0.000 0.014 0.050 0.048 0.014 0.941 0.933 0.096 0.015 0.935JAM ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 .
95 0.000 0.014 0.045 0.047 0.014 0.956 0.933 0.094 0.015 0.933F-LIML —– 0.000 0.014 0.066 0.051 0.015 0.942 0.936 0.100 0.016 0.953CLR —– —– —– 0.054 —– —– 0.947 0.926 —– —– 0.958
HMGCR
RegionTop SNP —– 0.001 0.018 0.048 0.049 0.019 0.953 0.776 0.099 0.019 0.924Pruning ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . k = 0 .
99 0.000 0.017 0.049 0.049 0.017 0.949 0.830 0.099 0.018 0.930 k = 0 .
999 0.000 0.016 0.052 0.048 0.017 0.953 0.826 0.098 0.017 0.934JAM ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 .
95 0.001 0.017 0.035 0.047 0.018 0.957 0.777 0.095 0.018 0.939F-LIML —– 0.000 0.017 0.064 0.050 0.017 0.945 0.850 0.101 0.019 0.946CLR —– —– —– 0.055 —– —– 0.947 0.834 —– —– 0.945
Table 1: Simulation results for the “strong instruments” scenario.14 ore pronounced for the
SHBG region and less so for the
HMGCR region, since genetic corre-lation were stronger in the
HMGCR region and using only a single variant could partly accountfor the effects of other variants through correlation. We also note that in our simulations, allcausal variants had G − X effects in the same direction and of similar magnitude, and there wereno heterogeneous effects towards the outcome. This is a best-case scenario for single-variantanalysis; differences between it and other methods are likely to be larger in practice.The performance of the PCA method was similar for k = 99% and k = 99 .
9% and was quiteaccurate. With a causal effect of θ = 0 .
1, the algorithm exhibited a small reduction in coveragedue to weak instrument bias but still performed better than LD-pruning. In general, the effectof weak instruments bias on the algorithm is more pronounced for larger values of the tuningparameter k . Here, the standard values of k = 99% and k = 99 .
9% worked reasonably well in allsimulation scenarios. The power to reject the null causal hypothesis was greater for k = 99 . SHBG region.The JAM algorithm exhibited similar performance to PCA, with a small attenuation ofcausal effect estimates for θ = 0 .
1. JAM requires a correlation threshold to be specified forthe pruning step before running Bayesian variable selection, but the algorithm’s performancewas quite robust to the value of that tuning parameter, certainly more so than that of LD-pruning. Its empirical power was slightly higher than PCA for the
SHBG region, but lower forhe
HMGCR region.The F-LIML method does not depend on a tuning parameter, as it can automatically deter-mine the number of latent factors to use. Compared to JAM and PCA, the algorithm yieldedslightly more accurate causal effect estimates and slightly better calibrated confidence intervalsfor θ = 0 . θ = 0. The latter issue was addressedby the conditional likelihood ratio test, at the expense of no causal effect estimates and a slightlylower power than F-LIML.Table 2 reports simulation results from the “weak instruments” scenario. Weak instrumentsbias had a much higher impact in these simulations, with several methods facing attenuationof their causal effect estimates. As expected, any bias occurred only for θ (cid:54) = 0 and was towardsthe null, while all methods also had increased standard errors and low power.Top-SNP analysis, LD-pruning, PCA and JAM all suffered from weak instrument bias inthis scenario. The magnitude of bias was similar for these methods. For pruning and PCA,the bias caused poor coverage properties for confidence intervals and Type I error rates abovenominal levels. JAM only selected a small number of genetic variants (it selected an averageof 1 . = 0 θ = 0 . θ = 0 . θ se (ˆ θ ) Type I ˆ θ se (ˆ θ ) Cov ˆ θ se (ˆ θ ) Cov Power SHBG
RegionTop SNP —– -0.002 0.051 0.045 0.033 0.052 0.926 0.072 0.054 0.891 0.259Pruning ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . k = 0 .
99 0.000 0.040 0.050 0.035 0.041 0.924 0.068 0.042 0.876 0.392 k = 0 .
999 -0.002 0.033 0.035 0.027 0.034 0.902 0.054 0.036 0.736 0.363JAM ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 .
95 -0.002 0.070 0.005 0.035 0.071 0.984 0.071 0.076 0.968 0.096F-LIML —– -0.003 0.037 0.183 0.051 0.039 0.782 0.101 0.041 0.802 0.667CLR —– —– —– 0.041 —– —– 0.939 —– —– 0.948 0.404
HMGCR
RegionTop SNP —– -0.001 0.051 0.055 0.037 0.054 0.939 0.077 0.054 0.916 0.309Pruning ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . k = 0 .
99 0.003 0.047 0.054 0.040 0.049 0.946 0.080 0.051 0.919 0.359 k = 0 .
999 0.001 0.043 0.054 0.033 0.045 0.936 0.067 0.046 0.854 0.303JAM ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 .
95 0.000 0.064 0.011 0.040 0.070 0.985 0.079 0.071 0.975 0.131F-LIML —– 0.002 0.047 0.165 0.051 0.050 0.859 0.105 0.054 0.862 0.524CLR —– —– —– 0.049 —– —– 0.955 —– —– 0.952 0.331
Table 2: Simulation results for the “weak instruments” scenario.16 ariant MR analysis can be reliable in simulations where weak instrument bias is of lesserconcern, but should be used with caution in cis-MR analyses where weak instrument bias issuspected. Even when LD-pruning is used, MR causal effect estimates should be computed andreported for a range of different correlation thresholds as a form of sensitivity analysis. BetweenJAM, PCA and F-LIML, the differences were small in the “strong instruments” simulation.With weak instruments, F-LIML provided accurate causal effect estimates but poor uncertaintyquantification, JAM yielded biased causal effect estimates, reasonable confidence intervals butlow power, the CLR test offered nominal confidence intervals and decent power at the expenseof no point estimates and PCA was the method most affected by weak instruments bias. Atheoretical advantage of JAM compared to the other approaches is that JAM’s variable selectiongives an indication of which genetic variants are more likely to be causally associated with therisk factor, although it is not clear whether this additional information would be useful in anMR study where the objective is to perform inference about the X − Y causal effect. Finally,it may be possible to improve the performance of some of the methods by removing very weakvariants (e.g. with p > .
05) from the analysis prior to implementing the methods, but we havenot considered that here.
Additional Simulations
In supplementary material, we report results from a range of additional simulation scenarios.Specifically, we conducted three additional simulations, on which we comment here briefly.First, we used a larger sample size of N = 100000 from which to obtain risk factor-outcome summary statistics. This could be reminiscent of a cis-MR analysis using a down-stream biomarker as a proxy for protein expression, and obtaining G − X summary data froma large-scale GWAS. Our results (Supplementary Table 1) suggested that the various methodsperform similarly in this case and issues such as weak instrument bias and robustness to thechoice of the pruning threshold are less concerning with large sample sizes.In the second simulation, we experimented with different numbers of causal variants. Weused the SHBG region, simulated under the “strong instruments” scenario, and assumed thatthe region contains either one or three genetic variants causally associated with the risk factor.The performance of the various methods (Supplementary Table 2) was similar to that observedin our baseline simulations with six causal variants, suggesting that the number of causal signalsin the region has little impact on their performance relative to each other, except for the single-variant approach which did expectedly better in simulations with only one causal variant.In our third supplementary simulation, we assessed the impact of only having access to asmall reference dataset on the various methods. In each iteration, we bootstrapped the UKBiobank data in order to obtain a small reference sample of size N ref = 1000 and implementedcis-MR using that reference dataset (Supplementary Table 3). The JAM algorithm was the mostsensitive method to this change. The algorithm selected more variants on average, exhibitedslight differences in its performance for different values of the correlation threshold and per-formed poorly for ρ ≥ .
9. In such datasets, we recommend using a more stringent correlationthreshold for the algorithm. The other methods were less affected in comparison.We refer to the supplement for more details on these simulations. pplications Causal Effect of LDL-cholesterol on CHD Risk Using Variants inthe
HMGCR
Region
Introduction
We now compare the various cis-MR methods in two real-data applications. In the first ap-plication, we use Mendelian randomization to investigate the relationship between low densitylipoprotein (LDL) cholesterol concentration and the risk of coronary heart disease (CHD) usinggenetic variants from the
HMGCR region. The
HMGCR region is located in chromosome 5and encodes the protein HMG coenzyme-A reductase. The protein plays an important rolein the metabolic pathway that produces cholesterol, and its inhibition by statins is a commontreatment to reduce LDL-cholesterol levels.This analysis is performed mainly for illustrative purposes: the causal connection betweenLDL-cholesterol and CHD risk has already been explored in several papers in the literature.Genome-wide MR studies of the effect of LDL-cholesterol on CHD risk have been performedby Linsel-Nitschke et al. (2008); Waterworth et al. (2010); Burgess et al. (2014); Holmes et al.(2014); Allara et al. (2019) among others. In the field of cis-MR, Swerdlow et al. (2015) consid-ered associations between
HMGCR variants and a range of biomarkers in order to investigatewhether statin treatment increases the risk of type 2 diabetes. Ference et al. (2016) comparedthe effects on type 2 diabetes for variants in the
HMGCR and
PCSK9 regions, while Ferenceet al. (2019) compared the LDL-CHD causal effects obtained using variants in the
HMGCR and
ACLY regions in order to explore the suitability of
ACLY as a drug target. The
HMGCR region has also been used as an applied example in methodological papers (Schmidt et al., 2019).
Datasets and Methods
Since a clear link between HMGCR protein expression and lowering LDL-cholesterol has been es-tablished, we used LDL-cholesterol as a risk factor instead of protein expression. We estimatedgenetic associations between variants in the
HMGCR region and LDL-cholesterol using datafrom the UK Biobank. Genetic associations were computed based on a sample of N = 349795unrelated individuals of European ancestry. Associations with coronary heart disease risk wereobtained from the CARDIoGRAMplusC4D consortium (CARDIoGRAMplusC4D Consortium,2015), based on a sample of N = 184305 individuals. Finally, genetic correlations were com-puted using individual-level data for 367643 unrelated Europeans from the UK Biobank (thesample size for genetic associations with LDL-cholesterol was slightly smaller than the referencesample due to missing values).We used a wider region than in our simulations and extracted information on variants within500Kb pairs from the HMGCR gene. In total, 2915 variants were present in both the UKBiobank and the CARDIoGRAMplusC4D dataset. Of those 2915 variants, 20 were discardedbecause they either had missing associations with LDL-cholesterol in the UK Biobank datasetor concerned multiple alleles on the same locus that were not detected in both datasets. Wedid not discard variants with low effect allele frequencies. Our analysis was therefore based on P = 2895 genetic variants. In Figure 2 we visualize the genetic correlations in the HMGCR region, and present a Manhattan plot of genetic association with LDL-cholesterol levels.We then conducted cis-MR using the minimum p-value variant, LD-pruning, principal com-ponents analysis, the JAM algorithm and the factor-based methods. The various methods wereimplemented using a range of different parameter specifications, similar to those used for our
HMGCR region. Right: Manhattan plot of p-values for G − X associations. simulation study. For stepwise pruning, we used the values ρ = 0 . , . , . , . , . τ = 5 × − . Of the 2895 variantsin the region, 1424 had p-values below the GWAS threshold. For the PCA method, we usedprincipal components that explained k = 99% or k = 99 .
9% of variation in the genetic data. ForJAM, we implemented a pre-pruning step with a correlation threshold of ρ = 0 . , . , . , . Results
Results of our analysis are presented in Table 3. We report causal effect estimates, standarderrors and 95% confidence intervals obtained by each method. The reported estimates representlog-odds ratios of increase in CHD risk per standard deviation increase in LDL-cholesterol levels.Genetically elevated LDL-cholesterol concentration based on variants in the
HMGCR regionis known to be associated with increased risk of coronary heart disease Cholesterol TreatmentTrialists (CTT) Collaboration (2010); Ference et al. (2019). This was confirmed by the prun-ing, JAM, F-LIML and CLR methods. For example, the JAM algorithm using a pre-pruningthreshold of ρ = 0 . .
355 (odds ratio 1 . . , . .
423 for k = 99% and 0 .
448 for k = 99 .
9% and the null causal hypothesis was rejected on bothoccasions. ethod ˆ θ se (ˆ θ ) 95% CITop-SNP —– 0.555 0.167 (0.228 , 0.882)Pruning ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . k = 0 .
99 -0.030 0.026 (-0.081 , 0.020) k = 0 .
999 0.000 0.022 (-0.043 , 0.042)JAM ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 .
95 0.358 0.117 (0.129 , 0.586)F-LIML —– 0.501 0.165 (0.178 , 0.824)CLR —– —– —– (0.169 , 0.839)
Table 3: Cis-MR analysis of the effect of LDL-cholesterol on CHD risk, using genetic variants in the
HMGCR region.
The remaining methods were more consistent in their results. LD-pruning exhibited theusual attenuation towards the null for ρ = 0 . ρ = 0 .
9, as a result of numerical errors. Bias due to numerical errors is more likely to be aconcern here than in our simulation study, due to the larger number of variants used. For theJAM algorithm, there were small differences between implementations for ρ = 0 . − Causal Effect of Testosterone on CHD Risk Using Variants inthe
SHBG
Region
Introduction
In our second application, we apply Mendelian randomization using variants in the
SHBG generegion in order to assess the causal relationship between testosterone levels and coronary heartdisease risk. The
SHBG region is located in chromosome 17 and encodes sex-hormone bindingglobulin, a protein that inhibits the function of sex hormones such as testosterone and estradiol.The region has been shown to contain strong genetic associations with testosterone levels, aswell as a number of sex hormone traits (Jin et al., 2012; Coviello et al., 2012; Schooling et al.,2018; Ruth et al., 2020). In addition, previous research has suggested that the region is likelyto contain several genetic variants independently associated with testosterone; Coviello et al.(2012) claimed that as many as nine independent signals may be present. This implies that anaive approach using only the variant with the smallest p-value would underestimate the geneticcontributions to testosterone levels. The causal relationship between sex hormone traits andCHD risk using variants in the
SHBG region has been studied by Burgess et al. (2017) and chooling et al. (2018) with evidence mostly suggesting no causal relationship. Here, we aimedto replicate the analysis of Burgess et al. (2017) using larger datasets. Datasets and Methods
We used serum testosterone levels as the exposure for our MR analysis. We obtained geneticassociations with testosterone using summary-level data from the Neale Lab website . Thesesummary data were derived from the UK Biobank, using a sample of N = 312102 unrelatedindividuals of European descent. We defined the genetic region to include genetic variantswithin 500Kb pairs on either side of the SHBG gene.As in the previous application, we obtained genetic associations with CHD risk from theCARDIoGRAMplusC4D dataset (CARDIoGRAMplusC4D Consortium, 2015), using a sampleof N = 184305 individuals. A total of 3053 genetic variants were present in both datasets andwere therefore included in our analysis. We did not conduct separate analyses on males andfemales, since we did not have access to sex-specific associations with CHD risk. Finally, weused the UK Biobank as a reference dataset from which to extract LD correlations betweengenetic variants. Figure 3 presents a Manhattan plot of associations with testosterone, as wellas the genetic correlations in the region. Figure 3: Left: genetic correlations in the
SHBG region. Right: Manhattan plot of p-values for G − X associations. We then implemented the various cis-MR methods in order to select genetic instruments andassess whether
SHBG variants are causally associated with CHD risk. We used the same settingas in the
HMGCR application for the tuning parameters of each method. The stepwise pruningmethod was implemented using only variants with GWAS-significant associations with serumtestosterone levels. A total of 1156 variants had p-values lower than the 5 × − threshold.The remaining methods were implemented using all 3053 variants in the region. https://docs.google.com/spreadsheets/d/1kvPoupSzsSFBNSztMzl04xMoSC3Kcx3CrjVf4yBmESU/edit?ts=5b5f17db esults Method ˆ θ se (ˆ θ ) 95% CITop-SNP —– -0.071 0.036 (-0.141 , -0.001)Pruning ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . k = 0 .
99 -0.052 0.029 (-0.109 , 0.004) k = 0 .
999 -0.018 0.024 (-0.065 , 0.028)JAM ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 .
95 -0.035 0.024 (-0.081 , 0.012)F-LIML —– -0.037 0.025 (-0.087 , 0.013)CLR —– —– —– (-0.087 , 0.014)
Table 4: Mendelian randomization analysis of the effect of serum testosterone levels on CHD risk,using genetic variants in the
SHBG region.
Table 4 reports causal effect estimates, standard errors and 95% confidence intervals foreach method. Once again, estimates are reported in the log-odds ratio scale and representincreases in CHD risk per standard deviation increase in serum testosterone levels. All methodswith the exception of top-SNP and LD-pruning at 0 . . SHBG region, although point estimates wereconsistently in the risk-decreasing direction.In this example, the performance of the various methods was less seriously affected by weakinstruments bias because there was apparently no causal relationship between serum testosteronelevels and CHD risk. This is consistent with our simulation design, where no bias was observedunder the “null causal effect” scenario.A notable inconsistency was that an MR analysis using only the genetic variant with thesmallest p-value in the region suggested a risk-decreasing, statistically significant causal effect.This may be suggestive of heterogeneity in evidence from different variants within the
SHBG region. Therefore, this example empirically demonstrates the pitfalls of using simplistic singlevariant analyses when in fact multiple signals exist within a region. The top variant in ouranalysis was rs1799941, which is known to be associated with testosterone levels (Ruth et al.,2020).The LD-pruning method was rather inconclusive in this application. Implementations witha low correlation threshold suggested no causal effect. However, the method suggested an effectin the risk-decreasing direction for ρ = 0 . ρ = 0 . − ignals in the region, confirming claims by Coviello et al. (2012). The F-LIML estimate and theCLR confidence interval were in line with the results obtained by JAM.Our results obtained in this application are similar to those reported in Burgess et al. (2017).Similar to the results reported in Table 4, pruning estimates in Burgess et al. (2017) suggestedno causal effect for small correlation thresholds bt were unstable for large ρ (in fact, they weremore unstable than in our results, possibly due to the smaller sample sizes used in that paper).Estimates from the PCA method were in the risk-decreasing direction but did not achievestatistical significance. Overall, both Burgess et al. (2017) and our analysis suggested no causalassociation between testosterone and CHD risk based on variants in the SHBG region; furtherevidence of no causality was provided by Schooling et al. (2018), which also used data from theUK Biobank.
Discussion
Cis-Mendelian randomization is emerging as a widely applicable approach to support drug de-velopment and inform clinical trials. In this manuscript we have outlined and compared a rangeof methods from the cis-MR literature to select variants from a genetic region of interest andestimate the MR causal effect. In particular, we have considered MR using only the minimump-value variant, and also LD-pruning, which is one of the most commonly used approaches inpractice, principal components analysis, factor-based methods and stochastic-search variable se-lection via the JAM algorithm. We compared the various methods across a range of simulationscenarios, which were based on real genetic correlation structures of two different gene regionsdrawn from the UK Biobank. We also assessed the performance of the different methods intwo cis-MR case studies, investigating the effect of LDL-cholesterol on CHD risk using variantsin the
HMGCR region and the effect of testosterone on CHD risk using variants in the
SHBG region.We have found that LD-pruning can be reliable when using large sample sizes and stronggenetic instruments, but can be susceptible to weak instrument bias. Moreover, results obtainedusing LD-pruning can be sensitive to the correlation threshold used, and using a high correlationthreshold can result in numerical instabilities. Based on these results, we would thereforerecommend that when the pruning method is used, it is implemented using a range of correlationthresholds as a form of sensitivity analysis.Although not completely free from weak instrument bias, methods such as PCA, F-LIML,the JAM algorithm and the conditional likelihood ratio test were less affected by such bias thanLD-pruning. JAM was also more robust to the choice of its tuning parameter than pruning,while PCA and F-LIML achieved good performance in a range of different scenarios with thedefault values for their tuning parameters. Therefore, we recommend that some of these methodsare implemented as sensitivity tools in applied cis-MR analyses.Compared to each other, JAM, PCA, F-LIML and CLR exhibited similar performance insimulations with strong genetic instruments. With weak instruments, the CLR test was leastaffected by weak instruments bias and provided a valid test for the causal null hypothesis, albeitno point estimates. F-LIML obtained accurate causal effect estimates but yielded confidence in-tervals with poor coverage, while JAM had biased causal effect estimates but better uncertaintyquantification. The principal components IVW approach was affected by weak instruments biasto a greater extent than the other methods, although our real-data applications suggest thatPCA can still perform well if very weak variants are removed from the analysis prior to itsimplementation. he use of multiple methods as a form of sensitivity analysis can increase the reliabilityof results of a cis-MR study. This can be further reinforced by triangulating evidence fromdifferent study designs (Gill et al., 2020). This includes colocalization, an approach which hasbeen shown to yield benefits when used in conjunction with MR to prioritize proteins for drugdevelopment (Zheng et al., 2019). Here, we have only focused on Mendelian randomization andnot considered alternative approaches.Our analysis has a number of limitations. Our simulations were by no means exhaustive;for example, they were based on only two genetic regions. In addition, our simulations havenot considered other forms of bias that may arise in cis-MR applications, such as selectionbias, population stratification or pleiotropy. Bias due to pleiotropy is worth discussing in moredetail, as it is one of the main concerns in many MR studies. Pleiotropy is less of a concernin cis-MR compared to traditional MR studies (Schmidt et al., 2019), but would arise if dif-ferent variants from the same region were involved in different biological pathways relevant tothe outcome, resulting in heterogeneous causal effect estimates. When pleiotropy does occur,standard algorithms for pleiotropy-robust MR (including the JAM algorithm) can be used toaddress it. However, care must be taken when extrapolating the assumptions made by suchalgorithms to the framework of many correlated genetic variants. For example, suppose that aregion contains a single pleiotropic variant, but that variant is strongly correlated with half ofthe other variants in the region. This will induce bias in the marginal causal effect estimates ofthe correlated variants, and hence pleiotropy-robust methods requiring more than 50% of thegenetic variants in the region to be valid instruments will be biased.Cis-Mendelian randomization is an active area of research, both methodological and applied.As the field develops further, it is likely that new statistical approaches will be created toaid applied investigators in their work. By providing an overview and evaluation of existingmethods, we hope that our current paper will contribute to that direction. Acknowledgements
We would like to thank Ashish Patel for many helpful discussions and comments. This workwas supported by the UK Medical Research Council (Core Medical Research Council Biostatis-tics Unit Funding Code: MC UU 00002/7). Apostolos Gkatzionis and Paul Newcombe weresupported by a Medical Research Council Methodology Research Panel grant (Grant NumberRG88311). Stephen Burgess was supported by a Sir Henry Dale Fellowship jointly funded bythe Wellcome Trust and the Royal Society (Grant Number 204623/Z/16/Z). This research hasbeen conducted using the UK Biobank Resource under Application Number 30931.
References
Allara, E., G. Morani, P. Carter, A. Gkatzionis, V. Zuber, C. N. Foley, J. M. Rees, A. M. Mason,S. Bell, D. Gill, S. Lindstrom, A. S. Butterworth, E. Di Angelantonio, J. Peters, S. Burgess,and INVENT consortium (2019). Genetic determinants of lipids and cardiovascular diseaseoutcomes: A wide-angled Mendelian randomization investigation.
Circulation. Genomic andprecision medicine 12 (12), e002711.Anderson, T. W. and H. Rubin (”1949, 3). Estimation of the parameters of a single equation ina complete system of stochastic equations.
Annals of Mathematical Statistics 20 (1), 46–63. simit, J. L., D. B. Rainbow, M. D. Fortune, N. F. Grinberg, L. S. Wicker, and C. Wallace(2019). Stochastic search and joint fine-mapping increases accuracy and identifies previouslyunreported associations in immune-mediated diseases. Nature Communications 10 .Bai, J. and S. Ng (2002). Determining the number of factors in approximate factor models.
Econometrica 70 (1), 191–221.Benner, C., C. C. A. Spencer, A. S. Havulinna, V. Salomaa, S. Ripatti, and M. Pirinen (2016).FINEMAP: efficient variable selection using summary data from genome-wide associationstudies.
Bioinformatics 32 (10), 1493–1501.Boef, A. G. C., O. M. Dekkers, and S. le Cessie (2015). Mendelian randomization studies: areview of the approaches used and the quality of reporting.
International Journal of Epi-demiology 44 (2), 496–511.Burgess, S., N. M. Davies, and S. G. Thompson (2016). Bias due to participant overlap intwo-sample Mendelian randomization.
Genetic Epidemiology 40 (7), 597–608.Burgess, S., F. Dudbridge, and S. G. Thompson (2016). Combining information on multiple in-strumental variables in Mendelian randomization: comparison of allele score and summarizeddata methods.
Statistics in Medicine 35 (11), 1880–1906.Burgess, S., D. F. Freitag, H. Khan, D. N. Gorman, and S. G. Thompson (2014, 10). Usingmultivariable Mendelian randomization to disentangle the causal effects of lipid fractions.
PLOS ONE 9 (10), 1–8.Burgess, S. and S. G. Thompson (2015).
Mendelian randomization: methods for using geneticvariants in causal estimation . Chapman & Hall, Boca Raton, FL.Burgess, S., S. G. Thompson, and CRP CHD Genetics Collaboration (2011, 03). Avoidingbias from weak instruments in Mendelian randomization studies.
International Journal ofEpidemiology 40 (3), 755–764.Burgess, S., V. Zuber, E. Valdes-Marquez, B. B. Sun, and J. C. Hopewell (2017). Mendelianrandomization with fine-mapped genetic data: Choosing from large numbers of correlatedinstrumental variables.
Genetic Epidemiology 41 (4), 714–725.CARDIoGRAMplusC4D Consortium (2015). A comprehensive 1000 Genomes-based genome-wide association meta-analysis of coronary artery disease.
Nature Genetics 47 (10), 1121–1130.CCGC (CRP CHD Genetics Collaboration) (2011). Association between C reactive protein andcoronary heart disease: Mendelian randomisation analysis based on individual participantdata.
British Medical Journal 342 .Cholesterol Treatment Trialists (CTT) Collaboration (2010). Efficacy and safety of more in-tensive lowering of LDL cholesterol: a meta-analysis of data from 170 000 participants in 26randomised trials.
The Lancet 376 (9753), 1670–1681.Coviello, A. D., R. Haring, M. Wellons, D. Vaidya, T. Lehtim¨aki, S. Keildson, and Other Authors(2012). A genome-wide association meta-analysis of circulating sex hormone–binding globulinreveals multiple loci implicated in sex steroid hormone regulation.
PLOS Genetics 8 (7), 1–12. avey Smith, G. and S. Ebrahim (2003). ‘Mendelian randomization’: can genetic epidemiologycontribute to understanding environmental determinants of disease? International Journalof Epidemiology 32 (1), 1 – 22.Dudbridge, F. (2013, 03). Power and predictive accuracy of polygenic risk scores.
PLOSGenetics 9 (3), 1–17.Dudbridge, F. and P. J. Newcombe (2015). Accuracy of gene scores when pruning markers bylinkage disequilibrium.
Human Heredity 80 (4), 178 – 186.Ference, B. A., K. K. Ray, A. L. Catapano, T. B. Ference, S. Burgess, D. R. Neff, C. Oliver-Williams, A. M. Wood, A. S. Butterworth, E. Di Angelantonio, J. Danesh, J. J. P. Kastelein,and S. J. Nicholls (2019). Mendelian Randomization Study of ACLY and CardiovascularDisease.
New England Journal of Medicine 380 (11), 1033–1042.Ference, B. A., J. G. Robinson, R. D. Brook, A. L. Catapano, M. J. Chapman, D. R. Neff,S. Voros, R. P. Giugliano, G. Davey Smith, S. Fazio, and M. S. Sabatine (2016). Variation inPCSK9 and HMGCR and risk of cardiovascular disease and diabetes.
New England Journalof Medicine 375 (22), 2144–2153.Gill, D., M. K. Georgakis, V. M. Walker, A. F. Schmidt, A. Gkatzionis, D. F. Freitag, C. Finan,A. D. Hingorani, J. C. M. Howson, S. Burgess, D. I. Swerdlos, M. Dichgans, B. M. Psaty,G. Davey Smith, M. V. Holmes, R. A. Scott, J. Zheng, and N. M. Davies (2020). Mendelianrandomization for studying the effects of perturbing drug targets: a practical guide. Inpreparation.Gkatzionis, A., S. Burgess, D. V. Conti, and P. J. Newcombe (2020). Bayesian variable selectionwith a pleiotropic loss function in Mendelian randomization. bioRxiv .Hay, M., D. W. Thomas, J. L. Craighead, C. Economides, and J. Rosenthal (2014). Clinicaldevelopment success rates for investigational drugs.
Nature Biotechnology 32 , 40–51.Hocking, R. R. (1976). A biometrics invited paper. the analysis and selection of variables inlinear regression.
Biometrics 32 (1), 1–49.Holmes, M. V., F. W. Asselbergs, T. M. Palmer, F. Drenos, M. B. Lanktree, C. P. Nelson,and Other Authors (2014, 01). Mendelian randomization of blood lipids for coronary heartdisease.
European Heart Journal 36 (9), 539–550.Jiang, L., S. Xu, N. Mancuso, P. J. Newcombe, and D. V. Conti (2020). A hierarchical approachusing marginal summary statistics for multiple intermediates in a Mendelian randomizationor transcriptome analysis. bioRxiv .Jin, G., J. Sun, S.-T. Kim, J. Feng, Z. Wang, S. Tao, Z. Chen, L. Purcell, S. Smith, W. B. Isaacs,R. S. Rittmaster, S. L. Zheng, L. D. Condreay, and J. Xu (2012). Genome-wide associationstudy identifies a new locus JMJD1C at 10q21 that may influence serum androgen levels inmen.
Human Molecular Genetics 21 (23), 5222–5228.Kamstrup, P. R., A. Tybjærg-Hansen, R. Steffensen, and B. G. Nordestgaard (2009). Geneticallyelevated lipoprotein(a) and increased risk of myocardial infarction.
JAMA 301 (22), 2331–2339. leibergen, F. (2002). Pivotal statistics for testing structural parameters in instrumental vari-ables regression. Econometrica 70 (5), 1781–1803.Krauss, R. M., L. M. Mangravite, J. D. Smith, M. W. Medina, D. Wang, X. Guo, M. J. Rieder,J. A. Simon, S. B. Hulley, D. Waters, M. Saad, P. T. Williams, K. D. Taylor, H. Yang, D. A.Nickerson, and J. I. Rotter (2008). Variation in the 3-hydroxyl-3-methylglutaryl coenzymeA reductase gene is associated with racial differences in low-density lipoprotein cholesterolresponse to simvastatin treatment.
Circulation 117 (12), 1537–1544.Linsel-Nitschke, P., A. G¨otz, J. Erdmann, I. Braenne, P. Braund, C. Hengstenberg, and OtherAuthors (2008, 08). Lifelong reduction of LDL-cholesterol related to a common variant in theldl-receptor gene decreases the risk of coronary artery disease—a Mendelian randomisationstudy.
PLOS ONE 3 (8), 1–9.Miller, A. J. (2002).
Subset Selection in Regression . Monographs on Statistics and AppliedProbability. London: Chapman and Hall.Moreira, M. J. (2003). A conditional likelihood ratio test for structural models.
Economet-rica 71 (4), 1027–1048.Nelson, M. R., H. Tipney, J. L. Painter, J. Shen, P. Nicoletti, Y. Shen, A. Floratos, P. C. Sham,M. J. Li, J. Wang, L. R. Cardon, J. C. Whittaker, and P. Sanseau (2015). The support ofhuman genetic evidence for approved drug indications.
Nature Genetics 47 (8), 856–860.Newcombe, P. J., D. V. Conti, and S. Richardson (2016). JAM: A scalable Bayesian frameworkfor joint analysis of marginal SNP effects.
Genetic Epidemiology 40 (3), 188 – 201.Patel, A., S. Burgess, D. Gill, and P. J. Newcombe (2020). Inference with many correlated weakinstruments and summary statistics. arXiv:2005.01765.Ray, K. K., H. E. Bays, A. L. Catapano, N. D. Lalwani, L. T. Bloedon, L. R. Sterling, P. L.Robinson, and C. M. Ballantyne (2019). Safety and efficacy of bempedoic acid to reduce LDLcholesterol.
New England Journal of Medicine 380 (11), 1022–1032.Ruth, K. S., F. R. Day, D. J. T. Jessica Tyrrel and, A. R. Wood, A. Mahajan, R. N. Beaumont,L. Wittemans, S. Martin, A. S. Busch, A. M. Erzurumluoglu, B. Hollis, T. A. O’Mara, TheEndometrial Cancer Association Consortium, M. I. McCarthy, C. Langenberg, D. F. Easton,N. J. Wareham, S. Burgess, A. Murray, K. K. Ong, T. M. Frayling, and J. R. B. Perry (2020).Using human genetics to understand the disease impacts of testosterone in men and women.
Nature Medicine 26 (2), 252–258.Schmidt, A. F., C. Finan, M. Gordillo-Mara˜n´on, F. W. Asselbergs, D. F. Freitag, R. S. Patel,B. Tyl, S. Chopade, R. Faraway, M. Zwierzyna, and A. D. Hingorani (2019). Genetic drugtarget validation using Mendelian randomization. bioRxiv .Schmidt, A. F., D. I. Swerdlow, M. V. Holmes, R. S. Patel, Z. Fairhurst-Hunter, D. M. Lyall,and Other Authors (2017). PCSK9 genetic variants and risk of type 2 diabetes: a Mendelianrandomisation study.
The Lancet Diabetes and Endocrinology 5 (2), 97–105. chooling, C. M., S. Luo, S. L. A. Yeung, D. J. Thompson, S. Karthikeyan, T. R. Bolton,A. M. Mason, E. Ingelsson, and S. Burgess (2018). Genetic predictors of testosterone andtheir associations with cardiovascular disease and risk factors: A Mendelian randomizationinvestigation. International Journal of Cardiology 267 , 171 – 176.Sofat, R., A. D. Hingorani, L. Smeeth, S. E. Humphries, P. J. Talmud, J. Cooper, and OtherAuthors (2010). Separating the mechanism-based and off-target actions of cholesteryl estertransfer protein inhibitors with cetp gene polymorphisms.
Circulation 121 (1), 52–62.Swerdlow, D. I., K. B. Kuchenbaecker, S. Shah, R. Sofat, M. V. Holmes, J. White, J. S. Mindell,M. Kivimaki, E. J. Brunner, J. C. Whittaker, J. P. Casas, and A. D. Hingorani (2016, 06).Selecting instruments for Mendelian randomization in the wake of genome-wide associationstudies.
International Journal of Epidemiology 45 (5), 1600–1616.Swerdlow, D. I., D. Preiss, K. B. Kuchenbaecker, M. V. Holmes, J. E. L. Engmann, and OtherAuthors (2015). HMG-coenzyme A reductase inhibition, type 2 diabetes, and bodyweight:evidence from genetic analysis and randomised trials.
The Lancet 385 (9965), 351–361.The Interleukin-6 Receptor Mendelian Randomisation Analysis (IL6R MR) Consortium (2012).The interleukin-6 receptor as a target for prevention of coronary heart disease: a Mendelianrandomisation analysis.
The Lancet 379 (9822), 1214–1224.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.
Journal of the RoyalStatistical Society. Series B (Methodological) 58 (1), 267–288.Vilhjalmsson, B. J., J. Yang, H. K. Finucane, A. Gusev, S. Lindstr¨om, S. Ripke, G. Genovese,P.-R. Loh, G. Bhatia, R. Do, T. Hayeck, H.-H. Won, Schizophrenia Working Group of the Psy-chiatric Genomics Consortium, DRIVE study, S. Kathiresan, M. Pato, C. Pato, R. Tamimi,E. Stahl, N. Zaitlen, B. Pasaniuc, G. Belbin, E. E. Kenny, M. H. Schierup, P. De Jager, N. A.Patsopoulos, S. McCarroll, M. Daly, S. Purcell, D. Chasman, B. Neale, M. Goddard, P. M.Visscher, P. Kraft, N. Patterson, and A. L. Price (2015). Modeling linkage disequilibriumincreases accuracy of polygenic risk scores.
The American Journal of Human Genetics 97 (4),576 – 592.Walker, V. M., N. M. Davies, G. Hemani, J. Zheng, P. C. Haycock, T. R. Gaunt, G. Davey Smith,and R. M. Martin (2019). Using the MR-Base platform to investigate risk factors and drugtargets for thousands of phenotypes.
Wellcome Open Research .Waterworth, D. M., S. L. Ricketts, K. Song, L. Chen, J. H. Zhao, S. Ripatti, and Other Authors(2010). Genetic variants influencing circulating lipid levels and risk of coronary artery disease.
Arteriosclerosis, Thrombosis, and Vascular Biology 30 (11), 2264–2276.Yang, J., T. Ferreira, A. P. Morris, S. E. Medland, G. I. of ANthropometric Traits (GI-ANT) Consortium, D. G. Replication, M. analysis (DIAGRAM) Consortium, P. A. F. Mad-den, A. C. Heath, N. G. Martin, G. W. Montgomery, M. N. Weedon, R. J. Loos, T. M.Frayling, M. I. McCarthy, J. N. Hirschhorn, M. E. Goddard, and P. M. Visscher (2012). Con-ditional and joint multiple-SNP analysis of GWAS summary statistics identifies additionalvariants influencing complex traits.
Nature Genetics 44 (4), 369–375.Yavorska, O. O. and S. Burgess (2017). MendelianRandomization: an R package for performingMendelian randomization analyses using summarized data.
International Journal of Epi-demiology 46 (6), 1734–1739. heng, J., V. Haberland, D. Baird, V. Walker, P. Haycock, A. Gutteridge, T. G. Richardson,J. Staley, B. Elsworth, S. Burgess, B. B. Sun, J. Danesh, H. Runz, J. C. Maranville, H. M.Martin, J. Yarmolinsky, C. Laurin, M. V. Holmes, J. Liu, K. Estrada, L. McCarthy, M. Hurle,D. Waterworth, M. R. Nelson, A. S. Butterworth, G. D. Smith, G. Hemani, R. A. Scott, andT. R. Gaunt (2019). Phenome-wide Mendelian randomization mapping the influence of theplasma proteome on complex diseases. bioRxiv . upplementary Material This file contains Supplementary material for the manuscript “Statistical Methods for cis-Mendelian randomizaton”.
Proportion of Genetic Variation Explained
To fix the proportion v G of variation in the risk factor explained by the genetic variants, weadjusted the residual variance σ in our simulations accordingly. We first specified a value for v G and generated SNP-risk factor associations β Xj as described in the simulation design sectionof the manuscript. The value of σ was then set equal to σ = 1 − v G v G N ref β TX G Tref G ref β X where G ref is the reference genetic matrix. This can be justified by recalling that in oursimulation design, X = P (cid:88) j =1 β Xj G j + α X U + (cid:15) X with U, (cid:15) X ∼ N (0 , σ ). Taking variances, we have that Var( X ) = Var( Gβ X )+2 σ and therefore, v G = Var( Gβ X )Var( X ) = Var( Gβ X )Var( Gβ X ) + 2 σ = N ref β TX G T Gβ X N ref β TX G T Gβ X + 2 σ which yields the previous expression for σ . A similar formula was derived in the appendix ofYang et al. (2012). Additional Simulations - Large G-X Sample Size
To augment the analysis presented in the paper, we compared the various cis-MR methods in arange of additional simulations. This section contains simulation results for a simulation using alarger sample size of N = 100000 to compute summary-level risk factor-outcome associations.This scenario may represent an applied analysis using a downstream biomarker as a proxy forprotein expression, as this would allow researchers to obtain summary-level data from existinglarge-scale GWAS studies whose sample size often ranges in the hundreds of thousands. Thiswas the case for the two real-data applications presented in the main part of our manuscript.We conducted simulations both for the SHBG and for the
HMGCR region, with six causalvariants per region, as in our main simulations. For brevity, we only considered the “stronginstruments” scenario, where v G = 3% for the SHBG region and v G = 2% for the HMGCR region. Otherwise, the simulations reported here were set up in the same way as those reportedin the main part of our paper. Results are reported in Table 5.A notable difference in the results compared to our main simulations was that a larger samplesize resulted in stronger instruments. A tenfold increase in the sample size resulted in a similar,tenfold increase in the values of the F statistics. For a regression using only the six causalvariants, the average F statistic was 514 for the
SHBG and 340 for the
HMGCR region. As aresult, the various cis-MR methods were less affected by weak instruments bias and performedquite well. The performance of the stepwise pruning method was was much more consistent fordifferent values of ρ than in the corresponding simulation with N = 10000. PCA and JAM = 0 θ = 0 . θ = 0 . θ se (ˆ θ ) Type I ˆ θ se (ˆ θ ) Cov Power ˆ θ se (ˆ θ ) Cov SHBG
RegionTop SNP —– -0.001 0.019 0.063 0.050 0.020 0.951 0.710 0.101 0.020 0.952Pruning ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . k = 0 .
99 0.000 0.015 0.038 0.051 0.015 0.950 0.909 0.100 0.016 0.957 k = 0 .
999 0.000 0.014 0.040 0.051 0.015 0.944 0.939 0.100 0.015 0.959JAM ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 .
95 0.000 0.013 0.038 0.051 0.014 0.952 0.951 0.100 0.014 0.960F-LIML —– -0.001 0.014 0.046 0.051 0.015 0.952 0.919 0.100 0.015 0.961CLR —– —– —– 0.044 —– —– 0.949 0.919 —– —– 0.958
HMGCR
RegionTop SNP —– 0.000 0.019 0.055 0.050 0.019 0.950 0.738 0.100 0.020 0.951Pruning ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . k = 0 .
99 0.000 0.017 0.047 0.050 0.017 0.956 0.815 0.100 0.018 0.947 k = 0 .
999 0.000 0.017 0.049 0.050 0.017 0.956 0.819 0.100 0.018 0.948JAM ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 .
95 0.000 0.017 0.049 0.050 0.017 0.951 0.825 0.099 0.018 0.942F-LIML —– 0.000 0.017 0.049 0.051 0.017 0.952 0.825 0.100 0.018 0.949CLR —– —– —– 0.049 —– —– 0.947 0.823 —– —– 0.947
Table 5: Simulation results for the “strong instruments” scenario and a large G − X sample size( N = 100000). 31 ere both unaffected by weak instrument bias for θ = 0 .
1. Confidence intervals constructedusing these methods practically attained nominal coverage, and so did the confidence intervalsbased on F-LIML and CLR. A small reduction in power for single-SNP analysis and LD-pruningwith low correlation thresholds were the only real difference between the various methods. Thiswas the case both for the
SHBG and for the
HMGCR region.These results imply that weak instrument bias is less likely to affect analyses based on large G − X sample sizes, and stepwise pruning does not underperform other methods in such analyses. Additional Simulations - Fewer Causal Variants
In our second set of additional simulations, we modified the number of causal variants that werepresent in each region. Focusing on the
SHBG region, we considered two additional scenarios.First, we assumed the existence of only a single causal variant in the region, placing thecausal signal at the variant that had the smallest univariate p-value in our real-data application(rs1799941). That variant was assumed to have an effect of β Xj = 0 .
38 on the risk factor, thesame as that observed in our real SHBG-testosterone dataset.In the second scenario, we generated three independent genetic effects on the risk factor. Theeffects were placed at genetic variants suggested as independently causal by Jin et al. (2012),who analysed genetic associations of variants in the
SHBG region with serum testosterone levels.Genetic effects for the causal variants were drawn randomly according to β Xj ∼ | N (0 , . | + 0 . v G = 3% and a G − X sample size of N = 10000. Simulation results arereported in Table 6.The results closely resembled the ones for our original simulation scenario with “stronginstruments” and six causal variants. All methods were quite accurate for θ = 0. When θ (cid:54) = 0, stepwise pruning was subject to weak instrument bias to some extent, especially for largecorrelation thresholds. The performance of the method depended on the correlation thresholdused. PCA and JAM were more consistent in terms of their tuning parameters and weregenerally quite accurate, while factor-based methods were even more accurate, with a smallinflation of Type I error rates under the null for F-LIML.Most methods performed slightly better with one than with three causal variants, but thedifferences in their performance were small and it seems that the number of causal signals inthe region should have little impact on the choice of which cis-MR method to use. An exceptionrelates to the use of the top-SNP approach, which was expectedly quite accurate when only asingle causal variant existed in the region. With three causal variants, top-SNP analysis wassubject to the same issues as in our original simulations (namely larger standard errors andlower power than other methods), but to a lesser degree. Additional Simulations - Small Reference Dataset
In our third set of additional simulations, we wanted to assess the impact of the referencedataset on the various MR methods. In particular, we assessed whether the use of a smallreference dataset, which would imply inaccurate genetic correlation estimates, affects some cis-MR methods worse than others. We therefore repeated the simulations from the main part ofour paper, but bootstrapped the rows of the UK Biobank matrix and only selected N ref = 1000individuals from which to compute genetic correlations, instead of using the entire UK Biobank = 0 θ = 0 . θ = 0 . θ se (ˆ θ ) Type I ˆ θ se (ˆ θ ) Cov Power ˆ θ se (ˆ θ ) CovOne Causal VariantTop SNP —– 0.001 0.013 0.046 0.051 0.014 0.952 0.966 0.100 0.014 0.937Pruning ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . k = 0 .
99 0.000 0.015 0.053 0.050 0.016 0.946 0.882 0.099 0.016 0.924 k = 0 .
999 0.001 0.014 0.051 0.048 0.014 0.952 0.922 0.096 0.015 0.919JAM ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 .
95 0.001 0.013 0.046 0.051 0.014 0.953 0.966 0.100 0.014 0.938F-LIML —– 0.001 0.014 0.061 0.051 0.015 0.944 0.936 0.100 0.016 0.942CLR —– —– —– 0.047 —– —– 0.951 0.928 —– —– 0.955Three Causal VariantsTop SNP —– 0.000 0.016 0.051 0.050 0.016 0.956 0.852 0.098 0.017 0.934Pruning ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . k = 0 .
99 0.000 0.015 0.051 0.049 0.015 0.941 0.884 0.096 0.016 0.932 k = 0 .
999 0.000 0.014 0.047 0.047 0.014 0.944 0.905 0.095 0.015 0.925JAM ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 .
95 0.000 0.013 0.044 0.049 0.014 0.946 0.948 0.097 0.014 0.929F-LIML —– 0.000 0.014 0.060 0.050 0.014 0.932 0.933 0.100 0.016 0.949CLR —– —– —– 0.049 —– —– 0.940 0.919 —– —– 0.959
Table 6: Simulation results for the
SHBG region with either 1 or 3 causal variants.33 = 0 θ = 0 . θ = 0 . θ se (ˆ θ ) Type I ˆ θ se (ˆ θ ) Cov Power ˆ θ se (ˆ θ ) Cov SHBG
RegionTop SNP —– -0.001 0.019 0.039 0.048 0.019 0.941 0.676 0.095 0.020 0.917Pruning ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . k = 0 .
99 0.000 0.015 0.041 0.049 0.015 0.939 0.883 0.097 0.015 0.922 k = 0 .
999 0.000 0.014 0.052 0.048 0.014 0.925 0.904 0.094 0.015 0.919JAM ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 .
95 0.000 0.014 0.093 0.047 0.014 0.888 0.885 0.091 0.014 0.843F-LIML —– 0.000 0.014 0.052 0.050 0.014 0.921 0.907 0.099 0.016 0.929CLR —– —– —– 0.040 —– —– 0.929 0.895 —– —– 0.933
HMGCR
RegionTop SNP —– -0.001 0.018 0.047 0.048 0.018 0.951 0.753 0.099 0.019 0.921Pruning ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . k = 0 .
99 -0.001 0.017 0.047 0.048 0.017 0.950 0.816 0.098 0.018 0.921 k = 0 .
999 0.000 0.016 0.051 0.047 0.017 0.949 0.814 0.097 0.017 0.922JAM ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 .
95 -0.001 0.017 0.150 0.044 0.017 0.824 0.759 0.092 0.018 0.791F-LIML —– -0.001 0.016 0.052 0.049 0.017 0.941 0.825 0.101 0.019 0.945CLR —– —– —– 0.049 —– —– 0.945 0.810 —– —– 0.942
Table 7: Simulation results for the “strong instruments” scenario, with a small reference sample( N ref = 1000). dataset of N bb = 367643 individuals. In practice, using small datasets as reference data is notuncommon: for example, the 1000 genomes dataset is commonly used for this purpose. Theuse of a small reference dataset does not systematically bias the estimated genetic covariancematrix, as would be the case with population stratification.We used both genetic regions and the “strong instruments” scenario. Again, the simulationdesign was identical to those reported in the main part of the paper, except using a smallerreference dataset. Results are reported in Table 7.The JAM algorithm proved to be more sensitive to the reference dataset. The algorithmadjusts marginal SNP-trait associations using the reference genetic correlations. By using aninaccurate correlation pattern, the algorithm’s adjustment was adversely affected and this re-sulted in selecting more genetic variants than in runs with a larger reference dataset. This wasespecially the case for large values of the correlation threshold, in which case the pre-pruningstep only discards a small number of variants before running JAM. For example, for the SHBG region with θ = 0 and ρ = 0 .
95, the algorithm had a posterior model size of 9 .
27 comparedto 3 .
75 when the entire UK Biobank was used as a reference dataset. This meant that the lgorithm was more susceptible to weak instrument bias, made JAM causal effect estimatesslightly more variable for larger ρ values and increased Type I error rates. For smaller valuesof the correlation threshold the algorithm’s performance was affected less.The performance of LD-pruning also attenuated for large values of its correlation threshold,but this attenuation was in line with what we observed in our original simulations, using a largereference dataset. Likewise, principal components analysis and factor-based methods seemedfairly robust to using a smaller reference dataset.values and increased Type I error rates. For smaller valuesof the correlation threshold the algorithm’s performance was affected less.The performance of LD-pruning also attenuated for large values of its correlation threshold,but this attenuation was in line with what we observed in our original simulations, using a largereference dataset. Likewise, principal components analysis and factor-based methods seemedfairly robust to using a smaller reference dataset.