Comparison between instrumental variable and mediation-based methods for reconstructing causal gene networks in yeast
CComparison between instrumental variable and mediation-basedmethods for reconstructing causal gene networks in yeast
Adriaan-Alexander Ludl and Tom Michoel ∗ November 20, 2020
Computational Biology Unit, Department of Informatics, University of Bergen, PO Box 7803, 5020Bergen, Norway ∗ Corresponding author, email: [email protected]
Abstract
Causal gene networks model the flow of information within a cell. Reconstructing causal networksfrom omics data is challenging because correlation does not imply causation. When genomics andtranscriptomics data from a segregating population are combined, genomic variants can be used to ori-ent the direction of causality between gene expression traits. Instrumental variable methods use a localexpression quantitative trait locus (eQTL) as a randomized instrument for a gene’s expression level,and assign target genes based on distal eQTL associations. Mediation-based methods additionally re-quire that distal eQTL associations are mediated by the source gene. A detailed comparison betweenthese methods has not yet been conducted, due to the lack of a standardized implementation of differ-ent methods, the limited sample size of most multi-omics datasets, and the absence of ground-truthnetworks for most organisms. Here we used Findr, a software package providing uniform implemen-tations of instrumental variable, mediation, and coexpression-based methods, a recent dataset of 1,012segregants from a cross between two budding yeast strains, and the YEASTRACT database of knowntranscriptional interactions to compare causal gene network inference methods. We found that causalinference methods result in a significant overlap with the ground-truth, whereas coexpression did notperform better than random. A subsampling analysis revealed that the performance of mediation sat-urates at large sample sizes, due to a loss of sensitivity when residual correlations become significant.Instrumental variable methods on the other hand contain false positive predictions, due to genomiclinkage between eQTL instruments. Instrumental variable and mediation-based methods also havecomplementary roles for identifying causal genes underlying transcriptional hotspots. Instrumentalvariable methods correctly predicted
STB5 targets for a hotspot centred on the transcription factor
STB5 , whereas mediation failed due to Stb5p auto-regulating its own expression. Mediation sug-gests a new candidate gene,
DNM1 , for a hotspot on Chr XII, whereas instrumental variable methodscould not distinguish between multiple genes located within the hotspot. In conclusion, causal infer-ence from genomics and transcriptomics data is a powerful approach for reconstructing causal genenetworks, which could be further improved by the development of methods to control for residualcorrelations in mediation analyses and genomic linkage and pleiotropic effects from transcriptionalhotspots in instrumental variable analyses. 1 a r X i v : . [ q - b i o . M N ] N ov Introduction
Causal gene networks model the flow of information from genotype to phenotype within a cell orwhole organism [1–4]. Reconstructing causal networks from omics data is challenging because cor-relation does not imply causation. However, when genomics and transcriptomics data from a largenumber of individuals in a segregating population are combined, genomic variants can be used toorient the direction of causality between gene expression traits. This is based on the fact that allelesare randomly segregated during meiosis and genotypes remain fixed during an individual’s lifetime,such that genomic variants act as causal anchors from which all arrows are directed outward [1, 2, 5].Moreover, local and distal expression quantitative trait locus (eQTL) associations have biologicallydistinct interpretations, because genomic variation at regulatory DNA elements leads to altered tran-scription of nearby genes by cis -acting, epigenetic mechanisms, whereas distal associations must beintermediated by trans -acting factors [6, 7]These principles are combined in different ways in two classes of causal inference methods that usegenomic variants as causal anchors: instrumental variable (known as
Mendelian randomization ingenetic epidemiology) or mediation -based [8].
Mediation infers the direction of causality between twotraits that are statistically associated to the same genomic variant by testing whether the associationbetween the variant and one of the traits is mediated by the other trait, in which case there must be acausal relation from the mediating trait to the other one [9, 10]. Mediation does not require that one ofthe traits has a “preferential” relation to the genomic variant (as in cis or trans ). However, mediationfails in the presence of high measurement noise or hidden confounders, such as common upstreamfactors coregulating both traits, where it rejects true interactions (i.e. reports false negatives) [11]. Instrumental variable or Mendelian randomization methods assume that the genomic variant actsas a randomized “instrument” for one of the traits, similar to the random assignment of individualsto treatment groups in randomized controlled trials, such that a statistical association between thevariant and the second trait is evidence for a causal relation from the first to the second trait. Therandom group assignment, in genetics the random segregation of alleles, ensures that causal effectscan be detected even in the presence of confounding. However, instrumental variable methods fail ifthere are pathways from the variant to the second trait other than through the first trait (pleiotropiceffects) [12–14].A detailed comparison between these two approaches requires a standardized implementation wherepre-processing (e.g. data normalization) and post-processing (e.g. multiple testing correction) arehandled uniformly. Previously, we developed
Findr , a computationally efficient software package im-plementing six likelihood ratio tests that can be combined in multiple ways to reconstruct instrumentalvariable as well as mediation-based causal gene networks [11].
Findr expresses the result of each testas a posterior probability (one minus the local false discovery rate), allowing tests to be combined bythe usual rules of probability theory [10]. This results in causal network inference methods that arerepresentative for the broader field. For instance, the implementation of the mediation-based methodin
Findr is identical to the method of Chen et al. [10], which had its roots in the “likelihood-basedcausal model selection” (LCMS) procedure of Schadt et al. [9]. The Causal Inference Test (CIT)software [15, 16] is another implementation of an LCMS-based mediation method, which combinesstatistical tests using omnibus p -values and FDR estimates. We found previously that it results insimilar inferences as the mediation-based method implemented in Findr [11]. Instrumental variablemethods on the other hand are based on genetic associations, for which
Findr uses categorical re-gression of gene expression profiles on genotype values, similar to for instance the ANOVA option inMatrix-eQTL [17]. 2sing simulated data from the DREAM5 Systems Genetics challenge [18, 19], we found previouslythat instrumental variable methods generally outperformed mediation-based methods in terms of areaunder the precision-recall curve, and that the performance of mediation-based methods decreased withincreasing sample size, due to increased statistical significance of confounding effects [11]. However,at that time, no real-world dataset with sufficient sample size as well as an accurate ground-truthnetwork of causal interactions was available to test these predictions in a real biological system.Fortuitously, a dataset has now become available of genomic variation and gene expression data inmore than 1,000 segregants from a cross between two strains of budding yeast, a popular eukary-otic model organism [20]. By learning networks from these data, and comparing against the wealthof transcriptional regulatory interactions and other functional validation data available for buddingyeast [21], a thorough benchmarking of methods for reconstructing causal gene networks has becomepossible. cis -eQTLs
Using the data on expression quantitative trait loci (eQTLs) from [20], we selected the strongest cis acting eQTLs for 2884 genes. The eQTLs were ranked in descending order according to the absolutevalue of the correlation coefficient between scaled expression levels and marker genotype ( r , obtainedfrom [20, Source data 4]), and for each gene the highest ranked eQTL was retained. Among theselected eQTLs 2044 occured once, 337 eQTLs were strongest for two genes, 44 were strongest forthree genes, 6 were strongest for four genes, 2 were strongest for five genes. We used the inference methods implemented in
Findr [11]. The source code is available at https://github.com/lingfeiwang/findr . The test P only uses gene expression data. For the other tests( P , P , P , P ), we used the genotype and gene expression data from [20] (see section 2.7 below fordetails) with cis -eQTLs as causal anchors for the inference tests. Composite tests are obtained byelement-wise multiplication of the matrices containing the results of individual tests. The Precision-Recall curves and area under the curve (AUPR) for interactions predicted by a given testwere computed using the scikit-learn package [22] and three ground-truth matrices (see Data section2.7). Recall is equivalent to the true positive rate (TPR), i.e. the number of true positive predictionsas a fraction of all known positive interactions in the network. Precision or positive predictive valueis 1 − FDR where FDR is the global false discovery rate.AUPR-ratio or fold-change is the AUPR divided by the theoretical value for random predictions on agiven ground truth. The latter is obtained as the precision for random predictions given by prec random = N E /( N R ∗ N T ) where N R is the number of regulating genes, N T is the number of target genes, N E is thenumber of edges, i.e. the number of ones in the ground-truth adjacency matrix.3 .4 Subsampling We performed subsampling on the segregants to evaluate the change in performance of our inferencemethods on various sample sizes. Four subsamples of randomly selected segregants were drawn forthe following sizes: 10, 100, 200, 400, 600, 800 and 1,000. The inference methods were run on eachsample. We report the average AUPR and its statistical standard deviation over the four subsamplesin Fig. 4.
We computed the covariance matrix of the genotypes at the retained eQTLs for all 1,012 segregants(Fig. 5). The rows and columns of the matrix were reordered according to the genome position of theeQTL, the ordering algorithm is described below in 2.7.
Findr posterior probability matrices were thresholded to obtain discrete networks with an expectedFDR target value as described previously [10, 11]: because for each interaction the local false dis-covery rate is given by fdr = − p , where p is the posterior probability value obtained by the test, theexpected FDR of a network consisting of all interactions with p ≥ p th is the average of the local fdrof the retained interactions. We determined p th as the threshold that gave the greatest expected FDRbelow the target value (5 or 10 %). We counted the number of targets for each source gene whose p > p th . The inferred regulatory relationships for the thresholds reported in Tab. 2 for the causal tests ( P P , P , P P , P ) and scripts to reproduce the analysis are provided in the repository https://github.com/michoel-lab/FindrCausalNetworkInferenceOnYeast . Running all Findr inference tests on thedata from [20] takes about 10 to 15 seconds on a typical desktop computer. We used gene expression data for 5,720 genes and genotypes for a panel of 1,012 segregants fromcrosses of one laboratory strain (BY) and a wine strain (RM) from [20]. Batch and optical density(OD) effects, as given by the covariates provided in [20], were removed from the expression datausing categorical regression, as implemented in the statsmodels python package [23]. The paper alsoprovides data on expression quantitative trait loci (eQTLs) that was used to select the strongest cis -eQTLs, as well as a file with annotations to the 102 hotspots that they identified.For validation we used networks of known transcriptional regulatory interactions in yeast (S. cere-visiae) from Y
EASTRACT [21]. Regulation matrices were obtained from . We retrieved the full ground-truth matrices containing all reported inter-actions of the following types from the Y
EASTRACT website:
DNA binding evidence was used as the“Binding”, expression evidence including TFs acting as activators and those acting as inhibitors wasused as the “Expression”,
DNA binding and expression evidence was used as the “Binding & Expres-sion”. Self regulation was removed from all ground truths. The numbers of regulators, targets andinteractions for these three ground-truth networks are shown in Tab. 1.4round-Truth Network N R N T N E N sE Binding 90 5,151 19,099 28Binding&Expression 80 3,394 5,680 24Expression 113 5,369 92,646 77
Table 1:
Properties of the Y
EASTRACT ground-truth networks. N R is the number of regulating genes, N T isthe number of target genes, N E is the number of edges excluding self-edges, N sE is the number of self-edges.Data was retrieved from Y EASTRACT [21].
Annotations of the yeast genome were used to map gene names to their identifiers and order themaccording to the position of their causal anchor (eQTL) along the full genome, first by chromosomeand then by position along the chromosome. The sorting algorithm places mitochondrial genes first(when present) and orders the chromosomes according to the numerical value of the roman numerals.We used the gff3 file (
Saccharomyces_cerevisiae.R64-1-1.83.gff3.gz ) from the Ensembldatabase (release 83, December 2015), [24], which is the version used by [20]. The file is accessibleat ftp://ftp.ensembl.org/pub/release-83/gff3/saccharomyces_cerevisiae/ . We used the software
Findr [11] to reconstruct causal and non-causal gene networks in yeast froma dataset of genomic variation and expression data for 5,720 genes in 1,012 segregants from a crossbetween two strains of budding yeast [20]. 2,884 genes had an associated genomic causal anchor,here defined as the variant most strongly associated to the gene and present in the list of genome-widesignificant eQTLs whose confidence interval (of variable size) overlaps with an interval covering thegene, 1,000 bp upstream and 200 bp downstream of the gene position [20].
Findr implements sixlikelihood ratio (LLR) tests between triplets ( E , A , B ) , where A and B are genes, and E is the causalanchor for A . For each test i , Findr outputs the posterior probability P i of the selected hypothesis beingtrue (Fig. 1A). These posterior probabilities can then be combined to obtain the posterior probabilitiesof various compound hypotheses being true. Here we considered four causal tests and one non-causaltest to reconstruct directed gene networks:• Mediation.
Mediation-based approaches infer a causal interaction A → B if gene B is statisticallyassociated to the causal anchor E , and the association is abolished after conditioning on gene A [9,10,16]. In Findr this is accomplished by the compound hypothesis that test 2 and 3 are bothtrue, i.e. by the posterior probability P × P . Mediation can distinguish true positive (TP) fromtrue negative (TN) causal interactions in the absence of hidden confounders, but will report afalse negative (FN) if a real causal interaction is confounded by a hidden factor (Fig. 1B, row1), due to a collider effect [10, 11].• Instrumental variables without pleiotropy.
Instrumental variable approaches assume that thecausal anchor E acts as a randomized instrument for gene A , and, in their simplest form, infer5 igure 1: A. Likelihood ratio (LLR) tests implemented in Findr. E is a causal anchor of gene A . Arrowsin a hypothesis indicate directed regulatory relations. Genes A and B each follow a normal distribution, whosemean depends additively on its regulator(s), as determined in the corresponding hypothesis. The dependencyis categorical on genotypes and linear on gene expression levels. The undirected line represents a multi-variatenormal distribution between the relevant variables. In each test, either the null or the alternative hypothesis isselected, as shown. Figure ©2017 Wang, Michoel, reproduced by permission from [11] under Creative Com-mons Attribution License. B. Causal model selection with Findr.
By combining the posterior probabilities P i of the selected hypothesis for test i being true, Findr determines whether coexpressed genes A and B areconnected by a causal A → B relation. (Row 1) In the absence of hidden confounders ( H ), mediation-basedcausal inference, combining Findr tests 2 and 3, correctly identifies true positive (TP; correlation due to causal A → B relation) and true negative (TN; correlation without causal A → B relation) models. However, it reportsa false negative (FN) if the causal relation is affected by a hidden confounder. (Row 2) If the causal anchor is“exclusive” to gene A , then the instrumental variable method based on Findr test 2 correctly identifies TP andTN models, even in the presence of hidden confounding. However, it reports a false positive (FP) if the asso-ciation between E and B is due to other paths than through A (pleiotropy). (Row 3) An instrumental variablemethod that combines Findr tests 2 and 5 correctly identifies a true negative if the correlation between A and B is entirely due to a pleitotropic effect of E , but will still report a false positive if there is an additional effectfrom a hidden confounder. (Row 4) An instrumental variable method based on the compound hypothesis thattest 4 is true, or test 2 and test 5 are true, reports a TP for causal relations where E → A → B is not the only pathfrom E to B , with or without confounding, but will report a FP if the true causal relation is B → A (or absent). a causal interaction A → B if gene B is statistically associated to the causal anchor E , i.e. by theposterior probability P that test 2 is true. Instrumental variables can distinguish true positivefrom true negative causal interactions even in the presence of hidden confounders, but willreport a false positive (FP) if there are other pathways than through A that cause a statisticalassociation between E and B (pleiotropy) (Fig. 1B, row 2).• Instrumental variables with perfect pleiotropy.
To address the problem of pleiotropy, we can ad-ditionally require that genes A and B are not independent after conditioning on E , accomplished6y the compound hypothesis that test 2 and 5 are both true, i.e. by the posterior probability P × P . This correctly identifies a true negative if E explains all of the correlation between A and B , but will still result in a false positive if there is a hidden confounder (Fig. 1B, row 3).• Instrumental variables with partial pleiotropy.
To overcome the problem of FP predictions inthe “confounded pleiotropy” situation, we introduced test 4 in
Findr , which tests whether gene B is not independent of E and A simultaneously, and found empirically that the combination P = ( P P + P ) performs better than P × P alone [11]. In particular, it identifies a TP forcausal A → B relations even in the presence of alternative E → B paths and hidden confounding,at the expense of FP predictions when the relation is reversed or absent (Fig. 1B, row 4).• Coexpression.
As a basic reference, we reconstructed a gene network based on coexpressionalone, using
Findr test 0. Note that the posterior probability P is not symmetric ( P ( A → B ) ≠ P ( B → A ) ), because it is estimated from the observed distribution of LLR test statistics for each A separately [11].To illustrate the differences between coexpression, instrumental variable, and mediation-based genenetworks, we considered the sub-networks inferred between the 2,884 genes that had a causal anchor(i.e. the sub-network where the probability of an edge can be estimated for both edge directions).As expected, the coexpression network ( P ) is largely symmetric (Fig. 2, left), whereas the causalinstrumental variable ( P , Fig. 2, center) and mediation-based ( P P , Fig. 2, right) networks show aclear asymmetric structure with some genes having a very large number of high-confidence targets.These genes correspond to transcriptional hotspots, regions of the genome with a large, genome-wideeffect on gene expression [20]. The overall structure of the causal networks appears consistent withthe general considerations above. The overall signal (strength of posterior probabilities) is weakerin the mediation-based network, consistent with an increased false negative rate (Fig. 2, right). Onthe other hand, the instrumental variable network appears to have a genomic structure, where nearbygenes are mutually connected and have a similar target profile (Fig. 2, middle). This could be due togenomic linkage between causal anchors: if two genes A and A ′ share the same or highly correlatedinstruments E and E ′ , then their predicted target sets would also be very similar, and probably includea large proportion of false positive predictions for either gene. We assessed the performance of networks predicted by Findr on three ground-truth networks of tran-scriptional regulatory interactions in yeast, where targets of a transcription factor (TF) are definedby TF-DNA binding interactions (“Binding” network), differential expression upon TF perturbation(“Expression” network), or the intersection of them (“Binding & Expression” network) (see Methodsand Table 1). The precision-recall curves for the four causal inference methods showed the charac-teristic peak of high precision at low recall indicative of an enrichment of true positives among thepredictions with highest posterior probabilities, and confirmed by increased area under the precision-recall curve (AUPR) compared to random predictions (Figure 3). This was markedly the case forthe Binding & Expression ground-truth, with AUPR more than 1.3 times higher than random. Thisis consistent with the notion that genes that are bound by a TF as well as differentially expressedupon TF perturbation are more likely to be real TF targets, that is, that the Binding & Expressionground-truth is of higher quality than the others. Differences between causal inference methods were7 igure 2:
Matrices of predicted gene interactions.
These square matrices represent the interactions between2884 genes with causal anchors (eQTLs), posterior probability values are color coded. Vertical bands corre-spond to hotspots.
Left : The correlation based test P . Center : The instrumental variable test P . Right : Themediation test P P . The genes are ordered according to the position of their causal anchor in the full yeastgenome. See Sup. Fig. S1 for the instrumental variable tests P P and P .Figure 3: Performance of causal inference on Y
EASTRACT ground truths.
Precision-recall curves for fourcausal inference methods ( P P , P , P P , P ) and one coexpression method ( P ) are shown for the Binding (left),Binding and Expression (center) and Expression (right) ground-truth networks. The insets show the area underthe precision-recall curves (AUPR) as the the “fold change” relative to the baseline performance for randompredictions. The horizontal red line shows the baseline performance for random predictions and is used asreference for AUPR fold change (insets). The network inference methods are described in Section 3.1. modest, with instrumental variable methods ( P , P P , P ) showing somewhat better performance thanthe mediation-based method ( P P ) on the Binding and Binding & Expression ground-truths, and viceversa on the Expression ground-truth (Figure 3). In contrast to the causal inference methods, thecoexpression-based method ( P ) did not show any improvement over random predictions. This isnot surprising. An unbiased evaluation of 35 diverse methods for network inference from expressionalone did not find any improvement over random predictions on a comparable ground-truth networkfor yeast [25]. 8 .3 The performance of mediation saturates at large sample size The availability of more than 1,000 segregants in the genotype and gene expression dataset allowedus to evaluate the performance of network inference across sample sizes by random subsampling ofthe data. The clearest pattern was again observed for the Binding & Expression ground-truth, con-sisting of the most reliable known transcriptional regulatory interactions, where the three instrumentalvariable methods ( P , P P , P ) showed a monotonous increase in AUPR with increasing sample size(Fig. 4). The mediation based method ( P P ) initially showed a similar performance as the instru-mental variable methods, but saturated above 400 samples when accounting for statistical error anddropped from having the highest to the lowest average performance of all causal inference methods.The same pattern is also observed on the Binding ground-truth, albeit in a less pronounced way,presumably due to lower AUPR values relative to random predictions for all methods.These results are consistent with previous work on simulated data, where we observed a decreasein performance with increasing sample size for mediation-based methods [11]. There we showedthat hidden confounders and measurement noise can lead to a residual correlation between the causalanchor E and a target gene B after adjusting for the regulatory gene A (cf. Fig. 1). At sufficiently largesample size, this residual correlation becomes significantly different from zero and thereby leads to afalse negative prediction.Sample size showed little effect on the coexpression method P for sample sizes larger than 400 forall ground truths. This is consistent with the notion that estimates of correlations will stabilize aroundtheir true values at smaller sample sizes than estimates of causal effects. Figure 4:
Performance of causal inference across sample sizes.
AUPR fold change values for four causalinference methods ( P P , P , P P , P ) and one coexpression method ( P ) (see Section 3.1) at various samplesizes for the Binding (left), Binding & Expression (center) and Expression (right) ground-truth networks. Foursamples were randomly drawn from the expression data and evaluated with each test. Error bars represent thestandard deviation across the four subsets. The horizontal grey line indicates the level of random predictions.The fold change is relative to the baseline performance for random predictions. Next, we assessed the extent to which instrumental variable methods are affected by genomic link-age between causal anchors which would lead to false positive predictions due to (real or apparent)pleiotropic effects (cf. Fig. 1). For instance for the P method, if two genes in the same genomic9eighbourhood have causal anchors with strongly correlated genotype values, the method would pre-dict them to have similar sets of target genes.To perform the analysis, we first truncated the posterior probability values in order to obtain discrete,directed networks. Thresholds were determined to obtain networks with an expected FDR ≤
5% (forthe instrumental variable methods) or ≤
10% (for the mediation-based method) (see Methods section).The larger FDR value for mediation was chosen to counterbalance its increased false negative rate,and resulted in posterior probability thresholds that were comparable between all methods (Table 2).Test p th FDR N R N T N E ρ P P P P P P Table 2:
Properties of thresholded predicted networks.
We report the thresholds ( p th ) used to select signifi-cant interactions for the four causal inference methods, the corresponding global False Discovery Rate (FDR),as well as descriptors for the resulting networks: N R is the number of regulating genes, N T is the number of tar-get genes, N E is the number of edges, and ρ is the filling ratio of the adjacency matrix, i.e. the ratio of non-zeroand zero values in the thresholded matrices. Despite the similar posterior probability thresholds, the instrumental variable networks are aroundten times more densely connected than those obtained by the mediation-based method (Table 2); adifference that cannot be explained by the lower sensitivity of the latter alone. We show that in theinstrumental variable networks, high interaction counts occur in blocks that roughly follow the struc-ture of the causal-anchor genotype covariance, whereas they occur more in spikes in the mediationnetwork (Fig. 5). This becomes apparent when plotting the number of targets for each regulatory gene(i.e. each gene with a significant cis -eQTL) versus its position on the genome. In instrumental vari-able methods, the genomic causal anchor is used as a “proxy” for the regulatory gene. Hence, if thecausal anchor genotypes of two genes within the same locus are correlated due to genomic linkage,then their target sets will unavoidably be similar as well, resulting in the pattern observed in Fig. 5.In mediation-based methods, the expression profile of the regulatory gene is used as the mediator (intest 3, cf. Fig. 1A), and hence a target set will be specific to a regulator, even when its causal anchoris correlated or shared with other genes.
Regions of the genome that are statistically associated with variation in expression of a high number ofgenes (the peaks in Fig. 5B) are called transcriptional “hotspots”, and finding the causal genes under-lying a hotspot is an important problem in quantitative genetics [26]. Albert et al. [20] identified 102hotspot loci using their data, and developed a fine-mapping strategy to narrow the confidence intervalsfor the hotspot locations. Overlaying the hotspot markers (median bootstrap hotspot locations [20])with the P target counts in the 5% FDR network (Fig. 5B) showed good consistency, as expected;37 of those hotspot markers were in our list of causal anchors (i.e. strongest local eQTL for at leastone gene). Albert et al. [20] defined candidate causal hotspot genes as the genes located within thefine-mapped hotspot regions, and for 26 hotspots they obtained three or fewer candidate genes. Here10 igure 5: Hotspots and genotype covariance. A.
The counts of significant interactions for the mediation-basedmethod P P at FDR below 10%, with annotations for eight regulatory genes with more than 1,000 targets. B. The counts of significant interactions for the instrumental variable method P at FDR below 5% (in grey),and the number of non-zero effects for 102 hotspot markers from [20] (in black); the subset of these hotspotsthat are also a causal anchor (i.e. the strongest local eQTL for at least one gene) for the Findr analysis aremarked in green and are also indicated by diamonds at the top of the panel. Interaction count plots for the otherinstrumental variable methods are in Supp. Fig. S3. C. The diagonal of the genotype covariance matrix for the2884 causal anchor eQTLs. Genes are ordered along the horizontal axis according to the position of their causalanchor in the yeast genome. we illustrate how causal gene network inference can contribute to the identification of causal hotspotgenes using two representative examples
STB5 and
DNM1 . STB5 is a transcription activator of multidrug resistance genes [27], and the only gene located in onehotspot region on chromosome VIII. The hotspot marker, chrVIII:459310_C/G, is located 11 bp up-stream from
STB5 , and is the causal anchor for
STB5 and for no other genes (Fig. 5B and Fig. 6 Left).The instrumental variable method P predicted 131 targets at FDR below 5% for STB5 , which arestrongly enriched for
STB5 targets in the Binding (hypergeometric p -value 2 . ⋅ − ) and Binding& Expression (hypergeometric p -value 1 . ⋅ − ) ground-truth networks. This suggests that when ahotpspot can be confidently mapped to a single gene, instrumental variable methods predict biolog-ically plausible target sets confirming the candidate causal hotspot gene. In contrast, the mediation-based method P P predicted only nine STB5 targets at FDR below 10%, with no enrichment in theground-truth networks. A possible mechanism that could explain the loss of sensitivity of mediationin this case was already suggested by Albert et al. [20]:
STB5 does not show allele-specific expres-sion, but carries protein-altering variants between the two yeast strains that were crossed, suggestingthat the causal variants in this hotspot act by directly altering Stb5p protein activity; moreover Stb5pis predicted to target its own promoter in Y
EASTRACT . Taken together, this leads to a model wheretranscription of
STB5 is a noisy measurement of Stb5p level, that does not block the path from the11rotein-altering variants to
STB5 target genes via Stb5p protein level (Supp. Fig. S4). Hence condi-tioning on
STB5 transcription in P P does not remove the association between these variants and thetarget genes completely, resulting in false negative predictions by a process similar to the measurementnoise model studied in [11]. DNM1 is a gene located near a hotspot region on chromosome XII, and is among the genes withhighest target count in the mediation-based network (Fig. 5A, Fig. 6 Right). The hotspot marker isalso the causal anchor of
DNM1 , which is located 11,363 bp downstream of this marker and outsidethe hotspot region mapped by Albert et al.
Comparison with the target counts in the instrumentalvariable network, which closely follow the genotype covariance pattern, shows that
DNM1 is thegene in this region that retains the most targets by far in the mediation network ( P , 2910 targets; P P , 1421 targets; Fig. 6 Right). This is particularly true when compared with two of the fourcandidate causal genes of Albert et al. that also have a local eQTL within the hotspot region, YLL007C (also known as
LMO1 ) ( P , 2846 targets, P P , 139 targets) and MMM1 ( P , 3610 targets; P P , 8targets). Based on the high specificity of the P P test, we conjecture that DNM1 is a more likelycausal gene for this hotspot than
LMO1 or MMM1 . Functional analysis in this case does not helpto distinguish between these candidates, because Dnm1p and Mmm1p are both essential proteinsfor the maintenance of mitochondrial morphology [28], and Lmo1p is a signaling protein involved inmitophagy [29]. However, deletion of
DNM1 and
MMM1 results in distinct mitochondrial phenotypes[28], and hence this prediction is experimentally testable in principle.
Reconstructing transcriptional regulatory networks from transcriptomics data has been a major re-search focus in computational biology during the last 20 years. Existing methods span the entirerange of correlation, mutual information, regression, Bayesian networks, random forest, and othermachine learning methods, as well as meta-methods combining multiple of these aproaches [19]. Yet,performance on eukaryotic gene expression data has been disappointing, with overlap between pre-dicted and known networks generally not better than random predictions [19]. To some extent, thisis due to the lack of reliable ground-truth data. For instance, there is little overlap between the twomost common high-throughput experimental techniques for measuring regulatory interactions, map-ping TF-DNA binding sites using ChIP-sequencing and measuring differential expression after TFdeletion or overexpression [30], see also Table 1. Exceptions to this rule are the transcriptional net-works controlling early development in multi-cellular organisms, which are mapped in exquisite detailin some model organisms [31]. When conventional network inference methods are applied to devel-opmental transcriptome data, good performance is in fact observed [32]. Nevertheless the problem ofreconstructing signalling transcriptional networks from observational expression data remains, and akey missing ingredient in existing approaches is the directionality of the edges. Without additionalinformation, any association inferred from transcriptomics data alone is essentially symmetric.Causal inference is designed to reconstruct truly directed networks, by integrating genomics and tran-scriptomics data based on general principles of quantitative trait locus analysis [1]. The publication ofa dataset of more than 1,000 yeast segregants has allowed us to demonstrate that causal inference in-deed results in directed networks with strong, non-random overlap with networks of known transcrip-12 igure 6:
Details of predicted targets in the vicinity of two genes.
We show the local structure at two genes:
STB5 with eQTL chrVIII:459310_C/G (left) and
DNM1 with eQTL chrXII:136527_T/C (right) . The top row shows genotype covariance in the vicinity of the eQTL (red line) for the gene, in the region where the covarianceis greater than 0 .
8. The middle row shows number of targets predicted by P (instrumental variables) at FDRbelow 5%. The bottom row shows number of targets predicted by P P (mediation) at FDR below 10%. Thehorizontal axis gives the position along the chromsome of the eQTL corresponding to each gene. Genes areannotated with their short name where available. Note that data points overlap in genotype covariance and in P for some genes because they share the same eQTL and that P P gives no targets on certain genes. tional interactions. Moreover, the overlap was highest for the most reliable ground-truth that combinedtwo sources of experimental evidence (DNA binding and response to perturbation). Although 1,012samples for an organism with around 6,000 genes may seem a large number, our analysis also showsthat there is no sign yet that performance is saturating as a function of sample size. Causal inferenceindeed requires larger sample sizes than coexpression-based methods, because it relies on more subtlepatterns in the data, something that was already apparent in early considerations of causal inferencein this context [5].Although the integration of genomics and transcriptomics data addresses the key shortcoming pertain-ing to lack of directionality in network inference when using transcriptomics data alone, importantlimitations remain. Apart from those already discussed at length in this paper—low sensitivity dueto hidden confounders for mediation-based methods, and increased false positive rate due to genomiclinkage for instrumental variable methods—another fundamental problem remains: transcriptionalregulation is, for the most part, carried out by proteins. Hence, causal interactions inferred from ge-13omics and transcriptomics data are by definition indirect. If an intermediate, unmeasured protein C (e.g., the protein product of A ) lies on the path from gene A to gene B , that is, A also mediatesassociations between C and local eQTLs for A , then this does not affect the causal inference for theinteraction A → B (Supp. Fig. S2). However, variation in transcription level of a transcription factor(or other regulatory protein) does not always translate to equal variation in protein level, and viceversa . For instance, Albert et al. [20] found several protein-altering variants in candidate causal genesmapped to hotspot regions that did not have any local eQTLs. In such cases, our methods wouldwrongly assign the trans -associated target genes to a gene with local eQTL (if one exists), and missthe non-varying (at transcription level) causal gene. This limitation can only be addressed by integrat-ing another layer of information—proteomics data, which is not yet available in comparable samplesizes.The methods implemented in Findr and analyzed in this paper are broadly representative of the currentstate-of-the-art for causal inference from genomics and transcriptomics data. Nevertheless, some ideashave been proposed recently that we did not evaluate here. For instance, in addition to the statisticaltests implemented in
Findr , Badsha and Fu [33] propose to also use a causal anchor for the target gene B to obtain evidence for a causal interaction A → B . However, including this test requires limiting theanalysis to interactions where both source and target genes have a significant eQTL. Yang et al. [34]on the other hand propose to address the hidden confounder problem in mediation by adjusting forselected surrogate variables (e.g. principal components). However, such variables are necessarilycomposed of combinations of genes and it is challenging to ensure that they only represent commonparents and no common children of an A → B interaction (which would introduce false positives ifconditioned upon, see Supp. Fig. S2). It will be of interest to include these developments in futurecomparisons. Causal inference is in essence a hypothesis-driven approach: the causal diagrams in Figure 1 encodeprior knowledge and assumptions of how genotypes, genes, and unknown confounding factors influ-ence each other. Based on these diagrams, we can make certain predictions about the patterns weexpect to find in the data, such as the relative sensitivity and specificity of mediation versus instru-mental variable methods, the different situations where each method will be successful or not, etc. It isgratifying to see these predictions confirmed using real data, strengthening significantly our previousfindings on simulated data [11].The hypothesis-driven nature of causal inference lies in between the use of biophysical models ofgene regulation and the application of unsupervised machine learning methods for reconstructinggene regulatory networks. Biophysical approaches attempt to include quantitative models of TF-DNAinteractions into the network inference process [35, 36], but are hampered by a lack of resolution inomics data (due to both noise within a sample, and limited sampling density). Unsupervised machinelearning approaches search for non-random patterns within the data, but without prior specificationof the type of pattern being sought, and they lack the ability to identify truly directed associations.Supervised or semi-supervised methods could potentially overcome this limitation, but require dataof known interactions for training, which is sparse for non-model organisms [37].The agreement between theoretical predictions and empirical results indicates that we have a correctunderstanding of how causal gene network inference algorithms work, and how to interpret results interms of what type of interactions these algorithms do and do not identify, albeit without any reference14o the underlying biophysical mechanisms.
We conclude this paper by sharing practical recommendations for researchers wanting to apply causalinference methods for the integration of genomics and transcriptomics data.In general, we recommend instrumental variable over mediation-based methods, as their increasedsensitivity tends to outweigh the higher specificity of mediation-based methods. The saturation ofperformance of mediation-based methods with increasing sample sizes is particularly worrying, al-though for most current datasets the point where performance saturates is probably not yet reached.We found limited differences between instrumental variable methods. In the absence of any ground-truth data to evaluate results, we would generally recommend to use the P P method, because it willremove at least the most obvious cases of pleiotropy from P , while having an easier interpretationthan the P method.The main weakness of instrumental variable methods is their susceptibility to false positive predictionsdue to genomic linkage. This is a particular concern in data from experimental crosses or inbredorganisms, where linkage blocks are large. However, also in human data it has been found that around10% of non-redundant local eQTLs are associated to the expression of multiple nearby genes [38]. Asillustrated, mediation-based causal inference and manual analysis of gene function can sometimes beused to resolve linkage of causal anchors.In conclusion, causal inference from genomics and transcriptomics data is a more powerful approachfor reconstructing causal gene networks than using transcriptomics data alone, which could be fur-ther improved by the inclusion of additional layers of omics data and the development of methodsto control for or find signal in residual correlations among genes in mediation analyses, and to re-solve genomic linkage and pleiotropic effects from transcriptional hotspots in instrumental variableanalyses. References [1] Jansen RC and Nap JP. Genetical genomics: the added value from segregation.
Trends inGenetics :388–391 (2001).[2] Rockman MV. Reverse engineering the genotype–phenotype map with natural genetic variation. Nature :738–744 (2008).[3] Schadt EE. Molecular networks as sensors and drivers of common human diseases.
Nature :218–223 (2009).[4] Boyle EA, Li YI and Pritchard JK. An expanded view of complex traits: from polygenic toomnigenic.
Cell :1177–1186 (2017).[5] Li Y et al.
Critical reasoning on causal inference in genome-wide linkage and association studies.
Trends in Genetics :493–498 (2010).[6] Albert FW and Kruglyak L. The role of regulatory variation in complex traits and disease. Nature Reviews Genetics :197–212 (2015).157] Pai AA, Pritchard JK and Gilad Y. The genetic and mechanistic basis for variation in generegulation. PLoS Genet :e1004857 (2015).[8] Hemani G, Tilling K and Smith GD. Orienting the causal relationship between impreciselymeasured traits using GWAS summary data. PLoS Genetics :e1007081 (2017).[9] Schadt EE et al. An integrative genomics approach to infer causal associations between geneexpression and disease.
Nature Genetics :710–717 (2005).[10] Chen LS, Emmert-Streib F and Storey JD. Harnessing naturally randomized transcription toinfer regulatory relationships among genes. Genome Biology :R219 (2007).[11] Wang L and Michoel T. Efficient and accurate causal inference with hidden confounders fromgenome-transcriptome variation data. PLoS Computational Biology :e1005703 (2017).[12] Davey Smith G and Ebrahim S. ‘mendelian randomization’: can genetic epidemiology con-tribute to understanding environmental determinants of disease? International Journal of Epi-demiology :1–22 (2003).[13] Lawlor DA et al. Mendelian randomization: using genes as instruments for making causalinferences in epidemiology.
Statistics in Medicine :1133–1163 (2008).[14] Davey Smith G and Hemani G. Mendelian randomization: genetic anchors for causal inferencein epidemiological studies. Human Molecular Genetics :R89–R98 (2014).[15] Millstein J et al. Disentangling molecular relationships with a causal inference test.
BMCGenetics :23 (2009).[16] Millstein J, Chen GK and Breton CV. cit: hypothesis testing software for mediation analysis ingenomic applications. Bioinformatics :2364–2365 (2016).[17] Shabalin AA. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics :1353–1358 (2012).[18] Pinna A et al. Simulating systems genetics data with SysGenSIM.
Bioinformatics :2459–2462(2011).[19] Marbach D et al. Wisdom of crowds for robust gene network inference.
Nature Methods :796–804 (2012).[20] Albert FW et al. Genetics of trans-regulatory variation in gene expression.
Elife :e35471(2018).[21] Monteiro PT et al. YEASTRACT+: a portal for cross-species comparative genomics of tran-scription regulation in yeasts.
Nucleic Acids Research :D642–D649 (2020).[22] Pedregosa F et al. Scikit-learn: Machine learning in Python.
Journal of Machine LearningResearch :2825–2830 (2011).[23] Seabold S and Perktold J. statsmodels: Econometric and statistical modeling with python. In (2010).[24] Yates AD et al. Ensembl 2020.
Nucleic Acids Research :D682–D688 (2019).1625] Marbach D et al. Predictive regulatory models in drosophila melanogaster by integrative infer-ence of transcriptional networks.
Genome Research :1334–1349 (2012).[26] Rockman MV and Kruglyak L. Genetics of global gene expression. Nature Reviews Genetics :862–872 (2006).[27] Kasten M and Stillman D. Identification of the saccharomyces cerevisiae genes STB1–STB5encoding Sin3p binding proteins. Molecular and General Genetics MGG :376–386 (1997).[28] Otsuga D et al.
The dynamin-related GTPase, Dnm1p, controls mitochondrial morphology inyeast.
The Journal of Cell Biology :333–349 (1998).[29] Schmitz HP et al.
Identification of Dck1 and Lmo1 as upstream regulators of the small GTPaseRho5 in Saccharomyces cerevisiae.
Molecular Microbiology :306–324 (2015).[30] Cusanovich DA et al. The functional consequences of variation in transcription factor binding.
PLoS Genetics :e1004226 (2014).[31] MacArthur S et al. Developmental roles of 21 drosophila transcription factors are determinedby quantitative differences in binding to an overlapping set of thousands of genomic regions.
Genome Biol :R80 (2009).[32] Joshi A, Beck Y and Michoel T. Multi-species network inference improves gene regulatory net-work reconstruction for early embryonic development in Drosophila. Journal of ComputationalBiology :253–265 (2015).[33] Badsha M and Fu AQ. Learning causal biological networks with the principle of Mendelianrandomization. Frontiers in Genetics :460 (2019).[34] Yang F et al. Identifying cis-mediators for trans-eqtls across many human tissues using genomicmediation analysis.
Genome research :1859–1871 (2017).[35] Bussemaker HJ, Foat BC and Ward LD. Predictive modeling of genome-wide mRNA expres-sion: from modules to molecules. Annu Rev Biophys Biomol Struct :329–347 (2007).[36] Äijö T and Bonneau R. Biophysically motivated regulatory network inference: progress andprospects. Human heredity :62–77 (2016).[37] Maetschke SR et al. Supervised, semi-supervised and unsupervised inference of gene regulatorynetworks.
Briefings in Bioinformatics :195–211 (2014).[38] Tong P, Monahan J and Prendergast JG. Shared regulatory sites are abundant in the humangenome and shed light on genome evolution and disease pleiotropy. PLoS genetics :e1006673(2017). 17 upplementary Information Method p th FDR P P P P P P P Table S1:
FDR thresholds.
The thresholds ( p th ) reported here were used to select significant interactions forthe methods shown in figure 5 and S3.Figure S1: Matrices of predicted gene interactions.
These square matrices represent the interactions be-tween 2884 genes with causal anchors (eQTLs), posterior probability values are color coded. Vertical bandscorrespond to hotspots.
Left : The instrumental variable test with partial pleiotropy P . Right : The instrumentalvariable test with perfect pleiotropy P P . The genes are ordered according to the position of their causal anchorin the full yeast genome. Definitions of the tests are given in the Methods section. This figure complementsFig. 2 in the main text. igure S2: Comparison of causal models.
These graphical models illustrate possible interactions betweengenes. The hidden confounder H would be a common parent of A and B as shown also in Fig. 1[B]. Theupper row shows a possible ground truth, the lower row shows how this scenario would be classified by Findr . Left : A common child C of A and B does not affect the predictions of Findr regardless of the presence of acommon parent H . Right : An intermediary node C could mediate the interaction between A and B , regardlessof a common parent H .Figure S3: Hotspots and genotype covariance. A and B show the counts of significant interactions for twoinference methods. Genes are ordered along the horizontal axis according to the position of their causal anchorin the full yeast genome. A : instrumental variables with perfect pleiotropy ( P P ) at FDR 5%. B : instrumentalvariables with partial pleiotropy ( P ) at FDR 5%. The thresholds used are reported in Tab. S1. C : The diagonalof the genotype covariance matrix for the 2884 eQTLs. igure S4: Hypothetical model for the
STB5 hotspot.
Stb5p protein level is determined by
STB5 transcriptionlevel and the genotype of one or more protein-altering variants E , and in turn affects STB5 transcription level byan auto-regulatory loop. Expression of
STB5 target genes Y is determined by STB5 transcription only throughStb5p level. Even in the absence of any hidden confounders,
STB5 transcription does not block the path between E and Y , and unless the correlation between STB5 transcription and Stb5p level is perfect (no biological orexperiment noise), conditioning on
STB5 transcription level will not remove the statistical association between E and Y . This model is consistent with the observed lack of allele-specific expression of STB5 [20], and withthe fact that the instrumental variable method P correctly identifies target genes with Stb5p binding sites, butthe mediation-based method P P does not.does not.