[PDF] Combining exome and gene expression datasets in one graphical model of disease to empower the discovery of disease mechanisms

Abstract

Identifying genes associated with complex human diseases is one of the main challenges of human genetics and computational medicine. To answer this question, millions of genetic variants get screened to identify a few of importance. To increase the power of identifying genes associated with diseases and to account for other potential sources of protein function aberrations, we propose a novel factor-graph based model, where much of the biological knowledge is incorporated through factors and priors. Our extensive simulations show that our method has superior sensitivity and precision compared to variant-aggregating and differential expression methods. Our integrative approach was able to identify important genes in breast cancer, identifying genes that had coding aberrations in some patients and regulatory abnormalities in others, emphasizing the importance of data integration to explain the disease in a larger number of patients.

Full PDF

CCombining exome and gene expression datasets in one graphicalmodel of disease to empower the discovery of disease mechanisms

Aziz M. Mezlini , Fabio Fuligni , Adam Shlien and Anna Goldenberg ∗ Department of Computer Science, University of Toronto Hospital for Sick Children, TorontoSeptember 1, 2015

Abstract

Identifying genes associated with complex human diseases is one of the main challenges of humangenetics and computational medicine. To answer this question, millions of genetic variants get screenedto identify a few of importance. To increase the power of identifying genes associated with diseases andto account for other potential sources of protein function aberrations, we propose a novel factor-graphbased model, where much of the biological knowledge is incorporated through factors and priors. Ourextensive simulations show that our method has superior sensitivity and precision compared to variant-aggregating and diﬀerential expression methods. Our integrative approach was able to identify importantgenes in breast cancer, identifying genes that had coding aberrations in some patients and regulatoryabnormalities in others, emphasizing the importance of data integration to explain the disease in a largernumber of patients.

The majority of human diseases are complex, arising due to a multitude of factors. Identifying these fac-tors is critical to understanding diseases and improving healthcare, yet it is a very diﬃcult computationalproblem. Taking just one modality of data, such as the human genome, already presents computationaldiﬃculties. First, there are millions of common variants called Single Nucleotide Polymorphisms (SNPs).Computationally, these data can be represented as a matrix of variables (SNPs, usually taking on valuesof 0,1,2) measured for each patient and control for hundreds to thousands of individuals. The goal is toselect features relevant to the disease, knowing that only a handful of features among these millions areactually relevant to the outcome (phenotype or disease). In the ﬁeld of human genetics, this problemhas been primarily addressed by univariate testing [14], where each variant is tested for a signiﬁcantdiﬀerence in distribution of values among patients vs healthy individuals. Due to a substantial multiplehypothesis correction problem, many studies have been underpowered to ﬁnd any variants associated withthe disease. While other, more complex methods, such as regularized regression [11, 24] were proposedto solve this problem, the biotechnological leap forward has presented new challenges.In the last ﬁve years, sequencing has become a commonplace due to the lowering cost of the availablebiotechnology. While now a lot more variants are being covered, which means that it is more likely thatthe actual causal variant is among the sequenced ones, the computational problem has become morecomplex in many ways. First, the number of variants has increased, while the number of samples hasdecreased (due to the cost of sequencing). Second, now rare variants are available in addition to thepreviously available common variants. Rare variants may be present in only a handful of patients inthe study and thus the standard approach to analyzing rare variants is to pool them together, creatingone meta variable per gene. These meta variables are then tested for association with the disease statusin the same way as the common variants were. There are several such methods that are in broad usetoday including burden tests such as CAST [15] (collapsing all variants within genes). There are also ∗ [email protected] a r X i v : . [ q - b i o . GN ] A ug ethods testing the variance component such as C-alpha [16] and SKAT [23] which are more robust tothe presence of neutral and protective variants.The major problem with the above approaches is that they only work with coding variants, i.e.variants that are in the exome protein-coding sequence. While those variants are indeed more likely tobe harmful, disrupting the creation of a healthy protein, recent studies made it abundantly clear thatthe regulatory variation (e.g. genetic variants in regions between genes) play enormous role [3]. Thesevariants are covered by the whole-genome sequencing (WGS) technology. However, WGS provides onlya partial answer: it is currently too expensive to sequence a large number of patients, so the samplesizes remain too small compared to the number of variants. Thus, the most common way the WGS datais used is by restricting the analysis to a few novel variants (appearing just in that patient). The restof the sequence is essentially wasted. However, even if there were enough patients with whole genomesequence available, this data would not be able to conclusively explain the abnormalities in the proteinquantities, since there are many reasons for those changes, including epigenetic modiﬁcations that arenot part of the DNA sequence. It is thus imperative to go beyond DNA to identify genes and proteinsthat are associated with the disease.Outside of the ﬁeld of genetics, there has been much work in identifying genes that are related todisease through diﬀerential gene expression DGE [2]. These studies have again used a small numberof patients and the approaches were mostly univariate (with a few exceptions through gene collectionsknown as pathways or networks [9]). Gene expression in itself is very noisy and hard to replicate, so fewof the gene signatures identiﬁed through DGE have been adopted in the clinic.The two data sources coding variants and gene expression are mostly complementary, responsible fordiﬀerent kinds of protein aberrations. We thus propose to combine these two data sources to a) improvethe power of detecting genes associated with the disease; b) implicate proteins that have been aﬀectedin the population in a variety of ways, rather than solely through the DNA sequence. To this extentwe propose a biologically motivated hierarchical factor graph model which eﬃciently combines these twosources of data. We use a variety of regularizing factors and prior knowledge, such as variant harmfulnessand gene interactions as priors, to increase the likelihood of identifying the variants correctly. To ourknowledge, this is the ﬁrst work that takes into account complementarity of exome and gene expressiondata sources in a principled way. It is also the ﬁrst method that integrates external annotations, such asvariant harmfulness and gene interaction information within the inference of the model. Our extensivesimulations conﬁrm that our method has superior sensitivity and precision compared to methods thatcombine rare variants. We have tested our approach in a large breast cancer dataset as a proof of conceptand found that our method is able to identify important breast cancer genes. Interestingly, we ﬁnd genesthat have mutations in some patients and gene expression aberrations in other patients, indicating thatour method is able to eﬀectively explain the disease in more patients. Our method jointly models the outcome (disease/no disease) and genotypes in a hierarchical graphicalmodel. The predictor variables are the exome variants and mRNA expression levels for each person(patient or healthy individual). Figure 1 shows a factor graph corresponding to our model.

Modeling coding aberrations speciﬁc to a single gene.

For every coding variant i observedin at least one patient within a given gene in the exome data, there is an indicator variable X i capturingwhether that variant aﬀects the function of the protein. The value of X i is inferred by our model. Weuse harmfulness predictions (as provided) as priors over the X i variables.For any considered individual and gene, there are hidden variables Y i = X i · g i , where the genotypeinput g i for the i th variant is encoded as 0, 1 or 2 (indicating respectively Homozygous Reference,Heterozygous, Homozygous Alternative). Y i ’s indicate whether the i th variant is functional in the givenindividual and whether it is in a homozygous or heterozygous state.To aggregate all functional variants aﬀecting the same gene in the same individual, we deﬁne thequality aberration variable Q ∈ { , . , } . Its values respectively indicate a normally functioning, apartially dysfunctional (e.g. haploinsuﬃciency) and a fully dysfunctional/inactivated protein. The prob-ability distribution of Q depends on the state of the Y and the relationship is described by the factor ϕ ( Q, Y ) = P ( Q | Y ). For more details, see Supplementary Table 1. Modeling gene expression aberration speciﬁc to a gene.

The gene expression data is pre-processed by applying quantile normalization and correcting for batch eﬀects if necessary. We then use log ( a + x ) to transform the data, where a (cid:31) e is the vectorof log-transformed expression levels for a particular gene then we have: e = e − median ( e ) K · median ( | e − median ( e ) | ) (1) K = φ − ( ) ≈ . E and are provided by the user as input to our method (see Figure1). Variable S is a categorical variable indicating which interval of expression levels ( e ) can have aneﬀect on the phenotype. S has 5 categories: no expression level is aberrant, only the very low levelsare abberant, only the low levels are abberant, only the high levels are abberant, only the very highlevels are aberrant. Since all values are the result of robust standardization, the intervals can be deﬁnedindependently of the genes. For example, e ≤ − e ≥ Q , A is a variable indicating mRNA expression level (Quantity) aberration. We have A ∈ { , . , } (standing for no aberration, mild dysregulation, strong dysregulation respectively). Thedistribution of A depends on the distribution of the S variable and on the expression level in the geneand individual considered (See description of ϕ factor in the Supplementary Table 2).Finally, the quality aberration Q from the exome data and the quantity aberration A from theexpression data are combined in an additive fashion to estimate the total functional disruption incurredby the studied gene in any given individual. The graphical representation of a single gene model is shownon the right panel of Figure 1. Whole genome model.

The main goal of the model is to distinguish genes that are relevant tothe disease from those that are not. For a given gene i , H i is equal to 1 if the gene is relevant to thephenotype and 0 otherwise. There are two types of priors used for the H variables: . The gene network prior is encoded in the factors Θ connecting each pair of genes known to interactfrom existing protein-protein interaction networks. These factors make two interacting genes morelikely to both be relevant to a given phenotype than a random pair of genes that do not interact.It also helps to bring interacting genes into the set of phenotypically relevant genes if one or bothof them are not marginally signiﬁcant. We use asymmetric directed factors parametrized by thedegree of the target gene in the network to account for the large discrepancy in degree distributionsfor diﬀerent genes (see Suppl. Materials 1.3 ).2. The sparsity prior is encoded in two ways: a) a uniform small prior over all genes indicating thatit is unlikely for any random gene to be involved in the phenotype; b) a regularization factor ρ encouraging situations where only a few H i are active at once.For every studied individual, there is one G variable per gene indicating whether that gene hasa functional importance in that particular individual. G is binary and depends on Q , A and H asdescribed by the factor: ϕ ( G = 1 , H, Q, A ) = P ( G = 1 | H, Q, A ) = H · max (1 , A + Q ). Thus, a genecan only be relevant in a particular individual if the gene is shown to be relevant to the phenotype ingeneral and if that particular individual has suspected aberrations in his exome variants or in his levelof expression for that gene.Finally, the phenotype (disease/no disease) is given as an input. It relates to the number of active G variables in each patient: a patient is likely to have some aﬀected genes while a healthy individualshould have few or no aﬀected genes. The expected number of aﬀected genes for patients and thetolerated number of aﬀected genes in healthy population depend on the disease and its complexity andare therefore taken as parameters ϕ by our method. (More on ϕ in Supplementary equation (2)). Inference

We use a loopy belief propagation to jointly infer the marginal distributions of the unob-served variables in our graphical model modiﬁed for eﬃciency. All computed messages are normalizedand kept in log space for numeric stability. We used message damping with parameter α = 0 . G and H variables are the fastest to change since these variables are tightly correlated.Therefore, for every time we update the messages for X , Y , Q and A variables, we update the messagesbetween G , H and phenotype variables multiple times or until their local convergence.Some of the factors connect a large number of nodes to one node. For example: the ϕ factorconnecting G variables to phenotype, the regularization factor ρ connecting the H variables together,and the ϕ factor connecting Y nodes to a Q node. Since the factor mainly depends on the sum of thevariables in each of these examples, the computation becomes possible by estimating the distributionof the sum of variables and then using it as an intermediate variable. We use the poisson binomial toestimate the sum of bernoulli variables. In a recent review of Poisson Binomial estimation methods [8],the complexity of the methods presented is at least O( n ), which is ineﬃcient for messages computationespecially for large number of variables n . Here we develop a divide and conquer approach for computingthe poisson binomial in O( nlog ( n )). This was done based on the simple observation that the sum oftwo Poisson binomial variables Z an Z is a poisson binomial variable Z and the distribution of Z isthe convolution of the two vectors representing the distributions of Z and Z (by analogy to a productof two polynomials). Messages to/from cardinality potentials were previously computed in the samecomplexity using a very similar approach [19], but it was not previously formulated as a solution forestimating poisson binomial distributions.After convergence of the loopy belief propagation our method returns the posterior marginal proba-bilities for the H and G variables, along with the list of most relevant abberrations (coding variants orexpression levels) in each aﬀected individual according to the model. Since there are very few complex diseases with known gene abberations, but there are very good genomesimulators of disease, particularly of rare variants, we ﬁrst analyze our performance using an extensiveset of simulations. We then showcase our method on a real breast cancer dataset. But ﬁrst, we list thesources of our biological priors.Harmfulness predictions are used as priors over the X i variables indicating whether the coding variant i aﬀects protein function. These predictions are obtained using public methods and libraries. We used NNOVAR [20] for variant annotation. We then used Polyphen2 scores as priors over the nonsynonymousvariants. Variants annotated as nonsense or loss-of-function (LoF) are systematically assigned a highprior (such as 0.99). For now, we ignored variants that are not annotated as nonsynonymous or nonsense.The use of harmfulness predictions is optional. Since they are only used as priors, their absence does notusually have a large impact on the ﬁnal result (see below).We used the BioGrid human protein-protein interaction network [4] as a prior for gene connectivity.

Overview of the simulation process.

For simulating coding variants, we used the Europeanpopulation model with the optimal parameters as in [12]. We considered the single nucleotide variants(SNVs) with selection coeﬃcient greater than s=0.001 as the deleterious SNVs and the remaining SNVsas neutral, as recommended in [17]. We then sampled harmfulness scores from real Polyphen scoresdistribution computed on known deleterious variants and neutral variants as in [17]. The generatedmissense SNVs were partitioned among the 6900 genes present in both BioGrid annotation and in thehealthy gene expression dataset we used as background for the distribution of genes’ levels. For everyindividual and gene, we randomly decided with rate λ whether there will be a perturbation of expressionlevel in that individual for the considered gene.The procedure above was done on a population level. To simulate a disease scenario, we selected anumber of genes that are close to each other on the protein-protein interaction network as being the causaldisease mechanism. For every individual, we count the number of harmful mutations and the number ofexpression perturbations within the disease mechanisms and we considered that sum as indicative of thephenotype. The disease cases were then selected to be the top 3000 individuals according to that sum(we suppose the disease prevalence in the population is 0.3%).For the expression data, we started from a healthy expression dataset (GEO: GSE9006) as a back-ground signal of expression. For every individual selected (patient or control), we generate gene expressionlevels by sampling with noise from the real data and then applying the gene expression perturbationssimulated earlier on. The magnitude of the change in expression was uniformly sampled from [2 σ, σ ]where σ is the standard deviation for the considered gene expression (background signal). The sign ofthe expression changes is chosen randomly for each gene.We varied several simulation parameters to check robustness of our method. First, we varied thesample size: we selected n patients from 3000 eﬀected individuals and n controls from the rest of thepopulation. Second, we varied the number of causal genes. Finally, we varied the proportion λ ofaberrations related to expression versus coding variants.We characterize the performance of our method by how well the disease mechanism was reconstructed(sensitivity, precision, N-Precision) and compare our results to those of four rare variants aggregationmethods: CAST [7], C-ALPHA[22], SKAT [23] and RWAS[18] along with diﬀerential expression and thedmGWAS method [10] that uses networks in gene detection. From the output of our method, we considera gene as a positive (disease associated) if its marginal probability is larger than 0 .

5. For all the othermethods except dmGWAS, gene p-values were used as the criterion for ranking and calling genes. A genewas considered a positive if its associated p-value was lower than the signiﬁcance threshold ( < . − after multiple hypothesis correction). dmGWAS returns ﬁxed lists of genes. We use the highest scoringlist as the positive gene calls. Figure 2-A,B,C shows that our method has higher sensitivity to the causal genes than any other methodconsidered and that it is often the only method capable of detecting the disease mechanism in manyrealistic scenarios. In particular, the N-Precision measure (how likely is the true gene to be rankedwithin the top N genes) in Figure 2-D,E,F shows that it is not a problem of stringent p-value thresholdfor the other methods but that our performance is higher for the task of prioritizing candidate genes ingeneral and that the causal genes are systematically ranked higher by our method. For the criterion usedto determine positive gene calls (marginals ≥ . P in Figure 2). We can see that the performance of mostmethods drops considerably when the contribution of the coding variation is only 20% (i.e. aberrations inexpression levels constitute 80% of the causes) instead of 50%. This is particularly interesting because the atio 20% to 80% for respectively coding and regulatory aberrations is a realistic scenario as mentionedin the literature [1]. Figure 2: Sensitivity and N-Precision as a function of the simulation parameters, evaluated for our method(in pink) and compared to other variants’ aggregation methods and to diﬀerential expression. The samplesize is varied from 50 to 400 on the x-axis of each single graph (The sample size is equal to the number ofcases, which is equal to the number of controls). The eﬀect of changing P and λ is shown in the diﬀerentcolumns. P is the number of genes simulated to be causal and λ is the proportion of simulated aberrationsthat are regulatory (expression) as opposed to coding (Exome). Sensitivity is the proportion of simulatedcausal genes that are detected by the method. N-Precision is computed based on the top N=P genes reportedby each method and is equal to the proportion of true causal genes among those. Our method takes multiple biological sources of information and annotations as inputs in order to enhancethe power to detect genes relevant to the phenotype. Figure 3 shows how much each of these additionalinputs contribute to the power of the method. Speciﬁcally, it shows that running our method withoutusing the gene network or without harmfulness predictions results in only a small drop in performance.However, running our method without expression data, does have a noticeable impact. This is expectedsince half the disease signal is contained in the expression data in this particular set of simulations.Still, even running without expression our method largely outperforms all others approaches as shown bycomparing Figure 3 and Figure 2-A,D. It is only when we use our method without any of these additionalinputs that we see a substantial drop in performance, making the performance of our method on parwith the other methods.We also checked that our method is still able to reconstruct the disease mechanism when there isa large discrepancy between the contributions of the causal genes (see Supplementary Figure 9) and isrobust to changes in regularization parameter (Supplementary Figure 10). P = 10 and the Expression versus SNPS causality ( λ ) to 50% (Bothhave equal contribution to the phenotype on average). Each bar describes the results of 10 simulations. We used exome sequencing and gene expression data (RNA-Seq) for 826 breast cancer patients from 4subtypes (238 Luminal A, 359 Luminal B, 148 Basal, 81 HER2+). All data was downloaded from TheCancer Genome Atlas (TCGA). The data for 20,000 genes was available and was used in the methoddirectly, without gene pre-selection. We analysed all pairs of breast cancer subtypes (using the ﬁrstsubtype in each pair as patients and the second as controls and vice versa) and report the genes withmarginals > .

5. We balance the number of patients and controls by subsampling from the subtypewith the larger sample size. For the luminal subtypes the recurrent genes were PIK3CA, CDH1 andGATA3, with luminal A patients having more often mutations in CDH1, MAP3K1 and MLL3, whileluminal B patients having more often TP53 and GATA3 mutations when we compare both subtypes toeach other. The HER2 subtype is the one with the smallest sample size and the only gene with marginal > . In this paper we propose an integrative graphical model framework for combining exome and geneexpression data to increase the power of detecting genes associated with complex human diseases. Themain contribution of the paper is the principled approach to combining these two data sources in abiologically sensible way. While people have used genetic variants to predict gene expression (eQTLanalysis), combining this data as two complementary sources has not been done to date. We made sureto incorporate additional biological information, such as variant harmfulness and gene interactions aspriors. This again is an innovation, since gene networks have been mostly used to either pre-select orcombine the identiﬁed variants, whereas we propose to use it as part of the inference process. Additionally,using networks as priors, rather than in the post-hoc analysis, allows the method to be less sensitive tothe inherent noise present in the network.We demonstrated that even given a small sample size, our method is able to outperform severalother widely used methods (using either gene expression or exome alone). Our simulations show thatthe method is not too sensitive to any of the priors and that the combination of this information givesa great boost to the gene identiﬁcation. The breast cancer analysis showed that we can recover knownbreast cancer genes. e have to note that it is important to take great care when using gene expression data in cancer.Expression data in cancer represents the downstream eﬀect of many genes through their interactions. Itis not unusual to observe a signiﬁcant gene expression change in thousands of genes, the majority beinga downstream, rather than the driver, eﬀect (e.g. inﬂammation, drug response, etc) Additionally, andmore importantly, there is a large heterogeneity in gene expression in cancer: many patients within thesame subtype will appear to have an abberant expression. These variations are of unknown cause. Itis desirable to remove the broad variability patterns while preserving consistent changes across a fewpatients. In our work, we used the quantile normalization to address this point, but we will be exploringthis question further to establish a cancer speciﬁc pre-processing framework.We will also further investigate the translational impact of this work, analyzing the posterior overgenes for each patient individually. This will help to assess the heterogeneity in cancer in general, butmore so, to potentially assess the ability of each patient to respond to mutation targetting drugs. References [1] a. Adeyemo and C. Rotimi. Genetic variants associated with complex human diseases show widevariation across multiple populations.

Public Health Genomics , 13(2):72–79, 2009.[2] Simon Anders and Wolfgang Huber. Diﬀerential expression analysis for sequence count data.

Genomebiology , 11(10):R106, January 2010.[3] Ewan Birney, John a Stamatoyannopoulos, and et al. Dutta, Pieter J. Identiﬁcation and analy-sis of functional elements in 1% of the human genome by the ENCODE pilot project.

Nature ,447(7146):799–816, 2007.[4] Andrew Chatr-Aryamontri, Bobby-Joe Breitkreutz, Rose Oughtred, Lorrie Boucher, Sven Heinicke,Daici Chen, Chris Stark, Ashton Breitkreutz, Nadine Kolas, Lara O’Donnell, and Mike et al. Reguly,Tyers. The BioGRID interaction database: 2015 update.

Nucleic acids research , 43(Databaseissue):D470–8, January 2015.[5] Matthew J Ellis, Li Ding, Dong Shen, Richard K Luo, and et al. Mardis, Elaine R. Whole-genomeanalysis informs breast cancer response to aromatase inhibition.

Nature , 486(7403):353–60, June2012.[6] Christopher Greenman, Philip Stephens, and Michael R et al. Smith, Stratton. Patterns of somaticmutation in human cancer genomes.

Nature , 446(7132):153–8, March 2007.[7] Thomas J. Hoﬀmann, Nicholas J. Marini, and John S. Witte. Comprehensive approach to analyzingrare genetic variants.

PLoS ONE , 5(11):9, 2010.[8] Yili Hong. On computing the distribution function for the Poisson binomial distribution.

Compu-tational Statistics & Data Analysis , 59:41–51, March 2013.[9] Peilin Jia and Zhongming Zhao. Network.assisted analysis to prioritize GWAS results: principles,methods and perspectives.

Human genetics , 133(2):125–38, February 2014.[10] Peilin Jia, Siyuan Zheng, Jirong Long, Wei Zheng, and Zhongming Zhao. dmGWAS: dense modulesearching for genome-wide association studies in protein-protein interaction networks.

Bioinformat-ics (Oxford, England) , 27(1):95–102, January 2011.[11] Seyoung Kim and Eric P Xing. Statistical estimation of correlated genome associations to a quan-titative trait network.

PLoS genetics , 5(8):e1000587, August 2009.[12] Gregory V Kryukov, Alexander Shpunt, John a Stamatoyannopoulos, and Shamil R Sunyaev. Powerof deep, all-exon resequencing for discovery of human trait genes.

Proceedings of the NationalAcademy of Sciences of the United States of America , 106(10):3871–3876, 2009.[13] Christophe Leys, Christophe Ley, Olivier Klein, Philippe Bernard, and Laurent Licata. Detectingoutliers: Do not use standard deviation around the mean, use absolute deviation around the median.

Journal of Experimental Social Psychology , 49(4):764–766, July 2013.[14] Mark I McCarthy, Gon¸calo R Abecasis, Lon R Cardon, David B Goldstein, Julian Little, John P AIoannidis, and Joel N Hirschhorn. Genome-wide association studies for complex traits: consensus,uncertainty and challenges.

Nature reviews. Genetics , 9(5):356–69, May 2008.[15] Stephan Morgenthaler and William G Thilly. A strategy to discover genes that carry multi-allelicor mono-allelic risk for common diseases: a cohort allelic sums test (CAST).

Mutation research ,615(1-2):28–56, February 2007.

16] Benjamin M Neale, Manuel A Rivas, Benjamin F Voight, David Altshuler, Bernie Devlin, MarjuOrho-Melander, Sekar Kathiresan, Shaun M Purcell, Kathryn Roeder, and Mark J Daly. Testingfor an unusual distribution of rare variants.

PLoS genetics , 7(3):e1001322, March 2011.[17] Alkes L. Price, Gregory V. Kryukov, Paul I W de Bakker, Shaun M. Purcell, Jeﬀ Staples, Lee JenWei, and Shamil R. Sunyaev. Pooled Association Tests for Rare Variants in Exon-ResequencingStudies.

American Journal of Human Genetics , 86(6):832–838, 2010.[18] Jae Hoon Sul, Buhm Han, Dan He, and Eleazar Eskin. An Optimal Weighted Aggregated AssociationTest for Identiﬁcation of Rare Variants Involved in Common Diseases.

Genetics , 188(1):181–188,2011.[19] D Tarlow, K Swersky, Rs Zemel, Rp Adams, and Bj Frey. Fast exact inference for recursive cardi-nality models.

Uncertainty in Artiﬁcial Intelligence , pages 825–834, 2012.[20] Kai Wang, Mingyao Li, and Hakon Hakonarson. ANNOVAR: functional annotation of geneticvariants from high-throughput sequencing data.

Nucleic acids research , 38(16):e164, September2010.[21] Xin-Xin Wang, Liya Fu, Xuan Li, Xiao Wu, Zhengmao Zhu, Li Fu, and Jin-Tang Dong. Somaticmutations of the mixed-lineage leukemia 3 (MLL3) gene in primary breast cancers.

Pathologyoncology research : POR , 17(2):429–33, June 2011.[22] G We. Testing for an unusual distribution of rare variants : Supplementary Methods.

PLoS genetics ,7(3):1–7, 2011.[23] Michael C Wu, Seunggeun Lee, Tianxi Cai, Yun Li, Michael Boehnke, and Xihong Lin. Rare-variantassociation testing for sequencing data with the sequence kernel association test.

American journalof human genetics , 89(1):82–93, July 2011.[24] Xiaolin Yang, Seyoung Kim, and Eric P Xing. Heterogeneous multitask learning with joint sparsityconstraints.

Advances in Neural Information Processing Systems , 23:1–9, 2009., 23:1–9, 2009.