Babak Alipanahi
University of Waterloo
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Babak Alipanahi.
Science | 2015
Hui Y. Xiong; Babak Alipanahi; Leo J. Lee; Hannes Bretschneider; Daniele Merico; Ryan K. C. Yuen; Yimin Hua; Serge Gueroussov; Hamed Shateri Najafabadi; Timothy R. Hughes; Quaid Morris; Yoseph Barash; Adrian R. Krainer; Nebojsa Jojic; Stephen W. Scherer; Benjamin J. Blencowe; Brendan J. Frey
Predicting defects in RNA splicing Most eukaryotic messenger RNAs (mRNAs) are spliced to remove introns. Splicing generates uninterrupted open reading frames that can be translated into proteins. Splicing is often highly regulated, generating alternative spliced forms that code for variant proteins in different tissues. RNA-binding proteins that bind specific sequences in the mRNA regulate splicing. Xiong et al. develop a computational model that predicts splicing regulation for any mRNA sequence (see the Perspective by Guigó and Valcárcel). They use this to analyze more than half a million mRNA splicing sequence variants in the human genome. They are able to identify thousands of known disease-causing mutations, as well as many new disease candidates, including 17 new autism-linked genes. Science, this issue 10.1126/science.1254806; see also p. 124 A model predicts how thousands of disease-linked nucleotide variants affect messenger RNA splicing. [Also see Perspective by Guigó and Valcárcel] INTRODUCTION Advancing whole-genome precision medicine requires understanding how gene expression is altered by genetic variants, especially those that are far outside of protein-coding regions. We developed a computational technique that scores how strongly genetic variants affect RNA splicing, a critical step in gene expression whose disruption contributes to many diseases, including cancers and neurological disorders. A genome-wide analysis reveals tens of thousands of variants that alter splicing and are enriched with a wide range of known diseases. Our results provide insight into the genetic basis of spinal muscular atrophy, hereditary nonpolyposis colorectal cancer, and autism spectrum disorder. RATIONALE We used “deep learning” computer algorithms to derive a computational model that takes as input DNA sequences and applies general rules to predict splicing in human tissues. Given a test variant, which may be up to 300 nucleotides into an intron, our model can be used to compute a score for how much the variant alters splicing. The model is not biased by existing disease annotations or population data and was derived in such a way that it can be used to study diverse diseases and disorders and to determine the consequences of common, rare, and even spontaneous variants. RESULTS Our technique is able to accurately classify disease-causing variants and provides insights into the role of aberrant splicing in disease. We scored more than 650,000 DNA variants and found that disease-causing variants have higher scores than common variants and even those associated with disease in genome-wide association studies (GWAS). Our model predicts substantial and unexpected aberrant splicing due to variants within introns and exons, including those far from the splice site. For example, among intronic variants that are more than 30 nucleotides away from any splice site, known disease variants alter splicing nine times as often as common variants; among missense exonic disease variants, those that least affect protein function are more than five times as likely as other variants to alter splicing. Autism has been associated with disrupted splicing in brain regions, so we used our method to score variants detected using whole-genome sequencing data from individuals with and without autism. Genes with high-scoring variants include many that have previously been linked with autism, as well as new genes with known neurodevelopmental phenotypes. Most of the high-scoring variants are intronic and cannot be detected by exome analysis techniques. When we scored clinical variants in spinal muscular atrophy and colorectal cancer genes, up to 94% of variants found to alter splicing using minigene reporters were correctly classified. CONCLUSION In the context of precision medicine, causal support for variants independent of existing whole-genome variant studies is greatly needed. Our computational model was trained to predict splicing from DNA sequence alone, without using disease annotations or population data. Consequently, its predictions are independent of and complementary to population data, GWAS, expression-based quantitative trait loci (QTL), and functional annotations of the genome. As such, our technique greatly expands the opportunities for understanding the genetic determinants of disease. “Deep learning” reveals the genetic origins of disease. A computational system mimics the biology of RNA splicing by correlating DNA elements with splicing levels in healthy human tissues. The system can scan DNA and identify damaging genetic variants, including those deep within introns. This procedure has led to insights into the genetics of autism, cancers, and spinal muscular atrophy. To facilitate precision medicine and whole-genome annotation, we developed a machine-learning technique that scores how strongly genetic variants affect RNA splicing, whose alteration contributes to many diseases. Analysis of more than 650,000 intronic and exonic variants revealed widespread patterns of mutation-driven aberrant splicing. Intronic disease mutations that are more than 30 nucleotides from any splice site alter splicing nine times as often as common variants, and missense exonic disease mutations that have the least impact on protein function are five times as likely as others to alter splicing. We detected tens of thousands of disease-causing mutations, including those involved in cancers and spinal muscular atrophy. Examination of intronic and exonic variants found using whole-genome sequencing of individuals with autism revealed misspliced genes with neurodevelopmental phenotypes. Our approach provides evidence for causal variants and should enable new discoveries in precision medicine.
Nature | 2013
Hong Han; Manuel Irimia; P. Joel Ross; Hoon-Ki Sung; Babak Alipanahi; Laurent David; Azadeh Golipour; Mathieu Gabut; Iacovos P. Michael; Emil N. Nachman; Eric T. Wang; Dan Trcka; Tadeo Thompson; Dave O’Hanlon; Valentina Slobodeniuc; Nuno L. Barbosa-Morais; Christopher B. Burge; Jason Moffat; Brendan J. Frey; Andras Nagy; James Ellis; Jeffrey L. Wrana; Benjamin J. Blencowe
Previous investigations of the core gene regulatory circuitry that controls the pluripotency of embryonic stem (ES) cells have largely focused on the roles of transcription, chromatin and non-coding RNA regulators. Alternative splicing represents a widely acting mode of gene regulation, yet its role in regulating ES-cell pluripotency and differentiation is poorly understood. Here we identify the muscleblind-like RNA binding proteins, MBNL1 and MBNL2, as conserved and direct negative regulators of a large program of cassette exon alternative splicing events that are differentially regulated between ES cells and other cell types. Knockdown of MBNL proteins in differentiated cells causes switching to an ES-cell-like alternative splicing pattern for approximately half of these events, whereas overexpression of MBNL proteins in ES cells promotes differentiated-cell-like alternative splicing patterns. Among the MBNL-regulated events is an ES-cell-specific alternative splicing switch in the forkhead family transcription factor FOXP1 that controls pluripotency. Consistent with a central and negative regulatory role for MBNL proteins in pluripotency, their knockdown significantly enhances the expression of key pluripotency genes and the formation of induced pluripotent stem cells during somatic cell reprogramming.
Nature Genetics | 2014
Mohammed Uddin; Kristiina Tammimies; Giovanna Pellecchia; Babak Alipanahi; Pingzhao Hu; Z. B. Wang; Dalila Pinto; Lynette Lau; Thomas Nalpathamkalam; Christian R. Marshall; Benjamin J. Blencowe; Brendan J. Frey; Daniele Merico; Ryan K. C. Yuen; Stephen W. Scherer
A universal challenge in genetic studies of autism spectrum disorders (ASDs) is determining whether a given DNA sequence alteration will manifest as disease. Among different population controls, we observed, for specific exons, an inverse correlation between exon expression level in brain and burden of rare missense mutations. For genes that harbor de novo mutations predicted to be deleterious, we found that specific critical exons were significantly enriched in individuals with ASD relative to their siblings without ASD (P < 1.13 × 10−38; odds ratio (OR) = 2.40). Furthermore, our analysis of genes with high exonic expression in brain and low burden of rare mutations demonstrated enrichment for known ASD-associated genes (P < 3.40 × 10−11; OR = 6.08) and ASD-relevant fragile-X protein targets (P < 2.91 × 10−157; OR = 9.52). Our results suggest that brain-expressed exons under purifying selection should be prioritized in genotype-phenotype studies for ASD and related neurodevelopmental conditions.
npj Genomic Medicine | 2016
Dimitri J. Stavropoulos; Daniele Merico; Rebekah Jobling; Sarah Bowdin; Nasim Monfared; Bhooma Thiruvahindrapuram; Thomas Nalpathamkalam; Giovanna Pellecchia; Ryan Kc C. Yuen; Michael J. Szego; Robin Z. Hayeems; Randi Zlotnik Shaul; Michael Brudno; Marta Girdea; Brendan J. Frey; Babak Alipanahi; Sohnee Ahmed; Riyana Babul-Hirji; Ramses Badilla Porras; Melissa T. Carter; Lauren Chad; Ayeshah Chaudhry; David Chitayat; Soghra Jougheh Doust; Cheryl Cytrynbaum; Lucie Dupuis; Resham Ejaz; Leona Fishman; Andrea Guerin; Bita Hashemi
The standard of care for first-tier clinical investigation of the aetiology of congenital malformations and neurodevelopmental disorders is chromosome microarray analysis (CMA) for copy-number variations (CNVs), often followed by gene(s)-specific sequencing searching for smaller insertion–deletions (indels) and single-nucleotide variant (SNV) mutations. Whole-genome sequencing (WGS) has the potential to capture all classes of genetic variation in one experiment; however, the diagnostic yield for mutation detection of WGS compared to CMA, and other tests, needs to be established. In a prospective study we utilised WGS and comprehensive medical annotation to assess 100 patients referred to a paediatric genetics service and compared the diagnostic yield versus standard genetic testing. WGS identified genetic variants meeting clinical diagnostic criteria in 34% of cases, representing a fourfold increase in diagnostic rate over CMA (8%; P value=1.42E−05) alone and more than twofold increase in CMA plus targeted gene sequencing (13%; P value=0.0009). WGS identified all rare clinically significant CNVs that were detected by CMA. In 26 patients, WGS revealed indel and missense mutations presenting in a dominant (63%) or a recessive (37%) manner. We found four subjects with mutations in at least two genes associated with distinct genetic disorders, including two cases harbouring a pathogenic CNV and SNV. When considering medically actionable secondary findings in addition to primary WGS findings, 38% of patients would benefit from genetic counselling. Clinical implementation of WGS as a primary test will provide a higher diagnostic yield than conventional genetic testing and potentially reduce the time required to reach a genetic diagnosis.
Bioinformatics | 2009
Babak Alipanahi; Xin Gao; Emre Karakoc; Logan Donaldson; Ming Li
Motivation: Picking peaks from experimental NMR spectra is a key unsolved problem for automated NMR protein structure determination. Such a process is a prerequisite for resonance assignment, nuclear overhauser enhancement (NOE) distance restraint assignment, and structure calculation tasks. Manual or semi-automatic peak picking, which is currently the prominent way used in NMR labs, is tedious, time consuming and costly. Results: We introduce new ideas, including noise-level estimation, component forming and sub-division, singular value decomposition (SVD)-based peak picking and peak pruning and refinement. PICKY is developed as an automated peak picking method. Different from the previous research on peak picking, we provide a systematic study of the proposed method. PICKY is tested on 32 real 2D and 3D spectra of eight target proteins, and achieves an average of 88% recall and 74% precision. PICKY is efficient. It takes PICKY on average 15.7 s to process an NMR spectrum. More important than these numbers, PICKY actually works in practice. We feed peak lists generated by PICKY to IPASS for resonance assignment, feed IPASS assignment to SPARTA for fragments generation, and feed SPARTA fragments to FALCON for structure calculation. This results in high-resolution structures of several proteins, for example, TM1112, at 1.25 Å. Availability: PICKY is available upon request. The peak lists of PICKY can be easily loaded by SPARKY to enable a better interactive strategy for rapid peak picking. Contact: [email protected]
Proceedings of the IEEE | 2016
Michael K. K. Leung; Andrew Delong; Babak Alipanahi; Brendan J. Frey
In this paper, we provide an introduction to machine learning tasks that address important problems in genomic medicine. One of the goals of genomic medicine is to determine how variations in the DNA of individuals can affect the risk of different diseases, and to find causal explanations so that targeted therapies can be designed. Here we focus on how machine learning can help to model the relationship between DNA and the quantities of key molecules in the cell, with the premise that these quantities, which we refer to as cell variables, may be associated with disease risks. Modern biology allows high-throughput measurement of many such cell variables, including gene expression, splicing, and proteins binding to nucleic acids, which can all be treated as training targets for predictive models. With the growing availability of large-scale data sets and advanced computational techniques such as deep learning, researchers can help to usher in a new era of effective genomic medicine.
npj Genomic Medicine | 2016
Ryan Kc Yuen; Daniele Merico; Hongzhi Cao; Giovanna Pellecchia; Babak Alipanahi; Bhooma Thiruvahindrapuram; Xin Tong; Yuhui Sun; Dandan Cao; Tao Zhang; Xueli Wu; Xin Jin; Ze Zhou; Xiaomin Liu; Thomas Nalpathamkalam; Susan Walker; Jennifer L. Howe; Z. B. Wang; Jeffrey R. MacDonald; Ada Js Chan; Lia D’Abate; Eric Deneault; Michelle T. Siu; Kristiina Tammimies; Mohammed Uddin; Mehdi Zarrei; Mingbang Wang; Yingrui Li; Jun Wang; Jian Wang
De novo mutations (DNMs) are important in autism spectrum disorder (ASD), but so far analyses have mainly been on the ~1.5% of the genome encoding genes. Here, we performed whole-genome sequencing (WGS) of 200 ASD parent–child trios and characterised germline and somatic DNMs. We confirmed that the majority of germline DNMs (75.6%) originated from the father, and these increased significantly with paternal age only (P=4.2×10−10). However, when clustered DNMs (those within 20 kb) were found in ASD, not only did they mostly originate from the mother (P=7.7×10−13), but they could also be found adjacent to de novo copy number variations where the mutation rate was significantly elevated (P=2.4×10−24). By comparing with DNMs detected in controls, we found a significant enrichment of predicted damaging DNMs in ASD cases (P=8.0×10−9; odds ratio=1.84), of which 15.6% (P=4.3×10−3) and 22.5% (P=7.0×10−5) were non-coding or genic non-coding, respectively. The non-coding elements most enriched for DNM were untranslated regions of genes, regulatory sequences involved in exon-skipping and DNase I hypersensitive regions. Using microarrays and a novel outlier detection test, we also found aberrant methylation profiles in 2/185 (1.1%) of ASD cases. These same individuals carried independently identified DNMs in the ASD-risk and epigenetic genes DNMT3A and ADNP. Our data begins to characterize different genome-wide DNMs, and highlight the contribution of non-coding variants, to the aetiology of ASD.
G3: Genes, Genomes, Genetics | 2015
Daniele Merico; Mehdi Zarrei; Gregory Costain; Lucas Ogura; Babak Alipanahi; Matthew J. Gazzellone; Nancy J. Butcher; Bhooma Thiruvahindrapuram; Thomas Nalpathamkalam; Eva W.C. Chow; Danielle M. Andrade; Brendan J. Frey; Christian R. Marshall; Stephen W. Scherer; Anne S. Bassett
Chromosome 22q11.2 microdeletions impart a high but incomplete risk for schizophrenia. Possible mechanisms include genome-wide effects of DGCR8 haploinsufficiency. In a proof-of-principle study to assess the power of this model, we used high-quality, whole-genome sequencing of nine individuals with 22q11.2 deletions and extreme phenotypes (schizophrenia, or no psychotic disorder at age >50 years). The schizophrenia group had a greater burden of rare, damaging variants impacting protein-coding neurofunctional genes, including genes involved in neuron projection (nominal P = 0.02, joint burden of three variant types). Variants in the intact 22q11.2 region were not major contributors. Restricting to genes affected by a DGCR8 mechanism tended to amplify between-group differences. Damaging variants in highly conserved long intergenic noncoding RNA genes also were enriched in the schizophrenia group (nominal P = 0.04). The findings support the 22q11.2 deletion model as a threshold-lowering first hit for schizophrenia risk. If applied to a larger and thus better-powered cohort, this appears to be a promising approach to identify genome-wide rare variants in coding and noncoding sequence that perturb gene networks relevant to idiopathic schizophrenia. Similarly designed studies exploiting genetic models may prove useful to help delineate the genetic architecture of other complex phenotypes.
Journal of Bioinformatics and Computational Biology | 2011
Babak Alipanahi; Xin Gao; Emre Karakoc; Shuai Cheng Li; Frank J. Balbach; Guangyu Feng; Logan W. Donaldson; Ming Li
Error tolerant backbone resonance assignment is the cornerstone of the NMR structure determination process. Although a variety of assignment approaches have been developed, none works sufficiently well on noisy fully automatically picked peaks to enable the subsequent automatic structure determination steps. We have designed an integer linear programming (ILP) based assignment system (IPASS) that has enabled fully automatic protein structure determination for four test proteins. IPASS employs probabilistic spin system typing based on chemical shifts and secondary structure predictions. Furthermore, IPASS extracts connectivity information from the inter-residue information and the (automatically picked) (15)N-edited NOESY peaks which are then used to fix reliable fragments. When applied to automatically picked peaks for real proteins, IPASS achieves an average precision and recall of 82% and 63%, respectively. In contrast, the next best method, MARS, achieves an average precision and recall of 77% and 36%, respectively. The assignments generated by IPASS are then fed into our protein structure calculation system, FALCON-NMR, to determine the 3D structures without human intervention. The final models have backbone RMSDs of 1.25Å, 0.88Å, 1.49Å, and 0.67Å to the reference native structures for proteins TM1112, CASKIN, VRAR, and HACS1, respectively. The web server is publicly available at http://monod.uwaterloo.ca/nmr/ipass.
Journal of Bioinformatics and Computational Biology | 2010
Yuzhong Zhao; Babak Alipanahi; Shuai Cheng Li; Ming Li
Accurate determination of protein secondary structure from the chemical shift information is a key step for NMR tertiary structure determination. Relatively few work has been done on this subject. There needs to be a systematic investigation of algorithms that are (a) robust for large datasets; (b) easily extendable to (the dynamic) new databases; and (c) approaching to the limit of accuracy. We introduce new approaches using k-nearest neighbor algorithm to do the basic prediction and use the BCJR algorithm to smooth the predictions and combine different predictions from chemical shifts and based on sequence information only. Our new system, SUCCES, improves the accuracy of all existing methods on a large dataset of 805 proteins (at 86% Q(3) accuracy and at 92.6% accuracy when the boundary residues are ignored), and it is easily extendable to any new dataset without requiring any new training. The software is publicly available at http://monod.uwaterloo.ca/nmr/succes.