Paolo Ribeca
Institute for Animal Health
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Paolo Ribeca.
Nature | 2012
Sarah Djebali; Carrie A. Davis; Angelika Merkel; Alexander Dobin; Timo Lassmann; Ali Mortazavi; Andrea Tanzer; Julien Lagarde; Wei Lin; Felix Schlesinger; Chenghai Xue; Georgi K. Marinov; Jainab Khatun; Brian A. Williams; Chris Zaleski; Joel Rozowsky; Maik Röder; Felix Kokocinski; Rehab F. Abdelhamid; Tyler Alioto; Igor Antoshechkin; Michael T. Baer; Nadav S. Bar; Philippe Batut; Kimberly Bell; Ian Bell; Sudipto Chakrabortty; Xian Chen; Jacqueline Chrast; Joao Curado
Eukaryotic cells make many types of primary and processed RNAs that are found either in specific subcellular compartments or throughout the cells. A complete catalogue of these RNAs is not yet available and their characteristic subcellular localizations are also poorly understood. Because RNA represents the direct output of the genetic information encoded by genomes and a significant proportion of a cell’s regulatory capabilities are focused on its synthesis, processing, transport, modification and translation, the generation of such a catalogue is crucial for understanding genome function. Here we report evidence that three-quarters of the human genome is capable of being transcribed, as well as observations about the range and levels of expression, localization, processing fates, regulatory regions and modifications of almost all currently annotated and thousands of previously unannotated RNAs. These observations, taken together, prompt a redefinition of the concept of a gene.
Nature | 2013
Tuuli Lappalainen; Michael Sammeth; Marc R. Friedländer; Peter A. C. 't Hoen; Jean Monlong; Manuel A. Rivas; Mar Gonzàlez-Porta; Natalja Kurbatova; Thasso Griebel; Pedro G. Ferreira; Matthias Barann; Thomas Wieland; Liliana Greger; M. van Iterson; Jonas Carlsson Almlöf; Paolo Ribeca; Irina Pulyakhina; Daniela Esser; Thomas Giger; Andrew Tikhonov; Marc Sultan; G. Bertier; Daniel G. MacArthur; Monkol Lek; Esther Lizano; Henk P. J. Buermans; Ismael Padioleau; Thomas Schwarzmayr; Olof Karlberg; Halit Ongen
Genome sequencing projects are discovering millions of genetic variants in humans, and interpretation of their functional effects is essential for understanding the genetic basis of variation in human traits. Here we report sequencing and deep analysis of messenger RNA and microRNA from lymphoblastoid cell lines of 462 individuals from the 1000 Genomes Project—the first uniformly processed high-throughput RNA-sequencing data from multiple human populations with high-quality genome sequences. We discover extremely widespread genetic variation affecting the regulation of most genes, with transcript structure and expression level variation being equally common but genetically largely independent. Our characterization of causal regulatory variation sheds light on the cellular mechanisms of regulatory and loss-of-function variation, and allows us to infer putative causal variants for dozens of disease-associated loci. Altogether, this study provides a deep understanding of the cellular mechanisms of transcriptome variation and of the landscape of functional variants in the human genome.
Nature Methods | 2012
Santiago Marco-Sola; Michael Sammeth; Roderic Guigó; Paolo Ribeca
Because of ever-increasing throughput requirements of sequencing data, most existing short-read aligners have been designed to focus on speed at the expense of accuracy. The Genome Multitool (GEM) mapper can leverage string matching by filtration to search the alignment space more efficiently, simultaneously delivering precision (performing fully tunable exhaustive searches that return all existing matches, including gapped ones) and speed (being several times faster than comparable state-of-the-art tools).
PLOS ONE | 2012
Thomas Derrien; Jordi Estellé; Santiago Marco Sola; David G. Knowles; Emanuele Raineri; Roderic Guigó; Paolo Ribeca
We present a fast mapping-based algorithm to compute the mappability of each region of a reference genome up to a specified number of mismatches. Knowing the mappability of a genome is crucial for the interpretation of massively parallel sequencing experiments. We investigate the properties of the mappability of eukaryotic DNA/RNA both as a whole and at the level of the gene family, providing for various organisms tracks which allow the mappability information to be visually explored. In addition, we show that mappability varies greatly between species and gene classes. Finally, we suggest several practical applications where mappability can be used to refine the analysis of high-throughput sequencing data (SNP calling, gene expression quantification and paired-end experiments). This work highlights mappability as an important concept which deserves to be taken into full account, in particular when massively parallel sequencing technologies are employed. The GEM mappability program belongs to the GEM (GEnome Multitool) suite of programs, which can be freely downloaded for any use from its website (http://gemlibrary.sourceforge.net).
PLOS ONE | 2012
Sarah Djebali; Julien Lagarde; Philipp Kapranov; Vincent Lacroix; Christelle Borel; Jonathan M. Mudge; Cédric Howald; Sylvain Foissac; Catherine Ucla; Jacqueline Chrast; Paolo Ribeca; David Martin; Ryan R. Murray; Xinping Yang; Lila Ghamsari; Chenwei Lin; Ian Bell; Erica Dumais; Jorg Drenkow; Michael L. Tress; Josep Lluís Gelpí; Modesto Orozco; Alfonso Valencia; Nynke L. van Berkum; Bryan R. Lajoie; Marc Vidal; John A. Stamatoyannopoulos; Philippe Batut; Alexander Dobin; Jennifer Harrow
The classic organization of a gene structure has followed the Jacob and Monod bacterial gene model proposed more than 50 years ago. Since then, empirical determinations of the complexity of the transcriptomes found in yeast to human has blurred the definition and physical boundaries of genes. Using multiple analysis approaches we have characterized individual gene boundaries mapping on human chromosomes 21 and 22. Analyses of the locations of the 5′ and 3′ transcriptional termini of 492 protein coding genes revealed that for 85% of these genes the boundaries extend beyond the current annotated termini, most often connecting with exons of transcripts from other well annotated genes. The biological and evolutionary importance of these chimeric transcripts is underscored by (1) the non-random interconnections of genes involved, (2) the greater phylogenetic depth of the genes involved in many chimeric interactions, (3) the coordination of the expression of connected genes and (4) the close in vivo and three dimensional proximity of the genomic regions being transcribed and contributing to parts of the chimeric RNAs. The non-random nature of the connection of the genes involved suggest that chimeric transcripts should not be studied in isolation, but together, as an RNA network.
Nature Communications | 2015
Tyler Alioto; Ivo Buchhalter; Sophia Derdak; Barbara Hutter; Matthew Eldridge; Eivind Hovig; Lawrence E. Heisler; Timothy Beck; Jared T. Simpson; Laurie Tonon; Anne Sophie Sertier; Ann Marie Patch; Natalie Jäger; Philip Ginsbach; Ruben M. Drews; Nagarajan Paramasivam; Rolf Kabbe; Sasithorn Chotewutmontri; Nicolle Diessl; Christopher Previti; Sabine Schmidt; Benedikt Brors; Lars Feuerbach; Michael Heinold; Susanne Gröbner; Andrey Korshunov; Patrick Tarpey; Adam Butler; Jonathan Hinton; David Jones
As whole-genome sequencing for cancer genome analysis becomes a clinical tool, a full understanding of the variables affecting sequencing analysis output is required. Here using tumour-normal sample pairs from two different types of cancer, chronic lymphocytic leukaemia and medulloblastoma, we conduct a benchmarking exercise within the context of the International Cancer Genome Consortium. We compare sequencing methods, analysis pipelines and validation methods. We show that using PCR-free methods and increasing sequencing depth to ∼100 × shows benefits, as long as the tumour:control coverage ratio remains balanced. We observe widely varying mutation call rates and low concordance among analysis pipelines, reflecting the artefact-prone nature of the raw data and lack of standards for dealing with the artefacts. However, we show that, using the benchmark mutation set we have created, many issues are in fact easy to remedy and have an immediate positive impact on mutation detection accuracy.
Bioinformatics | 2010
Emanuele Raineri; Paolo Ribeca; Luis Serrano; Tobias Maier
MOTIVATION Molecular chaperones prevent the aggregation of their substrate proteins and thereby ensure that they reach their functional native state. The bacterial GroEL/ES chaperonin system is understood in great detail on a structural, mechanistic and functional level; its interactors in Escherichia coli have been identified and characterized. However, a long-standing question in the field is: What makes a protein a chaperone substrate? RESULTS Here we identify, using a bioinformatics-based approach a simple set of quantities, which characterize the GroEL-substrate proteome. We define three novel parameters differentiating GroEL interactors from other cellular proteins: lower rate of evolution, hydrophobicity and aggregation propensity. Combining them with other known features to a simple Bayesian predictor allows us to identify known homologous and heterologous GroEL substrateproteins. We discuss our findings in relation to established mechanisms of protein folding and evolutionary buffering by chaperones.
Bioinformatics | 2008
Paolo Ribeca; Emanuele Raineri
BACKGROUND The computation of the statistical properties of motif occurrences has an obviously relevant application: patterns that are significantly over- or under-represented in genomes or proteins are interesting candidates for biological roles. However, the problem is computationally hard; as a result, virtually all the existing motif finders use fast but approximate scoring functions, in spite of the fact that they have been shown to produce systematically incorrect results. A few interesting exact approaches are known, but they are very slow and hence not practical in the case of realistic sequences. RESULTS We give an exact solution, solely based on deterministic finite-state automata (DFA), to the problem of finding the whole relevant part of the probability distribution function of a simple-word motif in a homogeneous (biological) sequence. Out of that, the z-value can always be computed, while the P-value can be obtained either when it is not too extreme with respect to the number of floating-point digits available in the implementation, or when the number of pattern occurrences is moderately low. In particular, the time complexity of the algorithms for Markov models of moderate order (0 < or = m < or = 2) is far better than that of Nuel, which was the fastest similar exact algorithm known to date; in many cases, even approximate methods are outperformed. CONCLUSIONS DFA are a standard tool of computer science for the study of patterns; previous works in biology propose algorithms involving automata, but there they are used, respectively, as a first step to write a generating function, or to build a finite Markov-chain imbedding (FMCI). In contrast, we directly rely on DFA to perform the calculations; thus we manage to obtain an algorithm which is both easily interpretable and efficient. This approach can be used for exact statistical studies of very long genomes and protein sequences, as we illustrate with some examples on the scale of the human genome.
Genome Biology | 2016
Federico Abascal; André Corvelo; Fernando Cruz; José Luis Villanueva-Cañas; Anna Vlasova; Marina Marcet-Houben; Begoña Martínez-Cruz; Jade Yu Cheng; Pablo Prieto; Víctor Quesada; Javier Quilez; Gang Li; F. García; Miriam Rubio-Camarillo; Leonor Frias; Paolo Ribeca; Salvador Capella-Gutiérrez; Jose Manuel Rodriguez; Francisco Câmara; Ernesto Lowy; Luca Cozzuto; Ionas Erb; Michael L. Tress; Jose Luis Rodriguez-Ales; Jorge Ruiz-Orera; Ferran Reverter; Mireia Casas-Marce; Laura Soriano; Javier R. Arango; Sophia Derdak
BackgroundGenomic studies of endangered species provide insights into their evolution and demographic history, reveal patterns of genomic erosion that might limit their viability, and offer tools for their effective conservation. The Iberian lynx (Lynx pardinus) is the most endangered felid and a unique example of a species on the brink of extinction.ResultsWe generate the first annotated draft of the Iberian lynx genome and carry out genome-based analyses of lynx demography, evolution, and population genetics. We identify a series of severe population bottlenecks in the history of the Iberian lynx that predate its known demographic decline during the 20th century and have greatly impacted its genome evolution. We observe drastically reduced rates of weak-to-strong substitutions associated with GC-biased gene conversion and increased rates of fixation of transposable elements. We also find multiple signatures of genetic erosion in the two remnant Iberian lynx populations, including a high frequency of potentially deleterious variants and substitutions, as well as the lowest genome-wide genetic diversity reported so far in any species.ConclusionsThe genomic features observed in the Iberian lynx genome may hamper short- and long-term viability through reduced fitness and adaptive potential. The knowledge and resources developed in this study will boost the research on felid evolution and conservation genomics and will benefit the ongoing conservation and management of this emblematic species.
BMC Genomics | 2017
Bernardo Rodríguez-Martín; Emilio Palumbo; Santiago Marco-Sola; Thasso Griebel; Paolo Ribeca; Graciela Alonso; Alberto Rastrojo; Begoña Aguado; Roderic Guigó; Sarah Djebali
BackgroundChimeric transcripts are commonly defined as transcripts linking two or more different genes in the genome, and can be explained by various biological mechanisms such as genomic rearrangement, read-through or trans-splicing, but also by technical or biological artefacts. Several studies have shown their importance in cancer, cell pluripotency and motility. Many programs have recently been developed to identify chimeras from Illumina RNA-seq data (mostly fusion genes in cancer). However outputs of different programs on the same dataset can be widely inconsistent, and tend to include many false positives. Other issues relate to simulated datasets restricted to fusion genes, real datasets with limited numbers of validated cases, result inconsistencies between simulated and real datasets, and gene rather than junction level assessment.ResultsHere we present ChimPipe, a modular and easy-to-use method to reliably identify fusion genes and transcription-induced chimeras from paired-end Illumina RNA-seq data. We have also produced realistic simulated datasets for three different read lengths, and enhanced two gold-standard cancer datasets by associating exact junction points to validated gene fusions. Benchmarking ChimPipe together with four other state-of-the-art tools on this data showed ChimPipe to be the top program at identifying exact junction coordinates for both kinds of datasets, and the one showing the best trade-off between sensitivity and precision. Applied to 106 ENCODE human RNA-seq datasets, ChimPipe identified 137 high confidence chimeras connecting the protein coding sequence of their parent genes. In subsequent experiments, three out of four predicted chimeras, two of which recurrently expressed in a large majority of the samples, could be validated. Cloning and sequencing of the three cases revealed several new chimeric transcript structures, 3 of which with the potential to encode a chimeric protein for which we hypothesized a new role. Applying ChimPipe to human and mouse ENCODE RNA-seq data led to the identification of 131 recurrent chimeras common to both species, and therefore potentially conserved.ConclusionsChimPipe combines discordant paired-end reads and split-reads to detect any kind of chimeras, including those originating from polymerase read-through, and shows an excellent trade-off between sensitivity and precision. The chimeras found by ChimPipe can be validated in-vitro with high accuracy.