Jakob Hull Havgaard
University of Copenhagen
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jakob Hull Havgaard.
Bioinformatics | 2007
Elfar Torarinsson; Jakob Hull Havgaard; Jan Gorodkin
MOTIVATION An apparent paradox in computational RNA structure prediction is that many methods, in advance, require a multiple alignment of a set of related sequences, when searching for a common structure between them. However, such a multiple alignment is hard to obtain even for few sequences with low sequence similarity without simultaneously folding and aligning them. Furthermore, it is of interest to conduct a multiple alignment of RNA sequence candidates found from searching as few as two genomic sequences. RESULTS Here, based on the PMcomp program, we present a global multiple alignment program, foldalignM, which performs especially well on few sequences with low sequence similarity, and is comparable in performance with state of the art programs in general. In addition, it can cluster sequences based on sequence and structure similarity and output a multiple alignment for each cluster. Furthermore, preliminary results with local datasets indicate that the program is useful for post processing foldalign pairwise scans. AVAILABILITY The program foldalignM is implemented in JAVA and is, along with some accompanying PERL scripts, available at http://foldalign.ku.dk/
Bioinformatics | 2005
Jakob Hull Havgaard; Rune B. Lyngsø; Gary D. Stormo; Jan Gorodkin
MOTIVATION Searching for non-coding RNA (ncRNA) genes and structural RNA elements (eleRNA) are major challenges in gene finding today as these often are conserved in structure rather than in sequence. Even though the number of available methods is growing, it is still of interest to pairwise detect two genes with low sequence similarity, where the genes are part of a larger genomic region. RESULTS Here we present such an approach for pairwise local alignment which is based on foldalign and the Sankoff algorithm for simultaneous structural alignment of multiple sequences. We include the ability to conduct mutual scans of two sequences of arbitrary length while searching for common local structural motifs of some maximum length. This drastically reduces the complexity of the algorithm. The scoring scheme includes structural parameters corresponding to those available for free energy as well as for substitution matrices similar to RIBOSUM. The new foldalign implementation is tested on a dataset where the ncRNAs and eleRNAs have sequence similarity <40% and where the ncRNAs and eleRNAs are energetically indistinguishable from the surrounding genomic sequence context. The method is tested in two ways: (1) its ability to find the common structure between the genes only and (2) its ability to locate ncRNAs and eleRNAs in a genomic context. In case (1), it makes sense to compare with methods like Dynalign, and the performances are very similar, but foldalign is substantially faster. The structure prediction performance for a family is typically around 0.7 using Matthews correlation coefficient. In case (2), the algorithm is successful at locating RNA families with an average sensitivity of 0.8 and a positive predictive value of 0.9 using a BLAST-like hit selection scheme. AVAILABILITY The program is available online at http://foldalign.kvl.dk/
Genome Biology | 2007
Jan Gorodkin; Susanna Cirera; Jakob Hedegaard; Michael J. Gilchrist; Frank Panitz; Claus Jørgensen; Karsten Scheibye-Knudsen; Troels Arvin; Steen Lumholdt; Milena Sawera; Trine Green; Bente Nielsen; Jakob Hull Havgaard; Carina Rosenkilde; Jun-Jun Wang; Heng Li; Ruiqiang Li; Bin Liu; Songnian Hu; Wei Dong; Wei Li; Jun Qing Yu; Jian Wang; Hans-Henrik Stærfeldt; Rasmus Wernersson; Lone Madsen; Bo Thomsen; Henrik Hornshøj; Zhan Bujie; Xuegang Wang
BackgroundKnowledge of the structure of gene expression is essential for mammalian transcriptomics research. We analyzed a collection of more than one million porcine expressed sequence tags (ESTs), of which two-thirds were generated in the Sino-Danish Pig Genome Project and one-third are from public databases. The Sino-Danish ESTs were generated from one normalized and 97 non-normalized cDNA libraries representing 35 different tissues and three developmental stages.ResultsUsing the Distiller package, the ESTs were assembled to roughly 48,000 contigs and 73,000 singletons, of which approximately 25% have a high confidence match to UniProt. Approximately 6,000 new porcine gene clusters were identified. Expression analysis based on the non-normalized libraries resulted in the following findings. The distribution of cluster sizes is scaling invariant. Brain and testes are among the tissues with the greatest number of different expressed genes, whereas tissues with more specialized function, such as developing liver, have fewer expressed genes. There are at least 65 high confidence housekeeping gene candidates and 876 cDNA library-specific gene candidates. We identified differential expression of genes between different tissues, in particular brain/spinal cord, and found patterns of correlation between genes that share expression in pairs of libraries. Finally, there was remarkable agreement in expression between specialized tissues according to Gene Ontology categories.ConclusionThis EST collection, the largest to date in pig, represents an essential resource for annotation, comparative genomics, assembly of the pig genome sequence, and further porcine transcription studies.
Nucleic Acids Research | 2005
Jakob Hull Havgaard; Rune B. Lyngsø; Jan Gorodkin
Foldalign is a Sankoff-based algorithm for making structural alignments of RNA sequences. Here, we present a web server for making pairwise alignments between two RNA sequences, using the recently updated version of foldalign. The server can be used to scan two sequences for a common structural RNA motif of limited size, or the entire sequences can be aligned locally or globally. The web server offers a graphical interface, which makes it simple to make alignments and manually browse the results. The web server can be accessed at .
Trends in Biotechnology | 2010
Jan Gorodkin; Ivo L. Hofacker; Elfar Torarinsson; Zizhen Yao; Jakob Hull Havgaard; Walter L. Ruzzo
Growing recognition of the numerous, diverse and important roles played by non-coding RNA in all organisms motivates better elucidation of these cellular components. Comparative genomics is a powerful tool for this task and is arguably preferable to any high-throughput experimental technology currently available, because evolutionary conservation highlights functionally important regions. Conserved secondary structure, rather than primary sequence, is the hallmark of many functionally important RNAs, because compensatory substitutions in base-paired regions preserve structure. Unfortunately, such substitutions also obscure sequence identity and confound alignment algorithms, which complicates analysis greatly. This paper surveys recent computational advances in this difficult arena, which have enabled genome-scale prediction of cross-species conserved RNA elements. These predictions suggest that a wealth of these elements indeed exist.
Bioinformatics | 2009
Bogumil Kaczkowski; Elfar Torarinsson; Kristin Reiche; Jakob Hull Havgaard; Peter F. Stadler; Jan Gorodkin
UNLABELLED MicroRNAs (miRNAs) are a group of small, approximately 21 nt long, riboregulators inhibiting gene expression at a post-transcriptional level. Their most distinctive structural feature is the foldback hairpin of their precursor pre-miRNAs. Even though each pre-miRNA deposited in miRBase has its secondary structure already predicted, little is known about the patterns of structural conservation among pre-miRNAs. We address this issue by clustering the human pre-miRNA sequences based on pairwise, sequence and secondary structure alignment using FOLDALIGN, followed by global multiple alignment of obtained clusters by WAR. As a result, the common secondary structure was successfully determined for four FOLDALIGN clusters: the RF00027 structural family of the Rfam database and three clusters with previously undescribed consensus structures. AVAILABILITY http://genome.ku.dk/resources/mirclust
intelligent systems in molecular biology | 2007
Frank Panitz; Henrik Stengaard; Henrik Hornshøj; Jan Gorodkin; Jakob Hedegaard; Susanna Cirera; Bo Thomsen; Lone Madsen; Anette Høj; Rikke K. Vingborg; Bujie Zahn; Xuegang Wang; Xuefei Wang; Rasmus Wernersson; Claus B. Jørgensen; Karsten Scheibye-Knudsen; Troels Arvin; Steen Lumholdt; Milena Sawera; Trine Green; Bente Nielsen; Jakob Hull Havgaard; Søren Brunak; Merete Fredholm; Christian Bendixen
MOTIVATION Single nucleotide polymorphisms (SNPs) analysis is an important means to study genetic variation. A fast and cost-efficient approach to identify large numbers of novel candidates is the SNP mining of large scale sequencing projects. The increasing availability of sequence trace data in public repositories makes it feasible to evaluate SNP predictions on the DNA chromatogram level. MAVIANT, a platform-independent Multipurpose Alignment VIewing and Annotation Tool, provides DNA chromatogram and alignment views and facilitates evaluation of predictions. In addition, it supports direct manual annotation, which is immediately accessible and can be easily shared with external collaborators. RESULTS Large-scale SNP mining of polymorphisms bases on porcine EST sequences yielded more than 7900 candidate SNPs in coding regions (cSNPs), which were annotated relative to the human genome. Non-synonymous SNPs were analyzed for their potential effect on the protein structure/function using the PolyPhen and SIFT prediction programs. Predicted SNPs and annotations are stored in a web-based database. Using MAVIANT SNPs can visually be verified based on the DNA sequencing traces. A subset of candidate SNPs was selected for experimental validation by resequencing and genotyping. This study provides a web-based DNA chromatogram and contig browser that facilitates the evaluation and selection of candidate SNPs, which can be applied as genetic markers for genome wide genetic studies. AVAILABILITY The stand-alone version of MAVIANT program for local use is freely available under GPL license terms at http://snp.agrsci.dk/maviant. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
BMC Genomics | 2014
Christian Anthon; Hakim Tafer; Jakob Hull Havgaard; Bo Thomsen; Jakob Hedegaard; Stefan E. Seemann; Sachin Pundhir; Stephanie Kehr; Sebastian Bartschat; Mathilde Nielsen; Rasmus Oestergaard Nielsen; Merete Fredholm; Peter F. Stadler; Jan Gorodkin
BackgroundAnnotating mammalian genomes for noncoding RNAs (ncRNAs) is nontrivial since far from all ncRNAs are known and the computational models are resource demanding. Currently, the human genome holds the best mammalian ncRNA annotation, a result of numerous efforts by several groups. However, a more direct strategy is desired for the increasing number of sequenced mammalian genomes of which some, such as the pig, are relevant as disease models and production animals.ResultsWe present a comprehensive annotation of structured RNAs in the pig genome. Combining sequence and structure similarity search as well as class specific methods, we obtained a conservative set with a total of 3,391 structured RNA loci of which 1,011 and 2,314, respectively, hold strong sequence and structure similarity to structured RNAs in existing databases. The RNA loci cover 139 cis-regulatory element loci, 58 lncRNA loci, 11 conflicts of annotation, and 3,183 ncRNA genes. The ncRNA genes comprise 359 miRNAs, 8 ribozymes, 185 rRNAs, 638 snoRNAs, 1,030 snRNAs, 810 tRNAs and 153 ncRNA genes not belonging to the here fore mentioned classes. When running the pipeline on a local shuffled version of the genome, we obtained no matches at the highest confidence level. Additional analysis of RNA-seq data from a pooled library from 10 different pig tissues added another 165 miRNA loci, yielding an overall annotation of 3,556 structured RNA loci. This annotation represents our best effort at making an automated annotation. To further enhance the reliability, 571 of the 3,556 structured RNAs were manually curated by methods depending on the RNA class while 1,581 were declared as pseudogenes. We further created a multiple alignment of pig against 20 representative vertebrates, from which RNAz predicted 83,859 de novo RNA loci with conserved RNA structures. 528 of the RNAz predictions overlapped with the homology based annotation or novel miRNAs. We further present a substantial synteny analysis which includes 1,004 lineage specific de novo RNA loci and 4 ncRNA loci in the known annotation specific for Laurasiatheria (pig, cow, dolphin, horse, cat, dog, hedgehog).ConclusionsWe have obtained one of the most comprehensive annotations for structured ncRNAs of a mammalian genome, which is likely to play central roles in both health modelling and production. The core annotation is available in Ensembl 70 and the complete annotation is available at http://rth.dk/resources/rnannotator/susscr102/version1.02.
Bioinformatics | 2017
Milad Miladi; Alexander Junge; Fabrizio Costa; Stefan E. Seemann; Jakob Hull Havgaard; Jan Gorodkin; Rolf Backofen
Motivation: Clustering RNA sequences with common secondary structure is an essential step towards studying RNA function. Whereas structural RNA alignment strategies typically identify common structure for orthologous structured RNAs, clustering seeks to group paralogous RNAs based on structural similarities. However, existing approaches for clustering paralogous RNAs, do not take the compensatory base pair changes obtained from structure conservation in orthologous sequences into account. Results: Here, we present RNAscClust, the implementation of a new algorithm to cluster a set of structured RNAs taking their respective structural conservation into account. For a set of multiple structural alignments of RNA sequences, each containing a paralog sequence included in a structural alignment of its orthologs, RNAscClust computes minimum free‐energy structures for each sequence using conserved base pairs as prior information for the folding. The paralogs are then clustered using a graph kernel‐based strategy, which identifies common structural features. We show that the clustering accuracy clearly benefits from an increasing degree of compensatory base pair changes in the alignments. Availability and Implementation: RNAscClust is available at http://www.bioinf.uni‐freiburg.de/Software/RNAscClust. Contact: [email protected] or [email protected]‐freiburg.de Supplementary information: Supplementary data are available at Bioinformatics online.
Current protocols in human genetics | 2012
Jakob Hull Havgaard; Simranjeet Kaur; Jan Gorodkin
This unit describes how to use Foldalign and FoldalignM to make structural alignments of non‐protein‐coding‐RNA (ncRNA). These tools can be used to find new ncRNAs, to find the structure of novel ncRNAs, and to improve alignments for known ncRNAs. Curr. Protoc. Bioinform. 39:12.11.1‐12.11.15.