Diogo Pratas | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Diogo Pratas is active.

Explore More

Publication

Featured researches published by Diogo Pratas.

Nucleic Acids Research | 2012

GReEn: a tool for efficient compression of genome resequencing data

Armando J. Pinho; Diogo Pratas; Sara P. Garcia

Research in the genomic sciences is confronted with the volume of sequencing and resequencing data increasing at a higher pace than that of data storage and communication resources, shifting a significant part of research budgets from the sequencing component of a project to the computational one. Hence, being able to efficiently store sequencing and resequencing data is a problem of paramount importance. In this article, we describe GReEn (Genome Resequencing Encoding), a tool for compressing genome resequencing data using a reference genome sequence. It overcomes some drawbacks of the recently proposed tool GRS, namely, the possibility of compressing sequences that cannot be handled by GRS, faster running times and compression gains of over 100-fold for some sequences. This tool is freely available for non-commercial use at ftp://ftp.ieeta.pt/∼ap/codecs/GReEn1.tar.gz.

Bioinformatics | 2014

MFCompress: a compression tool for FASTA and multi-FASTA data.

Armando J. Pinho; Diogo Pratas

Motivation: The data deluge phenomenon is becoming a serious problem in most genomic centers. To alleviate it, general purpose tools, such as gzip, are used to compress the data. However, although pervasive and easy to use, these tools fall short when the intention is to reduce as much as possible the data, for example, for medium- and long-term storage. A number of algorithms have been proposed for the compression of genomics data, but unfortunately only a few of them have been made available as usable and reliable compression tools. Results: In this article, we describe one such tool, MFCompress, specially designed for the compression of FASTA and multi-FASTA files. In comparison to gzip and applied to multi-FASTA files, MFCompress can provide additional average compression gains of almost 50%, i.e. it potentially doubles the available storage, although at the cost of some more computation time. On highly redundant datasets, and in comparison with gzip, 8-fold size reductions have been obtained. Availability: Both source code and binaries for several operating systems are freely available for non-commercial use at http://bioinformatics.ua.pt/software/mfcompress/. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

ieee signal processing workshop on statistical signal processing | 2011

Bacteria DNA sequence compression using a mixture of finite-context models

Armando J. Pinho; Diogo Pratas; Paulo Jorge S. G. Ferreira

The ability of finite-context models for compressing DNA sequences has been demonstrated on some recent works. In this paper, we further explore this line, proposing a compression method based on eight finite-context models, with orders from two to sixteen, whose probabilities are averaged using weights calculated through a recursive procedure. The method was tested on a total of 2,338 sequences belonging to bacterial genomes, with sizes ranging from 1,286 to 13,033,779 bases, showing better compression results than the state-of-the-art XM DNA coding algorithm and also faster operation.

Bioinformatics | 2015

Three minimal sequences found in Ebola virus genomes and absent from human DNA

Raquel M. Silva; Diogo Pratas; Luísa Castro; Armando J. Pinho; Paulo Jorge S. G. Ferreira

Motivation: Ebola virus causes high mortality hemorrhagic fevers, with more than 25 000 cases and 10 000 deaths in the current outbreak. Only experimental therapies are available, thus, novel diagnosis tools and druggable targets are needed. Results: Analysis of Ebola virus genomes from the current outbreak reveals the presence of short DNA sequences that appear nowhere in the human genome. We identify the shortest such sequences with lengths between 12 and 14. Only three absent sequences of length 12 exist and they consistently appear at the same location on two of the Ebola virus proteins, in all Ebola virus genomes, but nowhere in the human genome. The alignment-free method used is able to identify pathogen-specific signatures for quick and precise action against infectious agents, of which the current Ebola virus outbreak provides a compelling example. Availability and Implementation: EAGLE is freely available for non-commercial purposes at http://bioinformatics.ua.pt/software/eagle. Contact: [email protected]; [email protected] Supplementary Information: Supplementary data are available at Bioinformatics online.

PACBB | 2011

Compressing the Human Genome Using Exclusively Markov Models

Diogo Pratas; Armando J. Pinho

Models that rely exclusively on the Markov property, usually known as finite-context models, can model DNA sequences without considering mechanisms that take direct advantage of exact and approximate repeats. These models provide probability estimates that depend on the recent past of the sequence and have been used for data compression. In this paper, we investigate some properties of the finite-context models and we use these properties in order to improve the compression. The results are presented using the human genome as example.

Information-an International Interdisciplinary Journal | 2016

A Survey on Data Compression Methods for Biological Sequences

Morteza Hosseini; Diogo Pratas; Armando J. Pinho

The ever increasing growth of the production of high-throughput sequencing data poses a serious challenge to the storage, processing and transmission of these data. As frequently stated, it is a data deluge. Compression is essential to address this challenge—it reduces storage space and processing costs, along with speeding up data transmission. In this paper, we provide a comprehensive survey of existing compression approaches, that are specialized for biological data, including protein and DNA sequences. Also, we devote an important part of the paper to the approaches proposed for the compression of different file formats, such as FASTA, as well as FASTQ and SAM/BAM, which contain quality scores and metadata, in addition to the biological sequences. Then, we present a comparison of the performance of several methods, in terms of compression ratio, memory usage and compression/decompression time. Finally, we present some suggestions for future research on biological data compression.

Scientific Reports | 2015

An alignment-free method to find and visualise rearrangements between pairs of DNA sequences

Diogo Pratas; Raquel M. Silva; Armando J. Pinho; Paulo Jorge S. G. Ferreira

Species evolution is indirectly registered in their genomic structure. The emergence and advances in sequencing technology provided a way to access genome information, namely to identify and study evolutionary macro-events, as well as chromosome alterations for clinical purposes. This paper describes a completely alignment-free computational method, based on a blind unsupervised approach, to detect large-scale and small-scale genomic rearrangements between pairs of DNA sequences. To illustrate the power and usefulness of the method we give complete chromosomal information maps for the pairs human-chimpanzee and human-orangutan. The tool by means of which these results were obtained has been made publicly available and is described in detail.

IEEE Transactions on Information Theory | 2013

A Compression Model for DNA Multiple Sequence Alignment Blocks

L. M. O. de Matos; Diogo Pratas; Armando J. Pinho

A particularly voluminous dataset in molecular genomics, known as whole genome alignments, has gained considerable importance over the last years. In this paper, we propose a compression modeling approach for the multiple sequence alignment (MSA) blocks, which make up most of these datasets. Our method is based on a mixture of finite-context models. Contrarily to other recent approaches, it addresses both the DNA bases and gap symbols at once, better exploring the existing correlations. For comparison with previous methods, our algorithm was tested in the multiz28way dataset. On average, it attained 0.94 bits per symbol, approximately 7% better than the previous best, for a similar computational complexity. We also tested the model in the most recent dataset, multiz46way. In this dataset, that contains alignments of 46 different species, our compression model achieved an average of 0.72 bits per MSA block symbol.

data compression conference | 2016

Efficient Compression of Genomic Sequences

Diogo Pratas; Armando J. Pinho; Paulo Jorge S. G. Ferreira

The number of genomic sequences is growing substantially. Besides discarding part of the data, the only efficient possibility for coping with this trend is data compression. We present an efficient compressor for genomic sequences, allowing both reference-free and referential compression. This compressor uses a mixture of context models of several orders, according to two model classes: reference and target. A new type of context model, which is capable of tolerating substitution errors, is introduced. For ensuring flexibility regarding hardware specifications, the compressor uses cache-hashes in high order models. The results show additional compression gains over several specific top tools in different levels of redundancy. The implementation is available at http://bioinformatics.ua.pt/software/geco/.

BMC Research Notes | 2014

XS: a FASTQ read simulator

Diogo Pratas; Armando J. Pinho; João M. O. S. Rodrigues

BackgroundThe emerging next-generation sequencing (NGS) is bringing, besides the natural huge amounts of data, an avalanche of new specialized tools (for analysis, compression, alignment, among others) and large public and private network infrastructures. Therefore, a direct necessity of specific simulation tools for testing and benchmarking is rising, such as a flexible and portable FASTQ read simulator, without the need of a reference sequence, yet correctly prepared for producing approximately the same characteristics as real data.FindingsWe present XS, a skilled FASTQ read simulation tool, flexible, portable (does not need a reference sequence) and tunable in terms of sequence complexity. It has several running modes, depending on the time and memory available, and is aimed at testing computing infrastructures, namely cloud computing of large-scale projects, and testing FASTQ compression algorithms. Moreover, XS offers the possibility of simulating the three main FASTQ components individually (headers, DNA sequences and quality-scores).ConclusionsXS provides an efficient and convenient method for fast simulation of FASTQ files, such as those from Ion Torrent (currently uncovered by other simulators), Roche-454, Illumina and ABI-SOLiD sequencing machines. This tool is publicly available at http://bioinformatics.ua.pt/software/xs/.

Explore More