Is this you? Create Your Porfile

Jonas S. Almeida

University of Alabama at Birmingham

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jonas S. Almeida is active.

Explore More

Publication

Featured researches published by Jonas S. Almeida.

Bioinformatics | 2004

Comparative evaluation of word composition distances for the recognition of SCOP relationships

Susana Vinga; Rodrigo Gouveia-Oliveira; Jonas S. Almeida

MOTIVATION Alignment-free metrics were recently reviewed by the authors, but have not until now been object of a comparative study. This paper compares the classification accuracy of word composition metrics therein reviewed. It also presents a new distance definition between protein sequences, the W-metric, which bridges between alignment metrics, such as scores produced by the Smith-Waterman algorithm, and methods based solely in L-tuple composition, such as Euclidean distance and Information content. RESULTS The comparative study reported here used the SCOP/ASTRAL protein structure hierarchical database and accessed the discriminant value of alternative sequence dissimilarity measures by calculating areas under the Receiver Operating Characteristic curves. Although alignment methods resulted in very good classification accuracy at family and superfamily levels, alignment-free distances, in particular Standard Euclidean Distance, are as good as alignment algorithms when sequence similarity is smaller, such as for recognition of fold or class relationships. This observation justifies its advantageous use to pre-filter homologous proteins since word statistics techniques are computed much faster than the alignment methods. AVAILABILITY All MATLAB code used to generate the data is available upon request to the authors. Additional material available at http://bioinformatics.musc.edu/wmetric

international conference on semantic systems | 2013

Linked cancer genome atlas database

Muhammad Saleem; Shanmukha S. Padmanabhuni; Axel-Cyrille Ngonga Ngomo; Jonas S. Almeida; Stefan Decker; Helena F. Deus

The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional pilot project to create an atlas of genetic mutations responsible for cancer. One of the aims of this project is to develop an infrastructure for making the cancer related data publicly accessible, to enable cancer researchers anywhere around the world to make and validate important discoveries. However, data in the cancer genome atlas are organized as text archives in a set of directories. Devising bioinformatics applications to analyse such data is still challenging, as it requires downloading very large archives and parsing the relevant text files in order to collect the critical co-variates necessary for analysis. Furthermore, the various types of experimental results are not connected biologically, i.e. in order to truly exploit the data in the genome-wide context in which the TCGA project was devised, the data needs to be converted into a structured representation and made publicly available for remote querying and virtual integration. In this work, we address these issues by RDFizing data from TCGA and linking its elements to the Linked Open Data (LOD) Cloud. The outcome is the largest LOD data source (to the best of our knowledge) comprising of over 30 billion triples. This data source can be exploited through publicly available SPARQL endpoints, thus providing an easy-to-use, time-efficient, and scalable solution to accessing the Cancer Genome Atlas. We also describe showcases which are enabled by the new linked data representation of the Cancer Genome Atlas presented in this paper.

Algorithms for Molecular Biology | 2012

Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis

Susana Vinga; Alexandra M. Carvalho; Alexandre P. Francisco; Luís M. S. Russo; Jonas S. Almeida

BackgroundChaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2 -L distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations.ResultsThe exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm.ConclusionsThe analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.

Genome Biology | 2017

Alignment-free sequence comparison: benefits, applications, and tools

Andrzej Zielezinski; Susana Vinga; Jonas S. Almeida; Wojciech M. Karlowski

Alignment-free sequence analyses have been applied to problems ranging from whole-genome phylogeny to the classification of protein families, identification of horizontally transferred genes, and detection of recombined sequences. The strength of these methods makes them particularly useful for next-generation sequencing data processing and analysis. However, many researchers are unclear about how these methods work, how they compare to alignment-based methods, and what their potential is for use for their research. We address these questions and provide a guide to the currently available alignment-free sequence analysis tools.

Archives of Pathology & Laboratory Medicine | 2016

Computational Pathology: A Path Ahead

David N. Louis; Michael Feldman; Alexis B. Carter; Anand S. Dighe; John D. Pfeifer; Lynn Bry; Jonas S. Almeida; Joel H. Saltz; Jonathan Braun; John E. Tomaszewski; John R. Gilbertson; John H. Sinard; Georg K. Gerber; Stephen J. Galli; Jeffrey A. Golden; Michael J. Becich

CONTEXT We define the scope and needs within the new discipline of computational pathology, a discipline critical to the future of both the practice of pathology and, more broadly, medical practice in general. OBJECTIVE To define the scope and needs of computational pathology. DATA SOURCES A meeting was convened in Boston, Massachusetts, in July 2014 prior to the annual Association of Pathology Chairs meeting, and it was attended by a variety of pathologists, including individuals highly invested in pathology informatics as well as chairs of pathology departments. CONCLUSIONS The meeting made recommendations to promote computational pathology, including clearly defining the field and articulating its value propositions; asserting that the value propositions for health care systems must include means to incorporate robust computational approaches to implement data-driven methods that aid in guiding individual and population health care; leveraging computational pathology as a center for data interpretation in modern health care systems; stating that realizing the value proposition will require working with institutional administrations, other departments, and pathology colleagues; declaring that a robust pipeline should be fostered that trains and develops future computational pathologists, for those with both pathology and nonpathology backgrounds; and deciding that computational pathology should serve as a hub for data-related research in health care systems. The dissemination of these recommendations to pathology and bioinformatics departments should help facilitate the development of computational pathology.

Journal of Biomedical Semantics | 2014

TopFed: TCGA tailored federated query processing and linking to LOD.

Muhammad Saleem; Shanmukha S Padmanabhuni; Axel-Cyrille Ngonga Ngomo; Aftab Iqbal; Jonas S. Almeida; Stefan Decker; Helena F. Deus

BackgroudThe Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to catalogue genetic mutations responsible for cancer using genome analysis techniques. One of the aims of this project is to create a comprehensive and open repository of cancer related molecular analysis, to be exploited by bioinformaticians towards advancing cancer knowledge. However, devising bioinformatics applications to analyse such large dataset is still challenging, as it often requires downloading large archives and parsing the relevant text files. Therefore, it is making it difficult to enable virtual data integration in order to collect the critical co-variates necessary for analysis.MethodsWe address these issues by transforming the TCGA data into the Semantic Web standard Resource Description Format (RDF), link it to relevant datasets in the Linked Open Data (LOD) cloud and further propose an efficient data distribution strategy to host the resulting 20.4 billion triples data via several SPARQL endpoints. Having the TCGA data distributed across multiple SPARQL endpoints, we enable biomedical scientists to query and retrieve information from these SPARQL endpoints by proposing a TCGA tailored federated SPARQL query processing engine named TopFed.ResultsWe compare TopFed with a well established federation engine FedX in terms of source selection and query execution time by using 10 different federated SPARQL queries with varying requirements. Our evaluation results show that TopFed selects on average less than half of the sources (with 100% recall) with query execution time equal to one third to that of FedX.ConclusionWith TopFed, we aim to offer biomedical scientists a single-point-of-access through which distributed TCGA data can be accessed in unison. We believe the proposed system can greatly help researchers in the biomedical domain to carry out their research effectively with TCGA as the amount and diversity of data exceeds the ability of local resources to handle its retrieval and parsing.

Journal of Pathology Informatics | 2014

Mining genome sequencing data to identify the genomic features linked to breast cancer histopathology.

Zheng Ping; Gene P. Siegal; Jonas S. Almeida; Stuart J. Schnitt; Dejun Shen

Background: Genetics and genomics have radically altered our understanding of breast cancer progression. However, the genomic basis of various histopathologic features of breast cancer is not yet well-defined. Materials and Methods: The Cancer Genome Atlas (TCGA) is an international database containing a large collection of human cancer genome sequencing data. cBioPortal is a web tool developed for mining these sequencing data. We performed mining of TCGA sequencing data in an attempt to characterize the genomic features correlated with breast cancer histopathology. We first assessed the quality of the TCGA data using a group of genes with known alterations in various cancers. Both genome-wide gene mutation and copy number changes as well as a group of genes with a high frequency of genetic changes were then correlated with various histopathologic features of invasive breast cancer. Results: Validation of TCGA data using a group of genes with known alterations in breast cancer suggests that the TCGA has accurately documented the genomic abnormalities of multiple malignancies. Further analysis of TCGA breast cancer sequencing data shows that accumulation of specific genomic defects is associated with higher tumor grade, larger tumor size and receptor negativity. Distinct groups of genomic changes were found to be associated with the different grades of invasive ductal carcinoma. The mutator role of the TP53 gene was validated by genomic sequencing data of invasive breast cancer and TP53 mutation was found to play a critical role in defining high tumor grade. Conclusions: Data mining of the TCGA genome sequencing data is an innovative and reliable method to help characterize the genomic abnormalities associated with histopathologic features of invasive breast cancer.

Proteome Science | 2012

Serum profiling by MALDI-TOF mass spectrometry as a diagnostic tool for domoic acid toxicosis in California sea lions

Benjamin A. Neely; Jennifer L. Soper; Denise J. Greig; Kevin P Carlin; Elizabeth G. Favre; Frances M. D. Gulland; Jonas S. Almeida; Michael G. Janech

BackgroundThere are currently no reliable markers of acute domoic acid toxicosis (DAT) for California sea lions. We investigated whether patterns of serum peptides could diagnose acute DAT. Serum peptides were analyzed by MALDI-TOF mass spectrometry from 107 sea lions (acute DAT n = 34; non-DAT n = 73). Artificial neural networks (ANN) were trained using MALDI-TOF data. Individual peaks and neural networks were qualified using an independent test set (n = 20).ResultsNo single peak was a good classifier of acute DAT, and ANN models were the best predictors of acute DAT. Performance measures for a single median ANN were: sensitivity, 100%; specificity, 60%; positive predictive value, 71%; negative predictive value, 100%. When 101 ANNs were combined and allowed to vote for the outcome, the performance measures were: sensitivity, 30%; specificity, 100%; positive predictive value, 100%; negative predictive value, 59%.ConclusionsThese results suggest that MALDI-TOF peptide profiling and neural networks can perform either as a highly sensitive (100% negative predictive value) or a highly specific (100% positive predictive value) diagnostic tool for acute DAT. This also suggests that machine learning directed by populations of predictive models offer the ability to modulate the predictive effort into a specific type of error.

BMC Bioinformatics | 2014

Non-synonymous variations in cancer and their effects on the human proteome: workflow for NGS data biocuration and proteome-wide analysis of TCGA data

Charles Cole; Konstantinos Krampis; Konstantinos Karagiannis; Jonas S. Almeida; William J. Faison; Mona Motwani; Quan Wan; Anton Golikov; Yang Pan; Vahan Simonyan; Raja Mazumder

BackgroundNext-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it.ResultsTo address the above challenge, we have implemented a NGS biocuration workflow and are analyzing short read sequences and associated metadata from cancer patients to better understand the human variome. Curation of variation and other related information from control (normal tissue) and case (tumor) samples will provide comprehensive background information that can be used in genomic medicine research and application studies. Our approach includes a CloudBioLinux Virtual Machine which is used upstream of an integrated High-performance Integrated Virtual Environment (HIVE) that encapsulates Curated Short Read archive (CSR) and a proteome-wide variation effect analysis tool (SNVDis). As a proof-of-concept, we have curated and analyzed control and case breast cancer datasets from the NCI cancer genomics program - The Cancer Genome Atlas (TCGA). Our efforts include reviewing and recording in CSR available clinical information on patients, mapping of the reads to the reference followed by identification of non-synonymous Single Nucleotide Variations (nsSNVs) and integrating the data with tools that allow analysis of effect nsSNVs on the human proteome. Furthermore, we have also developed a novel phylogenetic analysis algorithm that uses SNV positions and can be used to classify the patient population. The workflow described here lays the foundation for analysis of short read sequence data to identify rare and novel SNVs that are not present in dbSNP and therefore provides a more comprehensive understanding of the human variome. Variation results for single genes as well as the entire study are available from the CSR website (http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr).ConclusionsAvailability of thousands of sequenced samples from patients provides a rich repository of sequence information that can be utilized to identify individual level SNVs and their effect on the human proteome beyond what the dbSNP database provides.

BMC Bioinformatics | 2011

S3QL: A distributed domain specific language for controlled semantic integration of life sciences data

Helena F. Deus; Miriã C Correa; Romesh Stanislaus; Maria Miragaia; Wolfgang Maass; Hermínia de Lencastre; Ronan Fox; Jonas S. Almeida

BackgroundThe value and usefulness of data increases when it is explicitly interlinked with related data. This is the core principle of Linked Data. For life sciences researchers, harnessing the power of Linked Data to improve biological discovery is still challenged by a need to keep pace with rapidly evolving domains and requirements for collaboration and control as well as with the reference semantic web ontologies and standards. Knowledge organization systems (KOSs) can provide an abstraction for publishing biological discoveries as Linked Data without complicating transactions with contextual minutia such as provenance and access control.We have previously described the Simple Sloppy Semantic Database (S3DB) as an efficient model for creating knowledge organization systems using Linked Data best practices with explicit distinction between domain and instantiation and support for a permission control mechanism that automatically migrates between the two. In this report we present a domain specific language, the S3DB query language (S3QL), to operate on its underlying core model and facilitate management of Linked Data.ResultsReflecting the data driven nature of our approach, S3QL has been implemented as an application programming interface for S3DB systems hosting biomedical data, and its syntax was subsequently generalized beyond the S3DB core model. This achievement is illustrated with the assembly of an S3QL query to manage entities from the Simple Knowledge Organization System. The illustrative use cases include gastrointestinal clinical trials, genomic characterization of cancer by The Cancer Genome Atlas (TCGA) and molecular epidemiology of infectious diseases.ConclusionsS3QL was found to provide a convenient mechanism to represent context for interoperation between public and private datasets hosted at biomedical research institutions and linked data formalisms.

Explore More