Is this you? Create Your Porfile

Jonas Dias

Federal University of Rio de Janeiro

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jonas Dias is active.

Explore More

Publication

Featured researches published by Jonas Dias.

edbt icdt workshops | 2013

Capturing and querying workflow runtime provenance with PROV: a practical approach

Flavio Costa; Vítor Silva; Daniel de Oliveira; Kary A. C. S. Ocaña; Eduardo S. Ogasawara; Jonas Dias; Marta Mattoso

Scientific workflows are commonly used to model and execute large-scale scientific experiments. They represent key resources for scientists and are enacted and managed by Scientific Workflow Management Systems (SWfMS). Each SWfMS has its particular approach to execute workflows and to capture and manage their provenance data. Due to the large scale of experiments, it may be unviable to analyze provenance data only after the end of the execution. A single experiment may demand weeks to run, even in high performance computing environments. Thus scientists need to monitor the experiment during its execution, and this can be done through provenance data. Runtime provenance analysis allows for scientists to monitor workflow execution and to take actions before the end of it (i.e. workflow steering). This provenance data can also be used to fine-tune the parallel execution of the workflow dynamically. We use the PROV data model as a basic framework for modeling and providing runtime provenance as a database that can be queried even during the execution. This database is agnostic of SWfMS and workflow engine. We show the benefits of representing and sharing runtime provenance data for improving the experiment management as well as the analysis of the scientific data.

Concurrency and Computation: Practice and Experience | 2013

Chiron: a parallel engine for algebraic scientific workflows

Eduardo S. Ogasawara; Jonas Dias; Vítor Silva; Fernando Chirigati; Daniel de Oliveira; Fábio Porto; Patrick Valduriez; Marta Mattoso

Large‐scale scientific experiments based on computer simulations are typically modeled as scientific workflows, which eases the chaining of different programs. These scientific workflows are defined, executed, and monitored by scientific workflow management systems (SWfMS). As these experiments manage large amounts of data, it becomes critical to execute them in high‐performance computing environments, such as clusters, grids, and clouds. However, few SWfMS provide parallel support. The ones that do so are usually labor‐intensive for workflow developers and have limited primitives to optimize workflow execution. To address these issues, we developed workflow algebra to specify and enable the optimization of parallel execution of scientific workflows. In this paper, we show how the workflow algebra is efficiently implemented in Chiron, an algebraic based parallel scientific workflow engine. Chiron has a unique native distributed provenance mechanism that enables runtime queries in a relational database. We developed two studies to evaluate the performance of our algebraic approach implemented in Chiron; the first study compares Chiron with different approaches, whereas the second one evaluates the scalability of Chiron. By analyzing the results, we conclude that Chiron is efficient in executing scientific workflows, with the benefits of declarative specification and runtime provenance support. Copyright

ieee international conference on escience | 2011

Optimizing Phylogenetic Analysis Using SciHmm Cloud-based Scientific Workflow

Kary A. C. S. Ocaña; Daniel de Oliveira; Jonas Dias; Eduardo S. Ogasawara; Marta Mattoso

Phylogenetic analysis and multiple sequence alignment (MSA) are closely related bioinformatics fields. Phylogenetic analysis makes extensive use of MSA in the construction of phylogenetic trees, which are used to infer the evolutionary relationships between homologous genes. These bioinformatics experiments are usually modeled as scientific workflows. There are many alternative workflows that use different MSA methods to conduct phylogenetic analysis and each one can produce MSA with different quality. Scientists have to explore which MSA method is the most suitable for their experiments. However, workflows for phylogenetic analysis are both computational and data intensive and they may run sequentially during weeks. Although there any many approaches that parallelize these workflows, exploring all MSA methods many become a burden and expensive task. If scientists know the most adequate MSA method a priori, it would spare time and money. To optimize the phylogenetic analysis workflow, we propose in this paper SciHmm, a bioinformatics scientific workflow based in profile hidden Markov models (pHMMs) that aims at determining the most suitable MSA method for a phylogenetic analysis prior than executing the phylogenetic workflow. SciHmm is also executed in parallel in a cloud environment using SciCumulus middleware. The results demonstrated that optimizing a phylogenetic analysis using SciHmm considerably reduce the total execution time of phylogenetic analysis (up to 80%). This optimization also demonstrates that the biological results presented more quality. In addition, the parallel execution of SciHmm demonstrates that this kind of bioinformatics workflow is suitable to be executed in the cloud.

Future Generation Computer Systems | 2013

Performance evaluation of parallel strategies in public clouds: A study with phylogenomic workflows

Daniel de Oliveira; Kary A. C. S. Ocaña; Eduardo S. Ogasawara; Jonas Dias; João Carlos de A. R. Gonçalves; Fernanda Araujo Baião; Marta Mattoso

Data analysis is an exploratory process that demands high performance computing (HPC). SciPhylomics, for example, is a data-intensive workflow that aims at producing phylogenomic trees based on an input set of protein sequences of genomes to infer evolutionary relationships among living organisms. SciPhylomics can benefit from parallel processing techniques provided by existing approaches such as SciCumulus cloud workflow engine and MapReduce implementations such as Hadoop. Despite some performance fluctuations, computing clouds provide a new dimension for HPC due to its elasticity and availability features. In this paper, we present a performance evaluation for SciPhylomics executions in a real cloud environment. The workflow was executed using two parallel execution approaches (SciCumulus and Hadoop) at the Amazon EC2 cloud. Our results reinforce the benefits of parallelizing data for the phylogenomic inference workflow using MapReduce-like parallel approaches in the cloud. The performance results demonstrate that this class of bioinformatics experiment is suitable to be executed in the cloud despite its need for high performance capabilities. The evaluated workflow shares many features of several data intensive workflows, which present first insights that these cloud execution results can be extrapolated to other classes of experiments.

workflows in support of large scale science | 2011

Supporting dynamic parameter sweep in adaptive and user-steered workflow

Jonas Dias; Eduardo S. Ogasawara; Daniel de Oliveira; Fábio Porto; Alvaro L. G. A. Coutinho; Marta Mattoso

Large-scale experiments in computational science are complex to manage. Due to its exploratory nature, several iterations evaluate a large space of parameter combinations. Scientists analyze partial results and dynamically interfere on the next steps of the simulation. Scientific workflow management systems can execute those experiments by providing process management, distributed execution and provenance data. However, supporting scientists in complex exploratory processes involving dynamic workflows is still a challenge. Features, such as user steering on workflows to track, evaluate and adapt the execution need to be designed to support iterative methods. We provide an approach to support dynamic parameter sweep, in which scientists can use the results obtained in a slice of the parameter space to improve the remainder of the execution. We propose new control structures to enable adaptive and user-steered workflows supporting iterative methods using dynamic mechanisms. We evaluate our approach using a proof of concept (Lanczos algorithm) workflow and the results show up to 78% of execution time saved.

brazilian symposium on bioinformatics | 2012

Exploring Molecular Evolution Reconstruction Using a Parallel Cloud Based Scientific Workflow

Kary A. C. S. Ocaña; Daniel de Oliveira; Felipe Horta; Jonas Dias; Eduardo S. Ogasawara; Marta Mattoso

Recent studies of evolution at molecular level address two important issues: reconstruction of the evolutionary relationships between species and investigation of the forces of the evolutionary process. Both issues experienced an explosive growth in the last two decades due to massive generation of genomic data, novel statistical methods and computational approaches to process and analyze this large volume of data. Most experiments in molecular evolution are based on computing intensive simulations preceded by other computation tools and post-processed by computing validators. All these tools can be modeled as scientific workflows to improve the experiment management while capturing provenance data. However, these evolutionary analyses experiments are very complex and may execute for weeks. These workflows need to be executed in parallel in High Performance Computing (HPC) environments such as clouds. Clouds are becoming adopted for bioinformatics experiments due to its characteristics, such as, elasticity and availability. Clouds are evolving into HPC environments. In this paper, we introduce SciEvol, a bioinformatics scientific workflow for molecular evolution reconstruction that aims at inferring evolutionary relationships (i.e. to detect positive Darwinian selection) on genomic data. SciEvol is designed and implemented to execute in parallel over the clouds using SciCumulus workflow engine. Our experiments show that SciEvol can help scientists by enabling the reconstruction of evolutionary relationships using the cloud environment. Results present performance improvements of up to 94.64% in the execution time when compared to the sequential execution, which drops from around 10 days to 12 hours.

international conference on cloud computing | 2011

A Performance Evaluation of X-Ray Crystallography Scientific Workflow Using SciCumulus

Daniel de Oliveira; Kary A. C. S. Ocaña; Eduardo S. Ogasawara; Jonas Dias; Fernanda Araujo Baião; Marta Mattoso

X-ray crystallography is an important field due to its role in drug discovery and its relevance in bioinformatics experiments of comparative genomics, phylogenomics, evolutionary analysis, ortholog detection, and three-dimensional structure determination. Managing these experiments is a challenging task due to the orchestration of legacy tools and the management of several variations of the same experiment. Workflows can model a coherent flow of activities that are managed by scientific workflow management systems (SWfMS). Due to the huge amount of variations of the workflow to be explored (parameters, input data) it is often necessary to execute X-ray crystallography experiments in High Performance Computing (HPC) environments. Cloud computing is well known for its scalable and elastic HPC model. In this paper, we present a performance evaluation for the X-ray crystallography workflow defined by the PC4 (Provenance Challenge series). The workflow was executed using the SciCumulus middleware at the Amazon EC2 cloud environment. SciCumulus is a layer for SWfMS that offers support for the parallel execution of scientific workflows in cloud environments with provenance mechanisms. Our results reinforce the benefits (total execution time × monetary cost) of parallelizing the X-ray crystallography workflow using SciCumulus. The results show a consistent way to execute X-ray crystallography workflows that need HPC using cloud computing. The evaluated workflow shares features of many scientific workflows and can be applied to other experiments.

international conference on big data | 2013

Algebraic dataflows for big data analysis

Jonas Dias; Eduardo S. Ogasawara; Daniel de Oliveira; Fábio Porto; Patrick Valduriez; Marta Mattoso

Analyzing big data requires the support of dataflows with many activities to extract and explore relevant information from the data. Recent approaches such as Pig Latin propose a high-level language to model such dataflows. However, the dataflow execution is typically delegated to a MapRe-duce implementation such as Hadoop, which does not follow an algebraic approach, thus it cannot take advantage of the optimization opportunities of PigLatin algebra. In this paper, we propose an approach for big data analysis based on algebraic workflows, which yields optimization and parallel execution of activities and supports user steering using provenance queries. We illustrate how a big data processing dataflow can be modeled using the algebra. Through an experimental evaluation using real datasets and the execution of the dataflow with Chiron, an engine that supports our algebra, we show that our approach yields performance gains of up to 19.6% using algebraic optimizations in the dataflow and up to 39.1% of time saved on a user steering scenario.

Future Generation Computer Systems | 2013

Designing a parallel cloud based comparative genomics workflow to improve phylogenetic analyses

Kary A. C. S. Ocaña; Daniel de Oliveira; Jonas Dias; Eduardo S. Ogasawara; Marta Mattoso

Over the last years, comparative genomics analyses have become more compute-intensive due to the explosive number of available genome sequences. Comparative genomics analysis is an important a prioristep for experiments in various bioinformatics domains. This analysis can be used to enhance the performance and quality of experiments in areas such as evolution and phylogeny. A common phylogenetic analysis makes extensive use of Multiple Sequence Alignment (MSA) in the construction of phylogenetic trees, which are used to infer evolutionary relationships between homologous genes. Each phylogenetic analysis aims at exploring several different MSA methods to verify which execution produces trees with the best quality. This phylogenetic exploration may run during weeks, even when executed in High Performance Computing (HPC) environments. Although there are many approaches that model and parallelize phylogenetic analysis as scientific workflows, exploring all MSA methods becomes a complex and expensive task to be performed. If scientists determine a priorithe most adequate MSA method to use in the phylogenetic analysis, it would save time, and, in some cases, financial resources. Comparative genomics analyses play an important role in optimizing phylogenetic analysis workflows. In this paper, we extend the SciHmm scientific workflow, aimed at determining the most suitable MSA method, to use it in a phylogenetic analysis. SciHmm uses SciCumulus, a cloud workflow execution engine, for parallel execution. Experimental results show that using SciHmm considerably reduces the total execution time of the phylogenetic analysis (up to 80%). Experiments also show that trees built with the MSA program elected by using SciHmm presented more quality than the remaining, as expected. In addition, the parallel execution of SciHmm shows that this kind of bioinformatics workflow has an excellent cost/benefit when executed in cloud environments.

Future Generation Computer Systems | 2015

Data-centric iteration in dynamic workflows

Jonas Dias; Gabriel M. Guerra; Fernando A. Rochinha; Alvaro L. G. A. Coutinho; Patrick Valduriez; Marta Mattoso

Dynamic workflows are scientific workflows to support computational science simulations, typically using dynamic processes based on runtime scientific data analyses. They require the ability of adapting the workflow, at runtime, based on user input and dynamic steering. Supporting data-centric iteration is an important step towards dynamic workflows because user interaction with workflows is iterative. However, current support for iteration in scientific workflows is static and does not allow for changing data at runtime. In this paper, we propose a solution based on algebraic operators and a dynamic execution model to enable workflow adaptation based on user input and dynamic steering. We introduce the concept of iteration lineage that makes provenance data management consistent with dynamic iterative workflow changes. Lineage enables scientists to interact with workflow data and configuration at runtime through an API that triggers steering. We evaluate our approach using a novel and real large-scale workflow for uncertainty quantification on a 640-core cluster. The results show impressive execution time savings from 2.5 to 24 days, compared to non-iterative workflow execution. We verify that the maximum overhead introduced by our iterative model is less than 5% of execution time. Also, our proposed steering algorithms are very efficient and run in less than 1 millisecond, in the worst-case scenario. Algebraic operators support data-centric iteration in dynamic workflows.Runtime data lineage, a concept inspired by provenance enables dynamic loops.Two algorithms support runtime adaptation of the workflow based on user input.Real-life experiment for Uncertainty Quantification in the Oil & Gas domain.A novel iterative workflow for Uncertainty Quantification is steered by users.

Explore More