Kary A. C. S. Ocaña
Federal University of Rio de Janeiro
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kary A. C. S. Ocaña.
grid computing | 2012
Daniel de Oliveira; Kary A. C. S. Ocaña; Fernanda Araujo Baião; Marta Mattoso
In the last years, scientific workflows have emerged as a fundamental abstraction for structuring and executing scientific experiments in computational environments. Scientific workflows are becoming increasingly complex and more demanding in terms of computational resources, thus requiring the usage of parallel techniques and high performance computing (HPC) environments. Meanwhile, clouds have emerged as a new paradigm where resources are virtualized and provided on demand. By using clouds, scientists have expanded beyond single parallel computers to hundreds or even thousands of virtual machines. Although the initial focus of clouds was to provide high throughput computing, clouds are already being used to provide an HPC environment where elastic resources can be instantiated on demand during the course of a scientific workflow. However, this model also raises many open, yet important, challenges such as scheduling workflow activities. Scheduling parallel scientific workflows in the cloud is a very complex task since we have to take into account many different criteria and to explore the elasticity characteristic for optimizing workflow execution. In this paper, we introduce an adaptive scheduling heuristic for parallel execution of scientific workflows in the cloud that is based on three criteria: total execution time (makespan), reliability and financial cost. Besides scheduling workflow activities based on a 3-objective cost model, this approach also scales resources up and down according to the restrictions imposed by scientists before workflow execution. This tuning is based on provenance data captured and queried at runtime. We conducted a thorough validation of our approach using a real bioinformatics workflow. The experiments were performed in SciCumulus, a cloud workflow engine for managing scientific workflow execution.
edbt icdt workshops | 2013
Flavio Costa; Vítor Silva; Daniel de Oliveira; Kary A. C. S. Ocaña; Eduardo S. Ogasawara; Jonas Dias; Marta Mattoso
Scientific workflows are commonly used to model and execute large-scale scientific experiments. They represent key resources for scientists and are enacted and managed by Scientific Workflow Management Systems (SWfMS). Each SWfMS has its particular approach to execute workflows and to capture and manage their provenance data. Due to the large scale of experiments, it may be unviable to analyze provenance data only after the end of the execution. A single experiment may demand weeks to run, even in high performance computing environments. Thus scientists need to monitor the experiment during its execution, and this can be done through provenance data. Runtime provenance analysis allows for scientists to monitor workflow execution and to take actions before the end of it (i.e. workflow steering). This provenance data can also be used to fine-tune the parallel execution of the workflow dynamically. We use the PROV data model as a basic framework for modeling and providing runtime provenance as a database that can be queried even during the execution. This database is agnostic of SWfMS and workflow engine. We show the benefits of representing and sharing runtime provenance data for improving the experiment management as well as the analysis of the scientific data.
Concurrency and Computation: Practice and Experience | 2012
Daniel de Oliveira; Eduardo S. Ogasawara; Kary A. C. S. Ocaña; Fernanda Araujo Baião; Marta Mattoso
Many of the existing large‐scale scientific experiments modeled as scientific workflows are compute‐intensive. Some scientific workflow management systems already explore parallel techniques, such as parameter sweep and data fragmentation, to improve performance. In those systems, computing resources are used to accomplish many computational tasks in high performance environments, such as multiprocessor machines or clusters. Meanwhile, cloud computing provides scalable and elastic resources that can be instantiated on demand during the course of a scientific experiment, without requiring its users to acquire expensive infrastructure or to configure many pieces of software. In fact, because of these advantages some scientists have already adopted the cloud model in their scientific experiments. However, this model also raises many challenges. When scientists are executing scientific workflows that require parallelism, it is hard to decide a priori the amount of resources to use and how long they will be needed because the allocation of these resources is elastic and based on demand. In addition, scientists have to manage new aspects such as initialization of virtual machines and impact of data staging. SciCumulus is a middleware that manages the parallel execution of scientific workflows in cloud environments. In this paper, we introduce an adaptive approach for executing parallel scientific workflows in the cloud. This approach adapts itself according to the availability of resources during workflow execution. It checks the available computational power and dynamically tunes the workflow activity size to achieve better performance. Experimental evaluation showed the benefits of parallelizing scientific workflows using the adaptive approach of SciCumulus, which presented an increase of performance up to 47.1%. Copyright
ieee international conference on escience | 2011
Kary A. C. S. Ocaña; Daniel de Oliveira; Jonas Dias; Eduardo S. Ogasawara; Marta Mattoso
Phylogenetic analysis and multiple sequence alignment (MSA) are closely related bioinformatics fields. Phylogenetic analysis makes extensive use of MSA in the construction of phylogenetic trees, which are used to infer the evolutionary relationships between homologous genes. These bioinformatics experiments are usually modeled as scientific workflows. There are many alternative workflows that use different MSA methods to conduct phylogenetic analysis and each one can produce MSA with different quality. Scientists have to explore which MSA method is the most suitable for their experiments. However, workflows for phylogenetic analysis are both computational and data intensive and they may run sequentially during weeks. Although there any many approaches that parallelize these workflows, exploring all MSA methods many become a burden and expensive task. If scientists know the most adequate MSA method a priori, it would spare time and money. To optimize the phylogenetic analysis workflow, we propose in this paper SciHmm, a bioinformatics scientific workflow based in profile hidden Markov models (pHMMs) that aims at determining the most suitable MSA method for a phylogenetic analysis prior than executing the phylogenetic workflow. SciHmm is also executed in parallel in a cloud environment using SciCumulus middleware. The results demonstrated that optimizing a phylogenetic analysis using SciHmm considerably reduce the total execution time of phylogenetic analysis (up to 80%). This optimization also demonstrates that the biological results presented more quality. In addition, the parallel execution of SciHmm demonstrates that this kind of bioinformatics workflow is suitable to be executed in the cloud.
Future Generation Computer Systems | 2013
Daniel de Oliveira; Kary A. C. S. Ocaña; Eduardo S. Ogasawara; Jonas Dias; João Carlos de A. R. Gonçalves; Fernanda Araujo Baião; Marta Mattoso
Data analysis is an exploratory process that demands high performance computing (HPC). SciPhylomics, for example, is a data-intensive workflow that aims at producing phylogenomic trees based on an input set of protein sequences of genomes to infer evolutionary relationships among living organisms. SciPhylomics can benefit from parallel processing techniques provided by existing approaches such as SciCumulus cloud workflow engine and MapReduce implementations such as Hadoop. Despite some performance fluctuations, computing clouds provide a new dimension for HPC due to its elasticity and availability features. In this paper, we present a performance evaluation for SciPhylomics executions in a real cloud environment. The workflow was executed using two parallel execution approaches (SciCumulus and Hadoop) at the Amazon EC2 cloud. Our results reinforce the benefits of parallelizing data for the phylogenomic inference workflow using MapReduce-like parallel approaches in the cloud. The performance results demonstrate that this class of bioinformatics experiment is suitable to be executed in the cloud despite its need for high performance capabilities. The evaluated workflow shares many features of several data intensive workflows, which present first insights that these cloud execution results can be extrapolated to other classes of experiments.
Nucleic Acids Research | 2007
Alberto M. R. Dávila; Pablo N. Mendes; Glauber Wagner; Diogo A. Tschoeke; Rafael R. C. Cuadrat; Felipe Liberman; Luciana Matos; Thiago S. Satake; Kary A. C. S. Ocaña; Omar Triana; Sérgio Manuel Serra da Cruz; Henrique Jucá; Juliano C. Cury; Fabrício Nogueira da Silva; Guilherme A. Geronimo; Margarita Ruiz; Eduardo Ruback; Floriano P. Silva; Christian M. Probst; Edmundo Carlos Grisard; Marco Aurélio Krieger; Samuel Goldenberg; Maria Cláudia Cavalcanti; Milton Ozório Moraes; Maria Luiza Machado Campos; Marta Mattoso
ProtozoaDB (http://www.biowebdb.org/protozoadb) is being developed to initially host both genomics and post-genomics data from Plasmodium falciparum, Entamoeba histolytica, Trypanosoma brucei, T. cruzi and Leishmania major, but will hopefully host other protozoan species as more genomes are sequenced. It is based on the Genomics Unified Schema and offers a modern Web-based interface for user-friendly data visualization and exploration. This database is not intended to duplicate other similar efforts such as GeneDB, PlasmoDB, TcruziDB or even TDRtargets, but to be complementary by providing further analyses with emphasis on distant similarities (HMM-based) and phylogeny-based annotations including orthology analysis. ProtozoaDB will be progressively linked to the above-mentioned databases, focusing in performing a multi-source dynamic combination of information through advanced interoperable Web tools such as Web services. Also, to provide Web services will allow third-party software to retrieve and use data from ProtozoaDB in automated pipelines (workflows) or other interoperable Web technologies, promoting better information reuse and integration. We also expect ProtozoaDB to catalyze the development of local and regional bioinformatics capabilities (research and training), and therefore promote/enhance scientific advancement in developing countries.
brazilian symposium on bioinformatics | 2012
Kary A. C. S. Ocaña; Daniel de Oliveira; Felipe Horta; Jonas Dias; Eduardo S. Ogasawara; Marta Mattoso
Recent studies of evolution at molecular level address two important issues: reconstruction of the evolutionary relationships between species and investigation of the forces of the evolutionary process. Both issues experienced an explosive growth in the last two decades due to massive generation of genomic data, novel statistical methods and computational approaches to process and analyze this large volume of data. Most experiments in molecular evolution are based on computing intensive simulations preceded by other computation tools and post-processed by computing validators. All these tools can be modeled as scientific workflows to improve the experiment management while capturing provenance data. However, these evolutionary analyses experiments are very complex and may execute for weeks. These workflows need to be executed in parallel in High Performance Computing (HPC) environments such as clouds. Clouds are becoming adopted for bioinformatics experiments due to its characteristics, such as, elasticity and availability. Clouds are evolving into HPC environments. In this paper, we introduce SciEvol, a bioinformatics scientific workflow for molecular evolution reconstruction that aims at inferring evolutionary relationships (i.e. to detect positive Darwinian selection) on genomic data. SciEvol is designed and implemented to execute in parallel over the clouds using SciCumulus workflow engine. Our experiments show that SciEvol can help scientists by enabling the reconstruction of evolutionary relationships using the cloud environment. Results present performance improvements of up to 94.64% in the execution time when compared to the sequential execution, which drops from around 10 days to 12 hours.
international conference on cloud computing | 2011
Daniel de Oliveira; Kary A. C. S. Ocaña; Eduardo S. Ogasawara; Jonas Dias; Fernanda Araujo Baião; Marta Mattoso
X-ray crystallography is an important field due to its role in drug discovery and its relevance in bioinformatics experiments of comparative genomics, phylogenomics, evolutionary analysis, ortholog detection, and three-dimensional structure determination. Managing these experiments is a challenging task due to the orchestration of legacy tools and the management of several variations of the same experiment. Workflows can model a coherent flow of activities that are managed by scientific workflow management systems (SWfMS). Due to the huge amount of variations of the workflow to be explored (parameters, input data) it is often necessary to execute X-ray crystallography experiments in High Performance Computing (HPC) environments. Cloud computing is well known for its scalable and elastic HPC model. In this paper, we present a performance evaluation for the X-ray crystallography workflow defined by the PC4 (Provenance Challenge series). The workflow was executed using the SciCumulus middleware at the Amazon EC2 cloud environment. SciCumulus is a layer for SWfMS that offers support for the parallel execution of scientific workflows in cloud environments with provenance mechanisms. Our results reinforce the benefits (total execution time × monetary cost) of parallelizing the X-ray crystallography workflow using SciCumulus. The results show a consistent way to execute X-ray crystallography workflows that need HPC using cloud computing. The evaluated workflow shares features of many scientific workflows and can be applied to other experiments.
Future Generation Computer Systems | 2013
Kary A. C. S. Ocaña; Daniel de Oliveira; Jonas Dias; Eduardo S. Ogasawara; Marta Mattoso
Over the last years, comparative genomics analyses have become more compute-intensive due to the explosive number of available genome sequences. Comparative genomics analysis is an important a prioristep for experiments in various bioinformatics domains. This analysis can be used to enhance the performance and quality of experiments in areas such as evolution and phylogeny. A common phylogenetic analysis makes extensive use of Multiple Sequence Alignment (MSA) in the construction of phylogenetic trees, which are used to infer evolutionary relationships between homologous genes. Each phylogenetic analysis aims at exploring several different MSA methods to verify which execution produces trees with the best quality. This phylogenetic exploration may run during weeks, even when executed in High Performance Computing (HPC) environments. Although there are many approaches that model and parallelize phylogenetic analysis as scientific workflows, exploring all MSA methods becomes a complex and expensive task to be performed. If scientists determine a priorithe most adequate MSA method to use in the phylogenetic analysis, it would save time, and, in some cases, financial resources. Comparative genomics analyses play an important role in optimizing phylogenetic analysis workflows. In this paper, we extend the SciHmm scientific workflow, aimed at determining the most suitable MSA method, to use it in a phylogenetic analysis. SciHmm uses SciCumulus, a cloud workflow execution engine, for parallel execution. Experimental results show that using SciHmm considerably reduces the total execution time of the phylogenetic analysis (up to 80%). Experiments also show that trees built with the MSA program elected by using SciHmm presented more quality than the remaining, as expected. In addition, the parallel execution of SciHmm shows that this kind of bioinformatics workflow has an excellent cost/benefit when executed in cloud environments.
Advances and Applications in Bioinformatics and Chemistry | 2015
Kary A. C. S. Ocaña; Daniel de Oliveira
Today’s genomic experiments have to process the so-called “biological big data” that is now reaching the size of Terabytes and Petabytes. To process this huge amount of data, scientists may require weeks or months if they use their own workstations. Parallelism techniques and high-performance computing (HPC) environments can be applied for reducing the total processing time and to ease the management, treatment, and analyses of this data. However, running bioinformatics experiments in HPC environments such as clouds, grids, clusters, and graphics processing unit requires the expertise from scientists to integrate computational, biological, and mathematical techniques and technologies. Several solutions have already been proposed to allow scientists for processing their genomic experiments using HPC capabilities and parallelism techniques. This article brings a systematic review of literature that surveys the most recently published research involving genomics and parallel computing. Our objective is to gather the main characteristics, benefits, and challenges that can be considered by scientists when running their genomic experiments to benefit from parallelism techniques and HPC capabilities.
Collaboration
Dive into the Kary A. C. S. Ocaña's collaboration.
French Institute for Research in Computer Science and Automation
View shared research outputs