Devarshi Ghoshal
Indiana University Bloomington
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Devarshi Ghoshal.
Proceedings of the second international workshop on Data intensive computing in the clouds | 2011
Devarshi Ghoshal; Richard Shane Canon; Lavanya Ramakrishnan
The scientific community is exploring the suitability of cloud infrastructure to handle High Performance Computing (HPC) applications. The goal of Magellan, a project funded through DOE ASCR, is to investigate the potential role of cloud computing to address the computing needs of the Department of Energys Office of Science, especially for mid-range computing and data-intensive applications which are not served through existing DOE centers today. Prior work has shown that applications with significant communication or I/O tend to perform poorly in virtualized cloud environments. However, there is a limited understanding of the I/O characteristics in virtualized cloud environments. This paper will present our results in benchmarking the I/O performance over different cloud and HPC platforms to identify the major bottlenecks in existing infrastructure. We compare the I/O performance using IOR benchmark on two cloud platforms - Amazon and the Magellan cloud testbed. We analyze the performance of different storage options available, different instance types in multiple availability zones. Finally, we perform large-scale tests in order to analyze the variability in the I/O patterns over time and region. Our results highlight the overhead and variability in I/O performance on both public and private cloud solutions. Our results will help applications decide between the different storage options enabling applications to make effective choices.
ieee international conference on high performance computing, data, and analytics | 2012
Peng Chen; Beth Plale; You-Wei Cheah; Devarshi Ghoshal; Scott Jensen; Yuan Luo
Visualization facilitates the understanding of scientific data both through exploration and explanation of the visualized data. Provenance also contributes to the understanding of data by containing the contributing factors behind a result. The visualization of provenance, although supported in existing workflow management systems, generally focuses on small (medium) sized provenance data, lacking techniques to deal with big data with high complexity. This paper discusses visualization techniques developed for exploration and explanation of provenance, including layout algorithm, visual style, graph abstraction techniques, and graph matching algorithm, to deal with the high complexity. We demonstrate through application to two extensively analyzed case studies that involved provenance capture and use over three year projects, the first involving provenance of a satellite imagery ingest processing pipeline and the other of provenance in a large-scale computer network testbed.
edbt icdt workshops | 2013
Devarshi Ghoshal; Beth Plale
As new data products of research increasingly become the product or output of complex processes, the lineage of the resulting products takes on greater importance as a description of the processes that contributed to the result. Without adequate description of data products, their reuse is lessened. The act of instrumenting an application for provenance capture is burdensome, however. This paper explores the option of deriving provenance from existing log files, an approach that reduces the instrumentation task substantially but raises questions about sifting through huge amounts of information for what may or may not be complete provenance. In this paper we study the tradeoff of ease of capture and provenance completeness, and show that under some circumstances capture through logs can result in high quality provenance.
ieee international conference on high performance computing data and analytics | 2012
Devarshi Ghoshal; Lavanya Ramakrishnan
Scientific applications are increasingly using cloud resources for their data analysis workflows. However, managing data effectively and efficiently over these cloud resources is challenging due to the myriad storage choices with different performance-cost trade-offs, complex application choices, complexity associated with elasticity and, failure rates. The explosion in scientific data coupled with unique characteristics of cloud environments require a more flexible and robust distributed data management solution than the ones currently in existence. This paper describes the design and implementation of FRIEDA - a Flexible Robust Intelligent Elastic Data Management framework. FRIEDA coordinates data in a transient cloud environment taking into account specific application characteristics. Additionally, we describe a range of data management strategies and show the benefit of flexible data management schemes in cloud environments. We study two distinct scientific applications from bioinformatics and image analysis to understand the effectiveness of such a framework.
international conference on conceptual structures | 2011
Devarshi Ghoshal; Sreesudhan R. Ramkumar; Arun Chauhan
Abstract Speculative software parallelism has gained renewed interest recently as a mechanism to leverage multiple cores on emerging architectures. Two major mechanisms have been used to implement speculation-based parallelism in software, software transactional memory and speculative threads. We propose a third mechanism based on checkpoint restart. With recent developments in checkpoint restart technology this has become an attractive alternative. The approach has the potential advantage of the conceptual simplicity of transactional memory and flexibility of speculative threads. Since many checkpoint restart systems work with large distributed memory programs, this provides an automatic way to perform distributed speculation over clusters. Additionally, since checkpoint restart systems are primarily designed for fault tolerance, using the same system for speculation could provide fault tolerance within speculative execution as well when it is embedded in large-scale applications where fault tolerance is desirable. In this paper we use a series of micro-benchmarks to study the relative performance of a speculative system based on the DMTCP checkpoint restart system and compare it against a thread level speculative system. We highlight the relative merits of each approach and draw some lessons that could be used to guide future developments in speculative systems.
workflows in support of large scale science | 2013
Devarshi Ghoshal; Arun Chauhan; Beth Plale
Data provenance is the lineage of an artifact or object. Provenance can provide a basis upon which data can be regenerated, and can be used to determine the quality of both the process and provenance itself. Provenance capture from workflows is comprised of capturing data dependencies as and when a workflow executes. We propose a layered provenance model which identifies and stores provenance at different granularities statically by analyzing the source code of programs. We use this model to capture provenance from both workflows and modules within workflows. This paper contributes a static compile time analysis methodology that includes a logical layered provenance model to convert workflow provenance from black box to white box, where the precise mapping between the inputs and outputs of a task can be known.
ieee international conference on cloud computing technology and science | 2014
Lavanya Ramakrishnan; Devarshi Ghoshal; Valerie Hendrix; Eugen Feller; Pradeep Kumar Mantha; Christine Morin
Infrastructure as a Service (IaaS) clouds provide a composable environment that is attractive for mid-range, high-throughput and data-intensive scientific workloads. However, the flexibility of IaaS clouds presents unique challenges for storage and data management in these environments. Users use manual and/or ad-hoc methods to manage storage selection, storage configuration and data management in these environments. We address these challenges via a novel storage and data life cycle management through FRIEDA (Flexible Robust Intelligent Elastic Data Management), an application specific storage and data management framework for composable infrastructure environments.
ieee international conference on cloud engineering | 2014
Devarshi Ghoshal; Lavanya Ramakrishnan
Clouds are increasingly being used for running data-intensive scientific applications. Data-intensive science applications need performance, scalability and reliability. However, these can be hard to achieve in cloud environments. Intelligent strategies are required to obtain better performance, scalability and reliability on cloud platforms. In this paper, we propose a set of pipelining strategies to effectively utilize provisioned cloud resources. Our experiments on the ExoGENI cloud testbed demonstrates the effectiveness of our approach in increasing performance and reducing failures.
international provenance and annotation workshop | 2014
Devarshi Ghoshal; Arun Chauhan; Beth Plale
Application benchmarks are critical to establishing the performance of a new system or library. But benchmarking a system can be tricky and reproducing a benchmark result even trickier. Provenance can help. Referencing benchmarks and their results on similar platforms for collective comparison and evaluation requires capturing provenance related to the process of benchmark execution, programs involved and results generated. In this paper we define a formal model of benchmark applications and required provenance, describe an implementation of the model that employs compile time static and runtime provenance capture, and quantify data quality in the context of benchmarks. Our results show that through a mix of compile time and runtime provenance capture, we can enable higher quality benchmark regeneration.
international conference on e-science | 2014
Quan Zhou; Devarshi Ghoshal; Beth Plale
Data provenance is the lineage of a digital artifact or object. Its capture in workflow-controlled distributed applications is well studied but less is known about quality of provenance captured solely through existing control infrastructures (i.e., middleware frameworks used for high throughput computing). We study completeness of provenance in case where information is only available from the middleware layer. We use WorkQueue to validate our model. Our evaluation shows that provenance captured from a middleware framework is sufficient to represent the existence of output data and trace certain failures independent of the application semantics. We show the methods limitations as well.
Collaboration
Dive into the Devarshi Ghoshal's collaboration.
French Institute for Research in Computer Science and Automation
View shared research outputs