Is this you? Create Your Porfile

Daniel Crawl

University of California, San Diego

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Daniel Crawl is active.

Explore More

Publication

Featured researches published by Daniel Crawl.

workflows in support of large scale science | 2009

Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems

Jianwu Wang; Daniel Crawl; Ilkay Altintas

MapReduce provides a parallel and scalable programming model for data-intensive business and scientific applications. MapReduce and its de facto open source project, called Hadoop, support parallel processing on large datasets with capabilities including automatic data partitioning and distribution, load balancing, and fault tolerance management. Meanwhile, scientific workflow management systems, e.g., Kepler, Taverna, Triana, and Pegasus, have demonstrated their ability to help domain scientists solve scientific problems by synthesizing different data and computing resources. By integrating Hadoop with Kepler, we provide an easy-to-use architecture that facilitates users to compose and execute MapReduce applications in Kepler scientific workflows. Our implementation demonstrates that many characteristics of scientific workflow management systems, e.g., graphical user interface and component reuse and sharing, are very complementary to those of MapReduce. Using the presented Hadoop components in Kepler, scientists can easily utilize MapReduce in their domain-specific problems and connect them with other tasks in a workflow through the Kepler graphical user interface. We validate the feasibility of our approach via a word count use case.

Ecological Informatics | 2010

Workflows and extensions to the Kepler scientific workflow system to support environmental sensor data access and analysis

Derik Barseghian; Ilkay Altintas; Matthew Jones; Daniel Crawl; Nathan Potter; James Gallagher; Peter Cornillon; Mark Schildhauer; Elizabeth T. Borer; Eric W. Seabloom; Parviez R. Hosseini

Environmental sensor networks are now commonly being deployed within environmental observatories and as components of smaller-scale ecological and environmental experiments. Effectively using data from these sensor networks presents technical challenges that are difficult for scientists to overcome, severely limiting the adoption of automated sensing technologies in environmental science. The Realtime Environment for Analytical Processing (REAP) is an NSF-funded project to address the technical challenges related to accessing and using heterogeneous sensor data from within the Kepler scientific workflow system. Using distinct use cases in terrestrial ecology and oceanography as motivating examples, we describe workflows and extensions to Kepler to stream and analyze data from observatory networks and archives. We focus on the use of two newly integrated data sources in Kepler: DataTurbine and OPeNDAP. Integrated access to both near real-time data streams and data archives from within Kepler facilitates both simple data exploration and sophisticated analysis and modeling with these data sources.

international provenance and annotation workshop | 2008

A Provenance-Based Fault Tolerance Mechanism for Scientific Workflows

Daniel Crawl; Ilkay Altintas

Capturing provenance information in scientific workflows is not only useful for determining data-dependencies, but also for a wide range of queries including fault tolerance and usage statistics. As collaborative scientific workflow environments provide users with reusable shared workflows, collection and usage of provenance data in a generic way that could serve multiple data and computational models become vital. This paper presents a method for capturing data value- and control- dependencies for provenance information collection in the Kepler scientific workflow system. It also describes how the collected information based on these dependencies could be used for a fault tolerance framework in different models of computation.

workflows in support of large scale science | 2011

Provenance for MapReduce-based data-intensive workflows

Daniel Crawl; Jianwu Wang; Ilkay Altintas

MapReduce has been widely adopted by many business and scientific applications for data-intensive processing of large datasets. There are increasing efforts for workflows and systems to work with the MapReduce programming model and the Hadoop environment including our work on a higher-level programming model for MapReduce within the Kepler Scientific Workflow System. However, to date, provenance of MapReduce-based workflows and its effects on workflow execution performance have not been studied in depth. In this paper, we present an extension to our earlier work on MapReduce in Kepler to record the provenance of MapReduce workflows created using the Kepler+Hadoop framework. In particular, we present: (i) a data model that is able to capture provenance inside a MapReduce job as well as the provenance for the workflow that submitted it; (ii) an extension to the Kepler+Hadoop architecture to record provenance using this data model on MySQL Cluster; (iii) a programming interface to query the collected information; and (iv) an evaluation of the scalability of collecting and querying this provenance information using two scenarios with different characteristics.

international conference on conceptual structures | 2014

Workflow as a Service in the Cloud: Architecture and Scheduling Algorithms

Jianwu Wang; Prakashan Korambath; Ilkay Altintas; James W. Davis; Daniel Crawl

With more and more workflow systems adopting cloud as their execution environment, it becomes increasingly challenging on how to efficiently manage various workflows, virtual machines (VMs) and workflow execution on VM instances. To make the system scalable and easy-to-extend, we design a Workflow as a Service (WFaaS) architecture with independent services. A core part of the architecture is how to efficiently respond continuous workflow requests from users and schedule their executions in the cloud. Based on different targets, we propose four heuristic workflow scheduling algorithms for the WFaaS architecture, and analyze the differences and best usages of the algorithms in terms of performance, cost and the price/performance ratio via experimental studies.

international provenance and annotation workshop | 2010

Understanding Collaborative Studies through Interoperable Workflow Provenance

Ilkay Altintas; Manish Kumar Anand; Daniel Crawl; Shawn Bowers; Adam Belloum; Paolo Missier; Bertram Ludäscher; Carole A. Goble; Peter M. A. Sloot

The provenance of a data product contains information about how the product was derived, and is crucial for enabling scientists to easily understand, reproduce, and verify scientific results. Currently, most provenance models are designed to capture the provenance related to a single run, and mostly executed by a single user. However, a scientific discovery is often the result of methodical execution of many scientific workflows with many datasets produced at different times by one or more users. Further, to promote and facilitate exchange of information between multiple workflow systems supporting provenance, the Open Provenance Model (OPM) has been proposed by the scientific workflow community. In this paper, we describe a new query model that captures implicit user collaborations. We show how this model maps to OPM and helps to answer collaborative queries, e.g., identifying combined workflows and contributions of users collaborating on a project based on the records of previous workflow executions. We also adopt and extend the high-level Query Language for Provenance (QLP) with additional constructs, and show how these extensions allow non-expert users to express collaborative provenance queries against this model easily and concisely. Furthermore, we adopt the Provenance Challenge 3 (PC3) workflows as a collaborative and interoperable usecase scenario, where different stages of the workflow are executed in three different workflow environments - Kepler, Taverna, and WSVLAM. Through this usecase, we demonstrate how we can establish and understand collaborative studies through interoperable workflow provenance.

Computing in Science and Engineering | 2014

Big Data Applications using Workflows for Data Parallel Computing

Jianwu Wang; Daniel Crawl; Ilkay Altintas; Weizhong Li

In the Big Data era, workflow systems need to embrace data parallel computing techniques for efficient data analysis and analytics. Here, the authors present an easy-to-use, scalable approach to build and execute Big Data applications using actor-oriented modeling in data parallel computing. They use two bioinformatics use cases for next-generation sequencing data analysis to verify the feasibility of their approach.

international conference on conceptual structures | 2012

A Framework for Distributed Data-Parallel Execution in the Kepler Scientific Workflow System

Jianwu Wang; Daniel Crawl; Ilkay Altintas

Abstract Distributed Data-Parallel (DDP) patterns such as MapReduce have become increasingly popular as solutions to facilitate data-intensive applications, resulting in a number of systems supporting DDP workﬂows. Yet, applications or workﬂows built using these patterns are usually tightly-coupled with the underlying DDP execution engine they select. We present a framework for distributed data-parallel execution in the Kepler scientiﬁc workﬂow system that enables users to easily switch between different DDP execution engines. We describe a set of DDP actors based on DDP patterns and directors for DDP workﬂow executions within the presented framework. We demonstrate how DDP workﬂows can be easily composed in the Kepler graphic user interface through the reuse of these DDP actors and directors and how the generated DDP workﬂows can be executed in different distributed environments. Via a bioinformatics usecase, we discuss the usability of the proposed framework and validate its execution scalability.

Fundamenta Informaticae | 2013

Approaches to Distributed Execution of Scientific Workflows in Kepler

Marcin Płóciennik; Tomasz Żok; Ilkay Altintas; Jianwu Wang; Daniel Crawl; David Abramson; F. Imbeaux; B. Guillerminet; M. López-Caniego; Isabel Campos Plasencia; Wojciech Pych; Paweł Ciecieląg; Bartek Palak; Michal Owsiak; Yann Frauel

The Kepler scientific workflow system enables creation, execution and sharing of workflows across a broad range of scientific and engineering disciplines while also facilitating remote and distributed execution of workflows. In this paper, we present and compare different approaches to distributed execution of workflows using the Kepler environment, including a distributed data-parallel framework using Hadoop and Stratosphere, and Cloud and Grid execution using Serpens, Nimrod/K and Globus actors. We also present real-life applications in computational chemistry, bioinformatics and computational physics to demonstrate the usage of different distributed computing capabilities of Kepler in executable workflows. We further analyze the differences of each approach and provide a guidance for their applications.

international conference on conceptual structures | 2015

Towards an Integrated Cyberinfrastructure for Scalable Data-driven Monitoring, Dynamic Prediction and Resilience of Wildfires

Ilkay Altintas; Jessica Block; Raymond A. de Callafon; Daniel Crawl; Charles Cowart; Amarnath Gupta; Mai H. Nguyen; Hans-Werner Braun; Jürgen P. Schulze; Michael J. Gollner; Arnaud Trouvé; Larry Smarr

Abstract Wildfires are critical for ecosystems in many geographical regions. However, our current urbanized existence in these environments is inducing the ecological balance to evolve into a different dynamic leading to the biggest fires in history. Wildfire wind speeds and directions change in an instant, and first responders can only be effective if they take action as quickly as the conditions change. What is lacking in disaster management today is a system integration of real-time sensor networks, satellite imagery, near-real time data management tools, wildfire simulation tools, and connectivity to emergency command centers before, during and after a wildfire. As a first time example of such an integrated system, the WIFIRE project is building an end-to-end cyberinfrastructure for real-time and data-driven simulation, prediction and visualization of wildfire behavior. This paper summarizes the approach and early results of the WIFIRE project to integrate networked observations, e.g., heterogeneous satellite data and real-time remote sensor data with computational techniques in signal processing, visualization, modeling and data assimilation to provide a scalable, technological, and educational solution to monitor weather patterns to predict a wildfires Rate of Spread.

Explore More