Daniel Q. Duffy | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Daniel Q. Duffy is active.

Explore More

Publication

Featured researches published by Daniel Q. Duffy.

International Journal of Geographical Information Science | 2017

A spatiotemporal indexing approach for efficient processing of big array-based climate data with MapReduce

Zhenlong Li; Fei Hu; John L. Schnase; Daniel Q. Duffy; Tsengdar Lee; Michael K. Bowen; Chaowei Yang

ABSTRACT Climate observations and model simulations are producing vast amounts of array-based spatiotemporal data. Efficient processing of these data is essential for assessing global challenges such as climate change, natural disasters, and diseases. This is challenging not only because of the large data volume, but also because of the intrinsic high-dimensional nature of geoscience data. To tackle this challenge, we propose a spatiotemporal indexing approach to efficiently manage and process big climate data with MapReduce in a highly scalable environment. Using this approach, big climate data are directly stored in a Hadoop Distributed File System in its original, native file format. A spatiotemporal index is built to bridge the logical array-based data model and the physical data layout, which enables fast data retrieval when performing spatiotemporal queries. Based on the index, a data-partitioning algorithm is applied to enable MapReduce to achieve high data locality, as well as balancing the workload. The proposed indexing approach is evaluated using the National Aeronautics and Space Administration (NASA) Modern-Era Retrospective Analysis for Research and Applications (MERRA) climate reanalysis dataset. The experimental results show that the index can significantly accelerate querying and processing (~10× speedup compared to the baseline test using the same computing cluster), while keeping the index-to-data ratio small (0.0328%). The applicability of the indexing approach is demonstrated by a climate anomaly detection deployed on a NASA Hadoop cluster. This approach is also able to support efficient processing of general array-based spatiotemporal data in various geoscience domains without special configuration on a Hadoop cluster.

Bulletin of the American Meteorological Society | 2016

A Global Repository for Planet-Sized Experiments and Observations

Dean N. Williams; V. Balaji; Luca Cinquini; Sebastien Denvil; Daniel Q. Duffy; Ben Evans; Robert D. Ferraro; Rose Hansen; Michael Lautenschlager; Claire Trenham

AbstractWorking across U.S. federal agencies, international agencies, and multiple worldwide data centers, and spanning seven international network organizations, the Earth System Grid Federation (ESGF) allows users to access, analyze, and visualize data using a globally federated collection of networks, computers, and software. Its architecture employs a system of geographically distributed peer nodes that are independently administered yet united by common federation protocols and application programming interfaces (APIs). The full ESGF infrastructure has now been adopted by multiple Earth science projects and allows access to petabytes of geophysical data, including the Coupled Model Intercomparison Project (CMIP)—output used by the Intergovernmental Panel on Climate Change assessment reports. Data served by ESGF not only include model output (i.e., CMIP simulation runs) but also include observational data from satellites and instruments, reanalyses, and generated images. Metadata summarize basic infor...

ISPRS international journal of geo-information | 2018

Evaluating the Open Source Data Containers for Handling Big Geospatial Raster Data

Fei Hu; Mengchao Xu; Jingchao Yang; Yanshou Liang; Kejin Cui; Michael M. Little; Christopher S. Lynnes; Daniel Q. Duffy; Chaowei Yang

Big geospatial raster data pose a grand challenge to data management technologies for effective big data query and processing. To address these challenges, various big data container solutions have been developed or enhanced to facilitate data storage, retrieval, and analysis. Data containers were also developed or enhanced to handle geospatial data. For example, Rasdaman was developed to handle raster data and GeoSpark/SpatialHadoop were enhanced from Spark/Hadoop to handle vector data. However, there are few studies to systematically compare and evaluate the features and performances of these popular data containers. This paper provides a comprehensive evaluation of six popular data containers (i.e., Rasdaman, SciDB, Spark, ClimateSpark, Hive, and MongoDB) for handling multi-dimensional, array-based geospatial raster datasets. Their architectures, technologies, capabilities, and performance are compared and evaluated from two perspectives: (a) system design and architecture (distributed architecture, logical data model, physical data model, and data operations); and (b) practical use experience and performance (data preprocessing, data uploading, query speed, and resource consumption). Four major conclusions are offered: (1) no data containers, except ClimateSpark, have good support for the HDF data format used in this paper, requiring time- and resource-consuming data preprocessing to load data; (2) SciDB, Rasdaman, and MongoDB handle small/mediate volumes of data query well, whereas Spark and ClimateSpark can handle large volumes of data with stable resource consumption; (3) SciDB and Rasdaman provide mature array-based data operation and analytical functions, while the others lack these functions for users; and (4) SciDB, Spark, and Hive have better support of user defined functions (UDFs) to extend the system capability.

IEEE Geoscience and Remote Sensing Magazine | 2016

Big Data Challenges in Climate Science: Improving the next-generation cyberinfrastructure

John L. Schnase; Tsengdar J. Lee; Chris A. Mattmann; Christopher Lynnes; Luca Cinquini; Paul Ramirez; Andrew F. Hart; Dean N. Williams; Duane E. Waliser; Pamela Rinsland; W. Phillip Webster; Daniel Q. Duffy; Mark McInerney; Glenn S. Tamkin; Gerald Potter; Laura Carriere

The knowledge we gain from research in climate science depends on the generation, dissemination, and analysis of high-quality data. This work comprises technical practice as well as social practice, both of which are distinguished by their massive scale and global reach. As a result, the amount of data involved in climate research is growing at an unprecedented rate. Some examples of the types of activities that increasingly require an improved cyberinfrastructure for dealing with large amounts of critical scientific data are climate model intercomparison (CMIP) experiments; the integration of observational data and climate reanalysis data with climate model outputs, as seen in the Observations for Model Intercomparison Projects (Obs4MIPs), Analysis for Model Intercomparison Projects (Ana4MIPs), and Collaborative Reanalysis Technical Environment-Intercomparison Project (CREATE-IP) activities; and the collaborative work of the Intergovernmental Panel on Climate Change (IPCC). This article provides an overview of some of climate sciences big data problems and the technical solutions being developed to advance data publication, climate analytics as a service, and interoperability within the Earth System Grid Federation (ESGF), which is the primary cyberinfrastructure currently supporting global climate research activities.

ieee conference on mass storage systems and technologies | 2011

The NASA Center for Climate Simulation Data Management System

John L. Schnase; William P. Webster; Lynn Parnell; Daniel Q. Duffy

The NASA Center for Climate Simulation (NCCS) plays a lead role in meeting the computational and data management requirements of climate modeling and data assimilation. Scientific data services are becoming an important part of the NCCS mission. The NCCS Data Management System (DMS) is a key element of NCCSs technological response to this expanding role. In DMS, we are using the Integrated Rule-Oriented Data System (iRODS) to combine disparate data collections into a federated platform upon which various data services can be implemented. Work to date has demonstrated the effectiveness of iRODS in managing a large-scale collection of observational data, managing model output data in a cloud computing context, and managing NCCS-hosted data products that are published through community-defined services such as the Earth System Grid (ESG). Plans call for staged operational adoption of iRODS in the NCCS.

ieee conference on mass storage systems and technologies | 2005

Beyond the storage area network: data intensive computing in a distributed environment

Daniel Q. Duffy; N. Acks; V. Noga; T. Schardt; J.P. Gary; B. Fink; Ben Kobler; M. Donovan; J. McElvaney; K. Kamischke

NASA Earth and space science applications are currently utilizing geographically distributed computational platforms. These codes typically require the most compute cycles and generate the largest amount of data over any other applications currently supported by NASA. Furthermore, with the development of a leadership class SGI system at NASA Ames (Project Columbia), NASA has created an agency wide computational resource. This resource is heavily employed by Earth and space science users resulting in large amounts of data. The management of this data in a distributed environment requires a significant amount of effort from the users. This paper defines the approach taken to create an enabling infrastructure to help users easily access and move data across distributed computational resources. Specifically, this paper discusses the approach taken to create a wide area storage area network (SAN) using the SGI CXFS file system over standard TCP/IP. In addition, an emerging technology test bed initiative is described to show how NASA is creating an environment to continually evaluate new technology for data intensive computing.

ASME 2015 International Technical Conference and Exhibition on Packaging and Integration of Electronic and Photonic Microsystems collocated with the ASME 2015 13th International Conference on Nanochannels, Microchannels, and Minichannels | 2015

Thermodynamic Characterization of a Direct Water Cooled Server Rack Running Synthetic and Real High Performance Computing Work Loads

Lynn Parnell; Garrison Vaughan; John H. Thompson; Daniel Q. Duffy; Louis Capps; Mark E. Steinke; Vinod Kamath

High performance computing server racks are being engineered to contain significantly more processing capability within the same computer room footprint year after year. The processor density within a single rack is becoming high enough that traditional, inefficient air-cooling of servers is inadequate to sustain HPC workloads. Experiments that characterize the performance of a direct water-cooled server rack in an operating HPC facility are described in this paper. Performance of the rack is reported for a range of cooling water inlet temperatures, flow rates and workloads that include actual and worst-case synthetic benchmarks. Power and temperature measurements of all processors and memory components in the rack were made while extended benchmark tests were conducted throughout the range of cooling variables allowed within an operational HPC facility. Synthetic benchmark results were compared with those obtained on a single server of the same design that had been characterized thermodynamically. Neither actual nor synthetic benchmark performances were affected during the course of the experiments, varying less than 0.13 percent. Power consumption change in the rack was minimal for the entire excursion of coolant temperatures and flow rates. Establishing the characteristics of such a highly energy efficient server rack in situ is critical to determine how the technology might be integrated into an existing heterogeneous, hybrid cooled computing facility — i.e., a facility that includes some servers that are air cooled as well as some that are direct water cooled.Copyright

Computers & Geosciences | 2018

Climatespark: an In-Memory Distributed Computing Framework for Big Climate Data Analytics

Fei Hu; Chaowei Yang; John L. Schnase; Daniel Q. Duffy; Mengchao Xu; Michael K. Bowen; Tsengdar Lee; Weiwei Song

Abstract The unprecedented growth of climate data creates new opportunities for climate studies, and yet big climate data pose a grand challenge to climatologists to efficiently manage and analyze big data. The complexity of climate data content and analytical algorithms increases the difficulty of implementing algorithms on high performance computing systems. This paper proposes an in-memory, distributed computing framework, ClimateSpark , to facilitate complex big data analytics and time-consuming computational tasks. Chunking data structure improves parallel I/O efficiency, while a spatiotemporal index is built for the chunks to avoid unnecessary data reading and preprocessing. An integrated, multi-dimensional, array-based data model (ClimateRDD) and ETL operations are developed to address big climate data variety by integrating the processing components of the climate data lifecycle. ClimateSpark utilizes Spark SQL and Apache Zeppelin to develop a web portal to facilitate the interaction among climatologists, climate data, analytic operations and computing resources (e.g., using SQL query and Scala/Python notebook). Experimental results show that ClimateSpark conducts different spatiotemporal data queries/analytics with high efficiency and data locality. ClimateSpark is easily adaptable to other big multiple-dimensional, array-based datasets in various geoscience domains.

international conference on big data | 2015

Strategie roadmap for the earth system grid federation

Dean N. Williams; Michael Lautenschlager; V. Balaji; Luca Cinquini; Cecelia DeLuca; Sebastien Denvil; Daniel Q. Duffy; Benjamin J. K. Evans; Robert D. Ferraro; Martin Juckes; Claire Trenham

This article describes the Earth System Grid Federation (ESGF) mission and an international integration strategy for data, database and computational architecture, and stable infrastructure highlighted by the authors (the ESGF Executive Committee). These highlights are key developments needed over the next five to seven years in response to large-scale national and international climate community projects that depend on ESGF for success. Quality assurance and baseline performance from laptop to high performance computing characterizes available and potential data streams and strategies. These are required for interactive data collections to remedy gaps in handling enormous international federated climate data archives. Appropriate cyber security ensures protection of data according to projects but still allows access and portability to different ESGF and individual groups and users. A timeline and plan for forecasting interoperable tools takes ESGF from a federated database archive to a robust virtual laboratory and concludes the article.

ieee international conference on cloud computing technology and science | 2014

Climate Analytics as a Service

John L. Schnase; Daniel Q. Duffy; Mark McInerney; W. Phillip Webster; Tsengdar J. Lee

Exascale computing, big data, and cloud computing are driving the evolution of large-scale information systems toward a model of data-proximal analysis. In response, we are developing a concept of climate analytics as a service (CAaaS) that represents a convergence of data analytics and archive management. With this approach, high-performance compute–storage implemented as an analytic system is part of a dynamic archive comprising both static and computationally realized objects. It is a system the capabilities of which are framed as behaviors over a static data collection, but in which queries cause results to be created, rather than found and retrieved. Those results can be the product of a complex analysis, but, importantly, they can also be tailored responses to the simplest of requests. NASAs MERRA Analytic Service and associated Climate Data Services Application Programming Interface provide a real-world example of climate analytics delivered as a service in this way. Our experiences reveal several advantages to this approach, not the least of which is orders-of-magnitude time reduction in the data-assembly task common to many scientific workflows.

Explore More