Is this you? Create Your Porfile

David Donofrio

Lawrence Berkeley National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David Donofrio is active.

Explore More

Publication

Featured researches published by David Donofrio.

IEEE Computer | 2009

Energy-Efficient Computing for Extreme-Scale Science

David Donofrio; Leonid Oliker; John Shalf; Michael F. Wehner; Chris Rowen; Jens Krueger; Shoaib Kamil; Marghoob Mohiyuddin

A many-core processor design for high-performance systems draws from embedded computings low-power architectures and design processes, providing a radical alternative to cluster solutions.

ieee international conference on high performance computing data and analytics | 2014

Abstract machine models and proxy architectures for exascale computing

James A. Ang; Richard F. Barrett; R.E. Benner; D. Burke; Cy P. Chan; Jeanine Cook; David Donofrio; Simon D. Hammond; Karl Scott Hemmert; Suzanne M. Kelly; H. Le; Vitus J. Leung; David Resnick; Arun Rodrigues; John Shalf; Dylan T. Stark; Didem Unat; Nicholas J. Wright

To achieve exascale computing, fundamental hardware architectures must change. This will significantly impact scientific applications that run on current high performance computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. To adapt to exascale architectures, developers must be able to reason about new hardware and determine what programming models and algorithms will provide the best blend of performance and energy efficiency in the future. An abstract machine model is designed to expose to the application developers and system software only the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. A proxy architecture is a parameterized version of an abstract machine model, with parameters added to elucidate potential speeds and capacities of key hardware components. These more detailed architectural models enable discussion among the developers of analytic models and simulators and computer hardware architects and they allow for application performance analysis, system software development, and hardware optimization opportunities. In this paper, we present a set of abstract machine models and show how they might be used to help software developers prepare for exascale. We then apply parameters to one of these models to demonstrate how a proxy architecture can enable a more concrete exploration of how well application codes map onto future architectures.

ieee conference on mass storage systems and technologies | 2012

NANDFlashSim: Intrinsic latency variation aware NAND flash memory system modeling and simulation at microarchitecture level

Myoungsoo Jung; Ellis Herbert Wilson; David Donofrio; John Shalf; Mahmut T. Kandemir

As NAND flash memory becomes popular in diverse areas ranging from embedded systems to high performance computing, exposing and understanding flash memorys performance, energy consumption, and reliability becomes increasingly important. Moreover, with an increasing trend towards multiple-die, multiple-plane architectures and high speed interfaces, high performance NAND flash memory systems are expected to continue to scale. This scaling should further reduce costs and thereby widen proliferation of devices based on the technology. However, when designing NAND flash-based devices, making decisions about the optimal system configuration is non-trivial because NAND flash is sensitive to a large number of parameters, and some parameters exhibit significant latency variations. Such parameters include varying architectures such as multi-die and multi-plane, and a host of factors that affect performance, energy consumption, diverse node technology, and reliability. Unfortunately, there are no public domain tools for high-fidelity, microarchitecture level NAND flash memory simulation in existence to assist with making such decisions. Therefore, we introduce NANDFlashSim; a latency variation-aware, detailed, and highly configurable NAND flash simulation model. NANDFlashSim implements a detailed timing model for operations in sixteen state-of-the-art NAND flash operation mode combinations. In addition, NANDFlashSim models energies and reliability of NAND flash memory based on statistics. From our comprehensive experiments using NANDFlashSim, we found that 1) most read cases were unable to leverage the highly-parallel internal architecture of NAND flash regardless of the NAND flash operation mode, 2) the main source of this performance bottleneck is I/O bus activity, not NAND flash activity itself, 3) multi-level-cell NAND flash provides lower I/O bus resource contention than single-level-cell NAND flash, but the resource contention becomes a serious problem as the number of die increases, and 4) preference to employ many dies rather than to employ many planes promises better performance in disk-friendly real workloads. The simulator can be downloaded from http://www.cse.psu.edu/~mqj5086/nfs.

ieee international conference on high performance computing data and analytics | 2012

Accelerating analysis of void space in porous materials on multicore and GPU platforms

Richard L. Martin; Prabhat; David Donofrio; James A. Sethian; Maciej Haranczyk

Developing computational tools that enable discovery of new materials for energy-related applications is a challenge. Crystalline porous materials are a promising class of materials that can be used for oil refinement, hydrogen or methane storage as well as carbon dioxide capture. Selecting optimal materials for these important applications requires analysis and screening of millions of potential candidates. Recently, we proposed an automatic approach based on the Fast Marching Method (FMM) for performing analysis of void space inside materials, a critical step preceding expensive molecular dynamics simulations. This breakthrough enables unsupervised, high-throughput characterization of large material databases. The algorithm has three steps: (1) calculation of the cost-grid which represents the structure and encodes the occupiable positions within the void space; (2) using FMM to segment out patches of the void space in the grid of (1), and find how they are connected to form either periodic channels or inaccessible pockets; and (3) generating blocking spheres that encapsulate the discovered inaccessible pockets and are used in proceeding molecular simulations. In this work, we expand upon our original approach through (A) replacement of the FMM-based approach with a more computationally efficient flood fill algorithm; and (B) parallelization of all steps in the algorithm, including a GPU implementation of the most computationally expensive step, the cost-grid generation. We report the acceleration achievable in each step and in the complete application, and discuss the implications for high-throughput material screening.

Neuron | 2016

High-Performance Computing in Neuroscience for Data-Driven Discovery, Integration, and Dissemination

Kristofer E. Bouchard; James B. Aimone; Miyoung Chun; Thomas Dean; Michael Denker; Markus Diesmann; David Donofrio; Loren M. Frank; Narayanan Kasthuri; Chirstof Koch; Oliver Ruebel; Horst D. Simon; Friedrich T. Sommer; Prabhat

Opportunities offered by new neuro-technologies are threatened by lack of coherent plans to analyze, manage, and understand the data. High-performance computing will allow exploratory analysis of massive datasets stored in standardized formats, hosted in open repositories, and integrated with simulations.

international conference on parallel architectures and compilation techniques | 2015

NVMMU: A Non-volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures

Jie Zhang; David Donofrio; John Shalf; Mahmut T. Kandemir; Myoungsoo Jung

Thanks to massive parallelism in modern Graphics Processing Units (GPUs), emerging data processing applications in GPU computing exhibit ten-fold speedups compared to CPU-only systems. However, this GPU-based acceleration is limited in many cases by the significant data movement overheads and inefficient memory management for host-side storage accesses. To address these shortcomings, this paper proposes a non-volatile memory management unit (NVMMU) that reduces the file data movement overheads by directly connecting the Solid State Disk (SSD) to the GPU. We implemented our proposed NVMMU on a real hardware with commercially available GPU and SSD devices by considering different types of storage interfaces and configurations. In this work, NVMMU unifies two discrete software stacks (one for the SSD and other for the GPU) in two major ways. While a new interface provided by our NVMMU directly forwards file data between the GPU runtime library and the I/O runtime library, it supports non-volatile direct memory access (NDMA) that pairs those GPU and SSD devices via physically shared system memory blocks. This unification in turn can eliminate unnecessary user/kernel-mode switching, improve memory management, and remove data copy overheads. Our evaluation results demonstrate that NVMMU can reduce the overheads of file data movement by 95% on average, improving overall system performance by 78% compared to a conventional IOMMU approach.

international conference on computer design | 2015

OpenNVM: An open-sourced FPGA-based NVM controller for low level memory characterization

Jie Zhang; Gieseo Park; Mustafa Shihab; David Donofrio; John Shalf; Myoungsoo Jung

Accurate characterization of real device samples is essential for understanding the true potential of the emerging non-volatile memories (NVMs) and identifying their optimal placement in the memory hierarchy. Even though, NVM devices are now available from different manufacturers, lack of an appropriate NVM controller and evaluation platform in the public domain is the main challenge in extracting empirical data from these real devices. In this paper, we present Open-NVM, an open-sourced, highly configurable FPGA based evaluation/characterization platform for various NVM technologies. Through our OpenNVM, this work reveals important low-level NVM characteristics, including i) static and dynamic latency disparity, ii) error rate variation, iii) power consumption behavior, vi) interrelationship between frequency and NVM operational current. In addition, we also examine state-of-the-art write-once-memory (WOM) codes on a real NVM device and study diverse system-level performance impacts based on our findings. All FPGA source code and detailed information of our hardware design is ready to be open-sourced and downloaded for free.

ACM Transactions on Storage | 2016

NANDFlashSim: High-Fidelity, Microarchitecture-Aware NAND Flash Memory Simulation

Myoungsoo Jung; Wonil Choi; Shuwen Gao; Ellis Herbert Wilson; David Donofrio; John Shalf; Mahmut T. Kandemir

As the popularity of NAND flash expands in arenas from embedded systems to high-performance computing, a high-fidelity understanding of its specific properties becomes increasingly important. Further, with the increasing trend toward multiple-die, multiple-plane architectures and high-speed interfaces, flash memory systems are expected to continue to scale and cheapen, resulting in their broader proliferation. However, when designing NAND-based devices, making decisions about the optimal system configuration is nontrivial, because flash is sensitive to a number of parameters and suffers from inherent latency variations, and no available tools suffice for studying these nuances. The parameters include the architectures, such as multidie and multiplane, diverse node technologies, bit densities, and cell reliabilities. Therefore, we introduce NANDFlashSim, a high-fidelity, latency-variation-aware, and highly configurable NAND-flash simulator, which implements a detailed timing model for 16 state-of-the-art NAND operations. Using NANDFlashSim, we notably discover the following. First, regardless of the operation, reads fail to leverage internal parallelism. Second, MLC provides lower I/O bus contention than SLC, but contention becomes a serious problem as the number of dies increases. Third, many-die architectures outperform many-plane architectures for disk-friendly workloads. Finally, employing a high-performance I/O bus or an increased page size does not enhance energy savings. Our simulator is available at http://nfs.camelab.org.

network on chip architectures | 2014

OpenSoC Fabric: On-Chip Network Generator: Using Chisel to Generate a Parameterizable On-Chip Interconnect Fabric

Farzad Fatollahi-Fard; David Donofrio; George Michelogiannakis; John Shalf

Recent advancements in technology scaling have sparked a trend towards greater integration with large-scale chips containing thousands of processors connected to memories and other I/O devices using non-trivial network topologies. Software simulation suffers from long execution times or reduced accuracy in such complex systems, whereas hardware RTL development is too time-consuming. We present OpenSoC Fabric, a parameterizable and powerful on-chip network generator for evaluating future large-scape chip multiprocessors and SoCs. OpenSoC Fabric leverages a new hardware DSL, Chisel, which contains powerful abstractions provided by its base language, Scala, and generates both software (C++) and hardware (Verilog) models from a single code base. This is in contrast to other tools readily available which typically provide either software or hardware models, but not both. The OpenSoC Fabric infrastructure is modeled after existing state-of-the-art simulators, offers large and powerful collections of configuration options, is open-source, and uses object-oriented design and functional programming to make functionality extension as easy as possible.

international symposium on performance analysis of systems and software | 2016

OpenSoC Fabric: On-chip network generator

Farzad Fatollahi-Fard; David Donofrio; George Michelogiannakis; John Shalf

As technology scaling continues, on-chip networks are expected to remain important in future many-core chips due to the increased parallelism and, therefore, communication. However, designing and evaluating large-scale on-chip networks is a nontrivial task given the poor scalability of software simulation for thousands of cores and the intense development effort to develop hardware RTL. In this paper, we describe OpenSoC Fabric. OpenSoC Fabric is a comprehensive on-chip network generator written in Chisel. Chisel generates both C++ and Verilog models from a single code base and has a development effort comparable to functional programming. We describe the internal architecture of OpenSoC Fabric and its powerful list of configuration parameters. We then compare OpenSoC Fabric against pre-validated state-of-the-art simulators using both the generated C++ and Verilog models using FPGAs.

Explore More