Jeffrey A. Daily
Pacific Northwest National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jeffrey A. Daily.
2013 Workshop on Modeling and Simulation of Cyber-Physical Energy Systems (MSCPES) | 2013
Jason C. Fuller; Selim Ciraci; Jeffrey A. Daily; Andrew R. Fisher; Matthew L. Hauer
New smart grid technologies and concepts, such as dynamic pricing, demand response, dynamic state estimation, and wide area monitoring, protection, and control, are expected to require considerable communication resources. As the cost of retrofit can be high, future power grids will require the integration of high-speed, secure connections with legacy communication systems, while still providing adequate system control and security. The co-simulation of communication and power systems will become more important as the two systems become more interrelated. This paper will discuss ongoing work at Pacific Northwest National Laboratory to create a flexible, high-speed power and communication system co-simulator for smart grid applications. The framework for the software will be described, including architecture considerations for modular, high performance computing and large-scale scalability (serialization, load balancing, partitioning, cross-platform support, etc.). The current simulator supports the ns-3 (telecommunications) and GridLAB-D (distribution systems) simulators. A test case using the co-simulator, utilizing a transactive demand response system created for the Olympic Peninsula and AEP gridSMART demonstrations, requiring two-way communication between distributed and centralized market devices, will be used to demonstrate the value and intended purpose of the co-simulation environment.
BMC Bioinformatics | 2016
Jeffrey A. Daily
AbstractBackgroundSequence alignment algorithms are a key component of many bioinformatics applications.Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations.ResultsA faster intra-sequence local pairwise alignment implementation is described and benchmarked, including new global and semi-global variants. Using a 375 residue query sequence a speed of 136 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon E5-2670 24-core processor system, the highest reported for an implementation based on Farrar’s ‘striped’ approach. Rognes’s SWIPE optimal database search application is still generally the fastest available at 1.2 to at best 2.4 times faster than Parasail for sequences shorter than 500 amino acids. However, Parasail was faster for longer sequences. For global alignments, Parasail’s prefix scan implementation is generally the fastest, faster even than Farrar’s ‘striped’ approach, however the opal library is faster for single-threaded applications. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license.ConclusionsApplications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.
ieee international conference on high performance computing, data, and analytics | 2012
Abhinav Vishnu; Jeffrey A. Daily; Bruce J. Palmer
The Cray Gemini Interconnect has been recently introduced as a next generation network architecture for building multi-petaflop supercomputers. Cray XE6 systems including LANL Cielo, NERSC Hopper, and the proposed NCSA Blue-Waters, as well as the Cray XK6 ORNL Titan leverage the Gemini Interconnect as their primary Interconnection network. At the same time, programming models such as the Message Passing Interface (MPI) and Partitioned Global Address Space (PGAS) models such as Unified Parallel C (UPC) and Co-Array Fortran (CAF) have become available on these systems. Global Arrays is a popular PGAS model used in a variety of application domains including hydrodynamics, chemistry and visualization. Global Arrays uses Aggregate Remote Memory Copy Interface (ARMCI) as the communication runtime system for Remote Memory Access (RMA) communication. This paper presents a design, implementation and performance evaluation of scalable and high performance communication ARMCI on Cray Gemini. The design space is explored and time-space complexities of communication protocols for one-sided communication primitives such as contiguous and uniformly non-contiguous datatypes, atomic memory operations (AMOs) and memory synchronization is presented. An implementation of the proposed design (referred as ARMCI-Gemini) demonstrates the efficacy on communication primitives, application kernels such as LU decomposition and applications such as Smooth Particle Hydrodynamics (SPH).
ieee international conference on high performance computing, data, and analytics | 2014
Jeffrey A. Daily; Abhinav Vishnu; Bruce J. Palmer; Hubertus J. J. van Dam; Darren J. Kerbyson
Partitioned Global Address Space (PGAS) models are emerging as a popular alternative to MPI models for designing scalable applications. At the same time, MPI remains a ubiquitous communication subsystem due to its standardization, high performance, and availability on leading platforms. In this paper, we explore the suitability of using MPI as a scalable PGAS communication subsystem. We focus on the Remote Memory Access (RMA) communication in PGAS models which typically includes get, put, and atomic memory operations. We perform an in-depth exploration of design alternatives based on MPI. These alternatives include using a semantically-matching interface such as MPI-RMA, as well as not-so-intuitive interfaces such as MPI two-sided with a combination of multi-threading and dynamic process management. With an in-depth exploration of these alternatives and their shortcomings, we propose a novel design which is facilitated by the data-centric view in PGAS models. This design leverages a combination of highly tuned MPI two-sided semantics and an automatic, user-transparent split of MPI communicators to provide asynchronous progress. We implement the asynchronous progress ranks approach and other approaches within the Communication Runtime for Exascale which is a communication subsystem for Global Arrays. Our performance evaluation spans pure communication benchmarks, graph community detection and sparse matrix-vector multiplication kernels, and a computational chemistry application. The utility of our proposed PR-based approach is demonstrated by a 2.17x speedup on 1008 processors over the other MPI-based designs.
modeling, analysis, and simulation on computer and telecommunication systems | 2014
Selim Ciraci; Jeffrey A. Daily; Khushbu Agarwal; Jason C. Fuller; Laurentiu D. Marinovici; Andrew R. Fisher
The ongoing modernization of power grids consists of integrating them with communication networks in order to achieve robust and resilient control of grid operations. To understand the operation of the new smart grid, one approach is to use simulation software. Unfortunately, current power grid simulators at best utilize inadequate approximations to simulate communication networks, if at all. Cooperative simulation of specialized power grid and communication network simulators promises to more accurately reproduce the interactions of real smart grid deployments. However, co-simulation is a challenging problem. A co-simulation must manage the exchange of information, including the synchronization of simulator clocks, between all simulators while maintaining adequate computational performance. This paper describes two new conservative algorithms for reducing the overhead of time synchronization, namely Active Set Conservative and Reactive Conservative. We provide a detailed analysis of their performance characteristics with respect to the current state of the art including both conservative and optimistic synchronization algorithms. In addition, we provide guidelines for selecting the appropriate synchronization algorithm based on the requirements of the co-simulation. The newly proposed algorithms are shown to achieve as much as 14% and 63% improvement in performance, respectively, over the existing conservative algorithm.
computing frontiers | 2011
Nawab Ali; Sriram Krishnamoorthy; Mahantesh Halappanavar; Jeffrey A. Daily
Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance (ABFT) is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra (FTLA) algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. The evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.
International Journal of Parallel Programming | 2013
Nawab Ali; Sriram Krishnamoorthy; Mahantesh Halappanavar; Jeffrey A. Daily
Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. Evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.
ieee international conference on high performance computing, data, and analytics | 2012
Jeffrey A. Daily; Sriram Krishnamoorthy; Anantharaman Kalyanaraman
The field of bioinformatics and computational biology is experiencing a data revolution — experimental techniques to procure data have increased in throughput improved in accuracy and reduced in costs. This has spurred an array of high profile sequencing and data generation projects. While the data repositories represent untapped reservoirs of rich information critical for scientific breakthroughs the analytical software tools that are needed to analyze large volumes of such sequence data have significantly lagged behind in their capacity to scale. In this paper we address homology detection which is a fundamental problem in large-scale sequence analysis with numerous applications. We present a scalable framework to conduct large-scale optimal homology detection on massively parallel super-computing platforms. Our approach employs distributed memory work stealing to effectively parallelize optimal pairwise alignment computation tasks. Results on 120,000 cores of the Hopper Cray XE6 supercomputer demonstrate strong scaling and up to 2.42 × 107 optimal pairwise sequence alignments computed per second (PSAPS) the highest reported in the literature.
acm sigplan symposium on principles and practice of parallel programming | 2015
Nathan R. Tallent; Abhinav Vishnu; Hubertus J. J. van Dam; Jeffrey A. Daily; Darren J. Kerbyson; Adolfy Hoisie
Two trends suggest network contention for one-sided messages is poised to become a performance problem that concerns application developers: an increased interest in one-sided programming models and a rising ratio of hardware threads to network injection bandwidth. Often it is difficult to reason about when one-sided tasks decrease or increase network contention. We present effective and portable techniques for diagnosing the causes and severity of one-sided message contention. To detect that a message is affected by contention, we maintain statistics representing instantaneous network resource demand. Using lightweight measurement and modeling, we identify the portion of a messages latency that is due to contention and whether contention occurs at the initiator or target. We attribute these metrics to program statements in their full static and dynamic context. We characterize contention for an important computational chemistry benchmark on InfiniBand, Cray Aries, and IBM Blue Gene/Q interconnects. We pinpoint the sources of contention, estimate their severity, and show that when message delivery time deviates from an ideal model, there are other messages contending for the same network links. With a small change to the benchmark, we reduce contention by 50% and improve total runtime by 20%.
ieee congress on services | 2008
Jared M. Chase; Karen L. Schuchardt; George Chin; Jeffrey A. Daily; Timothy D. Scheibe
Numerical simulators are frequently used to assess future risks, support remediation and monitoring program decisions, and assist in design of specific remedial actions with respect to groundwater contaminants. Due to the complexity of the subsurface environment and uncertainty in the models, many alternative simulations must be performed, each producing data that is typically postprocessed and analyzed before deciding on the next set of simulations. Though parts of the process are readily amenable to automation through scientific workflow tools, the larger ldquoresearch workflowrdquo is not supported by current tools. We present a detailed use case for subsurface modeling, describe the use case in terms of workflow structure, briefly summarize a prototype that seeks to facilitate the overall modeling process, and discuss the many challenges for building such a comprehensive environment.