Is this you? Create Your Porfile

Kei Davis

Los Alamos National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kei Davis is active.

Explore More

Publication

Featured researches published by Kei Davis.

ieee international conference on high performance computing data and analytics | 2008

Entering the petaflop era: the architecture and performance of Roadrunner

Kevin J. Barker; Kei Davis; Adolfy Hoisie; Darren J. Kerbyson; Michael Lang; Scott Pakin; José Carlos Sancho

Roadrunner is a 1.38 Pflop/s-peak (double precision) hybrid-architecture supercomputer developed by LANL and IBM. It contains 12,240 IBM PowerXCell 8i processors and 12,240 AMD Opteron cores in 3,060 compute nodes. Roadrunner is the first supercomputer to run Linpack at a sustained speed in excess of 1 Pflop/s. In this paper we present a detailed architectural description of Roadrunner and a detailed performance analysis of the system. A case study of optimizing the MPI-based application Sweep3D to exploit Roadrunners hybrid architecture is also included. The performance of Sweep3D is compared to that of the code on a previous implementation of the Cell Broadband Engine architecture---the Cell BE---and on multi-core processors. Using validated performance models combined with Roadrunner-specific microbenchmarks we identify performance issues in the early pre-delivery system and infer how well the final Roadrunner configuration will perform once the system software stack has matured.

conference on high performance computing (supercomputing) | 2005

Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

Roberto Gioiosa; José Carlos Sancho; Song Jiang; Fabrizio Petrini; Kei Davis

We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifi- cally designed to provide fault tolerance in Linux clusters. This implementation, based on the 2.6.11 Linux kernel, provides the essential functionality for transparent, highly responsive, and efficient fault tolerance based on full or incremental checkpointing at system level. TICK is completely user-transparent and does not require any changes to user code or system libraries; it is highly responsive: an interrupt, such as a timer interrupt, can trigger a checkpoint in as little as 2.5µs; and it supports incremental and full checkpoints with minimal overhead-less than 6% with full checkpointing to disk performed as frequently as once per minute.

conference on high performance computing (supercomputing) | 2006

Quantifying the potential benefit of overlapping communication and computation in large-scale scientific applications

José Carlos Sancho; Kevin J. Barker; Darren J. Kerbyson; Kei Davis

The design and implementation of a high performance communication network are critical factors in determining the performance and cost-effectiveness of a large-scale computing system. The major issues center on the trade-off between the network cost and the impact of latency and bandwidth on application performance. One promising technique for extracting maximum application performance given limited network resources is based on overlapping computation with communication, which partially or entirely hides communication delays. While this approach is not new, there are few studies that quantify the potential benefit of such overlapping for large-scale production scientific codes. We address this with an empirical method combined with a network model to quantify the potential overlap in several codes and examine the possible performance benefit. Our results demonstrate, for the codes examined, that a high potential tolerance to network latency and bandwidth exists because of a high degree of potential overlap. Moreover, our results indicate that there is often no need to use fine-grained communication mechanisms to achieve this benefit, since the major source of potential overlap is found in independent work-computation on which pending messages does not depend. This allows for a potentially significant relaxation of network requirements without a consequent degradation of application performance

IEEE Computer | 2009

Using Performance Modeling to Design Large-Scale Systems

Kevin J. Barker; Kei Davis; Adolfy Hoisie; Darren J. Kerbyson; Michael Lang; Scott Pakin; José Carlos Sancho

A methodology for accurately modeling large applications explores the performance of ultrascale systems at different stages in their life cycle, from early design through production use.

international parallel and distributed processing symposium | 2005

Current practice and a direction forward in checkpoint/restart implementations for fault tolerance

José Carlos Sancho; Fabrizio Petrini; Kei Davis; Roberto Gioiosa; Song Jiang

Checkpoint/restart is a general idea for which particular implementations enable various functionalities in computer systems, including process migration, gang scheduling, hibernation, and fault tolerance. For fault tolerance, in current practice, implementations can be at user-level or system-level. User-level implementations are relatively easy to implement and portable, but suffer from a lack of transparency, flexibility, and efficiency, and in particular are unsuitable for the autonomic (self-managing) computing systems envisioned as the next revolutionary development in system management. In contrast, a system-level implementation can exhibit all of these desirable features, at the cost of a more sophisticated implementation, and is seen as an essential mechanism for the next generation of fault tolerant - and ultimately autonomic - large-scale computing systems. Linux is becoming the operating system of choice for the largest-scale machines, but development of system-level checkpoint/restart mechanisms for Linux is still in its infancy, with all extant implementations exhibiting serious deficiencies for achieving transparent fault tolerance. This paper provides a survey of extant implementations in a natural taxonomy, highlighting their strengths and inherent weaknesses.

conference on high performance computing (supercomputing) | 2004

A Performance and Scalability Analysis of the BlueGene/L Architecture

Kei Davis; Adolfy Hoisie; Greg Johnson; Darren J. Kerbyson; Michael Lang; Scott Pakin; Fabrizio Petrini

Based on a set of measurements done on the 512-node 500MHz prototype and early results on a 2048 node 700MHz BlueGene/L machine at IBM Watson, we present a performance and scalability analysis of the architecture from low-level characteristics to large-scale applications. In addition, we present predictions using our models for the performance of two representative applications from the ASC² workload on the full BlueGene/L configuration of 64K nodes. We have compared the measured values for several of the benchmarks in our suite against the predicted numbers from our performance models. In general, the error bars were relatively low. A comparison between the performance of BlueGene/L and the ASCI Q, the largest supercomputer in the US, is presented, also based on our predictive performance models.

international symposium on signal processing and information technology | 2004

Analysis of system overhead on parallel computers

Roberto Gioiosa; Fabrizio Petrini; Kei Davis; Fabien Lebaillif-Delamare

Ever-increasing demand for computing capability is driving the construction of ever-larger computer clusters, typically comprising commodity compute nodes, ranging in size up to thousands of processors, with each node hosting an instance of the operating system (OS). Recent studies [E. Hendriks (2002), F. Petrini et al. (2003)] have shown that even minimal intrusion by the OS on user applications, e.g. a slowdown of user processes of less than 1.0% on each OS instance, can result in a dramatic performance degradation - 50% or more - when the user applications are executed on thousands of processors. The contribution of this paper is the explication and demonstration by way of a case study, of a methodology for analyzing and evaluating the impact of the system (all software and hardware other than user applications) activity on application performance. Our methodology has three major components: 1) a set of simple benchmarks to quickly measure and identify the impact of intrusive system events; 2) a kernel-level profiling tool Oprofile to characterize all relevant events and their sources; and, 3) a kernel module that provides timing information for in-depth modeling of the frequency and duration of each relevant event and determines which sources have the greatest impact on performance (and are therefore the most important to eliminate). The paper provides a collection of experimental results conducted on a state-of-the-art dual AMD Opteron cluster running GNU/Linux 2.6.5. While our work has been performed on this specific OS, we argue that our contribution readily generalizes to other open source and commercial operating systems.

ieee international conference on high performance computing data and analytics | 2010

IOrchestrator: Improving the Performance of Multi-node I/O Systems via Inter-Server Coordination

Xuechen Zhang; Kei Davis; Song Jiang

A cluster of data servers and a parallel file system are often used to provide high-throughput I/O service to parallel programs running on a compute cluster. To exploit I/O parallelism parallel file systems stripe file data across the data servers. While this practice is effective in serving asynchronous requests, it may break individual programs spatial locality, which can seriously degrade I/O performance when the data servers concurrently serve synchronous requests from multiple I/O-intensive programs. In this paper we propose a scheme, IOrchestrator, to improve I/O performance of multi-node storage systems by orchestrating I/O services among programs when such inter-data-server coordination is dynamically determined to be cost effective. We have implemented IOrchestrator in the PVFS2 parallel file system. Our experiments with representative parallel benchmarks show that IOrchestrator can significantly improve I/O performance-- by up to a factor of 2.5--delivered by a cluster of data servers servicing concurrently-running parallel programs. Notably, we have not observed any scenarios in which the use of IOrchestrator causes substantial performance degradation.

international parallel and distributed processing symposium | 2012

iTransformer: Using SSD to Improve Disk Scheduling for High-performance I/O

Xuechen Zhang; Kei Davis; Song Jiang

The parallel data accesses inherent to large-scale data-intensive scientific computing require that data servers handle very high I/O concurrency. Concurrent requests from different processes or programs to hard disk can cause disk head thrashing between different disk regions, resulting in unacceptably low I/O performance. Current storage systems either rely on the disk scheduler at each data server, or use SSD as storage, to minimize this negative performance effect. However, the ability of the scheduler to alleviate this problem by scheduling requests in memory is limited by concerns such as long disk access times, and potential loss of dirty data with system failure. Meanwhile, SSD is too expensive to be widely used as the major storage device in the HPC environment. We propose iTransformer, a scheme that employs a small SSD to schedule requests for the data on disk. Being less space constrained than with more expensive DRAM, iTransformer can buffer larger amounts of dirty data before writing it back to the disk, or prefetch a larger volume of data in a batch into the SSD. In both cases high disk efficiency can be maintained even for concurrent requests. Furthermore, the scheme allows the scheduling of requests in the background to hide the cost of random disk access behind serving process requests. Finally, as a non-volatile memory, concerns about the quantity of dirty data are obviated. We have implemented iTransformer in the Linux kernel and tested it on a large cluster running PVFS2. Our experiments show that iTransformer can improve the I/O throughput of the cluster by 35% on average for MPI/IO benchmarks of various data access patterns.

international parallel and distributed processing symposium | 2009

Making resonance a common case: A high-performance implementation of collective I/O on parallel file systems

Xuechen Zhang; Song Jiang; Kei Davis

Collective I/O is a widely used technique to improve I/O performance in parallel computing. It can be implemented as a client-based or as a server-based scheme. The client-based implementation is more widely adopted in the MPIIO software such as ROMIO because of its independence from the storage system configuration and its greater portability. However, existing implementations of client-side collective I/O do not consider the actual pattern of file striping over multiple I/O nodes in the storage system. This can cause a large number of requests for non-sequential data at I/O nodes, substantially degrading I/O performance. Investigating a surprisingly high I/O throughput achieved when there is an accidental match between a particular request pattern and the data striping pattern on the I/O nodes, we reveal the resonance phenomenon as the cause. Exploiting readily available information on data striping from the metadata server in popular file systems such as PVFS2 and Lustre, we design a new collective I/O implementation technique, named as resonant I/O, that makes resonance a common case. Resonant I/O rearranges requests from multiple MPI processes according to the presumed data layout on the disks of I/O nodes so that non-sequential access of disk data can be turned into sequential access, significantly improving I/O performance without compromising the independence of a client-based implementation. We have implemented our design in ROMIO. Our experimental results on a small- and medium-scale cluster show that the scheme can increase I/O throughput for some commonly used parallel I/O benchmarks such as mpi-io-test and ior-mpi-io over the existing implementation of ROMIO by up to 157%, with no scenario demonstrating significantly decreased performance.

Explore More