Is this you? Create Your Porfile

Andy Yoo

Lawrence Livermore National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Andy Yoo is active.

Explore More

Publication

Featured researches published by Andy Yoo.

job scheduling strategies for parallel processing | 2003

SLURM: Simple Linux Utility for Resource Management

Andy Yoo; Morris A. Jette; Mark Grondona

A new cluster resource management system called Simple Linux Utility Resource Management (SLURM) is described in this paper. SLURM, initially developed for large Linux clusters at the Lawrence Livermore National Laboratory (LLNL), is a simple cluster manager that can scale to thousands of processors. SLURM is designed to be flexible and fault-tolerant and can be ported to other clusters of different size and architecture with minimal effort. We are certain that SLURM will benefit both users and system architects by providing them with a simple, robust, and highly scalable parallel job execution environment for their cluster system.

conference on high performance computing (supercomputing) | 2005

A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L

Andy Yoo; Edmond Chow; Keith Henderson; Will McLendon; Bruce Hendrickson

Many emerging large-scale data science applications require searching large graphs distributed across multiple memories and processors. This paper presents a distributed breadth- first search (BFS) scheme that scales for random graphs with up to three billion vertices and 30 billion edges. Scalability was tested on IBM BlueGene/L with 32,768 nodes at the Lawrence Livermore National Laboratory. Scalability was obtained through a series of optimizations, in particular, those that ensure scalable use of memory. We use 2D (edge) partitioning of the graph instead of conventional 1D (vertex) partitioning to reduce communication overhead. For Poisson random graphs, we show that the expected size of the messages is scalable for both 2D and 1D partitionings. Finally, we have developed efficient collective communication functions for the 3D torus architecture of BlueGene/L that also take advantage of the structure in the problem. The performance and characteristics of the algorithm are measured and reported.

conference on high performance computing (supercomputing) | 2002

An Empirical Performance Evaluation of Scalable Scientific Applications

Jeffrey S. Vetter; Andy Yoo

We investigate the scalability, architectural requirements,a nd performance characteristics of eight scalable scientific applications. Our analysis is driven by empirical measurements using statistical and tracing instrumentation for both communication and computation. Based on these measurements, we refine our analysis into precise explanations of the factors that influence performance and scalability for each application; we distill these factors into common traits and overall recommendations for both users and designers of scalable platforms. Our experiments demonstrate that some traits, such as improvements in the scaling and performance of MPIs collective operations, will benefit most applications. We also find specific characteristics of some applications that limit performance. For example, one applications intensive use of a 64-bit, floating-point divide instruction, which has high latency and is not pipelined on the POWER3, limits the performance of the applications primary computation.

IEEE Computer | 2008

Hardware Technologies for High-Performance Data-Intensive Computing

Maya Gokhale; Jonathan D. Cohen; Andy Yoo; William Marcus Miller; Arpith C. Jacob; Craig D. Ulmer; Roger A. Pearce

Data-intensive problems challenge conventional computing architectures with demanding CPU, memory, and I/O requirements. Experiments with three benchmarks suggest that emerging hardware technologies can significantly boost performance of a wide range of applications by increasing compute cycles and bandwidth and reducing latency.

conference on high performance computing (supercomputing) | 2004

Coscheduling in Clusters: Is It a Viable Alternative?

Gyu Sang Choi; Jin-Ha Kim; Deniz Ersoz; Andy Yoo; Chita R. Das

In this paper, we conduct an in-depth evaluation of a broad spectrum of scheduling alternatives for clusters. These include the widely used batch scheduling, local scheduling, gang scheduling, all prior communication-driven coscheduling algorithms (Dynamic Coscheduling (DCS), Spin Block (SB), Periodic Boost (PB), and Co-ordinated Coscheduling (CC)) and a newly proposed HYBRID coscheduling algorithm on a 16-node, Myrinet-connected Linux cluster. Performance and energy measurements using several NAS, LLNL and ANL benchmarks on the Linux cluster provide several interesting conclusions. First, although batch scheduling is currently used in most clusters, all blocking-based coscheduling techniques such as SB, CC and HYBRID and the gang scheduling can provide much better performance even in a dedicated cluster platform. Second, in contrast to some of the prior studies, we observe that blocking-based schemes like SB and HYBRID can provide better performance than spin-based techniques like PB on a Linux platform. Third, the proposed HYBRID scheduling provides the best performance-energy behavior and can be implemented on any cluster with little effort. All these results suggest that blocking-based coscheduling techniques are viable candidates to be used in clusters for significant performance-energy benefits.

symposium on code generation and optimization | 2003

METRIC: tracking down inefficiencies in the memory hierarchy via binary rewriting

Jaydeep Marathe; Frank Mueller; Tushar Mohan; Bronis R. de Supinski; Sally A. McKee; Andy Yoo

We present METRIC, an environment for determining memory inefficiencies by examining data traces. METRIC is designed to alter the performance behavior of applications that are mostly constrained by their latency to resolve memory references. We make four primary contributions. First, we present methods to extract partial data traces from running applications by observing their memory behavior via dynamic binary rewriting. Second, we present a methodology to represent partial data traces in constant space for regular references through a novel technique for online compression of reference streams. Third, we employ offline cache simulation to derive indications about memory performance bottlenecks from partial data traces. By exploiting summarized memory metrics, by-reference metrics as well as cache evictor information, we can pin-point the sources of performance problems. Fourth, we demonstrate the ability to derive opportunities for optimizations and assess their benefits in several experiments resulting in up to 40% lower miss ratios.

ieee international conference on high performance computing data and analytics | 2011

A scalable eigensolver for large scale-free graphs using 2D graph partitioning

Andy Yoo; Allison H. Baker; Roger A. Pearce; Van Emden Henson

Eigensolvers are important tools for analyzing and mining useful information from scale-free graphs. Such graphs are used in many applications and can be extremely large. Unfortunately, existing parallel eigensolvers do not scale well for these graphs due to the high communication overhead in the parallel matrix-vector multiplication (MatVec). We develop a MatVec algorithm based on 2D edge partitioning that significantly reduces the communication costs and embed it into a popular eigensolver library. We demonstrate that the enhanced eigensolver can attain two orders of magnitude performance improvement compared to the original on a state-of-art massively parallel machine. We illustrate the performance of the embedded MatVec by computing eigenvalues of a scale-free graph with 300 million vertices and 5 billion edges, the largest scale-free graph analyzed by any in-memory parallel eigensolver, to the best of our knowledge.

conference on high performance computing (supercomputing) | 2003

Identifying and Exploiting Spatial Regularity in Data Memory References

Tushar Mohan; Bronis R. de Supinski; Sally A. McKee; Frank Mueller; Andy Yoo; Martin Schulz

The growing processor/memory performance gap causes the performance of many codes to be limited by memory accesses. If known to exist in an application, strided memory accesses forming streams can be targeted by optimizations such as prefetching, relocation, remapping, and vector loads. Undetected, they can be a significant source of memory stalls in loops. Existing stream-detection mechanisms either require special hardware, which may not gather statistics for subsequent analysis, or are limited to compile-time detection of array accesses in loops. Formally, little treatment has been accorded to the subject; the concept of locality fails to capture the existence of streams in a programs memory accesses. The contributions of this paper are as follows. First, we define spatial regularity as a means to discuss the presence and effects of streams. Second, we develop measures to quantify spatial regularity, and we design and implement an on-line, parallel algorithm to detect streams - and hence regularity - in running applications. Third, we use examples from real codes and common benchmarks to illustrate how derived stream statistics can be used to guide the application of profile-driven optimizations. Overall, we demonstrate the benefits of our novel regularity metric as an instrument to detect potential for code optimizations affecting memory performance.

ACM Transactions on Programming Languages and Systems | 2007

METRIC: Memory tracing via dynamic binary rewriting to identify cache inefficiencies

Jaydeep Marathe; Frank Mueller; Tushar Mohan; Sally A. McKee; Bronis R. de Supinski; Andy Yoo

With the diverging improvements in CPU speeds and memory access latencies, detecting and removing memory access bottlenecks becomes increasingly important. In this work we present METRIC, a software framework for isolating and understanding such bottlenecks using partial access traces. METRIC extracts access traces from executing programs without special compiler or linker support. We make four primary contributions. First, we present a framework for extracting partial access traces based on dynamic binary rewriting of the executing application. Second, we introduce a novel algorithm for compressing these traces. The algorithm generates constant space representations for regular accesses occurring in nested loop structures. Third, we use these traces for offline incremental memory hierarchy simulation. We extract symbolic information from the application executable and use this to generate detailed source-code correlated statistics including per-reference metrics, cache evictor information, and stream metrics. Finally, we demonstrate how this information can be used to isolate and understand memory access inefficiencies. This illustrates a potential advantage of METRIC over compile-time analysis for sample codes, particularly when interprocedural analysis is required.

ieee international conference on high performance computing data and analytics | 1999

A Gang-Scheduling System for ASCI Blue-Pacific

José E. Moreira; Hubertus Franke; Waiman Chan; Liana L. Fong; Morris A. Jette; Andy Yoo

The ASCI Blue-Pacific machines are large parallel systems comprised of thousands of processors. We are currently developing and testing a gangscheduling job control system for these machines that exploits space-and time-sharing in the presence of dedicated communication devices. Our initial experience with this system indicates that, though applications pay a small overhead, overall system performance as measured by average job queue and response times improves significantly. This gang-scheduling system is planned for deployment into production mode during 1999 at Lawrence Livermore National Laboratory.

Explore More