Laura Carrington | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Laura Carrington is active.

Explore More

Publication

Featured researches published by Laura Carrington.

conference on high performance computing (supercomputing) | 2002

A Framework for Performance Modeling and Prediction

Allan Snavely; Laura Carrington; Nicole Wolter; Jesús Labarta; Rosa M. Badia; Avi Purkayastha

Cycle-accurate simulation is far too slow for modeling the expected performance of full parallel applications on large HPC systems. And just running an application on a system and observing wallclock time tells you nothing about why the application performs as it does (and is anyway impossible on yet-to-be-built systems). Here we present a framework for performance modeling and prediction that is faster than cycle-accurate simulation, more informative than simple benchmarking, and is shown useful for performance investigations in several dimensions.

international symposium on performance analysis of systems and software | 2010

PEBIL: Efficient static binary instrumentation for Linux

Michael A. Laurenzano; Mustafa M. Tikir; Laura Carrington; Allan Snavely

Binary instrumentation facilitates the insertion of additional code into an executable in order to observe or modify the executables behavior. There are two main approaches to binary instrumentation: static and dynamic binary instrumentation. In this paper we present a static binary instrumentation toolkit for Linux on the x86/x86_64 platforms, PEBIL (PMaCs Efficient Binary Instrumentation Toolkit for Linux). PEBIL is similar to other toolkits in terms of how additional code is inserted into the executable. However, it is designed with the primary goal of producing efficient-running instrumented code. To this end, PEBIL uses function level code relocation in order to insert large but fast control structures. Furthermore, the PEBIL API provides tool developers with the means to insert lightweight hand-coded assembly rather than relying solely on the insertion of instrumentation functions. These features enable the implementation of efficient instrumentation tools with PEBIL. The overhead introduced for basic block counting by PEBIL is an average of 65% of the overhead of Dyninst, 41% of the overhead of Pin, 15% of the overhead of DynamoRIO, and 8% of the overhead of Valgrind.

ieee international symposium on workload characterization | 2001

Modeling application performance by convolving machine signatures with application profiles

Allan Snavely; Nicole Wolter; Laura Carrington

This paper presents a performance modeling methodology that is faster than traditional cycle-accurate simulation, more sophisticated than performance estimation based on system peak-performance metrics, and is shown to be effective on a class of High Performance Computing benchmarks. The method yields insight into the factors that affect performance on single-processor and parallel computers.

Future Generation Computer Systems | 2006

A performance prediction framework for scientific applications

Laura Carrington; Allan Snavely; Nicole Wolter

This work presents the results of ongoing investigations in the development of a performance modeling framework, developed by the Performance Modeling and Characterization (PMaC) Lab at the San Diego Supercomputer Center. The framework is faster than traditional cycle-accurate simulation, more sophisticated than performance estimation based on system peak-performance metrics, and is shown to be effective on benchmarks and scientific applications. This paper focuses on one such functionality by investigating sensitivity studies to further understand observed and anticipated effect of both the architecture and the application in predicted runtime.

european conference on parallel processing | 2009

PSINS: An Open Source Event Tracer and Execution Simulator for MPI Applications

Mustafa M. Tikir; Michael A. Laurenzano; Laura Carrington; Allan Snavely

As the size of today’s supercomputers grow exponentially in numbers of processors, the applications that run on these systems scale to larger processor counts. The majority of these applications commonly use Message Passing Interface (MPI); a trace of these MPI communication events is an important input to the tools that visualize, simulate for performance modeling, or enable tuning of parallel applications. We introduce an efficient, accurate and flexible trace-driven performance modeling and prediction tool, PMaC’s Open Source Interconnect and Network Simulator (PSINS), for MPI applications. A principal feature of PSINS is its usability for applications that scale up to large processor counts. PSINS generates compact and tractable event traces for fast and efficient simulations while producing accurate performance predictions. PSINS was incorporated into PMaC’s automated performance prediction framework and used to model three applications from the High Performance Computing Modernization Program’s (HPCMP) Technology Insertion 2009 (TI-09) application suite.

international parallel and distributed processing symposium | 2012

Modeling Power and Energy Usage of HPC Kernels

Ananta Tiwari; Michael A. Laurenzano; Laura Carrington; Allan Snavely

Compute intensive kernels make up the majority of execution time in HPC applications. Therefore, many of the power draw and energy consumption traits of HPC applications can be characterized in terms of the power draw and energy consumption of these constituent kernels. Given that power and energy-related constraints have emerged as major design impediments for exascale systems, it is crucial to develop a greater understanding of how kernels behave in terms of power/energy when subjected to different compiler-based optimizations and different hardware settings. In this work, we develop CPU and DIMM power and energy models for three extensively utilized HPC kernels by training artificial neural networks. These networks are trained using empirical data gathered on the target architecture. The models utilize kernel-specific compiler-based optimization parameters and hard-ware tunables as inputs and make predictions for the power draw rate and energy consumption of system components. The resulting power draw and energy usage predictions have an absolute error rate that averages less than 5.5% for three important kernels - matrix multiplication (MM), stencil computation and LU factorization.

ieee international conference on high performance computing data and analytics | 2008

High-frequency simulations of global seismic wave propagation using SPECFEM3D_GLOBE on 62K processors

Laura Carrington; D. Komatitsch; Michael A. Laurenzano; Mustafa M. Tikir; David Michéa; N. Le Goff; Allan Snavely; John Tromp

SPECFEM3D_GLOBE is a spectral-element application enabling the simulation of global seismic wave propagation in 3D anelastic, anisotropic, rotating and self-gravitating Earth models at unprecedented resolution. A fundamental challenge in global seismology is to model the propagation of waves with periods between 1 and 2 seconds, the highest frequency signals that can propagate clear across the Earth. These waves help reveal the 3D structure of the Earths deep interior and can be compared to seismographic recordings. We broke the 2 second barrier using the 62K processor Ranger system at TACC. Indeed we broke the barrier using just half of Ranger, by reaching a period of 1.84 seconds with sustained 28.7 Tflops on 32K processors. We obtained similar results on the XT4 Franklin system at NERSC and the XT4 Kraken system at University of Tennessee Knoxville, while a similar run on the 28K processor Jaguar system at ORNL, which has better memory bandwidth per processor, sustained 35.7 Tflops (a higher flops rate) with a 1.94 shortest period. Thus we have enabled a powerful new tool for seismic wave simulation, one that operates in the same frequency regimes as nature; in seismology there is no need to pursue periods much smaller because higher frequency signals do not propagate across the entire globe. We employed performance modeling methods to identify performance bottlenecks and worked through issues of parallel I/O and scalability. Improved mesh design and numbering results in excellent load balancing and few cache misses. The primary achievements are not just the scalability and high teraflops number, but a historic step towards understanding the physics and chemistry of the Earths interior at unprecedented resolution.

international conference on parallel processing | 2011

Auto-tuning for energy usage in scientific applications

Ananta Tiwari; Michael A. Laurenzano; Laura Carrington; Allan Snavely

The power wall has become a dominant impeding factor in the realm of exascale system design. It is therefore important to understand how to most effectively create software to minimize its power usage while maintaining satisfactory levels of performance. This work uses existing software and hardware facilities to tune applications to minimize for several combinations of power and performance. The tuning is done with respect to software level performance-related tunables and for processor clock frequency. These tunable parameters are explored via an offline search to find the parameter combinations that are optimal with respect to performance (or delay, D), energy (E), energy×delay (E×D) and energy×delay×delay (E×D2). These searches are employed on a parallel application that solves Poissons equation using stencils. We show that the parameter configuration that minimizes energy consumption can save, on average, 5.4% energy with a performance loss of 4% when compared to the configuration that minimizes runtime.

international conference on cloud and green computing | 2012

Green Queue: Customized Large-Scale Clock Frequency Scaling

Ananta Tiwari; Michael A. Laurenzano; Joshua Peraza; Laura Carrington; Allan Snavely

We examine the scalability of a set of techniques related to Dynamic Voltage-Frequency Scaling (DVFS) on HPC systems to reduce the energy consumption of scientific applications through an application-aware analysis and runtime framework, Green Queue. Green Queue supports making CPU clock frequency changes in response to intra-node and inter-node observations about application behavior. Our intra-node approach reduces CPU clock frequencies and therefore power consumption while CPUs lacks computational work due to inefficient data movement. Our inter-node approach reduces clock frequencies for MPI ranks that lack computational work. We investigate these techniques on a set of large scientific applications on 1024 cores of Gordon, an Intel Sandy bridge-based supercomputer at the San Diego Supercomputer Center. Our optimal intra-node technique showed an average measured energy savings of 10.6% and a maximum of 21.0% over regular application runs. Our optimal inter-node technique showed an average 17.4% and a maximum of 31.7% energy savings.

international conference on parallel processing | 2011

Reducing energy usage with memory and computation-aware dynamic frequency scaling

Michael A. Laurenzano; Mitesh R. Meswani; Laura Carrington; Allan Snavely; Mustafa M. Tikir; Stephen W. Poole

Over the life of a modern supercomputer, the energy cost of running the system can exceed the cost of the original hardware purchase. This has driven the community to attempt to understand and minimize energy costs wherever possible. Towards these ends, we present an automated, fine-grained approach to selecting per-loop processor clock frequencies. The clock frequency selection criteria is established through a combination of lightweight static analysis and runtime tracing that automatically acquires application signatures - characterizations of the patterns of execution of each loop in an application. This application characterization is matched with one of a series of benchmark loops, which have been run on the target system and probe it in various ways. These benchmarks form a covering set, a machine characterization of the expected power consumption and performance traits of the machine over the space of execution patterns and clock frequencies. The frequency that confers the optimal behavior in terms of power-delay product for the benchmark that most closely resembles each application loop is the one chosen for that loop. The set of tools that implement this scheme is fully automated, built on top of freely available open source software, and uses an inexpensive power measurement apparatus. We use these tools to show a measured, system-wide energy savings of up to 7.6% on an 8-core Intel Xeon E5530 and 10.6% on a 32-core AMD Opteron 8380 (a Sun X4600 Node) across a range of workloads.

Explore More