Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Harold W. Cain is active.

Publication


Featured researches published by Harold W. Cain.


high performance computer architecture | 2001

An architectural evaluation of Java TPC-W

Harold W. Cain; Ravi Rajwar; Morris Marden; Mikko H. Lipasti

The use of the Java programming language for implementing server-side application logic is increasingly in popularity yet there is very little known about the architectural requirements of this emerging commercial workload. We present a detailed characterization of the Transaction Processing Councils TPC-W web benchmark, implemented in Java. The TPC-W benchmark is designed to exercise the web server and transaction processing system of a typical e-commerce web site. We have implemented TPC-W as a collection of Java servlets, and present an architectural study detailing the memory system and branch predictor behavior of the workload. We also evaluate the effectiveness of a coarse-grained multithreaded processor at increasing system throughput using TPC-W and other commercial workloads. We measure system throughput improvements from 8% to 41% for a two context processor, and 12% to 60% for a four context uniprocessor over a single-threaded uniprocessor despite decreased branch prediction accuracy and cache hit rates.


programming language design and implementation | 2006

Accurate, efficient, and adaptive calling context profiling

Xiaotong Zhuang; Mauricio J. Serrano; Harold W. Cain; Jong-Deok Choi

Calling context profiles are used in many inter-procedural code optimizations and in overall program understanding. Unfortunately, the collection of profile information is highly intrusive due to the high frequency of method calls in most applications. Previously proposed calling-context profiling mechanisms consequently suffer from either low accuracy, high overhead, or both. We have developed a new approach for building the calling context tree at runtime, called adaptive bursting. By selectively inhibiting redundant profiling, this approach dramatically reduces overhead while preserving profile accuracy. We first demonstrate the drawbacks of previously proposed calling context profiling mechanisms. We show that a low-overhead solution using sampled stack-walking alone is less than 50% accurate, based on degree of overlap with a complete calling-context tree. We also show that a static bursting approach collects a highly accurate profile, but causes an unacceptable application slowdown. Our adaptive solution achieves 85% degree of overlap and provides an 88% hot-edge coverage when using a 0.1 hot-edge threshold, while dramatically reducing overhead compared to the static bursting approach.


international symposium on computer architecture | 2013

Robust architectural support for transactional memory in the power architecture

Harold W. Cain; Maged M. Michael; Brad Frey; Cathy May; Derek Edward Williams; Hung Q. Le

On the twentieth anniversary of the original publication [10], following ten years of intense activity in the research literature, hardware support for transactional memory (TM) has finally become a commercial reality, with HTM-enabled chips currently or soon-to-be available from many hardware vendors. In this paper we describe architectural support for TM added to a future version of the Power ISA#8482;. Two imperatives drove the development: the desire to complement our weakly-consistent memory model with a more friendly interface to simplify the development and porting of multithreaded applications, and the need for robustness beyond that of some early implementations. In the process of commercializing the feature, we had to resolve some previously unexplored interactions between TM and existing features of the ISA, for example translation shootdown, interrupt handling, atomic read-modify-write primitives, and our weakly consistent memory model. We describe these interactions, the overall architecture, and discuss the motivation and rationale for our choices of architectural semantics, beyond what is typically found in reference manuals.


international symposium on computer architecture | 2004

Memory Ordering: A Value-Based Approach

Harold W. Cain; Mikko H. Lipasti

Conventional out-of-order processors employ a multi-ported,fully-associative load queue to guarantee correctmemory reference order both within a single thread of executionand across threads in a multiprocessor system. Asimprovements in process technology and pipelining lead tohigher clock frequencies, scaling this complex structure toaccommodate a larger number of in-flight loads becomesdifficult if not impossible. Furthermore, each access to thiscomplex structure consumes excessive amounts of energy.In this paper, we solve the associative load queue scalabilityproblem by completely eliminating the associative loadqueue. Instead, data dependences and memory consistencyconstraints are enforced by simply re-executing loadinstructions in program order prior to retirement. Usingheuristics to filter the set of loads that must be re-executed,we show that our replay-based mechanism enables a simple,scalable, and energy-efficient FIFO load queue designwith no associative lookup functionality, while sacrificingonly a negligible amount of performance and cache bandwidth.


international symposium on microarchitecture | 2001

Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing

Milo M. K. Martin; Daniel J. Sorin; Harold W. Cain; Mark D. Hill; Mikko H. Lipasti

This paper explores the interaction of value prediction with thread-level parallelism techniques, including multithreading and multiprocessing, where correctness is defined by a memory consistency model. Value prediction subtly interacts with the memory consistency model by allowing data dependent instructions to be reordered We find that predicting a value and later verifying that the value eventually calculated is the same as the value predicted is not always sufficient. We present an example of a multithreaded pointer manipulation that can generate a surprising and erroneous result when value prediction is implemented without considering memory consistency correctness. We show that this problem can occur with real software, and we discuss how to apply existing techniques to eliminate the problem in both sequentially consistent systems and systems that obey relaxed memory consistency models.


acm symposium on parallel algorithms and architectures | 2002

Verifying sequential consistency using vector clocks

Harold W. Cain; Mikko H. Lipasti

We present an algorithm for dynamically verifying that the execution of a multithreaded program is sequentially consistent. The algorithm uses a vector-timestamp logical time mechanism to construct and verify the acyclic nature of an executions constraint graph.


Concurrency and Computation: Practice and Experience | 2002

A callgraph‐based search strategy for automated performance diagnosis

Harold W. Cain; Barton P. Miller; Brian J. N. Wylie

We introduce a new technique for automated performance diagnosis, using the programs callgraph. We discuss our implementation of this diagnosis technique in the Paradyn Performance Consultant. Our implementation includes the new search strategy and new dynamic instrumentation to resolve pointer‐based dynamic call sites at run‐time. We compare the effectiveness of our new technique to the previous version of the Performance Consultant for several sequential and parallel applications. Our results show that the new search method performs its search while inserting dramatically less instrumentation into the application, resulting in reduced application perturbation and consequently a higher degree of diagnosis accuracy. Copyright


international conference on parallel architectures and compilation techniques | 2003

Redeeming IPC as a performance metric for multithreaded programs

Kevin M. Lepak; Harold W. Cain; Mikko H. Lipasti

Recent work has shown that multithreaded workloads running in execution-driven, full-system simulation environments cannot use instructions per cycle (IPC) as a valid performance metric due to nondeterministic program behavior. Unfortunately, invalidating IPC as a performance metric introduces its own host of difficulties: special workload setup, consideration of cold-start and end-effects, statistical methodologies leading to increased simulation bandwidth, and workload-specific, higher-level metrics to measure performance. We explore the nondeterminism problem in multithreaded programs, describe a method to eliminate nondeterminism across simulations of different experimental machine models, and demonstrates the suitability of this methodology for performing architectural performance analysis, thus redeeming IPC as a performance metric for multithreaded programs.


international symposium on computer architecture | 2006

Conditional Memory Ordering

Christoph von Praun; Harold W. Cain; Jong-Deok Choi; Kyung Dong Ryu

Conventional relaxed memory ordering techniques follow a proactive model: at a synchronization point, a processor makes its own updates to memory available to other processors by executing a memory barrier instruction, ensuring that recent writes have been ordered with respect to other processors in the system. We show that this model leads to superfluous memory barriers in programs with acquire-release style synchronization, and present a combined hardware/software synchronization mechanism called conditional memory ordering (CMO) that reduces memory ordering overhead. CMO is demonstrated on a lock algorithm that identifies those dynamic lock/unlock operations for which memory ordering is unnecessary, and speculatively omits the associated memory ordering instructions. When ordering is required, this algorithm relies on a hardware mechanism for initiating a memory ordering operation on another processor. Based on evaluation using a software-only CMO prototype, we show that CMO avoids memory ordering operations for the vast majority of dynamic acquire and release operations across a set of multithreaded Java workloads, leading to significant speedups for many. However, performance improvements in the software prototype are hindered by the high cost of remote memory ordering. Using empirical data, we construct an analytical model demonstrating the benefits of a combined hardware-software implementation.


european conference on parallel processing | 2000

A Callgraph-Based Search Strategy for Automated Performance Diagnosis

Harold W. Cain; Barton P. Miller; Brian J. N. Wylie

We introduce a new technique for automated performance diagnosis, using the program’s callgraph. We discuss our implementation of this diagnosis technique in the Paradyn Performance Consultant. Our implementation includes the new search strategy and new dynamic instrumentation to resolve pointer-based dynamic call sites at run-time. We compare the effectiveness of our new technique to the previous version of the Performance Consultant for several sequential and parallel applications. Our results show that the new search method performs its search while inserting dramatically less instrumentation into the application, resulting in reduced application perturbation and consequently a higher degree of diagnosis accuracy.

Collaboration


Dive into the Harold W. Cain's collaboration.

Researchain Logo
Decentralizing Knowledge