Kevin M. Lepak
University of Wisconsin-Madison
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kevin M. Lepak.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2005
Weiping Liao; Lei He; Kevin M. Lepak
Performance and power are two primary design issues for systems ranging from server computers to handhelds. Performance is affected by both temperature and supply voltage because of the temperature and voltage dependence of circuit delay. Furthermore, as semiconductor technology scales down, leakage powers exponential dependence on temperature and supply voltage becomes significant. Therefore, future design studies call for temperature and voltage aware performance and power modeling. In this paper, we study microarchitecture-level temperature and voltage aware performance and power modeling. We present a leakage power model with temperature and voltage scaling, and show that leakage and total energy vary by 38% and 24%, respectively, between 65/spl deg/C and 110/spl deg/C. We study thermal runaway induced by the interdependence between temperature and leakage power, and demonstrate that without temperature-aware modeling, underestimation of leakage power may lead to the failure of thermal controls, and overestimation of leakage power may result in excessive performance penalties of up to 5.24%. All of these studies underscore the necessity of temperature-aware power modeling. Furthermore, we study optimal voltage scaling for best performance with dynamic power and thermal management under different packaging options. We show that dynamic power and thermal management allows designs to target at the common-case thermal scenario among benchmarks and improves performance by 6.59% compared to designs targeted at the worst case thermal scenario without dynamic power and thermal management. Additionally, the optimal V/sub dd/ for the best performance may not be the largest V/sub dd/ allowed by the given packaging platform, and that advanced cooling techniques can improve throughput significantly.
international symposium on computer architecture | 2000
Kevin M. Lepak; Mikko H. Lipasti
Value locality, a recently discovered program attribute that describes the likelihood of the recurrence of previously-seen program values, has been studied enthusiastically in the recent published literature. Much of the energy has focused on refining the initial efforts at predicting load instruction outcomes, with the balance of the effort examining the value locality of either all register-writing instructions, or a focused subset of them. Surprisingly, there has been very little published characterization of or effort to exploit the value locality of data words stored to memory by computer programs. This paper presents such a characterization, proposes both memory-centric (based on message passing) and producer-centric (based on program structure) prediction mechanisms for stored data values, introduces the concept of silent stores and new definitions of multiprocessor false sharing based on these observations, and suggests new techniques for aligning cache coherence protocols and microarchitectural store handling techniques to exploit the value locality of stores. We find that realistic implementations of these techniques can significantly reduce multiprocessor data bus traffic and are more effective at reducing address bus traffic than the addition of Exclusive state to a MSI coherence protocol. We also show that squashing of silent stores can provide uniprocessor speedups greater than the addition of store-to-load forwarding.
architectural support for programming languages and operating systems | 2002
Kevin M. Lepak; Mikko H. Lipasti
Recent work has shown that silent stores--stores which write a value matching the one already stored at the memory location--occur quite frequently and can be exploited to reduce memory traffic and improve performance. This paper extends the definition of silent stores to encompass sets of stores that change the value stored at a memory location, but only temporarily, and subsequently return a previous value of interest to the memory location. The stores that cause the value to revert are called temporally silent stores. We redefine multiprocessor sharing to account for temporal silence and show that in the limit, up to 45% of communication misses in scientific and commercial applications can be eliminated by exploiting values that change only temporarily. We describe a practical mechanism that detects temporally silent stores and removes the coherence traffic they cause in conventional multiprocessors. We find that up to 42% of communication misses can be eliminated with a simple extension to the MESI protocol. Further, we examine application and operating system code to provide insight into the temporal silence phenomenon and characterize temporal silence by examining value frequencies and dynamic instruction distances between temporally silent pairs. These studies indicate that the operating system is involved heavily in temporal silence, in both commercial and scientific workloads, and that while detectable synchronization primitives provide substantial contributions, significant opportunity exists outside these references.
IEEE Transactions on Computers | 2001
Kevin M. Lepak; Gordon B. Bell; Mikko H. Lipasti
Value locality, a recently discovered program attribute that describes the likelihood of the recurrence of previously seen program values, has been studied enthusiastically in the recent published literature. Much of the energy has focused on refining the initial efforts at predicting load instruction outcomes, with the balance of the effort examining the value locality of either all register-writing instructions or a focused subset of them. Surprisingly, there has been very little published characterization of or effort to exploit the value locality of data words stored to memory by computer programs. This paper presents such a characterization, including detailed source-level analysis of the causes of silent stores, proposes both memory-centric (based on message passing) and producer-centric (based on program structure) prediction mechanisms for stored data values, introduces the concept of silent stores and new definitions of multiprocessor false sharing based on these observations, and suggests new techniques for aligning cache coherence protocols and microarchitectural store handling techniques to exploit the value locality of stores. We find that realistic implementations of these techniques can significantly reduce multiprocessor data bus traffic and are more effective at reducing address bus traffic than the addition of Exclusive state to a MS I coherence protocol. We also show that squashing of silent stores can provide uniprocessor speedups greater than the addition of store-to-load forwarding.
design automation conference | 2001
Kevin M. Lepak; Irwan Luwandi; Lei He
For multiple coupled RLC nets, we formulate the min-area simultaneous shield insertion and net ordering (SINO/NB-v) problem to satisfy the given noise bound. We develop an efficient and conservative model to compute the peak noise, and apply the noise model to a simulated-annealing (SA) based algorithm for the SINO/NB-v problem. Extensive and accurate experiments show that the SA-based algorithm is efficient, and always achieves solutions satisfying the given noise bound. It uses up to 71% and 30% fewer shields when compared to a greedy based shield insertion algorithm and a separated shield insertion and net ordering algorithm, respectively. To the best of our knowledge, it is the first work that presents an in-depth study on the min-area SINO problem under an explicit noise constraint.
international conference on parallel architectures and compilation techniques | 2003
Kevin M. Lepak; Harold W. Cain; Mikko H. Lipasti
Recent work has shown that multithreaded workloads running in execution-driven, full-system simulation environments cannot use instructions per cycle (IPC) as a valid performance metric due to nondeterministic program behavior. Unfortunately, invalidating IPC as a performance metric introduces its own host of difficulties: special workload setup, consideration of cold-start and end-effects, statistical methodologies leading to increased simulation bandwidth, and workload-specific, higher-level metrics to measure performance. We explore the nondeterminism problem in multithreaded programs, describe a method to eliminate nondeterminism across simulations of different experimental machine models, and demonstrates the suitability of this methodology for performing architectural performance analysis, thus redeeming IPC as a performance metric for multithreaded programs.
international conference on parallel architectures and compilation techniques | 2000
Gordon B. Bell; Kevin M. Lepak; Mikko H. Lipasti
The recent discovery that many store instructions are silent creates new opportunities for computer architects. A silent store does not change the state of the system because it writes a value that already exists at the write address, and can safely be eliminated from the dynamic instruction stream. We analyze silent stores in several benchmarks in the context of their high-level source code and explain why they occur. We also introduce the concept of critical silent stores and show that their removal is sufficient for eliminating avoidable writebacks. Finally, we show that frequently occurring stores are highly likely to be silent and that selectively squashing them can drastically reduce the total number of silent stores. This paper explores and illuminates several aspects of store value locality.
ACM Sigarch Computer Architecture News | 2001
Harold W. Cain; Kevin M. Lepak; Mikko H. Lipasti
We present the design of a PowerPC-based simulation infrastructure for architectural research. Our infrastructure uses an execution-driven out-of-order processor timing simulator from the SimpleScalar tool set. While porting SimpleScalar to the PowerPC architecture, we would like to remain compatible with other versions of SimpleScalar. We accomplish this by performing dynamic binary translation of the PowerPC instruction set architecture to the SimpleScalar instruction set architecture, and by mapping the PowerPC architectural state onto the SimpleScalar register set. Using this infrastructure, we execute unmodified PowerPC binaries on an out-of-order processor timing simulator which implements the SimpleScalar architecture. We describe and investigate trade-offs in the translation of some complex PowerPC instructions and advocate adoption of speculative decode to optimize instruction translations for the common case. We find that simple decode predictors can reach better than 90% accuracy for guiding speculative decode.
international symposium on performance analysis of systems and software | 2005
Kevin M. Lepak; Mikko H. Lipasti
Communication misses - those serviced by dirty data in remote caches - are a pressing performance limiter in shared-memory multiprocessors. Recent research has indicated that temporally silent stores can be exploited to substantially reduce such misses, either with coherence protocol enhancements (MESTI); by employing speculation to create atomic silent store-pairs that achieve speculative lock elision (SLE); or by employing load value prediction (LVP). We evaluate all three approaches utilizing full-system, execution-driven simulation, with scientific and commercial workloads, to measure performance. Our studies indicate that accurate detection of elision idioms for SLE is vitally important for delivering robust performance and appears difficult for existing commercial codes. Furthermore, common datapath issues in out-of-order cores cause barriers to speculation and therefore may cause SLE failures unless SLE-specific speculation mechanisms are added to the microarchitecture. We also propose novel prediction and silence detection mechanisms that enable the MESTI protocol to deliver robust performance for all workloads. Finally, we conduct a detailed execution-driven performance evaluation of load value prediction (LVP), another simple method for capturing the benefit of temporally silent stores. We show that while theoretically LVP can capture the greatest fraction of communication misses among all approaches, it is usually not the most effective at delivering performance. This occurs because attempting to hide latency by speculating at the consumer, i.e. predicting load values, is fundamentally less effective than eliminating the latency at the source, by removing the invalidation effect of stores. Applying each method, we observe performance changes in application benchmarks ranging from 1% to 14% for an enhanced version of MESTI, -1.0% to 9% for LVP, -3% to 9% for enhanced SLE, and 2% to 21% for combined techniques
Archive | 2002
Mikko H. Lipasti; Harold W. Cain; Kevin M. Lepak