Ed Grochowski | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ed Grochowski is active.

Explore More

Publication

Featured researches published by Ed Grochowski.

international symposium on computer architecture | 2005

Mitigating Amdahl's Law through EPI Throttling

Murali Annavaram; Ed Grochowski; John Paul Shen

This paper is motivated by three recent trends in computer design. First, chip multi-processors (CMPs) with increasing numbers of CPU cores per chip are becoming common. Second, multi-threaded software that can take advantage of CMPs will soon become prevalent. Due to the nature of the algorithms, these multi-threaded programs inherently will have phases of sequential execution; Amdahls law dictates that the speedup of such parallel programs will be limited by the sequential portion of the computation. Finally, increasing levels of on-chip integration coupled with a slowing rate of reduction in supply voltage make power consumption a first order design constraint. Given this environment, our goal is to minimize the execution times of multi-threaded programs containing nontrivial parallel and sequential phases, while keeping the CMPs total power consumption within a fixed budget. In order to mitigate the effects of Amdahls law, in this paper we make a compelling case for varying the amount of energy expended to process instructions according to the amount of available parallelism. Using the equation, Power-Energy per instruction (EPI) * Instructions per second (IPS), we propose that during phases of limited parallelism (low IPS) the chip multi-processor will spend more EPI; similarly, during phases of higher parallelism (high IPS) the chip multi-processor will spend less EPI; in both scenarios power is fixed. We evaluate the performance benefits of an EPI throttle on an asymmetric multiprocessor (AMP) prototyped from a physical 4-way Xeon SMP server. Using a wide range of multi-threaded programs, we show a 38% wall clock speedup on an AMP compared to a standard SMP that uses the same power. We also measure the supply current on a 4-way SMP server while running the multi-threaded programs and use the measured data as input to a software simulator that implements a more flexible EPI throttle. The results from the measurement-driven simulation show performance benefits comparable to the AMP prototype. We analyze the results from both techniques, explain why and when an EPI throttle works well, and conclude with a discussion of the challenges in building practical EPI throttles.

international conference on computer design | 2004

Best of both latency and throughput

Ed Grochowski; Ronny Ronen; John Paul Shen; Hong Wang

This paper describes the tradeoff between latency performance and throughput performance in a power-constrained environment. We show that the key to achieving both excellent latency performance as well as excellent throughput performance is to dynamically vary the amount of energy expended to process instructions according to the amount of parallelism available in the software. We survey four techniques for achieving variable energy per instruction: voltage/frequency scaling, asymmetric cores, variable-size cores, and speculation control. We estimate the potential range of energies obtainable by each technique and conclude that a combination of asymmetric cores and voltage/frequency scaling offers the most promising approach to design a chip-level multiprocessor that can achieve both excellent latency performance and excellent throughput performance.

international symposium on microarchitecture | 2009

Larrabee: A Many-Core x86 Architecture for Visual Computing

Larry Seiler; Doug Carmean; Eric Sprangle; Tom Forsyth; Pradeep Dubey; Stephen Junkins; Adam T. Lake; Robert D. Cavin; Roger Espasa; Ed Grochowski; Toni Juan; Michael Abrash; Jeremy Sugerman; Pat Hanrahan

The Larrabee many-core visual computing architecture uses multiple in-order x86 cores augmented by wide vector processor units, together with some fixed-function logic. This increases the architectures programmability as compared to standard GPUs. The article describes the Larrabee architecture, a software renderer optimized for it, and other highly parallel applications. The article analyzes performance through scalability studies based on real-world workloads.

high-performance computer architecture | 2002

Microarchitectural simulation and control of di/dt-induced power supply voltage variation

Ed Grochowski; David Ayers; Vivek Tiwari

As the power consumption of modern high-performance microprocessors increases beyond 100 W, power becomes an increasingly important design consideration. This paper presents a novel technique to simulate power supply voltage variation as a result of varying activity levels within the microprocessor when executing typical software. The voltage simulation capability may be added to existing microarchitecture simulators that determine the activities of each functional block on a clock-by-clock basis. We then discuss how the same technique can be implemented in logic on the microprocessor die to enable real-time computation of current consumption and power supply voltage. When used in a feedback loop, this logic makes it possible to control the microprocessors activities to reduce demands on the power delivery system. With on-die voltage computation and di/dt control, we show that a significant reduction in power supply voltage variation may be achieved with little performance loss or average power increase.

high-performance computer architecture | 2002

Memory latency-tolerance approaches for Itanium processors: out-of-order execution vs. speculative precomputation

Perry H. Wang; Hong Wang; Jamison D. Collins; Ed Grochowski; Ralph-Michael Kling; John Paul Shen

The performance of in-order execution Itanium/sup TM/ processors can suffer significantly due to cache misses. Two memory latency tolerance approaches can be applied for the Itanium processors. One uses an out-of-order (OOO) execution core; the other assumes multithreading support and exploits cache prefetching via speculative precomputation (SP). This paper evaluates and contrasts these two approaches. In addition, this paper assesses the effectiveness of combining the two approaches. For a select set of memory-intensive programs, an in-order SMT Itanium processor using speculative precomputation can achieve performance improvement (92%) comparable to that of an out-of-order design (87%). Applying both 000 and SP yields a total performance improvement of 141% over the baseline in-order machine. OOO tends to be effective in prefetching-for L1 misses; whereas SP is primarily good at covering L2 and L3 misses. Our analysis indicates that the two approaches can be redundant or complementary depending on the type of delinquent loads that each targets. Both approaches are effective on delinquent loads in the loop body; however only SP is effective on delinquent loads found in loop control code.

international conference on computer design | 1989

Issues in the implementation of the i486 cache and bus

Ed Grochowski; K. Shoemaker

To meet its performance goal of executing most instructions in a single clock, the i486 microprocessor uses a cache memory that is integrated on the silicon die as the processor. The integrated cache is capable of performing one memory read or write each clock cycle. The size and organization of the cache were selected based on available silicon area and on the results of trace-driven simulation. An external bus was devised featuring burst data transfers to quickly fill cache lines and provisions to ensure cache coherency.<<ETX>>

international symposium on computer architecture | 2014

Improving the energy efficiency of big cores

Kenneth Czechowski; Victor W. Lee; Ed Grochowski; Ronny Ronen; Ronak Singhal; Richard W. Vuduc; Pradeep Dubey

Traditionally, architectural innovations designed to boost single-threaded performance incur overhead costs which significantly increase power consumption. In many cases the increase in power exceeds the improvement in performance, resulting in a net increase in energy consumption. Thus, it is reasonable to assume that modern attempts to improve single-threaded performance will have a negative impact on energy efficiency. This has led to the belief that “Big Cores” are inherently inefficient. To the contrary, we present a study which finds that the increased complexity of the core microarchitecture in recent generations of the Intel® Core™ processor have reduced both the time and energy required to run various workloads. Moreover, taking out the impact of process technology changes, our study still finds the architecture and microarchitecture changes -such as the increase in SIMD width, addition of the frontend caches, and the enhancement to the out-of-order execution engine- account for 1.2x improvement in energy efficiency for these processors. This paper provides real-world examples of how architectural innovations can mitigate inefficiencies associated with “Big Cores” -for example, micro-op caches obviate the costly decode of complex x86 instructions- resulting in a core architecture that is both high performance and energy efficient. It also contributes to the understanding of how microarchitecture affects performance, power and energy efficiency by modeling the relationship between them.

high-performance computer architecture | 2007

Implications of Device Timing Variability on Full Chip Timing

Murali Annavaram; Ed Grochowski; Paul Reed

As process technologies continue to scale, the magnitude of within-die device parameter variations is expected to increase and may lead to significant timing variability. This paper presents a quantitative evaluation of how low level device timing variations impact the timing at the functional block level. We evaluate two types of timing variations: random and systematic variations. The study introduces random and systematic timing variations to several functional blocks in Intelreg Coretrade Duo microprocessor design database and measures the resulting timing margins. The primary conclusion of this research is that as a result of combining two probability distributions (the distribution of the random variation and the distribution of path timing margins) functional block timing margins degrade non-linearly with increasing variability

international parallel and distributed processing symposium | 2012

Performance Benefits of Heterogeneous Computing in HPC Workloads

Victor W. Lee; Ed Grochowski; Robert Geva

Chip multi-processors (CMPs) with increasing number of processor cores are now becoming widely available. To take advantage of many-core CMPs, applications must be parallelized. However, due to the nature of algorithm/programming model, some parts of the application would remain serial. According to Amdahls law, the speedup of a parallel application is limited by the amount of serial execution it has. For a CMP with many cores, this can be a serious limitation. To take full advantage of the increasing number of cores, one must try to reduce the execution time of the serial portion of a parallel program. However, rewriting an application takes time and often the return on the effort invested may not justify parallelizing every part of the program. Heterogeneous many-core CMP design is one possible solution to support massive parallel execution and to provide a reasonable single-thread performance. In this paper, we use a simple spreadsheet model to evaluate homogeneous and heterogeneous CMP designs using execution profiles of real HPC applications. Evaluated on 12 parallel HPC applications, we show that heterogeneous CMPs can outperform homogeneous CMPs by up to 1.35× with an average speedup of 1.06× when both the heterogeneous CMPs and homogeneous CMPs are constrained to use the same power budget. Our study found the heterogeneous CMPs can take advantage of serial portion of execution that is as little as 2% of total run time to provide performance benefit. This suggests heterogeneous computing can help mitigate the effect of not parallelizing some portions of an application due to return on investment concern on programming efforts.

international symposium on physical design | 2008