Michael Karl Gschwind

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Michael Karl Gschwind is active.

Explore More

Publication

Featured researches published by Michael Karl Gschwind.

IEEE Micro | 2006

Synergistic Processing in Cell's Multicore Architecture

Michael Karl Gschwind; Harm Peter Hofstee; Brian Flachs; M. Hopkin; Y. Watanabe; T. Yamazaki

Eight synergistic processor units enable the Cell Broadband Engines breakthrough performance. The SPU architecture implements a novel, pervasively data-parallel architecture combining scalar and SIMD processing on a wide data path. A large number of SPUs per chip provide high thread-level parallelism. The streamlined architecture provides an efficient multithreaded execution environment for both scalar and SIMD threads and represents a reaffirmation of the RISC principles of combining leading edge architecture and compiler optimizations. These design decisions have enabled the Cell BE to deliver unprecedented supercomputer-class compute power for consumer applications

international symposium on microarchitecture | 2012

The IBM Blue Gene/Q Compute Chip

Ruud A. Haring; Martin Ohmacht; Thomas W. Fox; Michael Karl Gschwind; David L. Satterfield; Krishnan Sugavanam; Paul W. Coteus; Philip Heidelberger; Matthias A. Blumrich; Robert W. Wisniewski; Alan Gara; George Liang-Tai Chiu; Peter A. Boyle; Norman H. Chist; Changhoan Kim

Blue Gene/Q aims to build a massively parallel high-performance computing system out of power-efficient processor chips, resulting in power-efficient, cost-efficient, and floor-space- efficient systems. Focusing on reliability during design helps with scaling to large systems and lowers the total cost of ownership. This article examines the architecture and design of the Compute chip, which combines processors, memory, and communication functions on a single chip.

international conference on parallel architectures and compilation techniques | 2005

Optimizing Compiler for the CELL Processor

Alexandre E. Eichenberger; Kathryn M. O'Brien; Peng Wu; Tong Chen; P.H. Oden; D.A. Prener; J.C. Shepherd; Byoungro So; Zehra Sura; Amy Wang; Tao Zhang; Peng Zhao; Michael Karl Gschwind

Developed for multimedia and game applications, as well as other numerically intensive workloads, the CELL processor provides support both for highly parallel codes, which have high computation and memory requirements, and for scalar codes, which require fast response time and a full-featured programming environment. This first generation CELL processor implements on a single chip a Power Architecture processor with two levels of cache, and eight attached streaming processors with their own local memories and globally coherent DMA engines. In addition to processor-level parallelism, each processing element has a Single Instruction Multiple Data (SIMD) unit that can process from 2 double precision floating points up to 16 bytes per instruction. This paper describes, in the context of a research prototype, several compiler techniques that aim at automatically generating high quality codes over a wide range of heterogeneous parallelism available on the CELL processor. Techniques include compiler-supported branch prediction, compiler-assisted instruction fetch, generation of scalar codes on SIMD units, automatic generation of SIMD codes, and data and code partitioning across the multiple processor elements in the system. Results indicate that significant speedup can be achieved with a high level of support from the compiler.

IEEE Transactions on Computers | 2001

Dynamic binary translation and optimization

Kemal Ebcioglu; Erik R. Altman; Michael Karl Gschwind; Sumedh W. Sathaye

We describe a VLIW architecture designed specifically as a target for dynamic compilation of an existing instruction set architecture. This design approach offers the simplicity and high performance of statically scheduled architectures, achieves compatibility with an established architecture, and makes use of dynamic adaptation. Thus, the original architecture is implemented using dynamic compilation, a process we refer to as DAISY (Dynamically Architected Instruction Set from Yorktown). The dynamic compiler exploits runtime profile information to optimize translations so as to extract instruction level parallelism. This paper reports different design trade-offs in the DAISY system and their impact on final system performance. The results show high degrees of instruction parallelism with reasonable translation overhead and memory usage.

international symposium on microarchitecture | 2002

Optimizing pipelines for power and performance

Viji Srinivasan; David M. Brooks; Michael Karl Gschwind; Pradip Bose; Victor Zyuban; Philip N. Strenski; Philip G. Emma

During the concept phase and definition of next generation high-end processors, power and performance will need to be weighted appropriately to deliver competitive cost/performance. It is not enough to adopt a CPI-centric view alone in early-stage definition studies. One of the fundamental issues confronting the architect at this stage is the choice of pipeline depth and target frequency. In this paper we present an optimization methodology that starts with an analytical power-performance model to derive optimal pipeline depth for a superscalar processor. The results are validated and further refined using detailed simulation based analysis. As part of the power-modeling methodology, we have developed equations that model the variation of energy as a function of pipeline depth. Our results using a set of SPEC2000 applications show that when both power and performance are considered for optimization, the optimal clock period is around 18 FO4. We also provide a detailed sensitivity analysis of the optimal pipeline depth against key assumptions of these energy models.

IEEE Computer | 2000

Dynamic and transparent binary translation

Michael Karl Gschwind; Erik R. Altman; Sumedh W. Sathaye; Paul Ledak; David Appenzeller

High-frequency design and instruction-level parallelism (ILP) are important for high-performance microprocessor implementations. The Binary-translation Optimized Architecture (BOA), an implementation of the IBM PowerPC family, combines binary translation with dynamic optimization. The authors use these techniques to simplify the hardware by bridging a semantic gap between the PowerPCs reduced instruction set and even simpler hardware primitives. Processors like the Pentium Pro and Power4 have tried to achieve high frequency and ILP by implementing a cracking scheme in hardware: an instruction decoder in the pipeline generates multiple micro-operations that can then be scheduled out of order. BOA relies on an alternative software approach to decompose complex operations and to generate schedules, and thus offers significant advantages over purely static compilation approaches. This article explains BOAs translation strategy, detailing system issues and architecture implementation.

computing frontiers | 2006

Chip multiprocessing and the cell broadband engine

Michael Karl Gschwind

Chip multiprocessing has become an exciting new direction for system designers to deliver increased performance by exploiting CMOS scaling. We discuss key design decisions facing the system architect of a chip multiprocessor and describe how these choices were made in the design of the Cell Broadband Engine.An important decision is whether to base system performance on thread-level parallelism alone, or to complement thread-level parallelism with other forms of parallelism. Depending on workload characteristics, providing parallelism at the processor core level may increase overall system efficiency.Parallelism is also a key to utilize available memory bandwidth more efficiently, by overlapping and interleaving multiple accesses to system memory. By interleaving the access streams of multiple threads, memory level parallelism can be increased to allow better memory interface utilization. In addition, compute-transfer parallelism (CTP) offers a new form of parallelism to initiate memory transfers under software control without stalling the requesting thread.We describe how the Cell Broadband Enginetmuses parallelism at all levels of the system abstraction to deliver a quantum leap in application performance, and how the Cell Synergistic Memory Flow engine exploits compute-transfer level parallelism by providing efficient block transfer capabilities.

International Journal of Parallel Programming | 2007

The cell broadband engine: exploiting multiple levels of parallelism in a chip multiprocessor

Michael Karl Gschwind

As CMOS feature sizes continue to shrink and traditional microarchitectural methods for delivering high performance (e.g., deep pipelining) become too expensive and power-hungry, chip multiprocessors (CMPs) become an exciting new direction by which system designers can deliver increased performance. Exploiting parallelism in such designs is the key to high performance, and we find that parallelism must be exploited at multiple levels of the system: the thread-level parallelism that has become popular in many designs fails to exploit all the levels of available parallelism in many workloads for CMP systems. We describe the Cell Broadband Engine and the multiple levels at which its architecture exploits parallelism: data-level, instruction-level, thread-level, memory-level, and compute-transfer parallelism. By taking advantage of opportunities at all levels of the system, this CMP revolutionizes parallel architectures to deliver previously unattained levels of single chip performance. We describe how the heterogeneous cores allow to achieve this performance by parallelizing and offloading computation intensive application code onto the Synergistic Processor Element (SPE) cores using a heterogeneous thread model with SPEs. We also give an example of scheduling code to be memory latency tolerant using software pipelining techniques in the SPE.

Proceedings of the Seventh International Workshop on Hardware/Software Codesign (CODES'99) (IEEE Cat. No.99TH8450) | 1999

Instruction set selection for ASIP design

Michael Karl Gschwind

We describe an approach for application-specific processor design based on an extendible microprocessor core. Core-based design allows to derive application-specific instruction processors from a common base architecture with low non-recurring engineering cost. The results of this application-specific customization of a common base architecture are families of related and largely compatible processor families. These families can share support tools and even binary compatible code which has been written for the common base architecture. Critical code portions are customized using the application-specific instruction set extensions. We describe a hardware/software co-design methodology which can be used with this design approach. The presented approach uses the processor core to allow early evaluation of ASIP design options using rapid prototyping techniques. We demonstrate this approach with two case studies, based on the implementation and evaluation of application-specific processor extensions for Prolog program execution, and memory prefetching for vector and matrix operations.

IEEE Transactions on Computers | 2004

Integrated analysis of power and performance for pipelined microprocessors

Victor Zyuban; David M. Brooks; Viji Srinivasan; Michael Karl Gschwind; Pradip Bose; Philip N. Strenski; Philip G. Emma

Choosing the pipeline depth of a microprocessor is one of the most critical design decisions that an architect must make in the concept phase of a microprocessor design. To be successful in todays cost/performance marketplace, modern CPU designs must effectively balance both performance and power dissipation. The choice of pipeline depth and target clock frequency has a critical impact on both of these metrics. We describe an optimization methodology based on both analytical models and detailed simulations for power and performance as a function of pipeline depth. Our results for a set of SPEC2000 applications show that, when both power and performance are considered for optimization, the optimal clock period is around 18 FO4. We also provide a detailed sensitivity analysis of the optimal pipeline depth against key assumptions of our energy models. Finally, we discuss the potential risks in design quality for overly aggressive or conservative choices of pipeline depth.

Explore More