Scott A. Mahlke | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Scott A. Mahlke is active.

Explore More

Publication

Featured researches published by Scott A. Mahlke.

international symposium on computer architecture | 1991

IMPACT: an architectural framework for multiple-instruction-issue processors

Pohua P. Chang; Scott A. Mahlke; William Y. Chen; Nancy J. Warter; Wen-mei W. Hwu

The performance of multiple-instruction-issue processors can be severely limited by the compiler’s ability to generate efficient code for concurrent hardware. In the IMPACT project, we have developed IMPACT-I, a highly optimizing C compiler to exploit instruction level concurrency. The optimization capabilities of the IMPACT-I C compiler are summarized in this paper. Using the IMPACT-I C compiler, we ran experiments to analyze the performance of multiple-instruction-issue processors executing some important non-numerical programs. The

Software - Practice and Experience | 1991

Using profile information to assist classic code optimizations

Pohua P. Chang; Scott A. Mahlke; Wen-mei W. Hwu

This paper describes the design and implementation of an optimizing compiler that automatically generates profile information to assist classic code optimizations. This compiler contains two new components, an execution profiler and a profile‐based code optimizer, which are not commonly found in traditional optimizing compilers. The execution profiler inserts probes into the input program, executes the input program for several inputs, accumulates profile information and supplies this information to the optimizer. The profile‐based code optimizer uses the profile information to expose new optimization opportunities that are not visible to traditional global optimization methods. Experimental results show that the profile‐based code optimizer significantly improves the performance of production C programs that have already been optimized by a high‐quality global code optimizer.

international symposium on microarchitecture | 2003

Processor acceleration through automated instruction set customization

Nathan Clark; Hongtao Zhong; Scott A. Mahlke

Application-specific extensions to the computational capabilities of a processor provide an efficient mechanism to meet the growing performance and power demands of embedded applications. Hardware, in the form of new function units (or co-processors), and the corresponding instructions, are added to a baseline processor to meet the critical computational demands of a target application. The central challenge with this approach is the large degree of human effort required to identify and create the custom hardware units, as well as porting the application to the extended processor. In this paper, we present the design of a system to automate the instruction set customization process. A dataflow graph design space exploration engine efficiently identifies profitable computation subgraphs from which to create custom hardware, without artificially constraining their size or shape. The system also contains a compiler subgraph matching framework that identifies opportunities to exploit and generalize the hardware to support more computation graphs. We demonstrate the effectiveness of this system across a range of application domains and study the applicability of the custom hardware across the domain.

programming language design and implementation | 2008

Orchestrating the execution of stream programs on multicore platforms

Manjunath Kudlur; Scott A. Mahlke

While multicore hardware has become ubiquitous, explicitly parallel programming models and compiler techniques for exploiting parallelism on these systems have noticeably lagged behind. Stream programming is one model that has wide applicability in the multimedia, graphics, and signal processing domains. Streaming models execute as a set of independent actors that explicitly communicate data through channels. This paper presents a compiler technique for planning and orchestrating the execution of streaming applications on multicore platforms. An integrated unfolding and partitioning step based on integer linear programming is presented that unfolds data parallel actors as needed and maximally packs actors onto cores. Next, the actors are assigned to pipeline stages in such a way that all communication is maximally overlapped with computation on the cores. To facilitate experimentation, a generalized code generation template for mapping the software pipeline onto the Cell architecture is presented. For a range of streaming applications, a geometric mean speedup of 14.7x is achieved on a 16-core Cell platform compared to a single core.

international symposium on computer architecture | 1995

A comparison of full and partial predicated execution support for ILP processors

Scott A. Mahlke; Richard E. Hank; James E. McCormick; David I. August; Wen-mei W. Hwu

One can effectively utilize predicated execution to improve branch handling in instruction-level parallel processors. Although the potential benefits of predicated execution are high, the tradeoffs involved in the design of an instruction set to support predicated execution can be difficult. On one end of the design spectrum, architectural support for full predicated execution requires increasing the number of source operands for all instructions. Full predicate support provides for the most flexibility and the largest potential performance improvements. On the other end, partial predicated execution support, such as conditional moves, requires very little change to existing architectures. This paper presents a preliminary study to qualitatively and quantitatively address the benefit of full and partial predicated execution support. With our current compiler technology, we show that the compiler can use both partial and full predication to achieve speedup in large control-intensive programs. Some details of the code generation techniques are shown to provide insight into the benefit of going from partial to full predication. Preliminary experimental results are very encouraging: partial predication provides an average of 33% performance improvement for an 8-issue processor with no predicate support while full predication provides an additional 30% improvement.

international symposium on microarchitecture | 2004

Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization

Nathan Clark; Manjunath Kudlur; Hyunchul Park; Scott A. Mahlke; Krisztian Flautner

Application-specific instruction set extensions are an effective way of improving the performance of processors. Critical computation subgraphs can be accelerated by collapsing them into new instructions that are executed on specialized function units. Collapsing the subgraphs simultaneously reduces the length of computation as well as the number of intermediate results stored in the register file. The main problem with this approach is that a new processor must be generated for each application domain. While new instructions can be designed automatically, there is a substantial amount of engineering cost incurred to verify and to implement the final custom processor. In this work, we propose a strategy to transparent customization of the core computation capabilities of the processor without changing its instruction set. A congurable array of function units is added to the baseline processor that enables the acceleration of a wide range of data flow subgraphs. To exploit the array, the microarchitecture performs subgraph identification at run-time, replacing them with new microcode instructions to configure and utilize the array. We compare the effectiveness of replacing subgraphs in the fill unit of a trace cache versus using a translation table during decode, and evaluate the tradeoffs between static and dynamic identification of subgraphs for instruction set customization.

high-performance computer architecture | 2006

BulletProof: a defect-tolerant CMP switch architecture

Kypros Constantinides; Stephen M. Plaza; Jason A. Blome; Bin Zhang; Valeria Bertacco; Scott A. Mahlke; Todd M. Austin; Michael Orshansky

As silicon technologies move into the nanometer regime, transistor reliability is expected to wane as devices become subject to extreme process variation, particle-induced transient errors, and transistor wear-out. Unless these challenges are addressed, computer vendors can expect low yields and short mean-times-to-failure. In this paper, we examine the challenges of designing complex computing systems in the presence of transient and permanent faults. We select one small aspect of a typical chip multiprocessor (CMP) system to study in detail, a single CMP router switch. To start, we develop a unified model of faults, based on the time-tested bathtub curve. Using this convenient abstraction, we analyze the reliability versus area tradeoff across a wide spectrum of CMP switch designs, ranging from unprotected designs to fully protected designs with online repair and recovery capabilities. Protection is considered at multiple levels from the entire system down through arbitrary partitions of the design. To better understand the impact of these faults, we evaluate our CMP switch designs using circuit-level timing on detailed physical layouts. Our experimental results are quite illuminating. We find that designs are attainable that can tolerate a larger number of defects with less overhead than naive triple-modular redundancy, using domain-specific techniques such as end-to-end error detection, resource sparing, automatic circuit decomposition, and iterative diagnosis and reconfiguration.

international symposium on computer architecture | 2005

An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors

Nathan Clark; Jason A. Blome; Michael L. Chu; Scott A. Mahlke; Stuart David Biles; Krisztian Flautner

Instruction set customization is an effective way to improve processor performance. Critical portions of application data-flow graphs are collapsed for accelerated execution on specialized hardware. Collapsing dataflow subgraphs will compress the latency along critical paths and reduces the number of intermediate results stored in the register file. While custom instructions can be effective, the time and cost of designing a new processor for each application is immense. To overcome this roadblock, this paper proposes a flexible architectural framework to transparently integrate custom instructions into a general-purpose processor. Hardware accelerators are added to the processor to execute the collapsed subgraphs. A simple microarchitectural interface is provided to support a plug-and-play model for integrating a wide range of accelerators into a pre-designed and verified processor core. The accelerators are exploited using an approach of static identification and dynamic realization. The compiler is responsible for identifying profitable subgraphs, while the hardware handles discovery, mapping, and execution of compatible subgraphs. This paper presents the design of a plug-and-play transparent accelerator system arid evaluates the cost/performance implications of the design.

international symposium on microarchitecture | 2007

Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures

Michael L. Chu; Rajiv A. Ravindran; Scott A. Mahlke

The recent design shift towards multicore processors has spawned a significant amount of research in the area of program parallelization. The future abundance of cores on a single chip requires programmer and compiler intervention to increase the amount of parallel work possible. Much of the recent work has fallen into the areas of coarse-grain parallelization: new programming models and different ways to exploit threads and data-level parallelism. This work focuses on a complementary direction, improving performance through automated fine-grain parallelization. The main difficulty in achieving a performance benefit from fine-grain parallelism is the distribution of data memory accesses across the data caches of each core. Poor choices in the placement of data accesses can lead to increased memory stalls and low resource utilization. We propose a profile-guided method for partitioning memory accesses across distributed data caches. First, a profile determines affinity relationships between memory accesses and working set characteristics of individual memory operations in the program. Next, a program-level partitioning of the memory operations is performed to divide the memory accesses across the data caches. As a result, the data accesses are proactively dispersed to reduce memory stalls and improve computation parallelization. A final detailed partitioning of the computation instructions is performed with knowledge of the cache location of their associated data. Overall, our data partitioning reduces stall cycles by up to 51 % versus data-incognizant partitioning, and has an overall speedup average of 30% over a single core processor.

architectural support for programming languages and operating systems | 1994

Dynamic memory disambiguation using the memory conflict buffer

David M. Gallagher; William Y. Chen; Scott A. Mahlke; John C. Gyllenhaal; Wen-mei W. Hwu

To exploit instruction level parallelism, compilers for VLIW and superscalar processors often employ static code scheduling. However, the available code reordering may be severely restricted due to ambiguous dependences between memory instructions. This paper introduces a simple hardware mechanism, referred to as the memory conflict buffer, which facilitates static code scheduling in the presence of memory store/load dependences. Correct program execution is ensured by the memory conflict buffer and repair code provided by the compiler. With this addition, significant speedup over an aggressive code scheduling model can be achieved for both non-numerical and numerical programs.

Explore More