Mary W. Hall | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mary W. Hall is active.

Explore More

Publication

Featured researches published by Mary W. Hall.

IEEE Computer | 1996

Maximizing multiprocessor performance with the SUIF compiler

Mary W. Hall; Jennifer-Ann M. Anderson; Saman P. Amarasinghe; Brian R. Murphy; Shih-Wei Liao; Edouard Bugnion; Monica S. Lam

This article describes automatic parallelization techniques in the SUIF (Stanford University Intermediate Format) compiler that result in good multiprocessor performance for array-based numerical programs. Parallelizing compilers for multiprocessors face many hurdles. However, SUIFs robust analysis and memory optimization techniques enabled speedups on three fourths of the NAS and SPECfp95 benchmark programs.

Sigplan Notices | 1994

SUIF: an infrastructure for research on parallelizing and optimizing compilers

Robert P. Wilson; Robert S. French; Christopher S. Wilson; Saman P. Amarasinghe; Jennifer-Ann M. Anderson; Steven W. K. Tjiang; Shih-Wei Liao; Chau-Wen Tseng; Mary W. Hall; Monica S. Lam; John L. Hennessy

Compiler infrastructures that support experimental research are crucial to the advancement of high-performance computing. New compiler technology must be implemented and evaluated in the context of a complete compiler, but developing such an infrastructure requires a huge investment in time and resources. We have spent a number of years building the SUIF compiler into a powerful, flexible system, and we would now like to share the results of our efforts.SUIF consists of a small, clearly documented kernel and a toolkit of compiler passes built on top of the kernel. The kernel defines the intermediate representation, provides functions to access and manipulate the intermediate representation, and structures the interface between compiler passes. The toolkit currently includes C and Fortran front ends, a loop-level parallelism and locality optimizer, an optimizing MIPS back end, a set of compiler development tools, and support for instructional use.Although we do not expect SUIF to be suitable for everyone, we think it may be useful for many other researchers. We thus invite you to use SUIF and welcome your contributions to this infrastructure. Directions for obtaining the SUIF software are included at the end of this paper.

international conference on supercomputing | 2002

The architecture of the DIVA processing-in-memory chip

Jeffrey Draper; Jacqueline Chame; Mary W. Hall; Craig S. Steele; Tim Barrett; Jeff LaCoss; John J. Granacki; Jaewook Shin; Chun Chen; Chang Woo Kang; Ihn Kim; Gokhan Daglikoca

The DIVA (Data IntensiVe Architecture) system incorporates a collection of Processing-In-Memory (PIM) chips as smart-memory co-processors to a conventional microprocessor. We have recently fabricated prototype DIVA PIMs. These chips represent the first smart-memory devices designed to support virtual addressing and capable of executing multiple threads of control. In this paper, we describe the prototype PIM architecture. We emphasize three unique features of DIVA PIMs, namely, the memory interface to the host processor, the 256-bit wide datapaths for exploiting on-chip bandwidth, and the address translation unit. We present detailed simulation results on eight benchmark applications. When just a single PIM chip is used, we achieve an average speedup of 3.3X over host-only execution, due to lower memory stall times and increased fine-grain parallelism. These 1-PIM results suggest that a PIM-based architecture with many such chips yields significantly higher performance than a multiprocessor of a similar scale and at a much reduced hardware cost.

conference on high performance computing (supercomputing) | 1995

Detecting Coarse - Grain Parallelism Using an Interprocedural Parallelizing Compiler

Mary W. Hall; Saman P. Amarasinghe; Brian R. Murphy; Shih-Wei Liao; Monica S. Lam

This paper presents an extensive empirical evaluation of an interprocedural parallelizing compiler, developed as part of the Stanford SUIF compiler system. The system incorporates a comprehensive and integrated collection of analyses, including privatization and reduction recognition for both array and scalar variables, and symbolic analysis of array subscripts. The interprocedural analysis framework is designed to provide analysis results nearly as precise as full inlining but without its associated costs. Experimentation with this system shows that it is capable of detecting coarser granularity of parallelism than previously possible. Specifically, it can parallelize loops that span numerous procedures and hundreds of lines of codes, frequently requiring modifications to array data structures such as privatization and reduction transformations. Measurements from several standard benchmark suites demonstrate that an integrated combination of interprocedural analyses can substantially advance the capability of automatic parallelization technology.

international parallel and distributed processing symposium | 2009

A scalable auto-tuning framework for compiler optimization

Ananta Tiwari; Chun Chen; Jacqueline Chame; Mary W. Hall; Jeffrey K. Hollingsworth

We describe a scalable and general-purpose framework for auto-tuning compiler-generated code. We combine Active Harmonys parallel search backend with the CHiLL compiler transformation framework to generate in parallel a set of alternative implementations of computation kernels and automatically select the one with the best-performing implementation. The resulting system achieves performance of compiler-generated code comparable to the fully automated version of the ATLAS library for the tested kernels. Performance for various kernels is 1.4 to 3.6 times faster than the native Intel compiler without search. Our search algorithm simultaneously evaluates different combinations of compiler optimizations and converges to solutions in only a few tens of search-steps.

symposium on code generation and optimization | 2005

Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy

Chun Chen; Jacqueline Chame; Mary W. Hall

This paper describes an algorithm for simultaneously optimizing across multiple levels of the memory hierarchy for dense-matrix computations. Our approach combines compiler models and heuristics with guided empirical search to take advantage of their complementary strengths. The models and heuristics limit the search to a small number of candidate implementations, and the empirical results provide the most accurate information to the compiler to select among candidates and tune optimization parameter values. We have developed an initial implementation and applied this approach to two case studies, matrix multiply and Jacobi relaxation. For matrix multiply, our results on two architectures, SGI R10000 and Sun UltraSparc IIe, outperform the native compiler, and either outperform or achieve comparable performance as the ATLAS self-tuning library and the hand-tuned vendor BLAS library. Jacobi results also substantially outperform the native compilers.

Computer Languages | 1993

A methodology for procedure cloning

Keith D. Cooper; Mary W. Hall; Ken Kennedy

Procedure cloning is an interprocedural transformation where the compiler creates specialized copies of procedure bodies. The compiler divides incoming calls between the original procedure and its copies. By carefully partitioning the calls, the compiler ensures that each clone inherits an environment that allows for better code optimization. This paper presents a three-phase algorithm for deciding when to clone a procedure. The algorithm seeks to avoid unnecessary code growth by considering how the information exposed by cloning will be used during optimization. We present a set of assumptions that bound both the algorithms running time and code expansion.

programming language design and implementation | 2002

A compiler approach to fast hardware design space exploration in FPGA-based systems

Byoungro So; Mary W. Hall; Pedro C. Diniz

The current practice of mapping computations to custom hardware implementations requires programmers to assume the role of hardware designers. In tuning the performance of their hardware implementation, designers manually apply loop transformations such as loop unrolling. designers manually apply loop transformations. For example, loop unrolling is used to expose instruction-level parallelism at the expense of more hardware resources for concurrent operator evaluation. Because unrolling also increases the amount of data a computation requires, too much unrolling can lead to a memory bound implementation where resources are idle. To negotiate inherent hardware space-time trade-offs, designers must engage in an iterative refinement cycle, at each step manually applying transformations and evaluating their impact. This process is not only error-prone and tedious but also prohibitively expensive given the large search spaces and with long synthesis times. This paper describes an automated approach to hardware design space exploration, through a collaboration between parallelizing compiler technology and high-level synthesis tools. We present a compiler algorithm that automatically explores the large design spaces resulting from the application of several program transformations commonly used in application-specific hardware designs. Our approach uses synthesis estimation techniques to quantitatively evaluate alternate designs for a loop nest computation. We have implemented this design space exploration algorithm in the context of a compilation and synthesis system called DEFACTO, and present results of this implementation on five multimedia kernels. Our algorithm derives an implementation that closely matches the performance of the fastest design in the design space, and among implementations with comparable performance, selects the smallest design. We search on average only 0.3% of the design space. This technology thus significantly raises the level of abstraction for hardware design and explores a design space much larger than is feasible for a human designer.

symposium on code generation and optimization | 2005

Superword-Level Parallelism in the Presence of Control Flow

Jaewook Shin; Mary W. Hall; Jacqueline Chame

In this paper, we describe how to extend the concept of superword-level parallelization (SLP), used for multimedia extension architectures, so that it can be applied in the presence of control flow constructs. Superword-level parallelization involves identifying scalar instructions in a large basic block that perform the same operation, and, if dependences do not prevent it, combining them into a superword operation on a multi-word object. A key insight is that we can use techniques related to optimizations for architectures supporting predicated execution, even for multimedia ISAs that do not provide hardware predication. We derive large basic blocks with predicated instructions to which SLP can be applied. We describe how to minimize overheads for superword predicates and re-introduce control flow for scalar operations. We discuss other extensions to SLP to address common features of real multimedia codes. We present automatically-generated performance results on 8 multimedia codes to demonstrate the power of this approach. We observe speedups ranging from 1.97X to 15.07X as compared to both sequential execution and SLP alone.

Proceedings of the IEEE | 1993

The ParaScope parallel programming environment

Keith D. Cooper; Mary W. Hall; Robert T. Hood; Ken Kennedy; Kathryn S. McKinley; John M. Mellor-Crummey; Linda Torczon; Scott K. Warren

The ParaScope parallel programming environment, developed to support scientific programming of shared-memory multiprocessors, is described. It includes a collection of tools that use global program analysis to help users develop and debug parallel programs. The focus is on ParaScopes compilation system. The compilation system extends the traditional single-procedure compiler by providing a mechanism for managing the compilation of complete programs. The ParaScope editor brings both compiler analysis and user expertise to bear on program parallelization. The debugging system detects and reports timing-dependent errors, called data races, in execution of parallel programs. A project aimed at extending ParaScope to support programming in FORTRAN D, a machine-independent parallel programming language for use with both distributed-memory and shared-memory parallel computers, is described. >

Explore More