Shih-Wei Liao | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shih-Wei Liao is active.

Explore More

Publication

Featured researches published by Shih-Wei Liao.

IEEE Computer | 1996

Maximizing multiprocessor performance with the SUIF compiler

Mary W. Hall; Jennifer-Ann M. Anderson; Saman P. Amarasinghe; Brian R. Murphy; Shih-Wei Liao; Edouard Bugnion; Monica S. Lam

This article describes automatic parallelization techniques in the SUIF (Stanford University Intermediate Format) compiler that result in good multiprocessor performance for array-based numerical programs. Parallelizing compilers for multiprocessors face many hurdles. However, SUIFs robust analysis and memory optimization techniques enabled speedups on three fourths of the NAS and SPECfp95 benchmark programs.

Sigplan Notices | 1994

SUIF: an infrastructure for research on parallelizing and optimizing compilers

Robert P. Wilson; Robert S. French; Christopher S. Wilson; Saman P. Amarasinghe; Jennifer-Ann M. Anderson; Steven W. K. Tjiang; Shih-Wei Liao; Chau-Wen Tseng; Mary W. Hall; Monica S. Lam; John L. Hennessy

Compiler infrastructures that support experimental research are crucial to the advancement of high-performance computing. New compiler technology must be implemented and evaluated in the context of a complete compiler, but developing such an infrastructure requires a huge investment in time and resources. We have spent a number of years building the SUIF compiler into a powerful, flexible system, and we would now like to share the results of our efforts.SUIF consists of a small, clearly documented kernel and a toolkit of compiler passes built on top of the kernel. The kernel defines the intermediate representation, provides functions to access and manipulate the intermediate representation, and structures the interface between compiler passes. The toolkit currently includes C and Fortran front ends, a loop-level parallelism and locality optimizer, an optimizing MIPS back end, a set of compiler development tools, and support for instructional use.Although we do not expect SUIF to be suitable for everyone, we think it may be useful for many other researchers. We thus invite you to use SUIF and welcome your contributions to this infrastructure. Directions for obtaining the SUIF software are included at the end of this paper.

acm sigplan symposium on principles and practice of parallel programming | 1999

SUIF Explorer: an interactive and interprocedural parallelizer

Shih-Wei Liao; Amer Diwan; Robert P. Bosch; Anwar M. Ghuloum; Monica S. Lam

The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarse-grain loops, thus minimizing the number of spurious dependences requiring attention. Second, the system uses dynamic execution analyzers to identify those important loops that are likely to be parallelizable. Third, the SUIF Explorer is the first to apply program slicing to aid programmers in interactive parallelization. The system guides the programmer in the parallelization process using a set of sophisticated visualization techniques.This paper demonstrates the effectiveness of the SUIF Explorer with three case studies. The programmer was able to speed up all three programs by examining only a small fraction of the program and privatizing a few variables.

acm sigplan symposium on principles and practice of parallel programming | 2001

Blocking and array contraction across arbitrarily nested loops using affine partitioning

Amy W. Lim; Shih-Wei Liao; Monica S. Lam

Applicable to arbitrary sequences and nests of loops, affine partitioning is a program transformation framework that unifies many previously proposed loop transformations, including unimodular transforms, fusion, fission, reindexing, scaling and statement reordering. Algorithms based on affine partitioning have been shown to be effective for parallelization and communication minimization. This paper presents algorithms that improve data locality using affine partitioning. Blocking and array contraction are two important optimizations that have been shown to be useful for data locality. Blocking creates a set of inner loops so that data brought into the faster levels of the memory hierarchy can be reused. Array contraction reduces an array to a scalar variable and thereby reduces the number of memory operations executed and the memory footprint. Loop transforms are often necessary to make blocking and array contraction possible. By bringing the full generality of affine partitioning to bear on the problem, our locality algorithm can find more contractable arrays than previously possible. This paper also generalizes the concept of blocking and shows that affine partitioning allows the benefits of blocking be realized in arbitrarily nested loops. Experimental results on a number of benchmarks and a complete multigrid application in aeronautics indicates that affine partitioning is effective in practice.

conference on high performance computing (supercomputing) | 1995

Detecting Coarse - Grain Parallelism Using an Interprocedural Parallelizing Compiler

Mary W. Hall; Saman P. Amarasinghe; Brian R. Murphy; Shih-Wei Liao; Monica S. Lam

This paper presents an extensive empirical evaluation of an interprocedural parallelizing compiler, developed as part of the Stanford SUIF compiler system. The system incorporates a comprehensive and integrated collection of analyses, including privatization and reduction recognition for both array and scalar variables, and symbolic analysis of array subscripts. The interprocedural analysis framework is designed to provide analysis results nearly as precise as full inlining but without its associated costs. Experimentation with this system shows that it is capable of detecting coarser granularity of parallelism than previously possible. Specifically, it can parallelize loops that span numerous procedures and hundreds of lines of codes, frequently requiring modifications to array data structures such as privatization and reduction transformations. Measurements from several standard benchmark suites demonstrate that an integrated combination of interprocedural analyses can substantially advance the capability of automatic parallelization technology.

symposium on code generation and optimization | 2006

Data and Computation Transformations for Brook Streaming Applications on Multiprocessors

Shih-Wei Liao; Zhaohui Du; Gansha Wu; Guei-Yuan Lueh

Multicore processors are about to become prevalent in the PC world. Meanwhile, over 90% of the computing cycles are estimated to be consumed by streaming media applications (Rixner et al., 1998). Although stream programming exposes parallelism naturally, we found that achieving high performance on multiprocessors is challenging. Therefore, we develop a parallel compiler for the Brook streaming language with aggressive data and computation transformations. First, we formulate fifteen Brook stream operators in terms of systems of inequalities. Our compiler optimizes the modeled operators to improve memory footprint and performance. Second, the stream computation including both kernels and operators is mapped to the affine partitioning model by modeling each kernel as an implicit loop nest over stream elements. Note that our general abstraction is not limited to Brook. Our modeling and transformations yield high performance on uniprocessors as well. The geometric mean of speedups is 4.7 on ten streaming applications on a Xeon. On multiprocessors, we show that exploiting the standard intra-kernel data parallelism is inferior to our general modeling. The former yields a speedup of 1.5 for ten applications on a 4-way Xeon, while the latter achieves a speedup of 6.4 over the same baseline. We show that our compiler effectively reduces memory footprint, exploits parallelism, and circumvents phase-ordering issues.

ACM Transactions on Programming Languages and Systems | 2005

Interprocedural parallelization analysis in SUIF

Mary W. Hall; Saman P. Amarasinghe; Brian R. Murphy; Shih-Wei Liao; Monica S. Lam

As shared-memory multiprocessor systems become widely available, there is an increasing need for tools to simplify the task of developing parallel programs. This paper describes one such tool, the automatic parallelization system in the Stanford SUIF compiler. This article represents a culmination of a several-year research effort aimed at making parallelizing compilers significantly more effective. We have developed a system that performs full interprocedural parallelization analyses, including array privatization analysis, array reduction recognition, and a suite of scalar data-flow analyses including symbolic analysis. These analyses collaborate in an integrated fashion to exploit coarse-grain parallel loops, computationally intensive loops that can execute on multiple processors independently with no cross-processor synchronization or communication. The system has successfully parallelized large interprocedural loops over a thousand lines of code completely automatically from sequential applications.This article provides a comprehensive description of the analyses in the SUIF system. We also present extensive empirical results on four benchmark suites, showing the contribution of individual analysis techniques both in executing more of the computation in parallel, and in increasing the granularity of the parallel computations. These results demonstrate the importance of interprocedural array data-flow analysis, array privatization and array reduction recognition; a third of the programs spend more than 50&percent; of their execution time in computations that are parallelized with these techniques. Overall, these results indicate that automatic parallelization can be effective on sequential scientific computations, but only if the compiler incorporates all of these analyses.

languages and compilers for parallel computing | 1995

Interprocedural Analysis for Parallelization

Mary W. Hall; Brian R. Murphy; Saman P. Amarasinghe; Shih-Wei Liao; Monica S. Lam

international symposium on microarchitecture | 1996

Multiprocessors from a software perspective

Saman P. Amarasinghe; Jennifer-Ann M. Anderson; Christopher S. Wilson; Shih-Wei Liao; Brian R. Murphy; Robert S. French; Monica S. Lam; Mary W. Hall

Like many architectural techniques that originated with mainframes. the use of multiple processors in a single computer is becoming popular in workstations and even personal computers. Multiprocessors constitute a significant percentage of recent workstation sales, and highly affordable multiprocessor personal computers are available in local computer stores. Once again, we find ourselves in a familiar situation: hardware is ahead of software. Because of the complexity of parallel programming, multiprocessors today are rarely used to speed up individual applications. Instead, they usually function as cycle-servers that achieve increased system throughput by running multiple tasks simultaneously. Automatic parallelization by a compiler is a particularly attractive approach to software development for multiprocessors, as it enables ordinary sequential programs to take advantage of the multiprocessor hardware without user involvement. This article looks to the future by examining some of the latest research results in automatic parallelization technology.

symposium on code generation and optimization | 2010

Taming hardware event samples for FDO compilation

Dehao Chen; Neil Vachharajani; Robert Hundt; Shih-Wei Liao; Vinodha Ramasamy; Paul Yuan; Wenguang Chen; Weiming Zheng

Feedback-directed optimization (FDO) is effective in improving application runtime performance, but has not been widely adopted due to the tedious dual-compilation model, the difficulties in generating representative training data sets, and the high runtime overhead of profile collection. The use of hardware-event sampling to generate estimated edge profiles overcomes these drawbacks. Yet, hardware event samples are typically not precise at the instruction or basic-block granularity. These inaccuracies lead to missed performance when compared to instrumentation-based FDO@. In this paper, we use multiple hardware event profiles and supervised learning techniques to generate heuristics for improved precision of basic-block-level sample profiles, and to further improve the smoothing algorithms used to construct edge profiles. We demonstrate that sampling-based FDO can achieve an average of 78% of the performance gains obtained using instrumentation-based exact edge profiles for SPEC2000 benchmarks, matching or beating instrumentation-based FDO in many cases. The overhead of collection is only 0.74% on average, while compiler based instrumentation incurs 6.8%-53.5% overhead (and 10x overhead on an industrial web search application), and dynamic instrumentation incurs 28.6%-1639.2% overhead.

Explore More