Is this you? Create Your Porfile

Byoungro So

University of Southern California

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Byoungro So is active.

Explore More

Publication

Featured researches published by Byoungro So.

international conference on parallel architectures and compilation techniques | 2005

Optimizing Compiler for the CELL Processor

Alexandre E. Eichenberger; Kathryn M. O'Brien; Peng Wu; Tong Chen; P.H. Oden; D.A. Prener; J.C. Shepherd; Byoungro So; Zehra Sura; Amy Wang; Tao Zhang; Peng Zhao; Michael Karl Gschwind

Developed for multimedia and game applications, as well as other numerically intensive workloads, the CELL processor provides support both for highly parallel codes, which have high computation and memory requirements, and for scalar codes, which require fast response time and a full-featured programming environment. This first generation CELL processor implements on a single chip a Power Architecture processor with two levels of cache, and eight attached streaming processors with their own local memories and globally coherent DMA engines. In addition to processor-level parallelism, each processing element has a Single Instruction Multiple Data (SIMD) unit that can process from 2 double precision floating points up to 16 bytes per instruction. This paper describes, in the context of a research prototype, several compiler techniques that aim at automatically generating high quality codes over a wide range of heterogeneous parallelism available on the CELL processor. Techniques include compiler-supported branch prediction, compiler-assisted instruction fetch, generation of scalar codes on SIMD units, automatic generation of SIMD codes, and data and code partitioning across the multiple processor elements in the system. Results indicate that significant speedup can be achieved with a high level of support from the compiler.

programming language design and implementation | 2002

A compiler approach to fast hardware design space exploration in FPGA-based systems

Byoungro So; Mary W. Hall; Pedro C. Diniz

The current practice of mapping computations to custom hardware implementations requires programmers to assume the role of hardware designers. In tuning the performance of their hardware implementation, designers manually apply loop transformations such as loop unrolling. designers manually apply loop transformations. For example, loop unrolling is used to expose instruction-level parallelism at the expense of more hardware resources for concurrent operator evaluation. Because unrolling also increases the amount of data a computation requires, too much unrolling can lead to a memory bound implementation where resources are idle. To negotiate inherent hardware space-time trade-offs, designers must engage in an iterative refinement cycle, at each step manually applying transformations and evaluating their impact. This process is not only error-prone and tedious but also prohibitively expensive given the large search spaces and with long synthesis times. This paper describes an automated approach to hardware design space exploration, through a collaboration between parallelizing compiler technology and high-level synthesis tools. We present a compiler algorithm that automatically explores the large design spaces resulting from the application of several program transformations commonly used in application-specific hardware designs. Our approach uses synthesis estimation techniques to quantitatively evaluate alternate designs for a loop nest computation. We have implemented this design space exploration algorithm in the context of a compilation and synthesis system called DEFACTO, and present results of this implementation on five multimedia kernels. Our algorithm derives an implementation that closely matches the performance of the fastest design in the design space, and among implementations with comparable performance, selects the smallest design. We search on average only 0.3% of the design space. This technology thus significantly raises the level of abstraction for hardware design and explores a design space much larger than is feasible for a human designer.

design automation conference | 2003

Using estimates from behavioral synthesis tools in compiler-directed design space exploration

Byoungro So; Pedro C. Diniz; Mary W. Hall

This paper considers the role of performance and area estimates from behavioral synthesis in design space exploration. We have developed a compilation system that automatically maps high-level algorithms written in C to application-specific designs for Field Programmable Gate Arrays (FPGAs), through collaboration between parallelizing compiler technology and high-level synthesis tools. Using several code transformations, the compiler optimizes a design to increase parallelism and utilization of external memory bandwidth, and selects the best design among a set of candidates. Performance and area estimates from behavioral synthesis provide feedback to the compiler to guide this selection. Estimates can be derived far more quickly (up to several orders of magnitude faster) than full synthesis and place-and-route, thus allowing the compiler to consider many more designs than would otherwise be practical. In this paper, we examine the accuracy of the estimates from behavioral synthesis as compared to the fully synthesized designs for a collection of 209 designs for five multimedia kernels. Though the estimates are not completely accurate, our results show that the same design would be selected by the design space exploration algorithm, whether we use estimates or actual results from place-and-route, because it favors smaller designs and only increases complexity when the benefit is significant.

field-programmable custom computing machines | 2002

Coarse-grain pipelining on multiple FPGA architectures

Heidi E. Ziegler; Byoungro So; Mary W. Hall; Pedro C. Diniz

Reconfigurable systems, and in particular, FPGA-based custom computing machines, offer a unique opportunity to define application-specific architectures. These architectures offer performance advantages for application domains such as image processing, where the use of customized pipelines exploits the inherent coarse-grain parallelism. In this paper we describe a set of program analyses and an implementation that map a sequential and un-annotated C program into a pipelined implementation running on a set of FPGAs, each with multiple external memories. Based on well-known parallel computing analysis techniques, our algorithms perform unrolling for operator parallelization, reuse and data layout for memory parallelization and precise communication analysis. We extend these techniques for FPGA-based systems to automatically partition the application data and computation into custom pipeline stages, taking into account the available FPGA and interconnect resources. We illustrate the analysis components by way of an example, a machine vision program. We present the algorithm results, derived with minimal manual intervention, which demonstrate the potential of this approach for automatically deriving pipelined designs from high-level sequential specifications.

symposium on code generation and optimization | 2004

Custom data layout for memory parallelism

Byoungro So; Mary W. Hall; Heidi E. Ziegler

We describe a generalized approach to deriving a custom data layout in multiple memory banks for array-based computations, to facilitate high-bandwidth parallel memory accesses in modern architectures, where multiple memory banks can simultaneously feed one or more functional units. We do not use a fixed data layout, but rather select application-specific layouts according to access patterns in the code. A unique feature of this approach is its flexibility in the presence of code reordering transformations, such as the loop nest transformations commonly applied to array-based computations. We have implemented this algorithm in the DEFACTO system, a design environment for automatically mapping C programs to hardware implementations for FPGA-based systems. We present experimental results for five multimedia kernels that demonstrate the benefits of this approach. Our results show that custom data layout yields results as good as, or better than, naive or fixed cyclic layouts, and is significantly better for certain access patterns and in the presence of code reordering transformations. When used in conjunction with unrolling loops in a nest to expose instruction-level parallelism, we observe greater than a 75% reduction in the number of memory access cycles and speedups ranging from 3.96 to 46.7 for 8 memories, as compared to using a single memory with no unrolling.

compiler construction | 2004

Increasing the Applicability of Scalar Replacement

Byoungro So; Mary W. Hall

This paper describes an algorithm for scalar replacement, which replaces repeated accesses to an array element with a scalar temporary. The element is accessed from a register rather than memory, thereby eliminating unnecessary memory accesses. A previous approach to this problem combines scalar replacement with a loop transformation called unroll-and-jam, whereby outer loops in a nest are unrolled, and the resulting duplicate inner loop bodies are fused together. The effect of unroll-and-jam is to bring opportunities for scalar replacement into inner loop bodies. In this paper, we describe an alternative approach that can exploit reuse opportunities across multiple loops in a nest, and without requiring unroll-and-jam. We also use this technique to eliminate unnecessary writes back to memory. The approach described in this paper is particularly well-suited to architectures with large register files and efficient mechanisms for register-to-register transfer. From our experimental results mapping 5 multimedia kernels to an FPGA platform, assuming 32 registers, we observe a 58 to 90 percent of reduction in memory accesses and speedup 2.34 to 7.31 over original programs.

Microprocessors and Microsystems | 2005

Automatic mapping of C to FPGAs with the DEFACTO compilation and synthesis system

Pedro C. Diniz; Mary W. Hall; Joonseok Park; Byoungro So; Heidi E. Ziegler

Abstract The DEFACTO compilation and synthesis system is capable of automatically mapping computations expressed in high-level imperative programming languages as C to FPGA-based systems. DEFACTO combines parallelizing compiler technology with behavioral VHDI, synthesis tools to guide the application of high-level compiler transformations in the search of high-quality hardware designs. In this article we illustrate the effectiveness of this approach in automatically mapping several kernel codes to an FPGA quickly and correctly. We also present a detailed example of the comparison of the performance of an automatically generated design against a manually generated implementation of the same computation. The design-space-exploration component of DEFACTO is able to explore a large number of designs for a particular computation that would otherwise be impractical for any designers.

IEEE Transactions on Parallel and Distributed Systems | 2000

Evaluating automatic parallelization in SUIF

Sungdo Moon; Byoungro So; Mary W. Hall

This paper presents the results of an experiment to measure empirically the remaining opportunities for exploiting loop-level parallelism that are missed by the Stanford SUIF compiler, a state-of-the-art automatic parallelization system targeting shared-memory multiprocessor architectures. For the purposes of this experiment, we have developed a run-time parallelization test called the Extended Lazy Privatizing Doall (ELPD) test, which is able to simultaneously test multiple loops in a loop nest. The ELPD test identifies a specific type of parallelism where each iteration of the loop being tested accesses independent data, possibly by making some of the data private to each processor. For 29 programs in three benchmark suites, the ELPD test was executed at run time for each candidate loop left unparallelized by the SUIF compiler to identify which of these loops could safely execute in parallel for the given program input. The results of this experiment point to two main requirements for improving the effectiveness of parallelizing compiler technology: incorporating control flow tests into analysis and extracting low-cost run-time parallelization tests from analysis results.

international conference on supercomputing | 1998

Measuring the effectiveness of automatic parallelization in SUIF

Byoungro So; Sungdo Moon; Mary W. Hall

This paper presents both an experiment and a system for inserting run-time dependence and privatization testing. The goal of the experiment is to measure empirically the remaining opportunities for exploiting loop-level parallelism that are missed by state-of-theart parallelizing compiler technology. We perform run-time testing of data accessed within all candidate loops not parallelized by the compiler to identify which of these loops could safely execute in parallel for the given program input. This system extends the Lazy Privatizing Doall (LPD) test to simultaneously instrument multiple loops in a nest. Using the results of interprocedural array dataflow analysis, we avoid unnecessary instrumentation of arrays with compile-time provable dependences or loops nested inside outer parallelized loops. We have implemented the system in the Stanford SUIF compiler and have measured programs in three benchmark suites.

languages and compilers for parallel computing | 2003

Search Space Properties for Mapping Coarse-Grain Pipelined FPGA Applications

Heidi E. Ziegler; Mary W. Hall; Byoungro So

This paper describes an automated approach to hardware design space exploration, through a collaboration between parallelizing compiler technology and high-level synthesis tools. In previous work, we described a compiler algorithm that optimizes individual loop nests, expressed in C, to derive an efficient FPGA implementation. In this paper, we describe a global optimization strategy that maps multiple loop nests to a coarse-grain pipelined FPGA implementation. The global optimization algorithm automatically transforms the computation to incorporate explicit communication and data reorganization between pipeline stages, and uses metrics to guide design space exploration to consider the impact of communication and to achieve balance between producer and consumer data rates across pipeline stages. We illustrate the components of the algorithm with a case study, a machine vision kernel.

Explore More