Heidi E. Ziegler
University of Southern California
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Heidi E. Ziegler.
international parallel processing symposium | 1999
Kiran Bondalapati; Pedro C. Diniz; Phillip Duncan; John J. Granacki; Mary W. Hall; Rajeev Jain; Heidi E. Ziegler
The lack of high-level design tools hampers the widespread adoption of adaptive computing systems. Application developers have to master a wide range of functions, from the high-level architecture design, to the timing of actual control and data signals. In this paper we describe DEFACTO, an end-to-end design environment aimed at bridging the gap in tools for adaptive computing by bringing together parallelizing compiler technology and synthesis techniques.
field-programmable custom computing machines | 2002
Heidi E. Ziegler; Byoungro So; Mary W. Hall; Pedro C. Diniz
Reconfigurable systems, and in particular, FPGA-based custom computing machines, offer a unique opportunity to define application-specific architectures. These architectures offer performance advantages for application domains such as image processing, where the use of customized pipelines exploits the inherent coarse-grain parallelism. In this paper we describe a set of program analyses and an implementation that map a sequential and un-annotated C program into a pipelined implementation running on a set of FPGAs, each with multiple external memories. Based on well-known parallel computing analysis techniques, our algorithms perform unrolling for operator parallelization, reuse and data layout for memory parallelization and precise communication analysis. We extend these techniques for FPGA-based systems to automatically partition the application data and computation into custom pipeline stages, taking into account the available FPGA and interconnect resources. We illustrate the analysis components by way of an example, a machine vision program. We present the algorithm results, derived with minimal manual intervention, which demonstrate the potential of this approach for automatically deriving pipelined designs from high-level sequential specifications.
symposium on code generation and optimization | 2004
Byoungro So; Mary W. Hall; Heidi E. Ziegler
We describe a generalized approach to deriving a custom data layout in multiple memory banks for array-based computations, to facilitate high-bandwidth parallel memory accesses in modern architectures, where multiple memory banks can simultaneously feed one or more functional units. We do not use a fixed data layout, but rather select application-specific layouts according to access patterns in the code. A unique feature of this approach is its flexibility in the presence of code reordering transformations, such as the loop nest transformations commonly applied to array-based computations. We have implemented this algorithm in the DEFACTO system, a design environment for automatically mapping C programs to hardware implementations for FPGA-based systems. We present experimental results for five multimedia kernels that demonstrate the benefits of this approach. Our results show that custom data layout yields results as good as, or better than, naive or fixed cyclic layouts, and is significantly better for certain access patterns and in the presence of code reordering transformations. When used in conjunction with unrolling loops in a nest to expose instruction-level parallelism, we observe greater than a 75% reduction in the number of memory access cycles and speedups ranging from 3.96 to 46.7 for 8 memories, as compared to using a single memory with no unrolling.
Microprocessors and Microsystems | 2005
Pedro C. Diniz; Mary W. Hall; Joonseok Park; Byoungro So; Heidi E. Ziegler
Abstract The DEFACTO compilation and synthesis system is capable of automatically mapping computations expressed in high-level imperative programming languages as C to FPGA-based systems. DEFACTO combines parallelizing compiler technology with behavioral VHDI, synthesis tools to guide the application of high-level compiler transformations in the search of high-quality hardware designs. In this article we illustrate the effectiveness of this approach in automatically mapping several kernel codes to an FPGA quickly and correctly. We also present a detailed example of the comparison of the performance of an automatically generated design against a manually generated implementation of the same computation. The design-space-exploration component of DEFACTO is able to explore a large number of designs for a particular computation that would otherwise be impractical for any designers.
design automation conference | 2003
Heidi E. Ziegler; Mary W. Hall; Pedro C. Diniz
In this paper, we describe a set of compiler analyses and an implementation that automatically map a sequential and un-annotated C program into a pipelined implementation, targeted for an FPGA with multiple external memories. For this purpose, we extend array data-flow analysis techniques from parallelizing compilers to identify pipeline stages, required inter-pipeline stage communication, and opportunities to find a minimal program execution time by trading communication overhead with the amount of computation overlap in different stages. Using the results of this analysis, we automatically generate application-specific pipelined FPGA hardware designs. We use a sample image processing kernel to illustrate these concepts. Our algorithm finds a solution in which transmitting a row of an array between pipeline stages per communication instance leads to a speedup of 1.76 over an implementation that communicates the entire array at once.
field programmable gate arrays | 2005
Heidi E. Ziegler; Mary W. Hall
This paper presents a set of measurements which characterize the design space for automatically mapping high-level algorithms consisting of multiple loop nests, expressed in C, onto an FPGA. We extend a prior compiler algorithm that derived optimized FPGA implementations for individual loop nests. We focus on the space-time tradeoffs associated with sharing constrained chip area among multiple computations represented by an asynchronous pipeline. Intermediate results are communicated on chip; communication analysis generates this communication automatically. Other analyses and transformations, also associated with parallelizing compiler technology, are used to perform high-level optimization of the designs. We vary the amount of parallelism in individual loop nests with the goal of deriving an overall design that makes the most effective use of chip resources. We describe several heuristics for automatically searching the space and a set of metrics for evaluating and comparing designs. From results obtained through an automated process, we demonstrate that heuristics derived through sophisticated compiler analysis are the most effective at navigating this complex search space, particularly for more complex applications.
languages and compilers for parallel computing | 2005
Heidi E. Ziegler; Priyadarshini L. Malusare; Pedro C. Diniz
Configurable architectures, with multiple independent on-chip RAM modules, offer the unique opportunity to exploit inherent parallel memory accesses in a sequential program by not only tailoring the number and configuration of the modules in the resulting hardware design but also the accesses to them. In this paper we explore the possibility of array replication for loop computations that is beyond the reach of traditional privatization and parallelization analyses. We present a compiler analysis that identifies portions of array variables that can be temporarily replicated within the execution of a given loop iteration, enabling the concurrent execution of statements or even non-perfectly nested loops. For configurable architectures where array replication is essentially free in terms of execution time, this replication enables not only parallel execution but also reduces or even eliminates memory contention. We present preliminary experiments applying the proposed technique to hardware designs for commercially available FPGA devices.
languages and compilers for parallel computing | 2003
Heidi E. Ziegler; Mary W. Hall; Byoungro So
This paper describes an automated approach to hardware design space exploration, through a collaboration between parallelizing compiler technology and high-level synthesis tools. In previous work, we described a compiler algorithm that optimizes individual loop nests, expressed in C, to derive an efficient FPGA implementation. In this paper, we describe a global optimization strategy that maps multiple loop nests to a coarse-grain pipelined FPGA implementation. The global optimization algorithm automatically transforms the computation to incorporate explicit communication and data reorganization between pipeline stages, and uses metrics to guide design space exploration to consider the impact of communication and to achieve balance between producer and consumer data rates across pipeline stages. We illustrate the components of the algorithm with a case study, a machine vision kernel.
acm symposium on parallel algorithms and architectures | 1998
Christopher Ho; Heidi E. Ziegler; Michel Dubois
In cache-coherent non-untform memory access (CC-NUMA) multiprocessors, the amount of memory consumed by the directories grows with the number of memory blocks and the number of processors. In this paper, we propose to eliminate the cost of directories by storing directory entries in the same memory usedfor the data that we keep coherent. A memory block stores either data when the block is uncached, or directory information when the block is cached. Only one extra bit per memory block is required to distinguish whether it contains data or directory information. We call this approach “in-memory directories”. The trade-off is that the new protocol consumes more network bandwidth and has slightly higher latencies. We quantify this trade-off using a subset of programs in the SPLASH-2 benchmarks. The evaluations show that in-memory directories are competitive with dedicated full-mapped directories and are a viable alternative to SCI for scalable multiprocessor systems. We also propose variations to the basic in-memory protocol, which reduce its memory access overhead.
field-programmable logic and applications | 2004
Heidi E. Ziegler
Configurable systems offer a unique opportunity to define application-specific architectures. These architectures offer performance advantages, where the use of customized pipelines exploits the inherent parallelism of the application. In this research, we describe a set of program analyses and an implementation that automatically map a sequential and un-annotated C program into a pipelined implementation targeted to an FPGA with multiple external memories. This research describes an automated approach to hardware design space exploration, through a collaboration between parallelizing compiler technology and high-level synthesis tools. In previous work, we described a compiler algorithm that optimizes individual loop nests, expressed in C, to derive an efficient FPGA implementation. In this research, we describe a global optimization strategy that maps multiple loop nests to a coarse-grain pipelined FPGA implementation.