Nastaran Baradaran | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nastaran Baradaran is active.

Explore More

Publication

Featured researches published by Nastaran Baradaran.

ACM Transactions on Design Automation of Electronic Systems | 2008

A compiler approach to managing storage and memory bandwidth in configurable architectures

Nastaran Baradaran; Pedro C. Diniz

Configurable architectures offer the unique opportunity of realizing hardware designs tailored to the specific data and computational patterns of an application code. Customizing the storage structures is becoming increasingly important in mitigating the continuing gap between memory latencies and internal computing speeds. In this article we describe and evaluate a compiler algorithm that maps the arrays of a loop-based computation to internal storage structures, either RAM blocks or discrete registers. Our objective is to minimize the overall execution time while considering the capacity and bandwidth constraints of the storage resources. The novelty of our approach lies in creating a single framework that combines high-level compiler techniques with lower-level scheduling information for mapping the data. We illustrate the benefits of our approach for a set of image/signal processing kernels using a Xilinx Virtex#8482; Field-Programmable Gate Array (FPGA). Our algorithm leads to faster designs compared to the state-of-the-art custom data layout mapping technique, in some instances using less storage. When compared to hand-coded designs, our results are comparable in terms of execution time and resources, but are derived in a minute fraction of the design time.

design, automation, and test in europe | 2005

A Register Allocation Algorithm in the Presence of Scalar Replacement for Fine-Grain Configurable Architectures

Nastaran Baradaran; Pedro C. Diniz

The aggressive application of scalar replacement to array references substantially reduces the number of memory operations at the expense of a possibly very large number of registers. We describe a register allocation algorithm that assigns registers to scalar replaced array references along the critical paths of a computation, in many cases exploiting the opportunity for concurrent memory accesses. Experimental results, for a set of image/signal processing code kernels, reveal that the proposed algorithm leads to a substantial reduction in the number of execution cycles for the corresponding hardware implementation on a contemporary field-programmable-gate-array (FPGA) when compared to other greedy allocation algorithms, in some cases, using even fewer registers.

field-programmable technology | 2004

Compiler reuse analysis for the mapping of data in FPGAs with RAM blocks

Nastaran Baradaran; Joonseok Park; Pedro C. Diniz

Contemporary configurable architectures have dedicated internal functional units such as multipliers, high-capacity storage RAM, and even CAM blocks. These RAM blocks allow the implementations to cache data to be reused in the near future, thereby avoiding the latency of external memory accesses. We present a data allocation algorithm that utilizes the RAM blocks in the presence of a limited number of hardware registers. This algorithm, based on a compiler data reuse analysis, determines which data should be cached in the internal RAM blocks and when. The preliminary results, for a set of image/signal processing kernels targeting a Xilinx Virtex/spl trade/ FPGA device, reveal that despite the increase latency of accessing data in RAM blocks, designs that use them require smaller configurable resources than designs that exclusively use registers, while attaining comparable and in some cases even better performance.

international parallel and distributed processing symposium | 2003

ECO: an empirical-based compilation and optimization system

Nastaran Baradaran; Jacqueline Chame; Chun Chen; Pedro C. Diniz; Mary W. Hall; Yoon-Ju Lee; Bing Liu; Robert F. Lucas

In this paper, we describe a compilation system that automates much of the process of performance tuning that is currently done manually by application programmers interested in high performance. Due to the growing complexity of accurate performance prediction, our system incorporates empirical techniques to execute variants of code segments with representative data on the target architecture. In this paper, we discuss how empirical techniques and performance modeling can be effectively combined. We also discuss the role of historical information from prior runs, and programmer specifications supporting run-time adaptation. These techniques can be employed to alleviate some of the performance problems that lead to inefficiencies in key applications today: register pressure, cache conflict misses, and the trade-off between synchronization, parallelism and locality in SMPs.

ieee international conference on high performance computing data and analytics | 2004

Extending the applicability of scalar replacement to multiple induction variables

Nastaran Baradaran; Pedro C. Diniz; Joonseok Park

Scalar replacement or register promotion uses scalar variables to save data that can be reused across loop iterations, leading to a reduction of the number of memory operations at the expense of a possibly large number of registers. In this paper we present a compiler data reuse analysis capable of uncovering and exploiting reuse opportunities for array references that exhibit Multiple-Induction-Variable (MIV) subscripts, beyond the reach of current data reuse analysis techniques. We present experimental results of the application of scalar replacement to a sample set of kernel codes targeting a programmable hardware computing device — a Field-Programmable-Gate-Array (FPGA). The results show that, for memory bound designs, scalar replacement alone leads to speedups that range between 2x to 6x at the expense of an increase in the FPGA design area in the range of 6x to 20x.

field-programmable logic and applications | 2006

Memory Parallelism Using Custom Array Mapping to Heterogeneous Storage Structures

Nastaran Baradaran; Pedro C. Diniz

Configurable architectures offer the unique opportunity of customizing the storage allocation to meet specific applications¿ needs. In this paper we describe a compiler approach to map the arrays of a loop-based computation to internal memories of a configurable architecture with the objective of minimizing the overall execution time. We present an algorithm that considers the data access patterns of the arrays along the critical path of the computation as well as the available storage and memory bandwidth. We demonstrate experimental results of the application of this approach for a set of kernel codes when targeting a Field-Programmable Gate-Array (FPGA). The results reveal that our algorithm outperforms naive and custom data layouts for these kernels by an average of 33% and 15% in terms of execution time, while taking into account the available hardware resources.

field-programmable technology | 2005

Compiler-directed design space exploration for caching and prefetching data in high-level synthesis

Nastaran Baradaran; Pedro C. Diniz

Emerging computing architectures exhibit a rich variety of controllable storage resources. Allocation and management of these resources critically affect the performance of data intensive applications. In this paper we describe a synergistic collaboration between compiler data dependence analysis and execution modeling techniques to explore the application of data caching and software prefetching for hardware designs in high-level synthesis. We describe a design space exploration algorithm that selects between data caching and prefetching of array references along the critical paths of the computation with the objective of minimizing the overall execution time, while meeting the architectures storage and bandwidth constraints. We present preliminary results of the application of the algorithm for a set of image/signal processing kernels on a commercial FPGA. The high precision of our execution model (average 94%) results in the selection of the fastest design in every case.

field-programmable logic and applications | 2004

Data Reuse in Configurable Architectures with RAM Blocks

Nastaran Baradaran; Joonseok Park; Pedro C. Diniz

The heterogeneity of modern con.gurable devices makes the problem of mapping computations to them increasingly complex. Due to the large number of possibilities for partitioning the data among storage modules, these architectures allow for a much richer memory structure. One general goal in managing this memory is to minimize the number of external memory accesses. A classic technique for reducing this number is to keep reusable data as close to the processor as possible. In order to do so one needs to have a good idea as to if and when the data in a speci.c memory location is going to be reused. General compilers are capable of detecting the reuse as well as applying di.erent techniques in order to exploit this reuse. In this paper we describe how to utilize a compiler reuse analysis to map the data to a con.gurable system, aiming at minimizing the number of external memory accesses. Our target architecture is a system with an external memory, a limited number of internal registers, and a .xed number and capacity of internal RAM blocks.

Archive | 2007