Miriam Leeser | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Miriam Leeser is active.

Explore More

Publication

Featured researches published by Miriam Leeser.

field programmable gate arrays | 2001

Algorithmic transformations in the implementation of K- means clustering on reconfigurable hardware

Mike Estlick; Miriam Leeser; James Theiler; John J. Szymanski

In mapping the k-means algorithm to FPGA hardware, we examined algorithm level transforms that dramatically increased the achievable parallelism. We apply the k-means algorithm to multi-spectral and hyper-spectral images, which have tens to hundreds of channels per pixel of data. K-means is an iterative algorithm that assigns assigns to each pixel a label indicating which of K clusters the pixel belongs to. K-means is a common solution to the segmentation of multi-dimensional data. The standard software implementation of k-means uses floating-point arithmetic and Euclidean distances. Floating point arithmetic and the multiplication-heavy Euclidean distance calculation are fine on a general purpose processor, but they have large area and speed penalties when implemented on an FPGA. In order to get the best performance of k-means on an FPGA, the algorithm needs to be transformed to eliminate these operations. We examined the effects of using two other distance measures, Manhattan and Max, that do not require multipliers. We also examined the effects of using fixed precision and truncated bit widths in the algorithm. It is important to explore algorithmic level transforms and tradeoffs when mapping an algorithm to reconfigurable hardware. A direct translation of the standard software implementation of k-means would result in a very inefficient use of FPGA hardware resources. Analysis of the algorithm and data is necessary for a more efficient implementation. Our resulting implementation exhibits approximately a 200 times speed up over a software implementation.

field programmable logic and applications | 2002

A Library of Parameterized Floating-Point Modules and Their Use

Pavle Belanovic; Miriam Leeser

We present a parameterized floating-point library for use with reconfigurable hardware. Our format is both general and flexible. All IEEE formats are a subset of our format, as are all previously published floating-point formats for reconfigurable hardware. We have developed a library of fully parameterized hardware modules for format control, arithmetic operations and conversion to and from any fixed-point format. The format converters allow for hybrid implementations that combine both fixed and floating-point calculations. This permits the designer to choose between the increased range of floating-point and the increased precision of fixed-point within the same application. We illustrate the use of this library with a hybrid implementation of the K-means clustering algorithm applied to multispectral satellite images.

field programmable gate arrays | 2004

An FPGA implementation of the two-dimensional finite-difference time-domain (FDTD) algorithm

Wang Chen; Panos Kosmas; Miriam Leeser; Carey M. Rappaport

Understanding and predicting electromagnetic behavior is needed more and more in modern technology. The Finite-Difference Time-Domain (FDTD) method is a powerful computational electromagnetic technique for modelling the electromagnetic space. The 3D FDTD buried object detection forward model is emerging as a useful application in mine detection and other subsurface sensing areas. However, the computation of this model is complex and time consuming. Implementing this algorithm in hardware will greatly increase its computational speed and widen its use in many other areas. We present an FPGA implementation to speedup the pseudo-2D FDTD algorithm which is a simplified version of the 3D FDTD model. The pseudo-2D model can be upgraded to 3D with limited modification of structure. We implement the pseudo-2D FDTD model for layered media and complete boundary conditions on an FPGA. The computational speed on the reconfigurable hardware design is about 24 times faster than a software implementation on a 3.0GHz PC. The speedup is due to pipelining, parallelism, use of fixed point arithmetic, and careful memory architecture design.

international symposium on microarchitecture | 1997

Division and square root: choosing the right implementation

Peter Soderquist; Miriam Leeser

Floating-point support has become a mandatory feature of new microprocessors due to the prevalence of business, technical, and recreational applications that use these operations. Spreadsheets, CAD tools, and games, for instance, typically feature floating-point-intensive code. Over the past few years, the leading architectures have incorporated several generations of floating-point units (FPUs). However, while addition and multiplication implementations have become increasingly efficient, support for division and square root has remained uneven. The design community has reached no consensus on the type of algorithm to use for these two functions, and quality and performance of the implementations vary widely. This situation originates in skepticism about the importance of division and square root and an insufficient understanding of the design alternatives. Quantifying what constitutes good performance is challenging. One rule thumb, for example, states that the latency of division should be three times that of multiplication; this figure is based on division frequencies in a selection of typical scientific applications. Even if we accept this doctrine at face value, implementing division-and square root-involves much more than relative latencies. We must also consider area, throughput, complexity, and the interaction with other operations. This article explores the various trade-offs involved and illuminates the consequences of different design choices, thus enabling designers to make informed decisions.

ACM Computing Surveys | 1996

Area and performance tradeoffs in floating-point divide and square-root implementations

Peter Soderquist; Miriam Leeser

Floating-point divide and square-root operations are essential to many scientific and engineering applications, and are required in all computer systems that support the IEEE floating-point standard. Yet many current microprocessors provide only weak support for these operations. The latency and throughput of division are typically far inferior to those of floating-point addition and multiplication, and square-root performance is often even lower. This article argues the case for high-performance division and square root. It also explains the algorithms and implementations of the primary techniques, subtractive and multiplicative methods, employed in microprocessor floating-point units with their associated area/performance tradeoffs. Case studies of representative floating-point unit configurations are presented, supported by simulation results using a carefully selected benchmark, Givens rotation, to show the dynamic performance impact of the various implementation alternatives. The topology of the implementation is found to be an important performance factor. Multiplicative algorithms, such as the Newton-Raphson method and Goldschmidts algorithm, can achieve low latencies. However, these implementations serialize multiply, divide, and square root operations through a single pipeline, which can lead to low throughput. While this hardware sharing yields low size requirements for baseline implementations, lower-latency versions require many times more area. For these reasons, multiplicative implementations are best suited to cases where subtractive methods are precluded by area constraints, and modest performance on divide and square root operations is tolerable. Subtractive algorithms, exemplified by radix-4 SRT and radix-16 SRT, can be made to execute in parallel with other floating-point operations.

IEEE Transactions on Very Large Scale Integration Systems | 2000

HML, a novel hardware description language and its translation to VHDL

Yanbing Li; Miriam Leeser

We present hardware ML (HML), an innovative hardware description language (HDL) based on the functional programming language SML. Features of HML not found in other HDLs include polymorphic types and advanced type checking and type inference techniques. We have implemented an HML type checker and a translator for automatically generating VHDL from HML descriptions. We generate a synthesizable subset of VHDL and automatically infer types and interfaces. This paper gives an overview of HML and discusses the translation from HML to VHDL and the type inference process.

field-programmable custom computing machines | 2006

Advanced Components in the Variable Precision Floating-Point Library

Xiaojun Wang; Sherman Braganza; Miriam Leeser

Optimal reconfigurable hardware implementations may require the use of arbitrary floating-point formats that do not necessarily conform to IEEE specified sizes. The authors have previously presented a variable precision floating-point library for use with reconfigurable hardware. The authors recently added three advanced components: floating-point division, floating-point square root and floating-point accumulation to our library. These advanced components use algorithms that are well suited to FPGA implementations and exhibit a good tradeoff between area, latency and throughput. The floating-point format of our library is both general and flexible. All IEEE formats, including 64-bit double-precision format, are a subset of our format. All previously published floating-point formats for reconfigurable hardware are a subset of our format as well. The generic floating-point format supported by all of our library components makes it easy and convenient to create a pipelined, custom data path with optimal bitwidth for each operation. Our library can be used to achieve more parallelism and less power dissipation than adhering to a standard format. To further increase parallelism and reduce power dissipation, our library also supports hybrid fixed and floating point operations in the same design. The division and square root designs are based on table lookup and Taylor series expansion, and make use of memories and multipliers embedded on the FPGA chip. The iterative accumulator utilizes the library addition module as well as buffering and control logic to achieve performance similar to that of the addition by itself. They are all fully pipelined designs with clock speed comparable to that of other library components to aid the designer in implementing fast, complex, pipelined designs

IEEE Design & Test of Computers | 1998

Rothko: a three-dimensional FPGA

Miriam Leeser; Waleed Meleis; Mankuan Michael Vai; Silviu Chiricescu; Weidong Xu; Paul M. Zavracky

Using transferred circuits and metal interconnections placed between layers of active devices anywhere on the chip, Rothko aims at solving utilization, routing, and delay problems of existing FPGA architectures. Experimental implementations have demonstrated important performance advantages.

acm multimedia | 1997

Optimizing the data cache performance of a software MPEG-2 video decoder

Peter Soderquist; Miriam Leeser

Multimedia functionality has become an established component of core computer worHoads. MPEG-2 video decoding represents a particularly important and computationally demanding application example. Instruction set extensions like Intel’s MMX significantly reduce the computational challenges of this and other multimedia algorithms. However, memory subsystem deficiencies have now become the major barrier to increased performance, partly as a consequence of this improved CPU performance. Decoding MPEG-2 video data in software makes significant bandwidth demands on memory subsystems, which is seriously aggravated by cache ineficiencies. Conventional data caches generate many times more cache-memory trafic than required, at best double the minimum necessary to support decoding. Improving eficiency requires understanding the behavior of the decoder and composition of its data set. We provide an analysis of the memory and cache behavior of software MPEG-2 video decoding, and lay out a set of cache-oriented architectural enhancements which offer relief for the problem of excess cache-memory bandwidth. Our results show that cache-sensitive handling of different data types can reduce trafic by 50 percent or more.

ACM Transactions on Reconfigurable Technology and Systems | 2010

VFloat: A Variable Precision Fixed- and Floating-Point Library for Reconfigurable Hardware

Xiaojun Wang; Miriam Leeser

Optimal reconfigurable hardware implementations may require the use of arbitrary floating-point formats that do not necessarily conform to IEEE specified sizes. We present a variable precision floating-point library (VFloat) that supports general floating-point formats including IEEE standard formats. Most previously published floating-point formats for use with reconfigurable hardware are subsets of our format. Custom datapaths with optimal bitwidths for each operation can be built using the variable precision hardware modules in the VFloat library, enabling a higher level of parallelism. The VFloat library includes three types of hardware modules for format control, arithmetic operations, and conversions between fixed-point and floating-point formats. The format conversions allow for hybrid fixed- and floating-point operations in a single design. This gives the designer control over a large number of design possibilities including format as well as number range within the same application. In this article, we give an overview of the components in the VFloat library and demonstrate their use in an implementation of the K-means clustering algorithm applied to multispectral satellite images.

Explore More