Montek Singh
University of North Carolina at Chapel Hill
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Montek Singh.
Computer Graphics Forum | 2005
Justin Hensley; Thorsten Scheuermann; Greg Coombe; Montek Singh; Anselmo Lastra
We introduce a technique to rapidly generate summed-area tables using graphics hardware. Summed area tables, originally introduced by Crow, provide a way to filter arbitrarily large rectangular regions of an image in a constant amount of time. Our algorithm for generating summed-area tables, similar to a technique used in scientific computing called recursive doubling, allows the generation of a summed-area table in O(log n) time. We also describe a technique to mitigate the precision requirements of summed-area tables. The ability to calculate and use summed-area tables at interactive rates enables numerous interesting rendering effects. We present several possible applications. First, the use of summed-area tables allows real-time rendering of interactive, glossy environmental reflections. Second, we present glossy planar reflections with varying blurriness dependent on a reflected object’s distance to the reflector. Third, we show a technique that uses a summed-area table to render glossy transparent objects. The final application demonstrates an interactive depth-of-field effect using summedarea tables.
IEEE Transactions on Very Large Scale Integration Systems | 2007
Montek Singh; Steven M. Nowick
An asynchronous pipeline style is introduced for high-speed applications, called MOUSETRAP. The pipeline uses standard transparent latches and static logic in its datapath, and small latch controllers consisting of only a single gate per pipeline stage. This simple structure is combined with an efficient and highly-concurrent event-driven protocol between adjacent stages. Post-layout SPICE simulations of a ten-stage pipeline with a 4-bit wide datapath indicate throughputs of 2.1-2.4 GHz in a 0.18-mum TSMC CMOS process. Similar results were obtained when the datapath width was extended to 16 bits. This performance is competitive even with that of wave pipelines, without the accompanying problems of complex timing and much design effort. Additionally, the new pipeline gracefully and robustly adapts to variable speed environments. The pipeline stages are extended to fork and join structures, to handle more complex system architectures.
international symposium on advanced research in asynchronous circuits and systems | 2000
Montek Singh; Steven M. Nowick
This paper introduces several new asynchronous pipeline designs which offer high throughput as well as low latency. The designs target dynamic datapaths, both dual-rail as well as single-rail. The new pipelines are latch-free and therefore are particularly well-suited for fine-grain pipelining, i.e., where each pipeline stage is only a single gate deep. The pipelines employ new control structures and protocols aimed at reducing the handshaking delay, the principal impediment to achieving high throughput in asynchronous pipelines. As a test vehicle, a 4-bit FIFO was designed using 0.6 micron technology. The results of careful HSPICE simulations of the FIFO designs are very encouraging. The dual-rail designs deliver a throughput of up to 860 million data items per second. This performance represents an improvement by a factor of 2 over a widely-used comparable approach by T.E. Williams (1991). The new single-rail designs deliver a throughput of up to 1208 million data items per second.
international conference on computer design | 2001
Montek Singh; Steven M. Nowick
A new asynchronous pipeline design is introduced for high-speed applications. The pipeline uses simple transparent latches in its datapath, and small latch controllers consisting of only a single gate per pipeline stage. This simple stage structure is combined with an efficient transition-signaling protocol between stages. Initial pre-layout HSPICE simulations of a 10-stage FIFO on a 16-bit wide datapath indicate throughput of 3.51 GigaHertz in 0.25 /spl mu/ CMOS, using a conservative process. This performance is competitive even with that of wave pipelines, without the accompanying problems of complex timing and much design effort. Additionally, the new pipeline gracefully and robustly adapts to variable-speed environments. The stage implementations are extended to fork and join structures, to handle more complex system architectures.
IEEE Design & Test of Computers | 2011
Steven M. Nowick; Montek Singh
Pipelining is a key element of high-performance design. Distributed synchronization is at the same time one of the key strengths and one of the major difficulties of asynchronous pipelining. It automatically provides elasticity and on-demand power consumption. This tutorial provides an overview of the best-in-class asynchronous pipelining methods that can be used to fully exploit the advantages of this design style, covering both static and dynamic logic implementations.
Pattern Recognition | 1997
Montek Singh; Amitabha Chatterjee; Santanu Chaudhury
Abstract This paper presents a genetic algorithm for solving the problem of structural shape matching. Both sequential and parallel versions of the algorithm have been presented. The genetic operators-reproduction, crossover and mutation-have been constructed for this specific problem. A new variation of the crossover operator, called the color crossover, is presented. This operator has resulted in significant improvement in runtime and algorithm efficiency. Parallelization has been achieved using an “island” model, with several subpopulations and occasional migration. A complete framework for an object recognition system using this genetic algorithm has been presented. Encouraging experimental results have been obtained.
IEEE Transactions on Very Large Scale Integration Systems | 2007
Montek Singh; Steven M. Nowick
This paper introduces a high-throughput asynchronous pipeline style, called high-capacity (HC) pipelines, targeted to datapaths that use dynamic logic. This approach includes a novel highly-concurrent handshake protocol, with fewer synchronization points between neighboring pipeline stages than almost all existing asynchronous dynamic pipelining approaches. Furthermore, the dynamic pipelines provide 100% buffering capacity, without explicit latches, by means of separate pullup and pulldown control for each pipeline stage: neighboring stages can store distinct data items, unlike almost all existing latchless dynamic asynchronous pipelines. As a result, very high throughput is obtained. Fabricated first-input-first-output (FIFO) designs, in 0.18-m technology, were fully functional over a wide range of supply voltages (1.2 to over 2.5 V), exhibiting a corresponding range of throughputs from 1.0-2.4 giga items/s. In addition, an experimental finite-impulse response (FIR) filter chip was designed and fabricated with IBM Research, whose speed-critical core used an HC pipeline. The HC pipeline exhibited throughputs up to 1.8 giga items/s, and the overall filter achieved 1.32 giga items/s, thus obtaining 15% higher throughput and 50% lower latency than the fastest previously-reported synchronous FIR filter, also designed at IBM Research.
IEEE Transactions on Very Large Scale Integration Systems | 2000
Montek Singh; Steven M. Nowick
A new asynchronous pipeline scheme (called LP/sub w/), and two new pipelined asynchronous adder implementations, are introduced for high-throughput applications such as DSPs for multimedia processing. The pipeline scheme is targeted to dynamic datapaths. A novelty of the approach is that it uses decoupled control for pull-up and pull-down stacks. The adders are pipelined at the gate-level and achieve very high throughput: 930-1023 million additions per second in a 0.6/spl mu/ CMOS process. These results are expected to scale to several gigaoperations per second in more modern technologies.
international conference on computer aided design | 2006
Manoj Kumar Ampalam; Montek Singh
This paper introduces a novel approach to efficiently implement several useful architectural features in asynchronous application-specific ICs (ASICs). These features include speculation, preemption, and eager evaluation, which have so far only been available on CPUs, and have not been adequately investigated for custom ASICs. For the efficient implementation of the new architectural features, a radically new approach inspired by Sproulls counterflow pipelines (1994) is proposed. The key idea is to allow special commands, called anti-tokens, to be propagated in a direction opposite to that of data, allowing certain computations to be killed before they are completed, if their results are no longer required. The net impact is a significant improvement in the throughput of a certain class of systems - e.g., those involving conditional computation - where a bottleneck pipeline stage can often be preempted if its result is determined to be no longer needed. Experimental results indicate that our approach can improve the system throughput by a factor of up to 2.2times, along with an energy savings of up to 27%
symposium on asynchronous circuits and systems | 2002
Montek Singh; Jose A. Tierno; Alexander V. Rylyakov; Sergey V. Rylov; Steven M. Nowick
A high-throughput low-latency digital finite impulse response (FIR) filter has been designed for use in partial-response maximum-likelihood (PRML) read channels of modem disk drives. The filter is a hybrid synchronous-asynchronous design. The speed critical portion of the filter is designed as a high-performance asynchronous pipeline, sandwiched between synchronous input and output portions, making it possible for the entire filter to be dropped into a clocked environment. A novel feature of the filter is that the degree of pipelining is dynamically variable, depending upon the input data rate. This feature is critical in obtaining a very low filter latency throughout the range of operating frequencies. The filter was fabricated in a 0.18 /spl mu/m CMOS process. Resulting chips were fully functional over a wide range of supply voltages, and exhibited throughputs of over 1.3 Giga items/second, and latencies as low as four clock cycles. The internal asynchronous pipeline was estimated to be capable of significantly higher throughputs, around 1.8 Giga items/second. With these performance metrics, the filter has better performance than that reported for existing digital read channel filters.A high-throughput low-latency digital finite impulse response (FIR) filter has been designed for use in partial-response maximum-likelihood (PRML) read channels of modern disk drives. The filter is a hybrid synchronous-asynchronous design. The speed-critical portion of the filter is designed as a high-performance asynchronous pipeline sandwiched between synchronous input and output portions, making it possible for the entire filter to be embedded within a clocked system. A novel feature of the filter is that the degree of pipelining is dynamically variable, depending upon the input data rate. This feature is critical in obtaining a very low filter latency throughout the range of operating frequencies. The filter is a ten-tap six-bit FIR filter, fabricated in a 0.18-μm CMOS process. Resulting chips were fully functional over a wide range of supply voltages, and exhibited throughputs of over 1.3 giga-items/s, and latencies of 2-5 clock cycles. Interestingly, the filter throughput was limited by the synchronous portion of the chip; the internal asynchronous pipeline was estimated to be capable of significantly higher throughputs, around 1.8 giga-items/s. More importantly though, the adaptively pipelined nature of the filter allows it to offer a worst-case latency of only 10 ns, which is half the worst-case latency of the best previously reported comparable fully-synchronous implementation by Rylov et al.