Gayatri Mehta | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gayatri Mehta is active.

Explore More

Publication

Featured researches published by Gayatri Mehta.

ACM Transactions in Embedded Computing Systems | 2006

Reducing power while increasing performance with supercisc

Raymond R. Hoare; Dara Kusic; Gayatri Mehta; Joshua Fazekas; John Foster

Multiprocessor Systems on Chips (MPSoCs) have become a popular architectural technique to increase performance. However, MPSoCs may lead to undesirable power consumption characteristics for computing systems that have strict power budgets, such as PDAs, mobile phones, and notebook computers. This paper presents the super-complex instruction-set computing (SuperCISC) Embedded Processor Architecture and, in particular, investigates performance and power consumption of this device compared to traditional processor architecture-based execution. SuperCISC is a heterogeneous, multicore processor architecture designed to exceed performance of traditional embedded processors while maintaining a reduced power budget compared to low-power embedded processors. At the heart of the SuperCISC processor is a multicore VLIW (Very Large Instruction Word) containing several homogeneous execution cores/functional units. In addition, complex and heterogeneous combinational hardware function cores are tightly integrated to the core VLIW engine providing an opportunity for improved performance and reduced energy consumption. Our SuperCISC processor core has been synthesized for both a 90-nm Stratix II Field Programmable Gate Aray (FPGA) and a 160-nm standard cell Application-Specific Integrated Circuit (ASIC) fabrication process from OKI, each operating at approximately 167 MHz for the VLIW core. We examine several reasons for speedup and power improvement through the SuperCISC architecture, including predicated control flow, cycle compression, and a reduction in arithmetic power consumption, which we call power compression. Finally, testing our SuperCISC processor with multimedia and signal-processing benchmarks, we show how the SuperCISC processor can provide performance improvements ranging from 7X to 160X with an average of 60X, while also providing orders of magnitude of power improvements for the computational kernels. The power improvements for our benchmark kernels range from just over 40X to over 400X, with an average savings exceeding 130X. By combining these power and performance improvements, our total energy improvements all exceed 1000X. As these savings are limited to the computational kernels of the applications, which often consume approximately 90% of the execution time, we expect our savings to approach the ideal application improvement of 10X.

international parallel and distributed processing symposium | 2006

Design space exploration for low-power reconfigurable fabrics

Gayatri Mehta; Raymond R. Hoare; Justin Stander

This paper presents a parameterizable, coarse grained, reconfigurable fabric model that attempts to maintain field programmable gate array (FPGA)-like programmability and computer aided design (CAD), with application specific integrated circuit (ASIC)-like power characteristics for digital signal processing (DSP) style applications. Using this model, architectural design space decisions are explored in order to define an energy-efficient fabric. The impact on energy and performance due to the variation of different parameters such as data width and interconnection flexibility has been studied. The multiplexer cardinality usage has also been studied by mapping some of the signal processing applications onto the fabric. The results point to the use of power optimized 32-bit width computational elements interconnected by low cardinality multiplexers like 4:1 multiplexers

IEEE Transactions on Circuits and Systems Ii-express Briefs | 2006

A VLIW Processor With Hardware Functions: Increasing Performance While Reducing Power

Raymond R. Hoare; Dara Kusic; Justin Stander; Gayatri Mehta; Joshua Fazekas

This brief presents a heterogeneous multicore embedded processor architecture designed to exceed performance of traditional embedded processors while reducing the power consumed compared to low-power embedded processors. At the heart of this architecture is a multicore very large instruction word (VLIW) containing homogeneous execution cores/functional units. Additionally, heterogeneous combinational hardware function cores are tightly integrated to the VLIW core providing an opportunity for improved performance and reduced energy consumption. Our processor has been synthesized for both a 90-nm Stratix II field-programmable gate array and a 160-nm cell-based application-specific integrated circuit from Oki each operating at a core frequency of 167 MHz. For selected multimedia and signal processing benchmarks, we show how this processor provides kernel performance improvements averaging 179X over an Intel StrongARM and 36X over an Intel XScale leading to application speedups averaging 30X over StrongARM and 10X over XScale

ACM Transactions on Design Automation of Electronic Systems | 2009

Interconnect customization for a hardware fabric

Gayatri Mehta; Justin Stander; Mustafa Baz; Brady Hunsaker

This article describes several multiplexer-based interconnection strategies designed to improve energy consumption of stripe-based coarse-grain reconfigurable fabrics. Application requirements for the architecture as well as two dense subgraphs are extracted from a suite of signal and image processing benchmarks. These statistics are used to drive the strategy of the composition of multiplexer-based interconnect. The article compares interconnects that are fully connected between stripes, those with a cardinality of 8:1 to 4:1, and extensions that provide a 5:1 cardinality, limited 6:1 cardinality, and hybrids between 5:1 and 3:1 cardinalities. Additionally, dedicated vertical routes are considered replacing some computational units with dedicated pass-gates. Using a fabric interconnect model (FIM) written in XML, we demonstrate that fabric instances and mappers can be automatically generated using a Web-based design flow. Upon testing these instances, we found that using an 8:1 cardinality interconnect with 33% of the computational units replaced with dedicated pass-gates provided the best energy versus mappability tradeoff, resulting in a 50% energy improvement over fully connected rows and 20% energy improvement over an 8:1 cardinality interconnect without dedicated vertical routes.

international parallel and distributed processing symposium | 2007

Interconnect Customization for a Coarse-grained Reconfigurable Fabric

Gayatri Mehta; Justin Slander; Mustafa Baz; Brady Hunsaker

This paper describes several system-level interconnection strategies for a coarse-grained reconfigurable fabric designed for low-energy hardware acceleration. A small, representative sub-graph for signal and image processing applications is used to predict the success of mapping larger applications onto the fabric device with these different interconnection strategies, which include 32:1, 8:1, 5:1, 4:1, 3553:1 (3:1, 5:1, 5:1, 3:1) and 355:1 (3:1, 5:1, 5:1) cardinalities. Three mapping techniques are presented and used to complete mappings onto several of these fabric instances including a mixed integer linear programming technique, a constraint programming approach, and a greedy heuristic. We present results for area (in number of required rows), power, delay, and energy as well as run times for mapping a set of signal and image processing benchmarks onto each of these interconnects. Our results indicate that the 5:1 interconnect provides the best overall results and does not require any additional hardware resources than the baseline 4:1 technique. When compared with other implementation strategies, the reconfigurable fabric energy consumption, using 5:1-based interconnect, is within 5-10X of a direct ASIC implementation, is 10X better than an Virtex II Pro FPGA and is 100X better than an Intel XScale processor.

international parallel and distributed processing symposium | 2008

Reducing energy by exploring heterogeneity in a coarse-grain fabric

Gayatri Mehta; Colin J. Ihrig

This paper explores the impact of heterogeneity on energy consumption in a stripe-based coarse-grain fabric architecture. We examine the benefit of replacing 25-50% of functional blocks with dedicated vertical routes in the fabric. Additionally, we reduce the number of operations supported by the functional units from 23 to 16, 10 and 8. To assist in testing and examining the impact of these different architectures on energy consumption, an automation process was created to automatically generate fabric instances based on a fabric instance model (FIM) written in XML. The FIM is also used as an input parameter to our heuristic mapper in order to program a particular fabric instance. Upon testing these instances, we found that the fabric with ALUs supporting 10-operations and using an 8:1 interconnect with 33% of the functional units replaced with dedicated pass gates provided the best energy versus mappability tradeoff, resulting in a 32% energy improvement and a 47% area savings over the baseline fabric with ALUs supporting 23-operations and using an 8:1 interconnect without dedicated vertical routes.

ieee international symposium on parallel distributed processing workshops and phd forum | 2010

An architectural space exploration tool for domain specific reconfigurable computing

Gayatri Mehta

In this paper, we describe a design space exploration (DSE) tool for domain specific reconfigurable computing where the needs of the applications drive the construction of the device architecture. The tool has been developed to automate the design space case studies which allows application developers to explore architectural tradeoffs efficiently and reach solutions quickly. We selected some of the core signal processing benchmarks from the MediaBench benchmark suite and some of the edge-detection benchmarks from the image processing domain for our case studies. We compare the energy consumption of the architecture selected from manual design space case studies with the architectural solution selected by the design space exploration tool. The architecture selected by the DSE tool consumes approximately 9% less energy on an average as compared to the best candidate from the manual design space case studies. The fabric architecture selected from the manual design case studies and the one selected by the tool were synthesized on 130 nm cell-based ASIC fabrication process from IBM. We compare the energy of the benchmarks implemented onto the fabric with other hardware and software implementations. Both fabric architectures (manual and tool) yield energy within 3X of a direct ASIC implementation, 330X better than a Virtex-II Pro FPGA and 2016X better than an Intel XScale processor.

ACM Transactions on Reconfigurable Technology and Systems | 2013

UNTANGLED: A Game Environment for Discovery of Creative Mapping Strategies

Gayatri Mehta; Carson Crawford; Xiaozhong Luo; Natalie Parde; Krunalkumar Patel; Brandon Rodgers; Anil Kumar Sistla; Anil Yadav; Marc Reisner

The problem of creating efficient mappings of dataflow graphs onto specific architectures (i.e., solving the place and route problem) is incredibly challenging. The difficulty is especially acute in the area of Coarse-Grained Reconfigurable Architectures (CGRAs) to the extent that solving the mapping problem may remove a significant bottleneck to adoption. We believe that the next generation of mapping algorithms will exhibit pattern recognition, the ability to learn from experience, and identification of creative solutions, all of which are human characteristics. This manuscript describes our game UNTANGLED, developed and fine-tuned over the course of a year to allow us to capture and analyze human mapping strategies. It also describes our results to date. We find that the mapping problem can be crowdsourced very effectively, that players can outperform existing algorithms, and that successful player strategies share many elements in common. Based on our observations and analysis, we make concrete recommendations for future research directions for mapping onto CGRAs.

microelectronics systems education | 2013

UNTANGLED - An interactive mapping game for engineering education

Gayatri Mehta; Xiaozhong Luo; Natalie Parde; Krunalkumar Patel; Brandon Rodgers; Anil Kumar Sistla

Retaining students poses a huge challenge in the field of engineering, as many students become discouraged while working on their degrees and switch majors or leave school entirely. Our key observation is that it is extremely important to introduce students to real-world problems early on in their studies. Too often, students become confused and dissatisfied by abstract theories in their early engineering courses, and fail to see any practical importance to what they are learning. This paper presents the idea of using an interactive game, UNTANGLED, to introduce real-world problems related to chip architecture and design in the early stages of engineering education, thus generating enthusiasm and helping students connect the theories they learn in classes to applicable problems. We believe that doing so will help elevate future engineering student retention rates.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2013

Data-Driven Mapping Using Local Patterns

Gayatri Mehta; Krunal Kumar Patel; Natalie Parde; Nancy S. Pollard

The problem of mapping a data flow graph onto a reconfigurable architecture has been difficult to solve quickly and optimally. Anytime algorithms have the potential to meet both goals by generating a good solution quickly and improving that solution over time, but they have not been shown to be practical for mapping. The key insight into this paper is that mapping algorithms based on search trees can be accelerated using a database of examples of high quality mappings. The depth of the search tree is reduced by placing patterns of nodes rather than single nodes at each level. The branching factor is reduced by placing patterns only in arrangements present in a dictionary constructed from examples. We present two anytime algorithms that make use of patterns and dictionaries: Anytime A* and Anytime Multiline Tree Rollup. We compare these algorithms to simulated annealing and to results from human mappers playing the online game UNTANGLED. The anytime algorithms outperform simulated annealing and the best game players in the majority of cases, and the combined results from all algorithms provide an informative comparison between architecture choices.

Explore More