Dinesh C. Suresh
University of California, Riverside
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dinesh C. Suresh.
languages compilers and tools for embedded systems | 2003
Dinesh C. Suresh; Walid A. Najjar; Frank Vahid; Jason R. Villarreal; Greg Stitt
Loops constitute the most executed segments of programs and therefore are the best candidates for hardware software partitioning. We present a set of profiling tools that are specifically dedicated to loop profiling and do support combined function and loop profiling. One tool relies on an instruction set simulator and can therefore be augmented with architecture and micro-architecture features simulation while the other is based on compile-time instrumentation of gcc and therefore has very little slow down compared to the original program We use the results of the profiling to identify the compute core in each benchmark and study the effect of compile-time optimization on the distribution of cores in a program. We also study the potential speedup that can be achieved using a configurable system on a chip, consisting of a CPU embedded on an FPGA, as an example application of these tools in hardware/software partitioning.
Design Automation for Embedded Systems | 2002
Jason R. Villarreal; Dinesh C. Suresh; Greg Stitt; Frank Vahid; Walid A. Najjar
We examine the energy and performance benefits that can be obtained by re-mapping frequently executed loops from a microprocessor to reconfigurable logic. We present a design flow that finds critical software loops automatically and manually re-implements these inconfigurable logic by implementing them in SA-C, a C language variation supportinga dataflow computation model and designed to specify and map DSP applicationsonto reconfigurable logic. We apply this design flow on several examples fromthe MediaBench benchmark suite and report the energy and performance improvements.
applied reconfigurable computing | 2006
Dinesh C. Suresh; Zhi Guo; Betul Buyukkurt; Walid A. Najjar
Virus detection at the router level is rapidly gaining in importance. Hardware-based implementations have the advantage of speed and hence can support a large throughput. In this paper we describe an FPGA-based implementation of the Bloom filter virus detection code that is compiled from the native C to VHDL and mapped onto a Virtex XC2V8000 FPGA. Our results show that a single engine tailored for handling virus signatures of length eight bytes can achieve a throughput of 18.6 Gbps while occupying only 8% of the FPGA area.
ACM Transactions in Embedded Computing Systems | 2009
Dinesh C. Suresh; Banit Agrawal; Jun Yang; Walid A. Najjar
Reducing the power consumption of computing devices has gained a lot of attention recently. Many research works have focused on reducing power consumption in the off-chip buses as they consume a significant amount of total power. Since the bus power consumption is proportional to the switching activity, reducing the bus switching is an effective way to reduce bus power. While numerous techniques exist for reducing bus power in address buses, only a handful of techniques have been proposed for data-bus power reduction, where frequent value encoding (FVE) is the best existing scheme to reduce the transition activity on the data buses. In this article, we propose improved frequent value data bus-encoding techniques aimed at reducing more switching activity and, hence, power consumption. We propose three new schemes and five new variations to exploit bit-wise temporal and spatial locality in the data-bus values. Our techniques just use one external control signal and capture bit-wise locality to efficiently encode data values. For all the embedded and SPEC applications we tested, the overall average switching reduction is 53% over unencoded data and 10% more than the conventional FVE scheme.
compilers, architecture, and synthesis for embedded systems | 2003
Dinesh C. Suresh; Banit Agrawal; Jun Yang; Walid A. Najjar; Laxmi N. Bhuyan
Reducing the power consumption of computing devices has gained a lot of attention recently. Many research works have focused on reducing power consumption in the off-chip buses as they consume a significant amount of total power. Since the bus power consumption is proportional to the switching activity, reducing the bus switching is an effective way to reduce bus power. While numerous techniques exist for reducing bus power in address buses, only a handful of techniques have been proposed for data-bus power reduction, where Frequent Value Encoding (FVE) is the best existing scheme to reduce the transition activity on the data buses.In this paper, we propose improved frequent value data-bus encoding techniques aimed at reducing more switching activity and hence, more power consumption. We propose three new schemes and five new variations to exploit bit-wise temporal and spatial locality in the data bus values. Our technique does not use additional external control signal and captures bit-wise locality to efficiently encode data values. For all the embedded and SPEC applications we tested, the overall average switching reduction is 53% over unencoded data and 11% more than the conventional FVE scheme.
IEEE Transactions on Computers | 2009
Dinesh C. Suresh; Banit Agrawal; Jun Yang; Walid A. Najjar
Off-Chip buses constitute a significant portion of the total system power in embedded systems. Many research works have focused on reducing power consumption in the off-chip buses. While numerous techniques exist for reducing bus power in address buses, only a handful of techniques have been proposed for off-chip data bus power reduction. In this paper, we propose two novel data bus encoding schemes to reduce power consumption in the data buses. The first scheme called Variable Length Value Encoder (VALVE) is capable of detecting and encoding variable lengths of repeated bit patterns in the data. The second technique called Tunable Bus Encoder (TUBE) encodes repetition in contiguous as well as noncontiguous bit positions of data values. Both schemes require just one external control signal to encode data values. TUBE is the first proposed hardware-based bus encoding scheme capable of detecting and encoding both contiguous and noncontiguous bit patterns of varying widths. Experimental evaluation on a large set of benchmarks shows an energy reduction of 58 percent and 60 percent on average for VALVE and TUBE, respectively. We evaluate the performance penalty incurred due to the codec delay and it is found to be 0.45 percent of the total program execution time. We also quantify our hardware overhead in terms of area, delay, and energy consumption. In 0.18 mum technology, VALVE and TUBE require a modest area of 0.0486 mm2 and 0.0521 mm2, respectively.
international conference on computer design | 2005
Dinesh C. Suresh; Banit Agrawal; Walid A. Najjar; Jun Yang
We propose variable length value encoding (VALVE) technique to reduce the power consumption in the off-chip data buses. While past research has focused on encoding fixed length data values to reduce the transition activity in the data buses, our proposed scheme is capable of detecting and encoding variable length bit patterns in the data values. VALVE also does not require prior knowledge of input data and uses just one external control signal. We evaluate our proposed scheme using a large spectrum of benchmarks and we achieve an energy reduction of 58% on an average and up to 75% on some benchmarks. We also analyze the performance penalty incurred due to the codec delay, which is found to be 0.45% of the total program execution time. We find that VALVE requires a minimal area of 0.0486 mm/sup 2/, which can be easily implemented with in a memory controller.
international conference on embedded computer systems architectures modeling and simulation | 2005
Dinesh C. Suresh; Walid A. Najjar; Jun Yang
Instruction caches typically consume 27% of the total power in modern high-end embedded systems. We propose a compiler-managed instruction store architecture (K-store) that places the computation intensive loops in a scratch-pad like SRAM memory and allocates the remaining instructions to a regular instruction cache. At runtime, execution is switched dynamically between the instructions in the traditional instruction cache and the ones in the K-store, by inserting jump instructions. The necessary jump instructions add 0.038% on an average to the total dynamic instruction count. We compare the performance and energy consumption of our K-store with that of a conventional instruction cache of equal size. When used in lieu of a 8KB, 4-way associative instruction cache, K-store provides 32% reduction in energy and 7% reduction in execution time. Unlike loop caches, K-store maps the frequent code in a reserved address space and hence, it can switch between the kernel memory and the instruction cache without any noticeable performance penalty.
ieee international conference on high performance computing, data, and analytics | 2003
Dinesh C. Suresh; Jun Yang; Chuanjun Zhang; Banit Agrawal; Walid A. Najjar
Power consumption becomes an important issue for modern processors. The off-chip buses consume considerable amount of total power [9,7]. One effective way to reduce power is to reduce the overall bus switching activities since they are proportional to the power. Up till now, the most effective technique in reducing the switching activities on the data buses is Frequent Value Encoding (FVE) that exploits abundant frequent value locality on the off-chip data buses. In this paper, we propose a technique that exploits more value locality that was overlooked by the FVE. We found that a significant amount of non-frequent values, not captured by the FVE, share common high-ordered bits. Therefore, we propose to extend the current FVE scheme to take bit-wise frequent values into consideration. On average, our technique reduces 48% switching activity. The average energy saving we achieved is 44.8%, which is 8% better than the FVE.
Computer Science - Research and Development | 2010
Dinesh C. Suresh; Roy Ju; Michael Lai; Mei Ye
In a multi-core system, while the processor core pipelines and local caches are replicated in each core, other resources, such as shared cache and memory bus, are shared across all cores in a processor. Running multiple copies of memory intensive applications on different cores often leads to poor scaling, because these shared resources can become bottlenecks to throughput performance. Such issues have been traditionally studied under the design and evaluation of processors, platforms, and operating systems. We have identified a set of compiler optimizations that have measurable impact on the scaling of applications on multi-core systems and evaluated them based on the standard rate run of the SPEC CPU2006 benchmark suite, where throughput is measured by running multiple copies of a program in a multi-core and multi-processor system. We have also collected data and analyzed how these compiler optimizations affect the utilization and behaviors of the shared resources. Through our experimental results, we show that conventional compiler optimizations can play an important role in improving the scaling of running memory intensive application threads on multi-core systems.