Srinivas Katkoori | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Srinivas Katkoori is active.

Explore More

Publication

Featured researches published by Srinivas Katkoori.

IEEE Transactions on Evolutionary Computation | 2006

A genetic algorithm for the design space exploration of datapaths during high-level synthesis

Vyas Krishnan; Srinivas Katkoori

High-level synthesis is comprised of interdependent tasks such as scheduling, allocation, and module selection. For todays very large-scale integration (VLSI) designs, the cost of solving the combined scheduling, allocation, and module selection problem by exhaustive search is prohibitive. However, to meet design objectives, an extensive design space exploration is often critical to obtaining superior designs. We present a framework for efficient design space exploration during high-level synthesis of datapaths for data-dominated applications. The framework uses a genetic algorithm (GA) to concurrently perform scheduling and allocation with the aim of finding schedules and module combinations that lead to superior designs while considering user-specified latency and area constraints. The GA uses a multichromosome representation to encode datapath schedules and module allocations and efficient heuristics to minimize functional and storage area costs, while minimizing circuit latencies. The framework provides the flexibility to perform resource-constrained scheduling, time-constrained scheduling, or a combination of the two, using a simple and fast list-scheduling technique. A graded penalty function is used as an objective function in evaluating the quality of designs to enable the GA to quickly reach areas of the search space where designs meeting user specified criteria are most likely to be found. Since GAs are population-based search heuristics, a unique feature of our framework is its ability to offer a large number of alternative datapath designs, all of which meet design specifications but differ in module, register, and interconnect configurations. Many experiments on well-known benchmarks show the effectiveness of our approach.

IEEE Design & Test of Computers | 1995

Profile-driven behavioral synthesis for low-power VLSI systems

Nand Kumar; Srinivas Katkoori; Leo Rader; Ranga Vemuri

We present a profile-driven approach to behavior level synthesis. In this approach, event activities related to various operations and carriers in the behavioral specification are measured by simulating the description using user-supplied profiling stimuli. These event activities are then used during the synthesis process to estimate the switching activity in the design being synthesized. Overall switching activity estimation is based on modulating the average intrinsic switching activities of the synthesis library modules using the event activities. This estimate is used to select a module set and a schedule which, besides meeting the area and clock-speed constraints, would minimize the switching activity in the design. Experimental results for a number of examples show that the switching activity estimated during synthesis deviates by less than 10% on the average from the actual switching activity measured after completing synthesis.The same profile-driven approach is applied to estimate the total amount of capacitance that would switch in the design when the given stimuli is applied. Again, experimental results show that, on the average, the estimated switched capacitance deviates from the actual measured value by about 12%.

IEEE Transactions on Evolutionary Computation | 2010

Customizable FPGA IP Core Implementation of a General-Purpose Genetic Algorithm Engine

Pradeep Fernando; Srinivas Katkoori; Didier Keymeulen; Ricardo Salem Zebulum; Adrian Stoica

Hardware implementation of genetic algorithms (GA) is gaining importance as genetic algorithms can be effectively used as an optimization engine for real-time applications (for e.g., evolvable hardware). In this work, we report the design of an IP core that implements a general purpose GA engine which has been successfully synthesized and verified on a Xilinx Virtex II Pro FPGA device (XC2VP30). The placed and routed IP core has an area utilization of only 16% and clock period of 2.2n s (~450 MHz). The GA core can be customized in terms of the population size, number of generations, cross-over and mutation rates, and the random number generator seed. The GA engine can be tailored to a given application by interfacing with the application specific fitness evaluation module as well as the required storage memory (to store the current and new populations). The core is soft in nature i.e., a gate-level netlist is provided which can be readily integrated with the users system.

IEEE Transactions on Very Large Scale Integration Systems | 2009

A Framework for Power-Gating Functional Units in Embedded Microprocessors

Soumyaroop Roy; Nagarajan Ranganathan; Srinivas Katkoori

Power gating is a technique commonly used for leakage reduction in integrated circuits. In microprocessors, power gating is implemented by using sleep transistors to selectively deactivate circuit modules that remain idle for sustained periods of time during program execution. In this work, we develop a new framework for power gating the functional units in embedded system microprocessors without degradation in performance. The proposed framework includes an efficient algorithm for idle time estimation, appropriate insertion of sleep instructions within the code, and a method for reactivating the sleeping units only when needed without the use of wakeup instructions. We introduce the notion of loop hierarchy trees (LHTs) to represent the partial ordering of the nested loops within the program. From the control flow graph (CFG) representation of the source program, a forest of LHTs is constructed and is used to identify the maximal subgraphs representing the long idle periods for the functional units. For each subgraph thus identified, a sleep instruction is introduced in the program with a list of corresponding functional units to be deactivated. When an instruction is decoded, the functional units needed for that instruction are automatically activated by the control unit such that the units are ready before the instruction reaches the execute stage. This eliminates the need for wakeup instructions to be inserted into the object code reducing the overheads. In our implementation, the ARM processor architecture was modified and resynthesized to include power gating by developing a CMOS cell library of functional units with the above capabilities. Experimental results are reported for a set of 12 benchmarks chosen from the MiBench suite, which indicate that, on average, our technique reduces the leakage energy in functional units by 31.1% for integer benchmarks and 26.8% for floating-point benchmarks.

international conference on vlsi design | 2003

Resource allocation and binding approach for low leakage power

Chandramouli Gopalakrishnan; Srinivas Katkoori

We propose a leakage power minimization approach based on multi-threshold CMOS (MTCMOS) technology. A clique partitioning-based resource allocation and binding algorithm is presented, which maximizes the idle periods of modules in the data-path. Modules with significant idle times are selectively bound to MTCMOS instances. We developed a parameterizable MTCMOS component library, characterized with respect to sleep transistor width. Using this characterization, the leakage power-delay trade-off is analyzed and optimal sleep transistor widths are identified. For three well known HLS benchmarks, we obtain an average leakage power reduction of 22.44%. The main disadvantage of MTCMOS technology is performance degradation. We present a performance recovery technique based on multi-cycling and introduction of slack. With this technique, the performance penalty reduces to as low as 14.28%. We obtain an average leakage power reduction of 17.46% after performance recovery. The average area overhead incurred due to the introduction of MTCMOS modules is 10.21%. Results are presented for 0.18 /spl mu/m CMOS technology.

ACM Transactions on Design Automation of Electronic Systems | 2004

Power minimization algorithms for LUT-based FPGA technology mapping

Hao Li; Srinivas Katkoori; Wai-Kei Mak

We study the technology mapping problem for LUT-based FPGAs targeting at power minimization. The problem has been proved to be NP-hard previously. Therefore, we present an efficient heuristic algorithm to generate low-power mapping solutions. The key idea is to compute and select low-power K-feasible cuts by an efficient incremental network flow computation method. Experimental results show that our algorithm reduces power consumption as well as area over the best algorithms reported in the literature. In addition, we present an extension to compute depth-optimal low-power mappings. Compared with Cutmap, a depth-optimal mapper with simultaneous area minimization, we achieve a 14% power savings on average without any depth penalty.

IEEE Transactions on Very Large Scale Integration Systems | 2004

Ant colony system application to macrocell overlap removal

Stelian Alupoaei; Srinivas Katkoori

We present a novel macrocell overlap removal algorithm, based on the ant colony optimization metaheuristic. The procedure generates a feasible placement from a relative placement with overlaps produced by some placement algorithms such as quadratic programming and force-directed. It uses the concept of ant colonies, a set of agents that work together to improve an existing solution. Each ant in the colony will generate a placement based on the relative positions of the cells and feedback information about the best placements generated by previous colonies. The solution of each ant is improved by using a local optimization procedure which reduces the unused space. The worst runtime is O(n/sup 3/), but the average runtime can be reduced to O(n/sup 2/).

IEEE Transactions on Computers | 2011

State-Retentive Power Gating of Register Files in Multicore Processors Featuring Multithreaded In-Order Cores

Soumyaroop Roy; Nagarajan Ranganathan; Srinivas Katkoori

In this work, we investigate state-retentive power gating of register files for leakage reduction in multicore processors supporting multithreading. In an in-order core, when a thread gets blocked due to a memory stall, the corresponding register file can be placed in a low leakage state through power gating for leakage reduction. When the memory stall gets resolved, the register file is activated for being accessed again. Since the contents of the register file are not lost and restored on wakeup, this is referred to as state-retentive power gating of register files. While state-retentive power gating in single cores has been studied in the literature, it is being investigated for multicore architectures for the first time in this work. We propose specific techniques to implement state-retentive power gating for three different multicore processor configurations based on the multithreading model: 1) coarse-grained multithreading, 2) fine-grained multithreading, and 3) simultaneous multithreading. The proposed techniques can be implemented as design extensions within the control units of the in-order cores. Each technique uses two different modes of leakage states: low-leakage savings and low wake-up and high-leakage savings and high wake-up latency. The overhead due to wake-up latency is completely avoided in two techniques while it is hidden for most part in the third approach, either by overlapping the wake-up process with the thread context switching latency or by executing instructions from other threads ready for execution. The proposed techniques were evaluated through simulations with multiprogrammed workloads comprised of SPEC 2000 integer benchmarks. Experimental results show that in an 8-core processor executing 64 threads, the average leakage savings were 42 percent in coarse-grained multithreading, while they were between seven percent and eight percent for finegrained and simultaneous multithreading.

IEEE Transactions on Very Large Scale Integration Systems | 2001

LUT-based FPGA technology mapping for power minimization with optimal depth

Hao Li; Wai-Kei Mak; Srinivas Katkoori

In this paper, we study the technology mapping problem for LUT-based FPGAs targeting power minimization. We present the PowerMap algorithm to generate a mapping solution to minimize power consumption while keeping the delay optimal. We compute min-height K-feasible cuts for critical nodes to optimize the depth and compute min-weight K-feasible cuts for noncritical nodes to minimize the power consumption of the mapping solution. We have implemented PowerMap in C and tested it on a number of MCNC benchmark circuits. Compared to FlowMap, a delay-optimal mapper, our algorithm reduces the power consumption by 17.8% and uses 9.4% less LUTs without any depth penalty.

ieee computer society annual symposium on vlsi | 2002

Force-directed scheduling for dynamic power optimization

Suvodeep Gupta; Srinivas Katkoori

We present a latency-constrained scheduling algorithm to optimize a design for dynamic power Usage of forces to model power is motivated by the force-directed scheduling (FDS) heuristic proposed by Paulin and Knight (1989). Given a dataflow graph (DFG) and an input data environment, we profile the DFG with representative data streams. Our algorithm reduces dynamic power by reducing switched capacitance inside resources. The switched capacitance of combinations among DFG operations, which could share a resource, and the probability of selecting such a combination, are evaluated. Switched capacitance inside a module is modeled as the spring constant k and probability of selecting the corresponding combination is modeled as the displacement x, in the force equation F=kx. Thus, a force is associated with each feasible combination corresponding to its power cost. Due to numerous possibilities, we obtain a distribution of forces whose mean, standard deviation, and skew are used to make a power-optimal scheduling decision. Compared to original FDS, our algorithm shows average power savings of 16.4% for the same throughput at the cost of a nominal area overhead.

Explore More