Srimat T. Chakradhar | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Srimat T. Chakradhar is active.

Explore More

Publication

Featured researches published by Srimat T. Chakradhar.

architectural support for programming languages and operating systems | 2012

Tarazu: optimizing MapReduce on heterogeneous clusters

Faraz Ahmad; Srimat T. Chakradhar; Anand Raghunathan; T. N. Vijaykumar

Data center-scale clusters are evolving towards heterogeneous hardware for power, cost, differentiated price-performance, and other reasons. MapReduce is a well-known programming model to process large amount of data on data center-scale clusters. Most MapReduce implementations have been designed and optimized for homogeneous clusters. Unfortunately, these implementations perform poorly on heterogeneous clusters (e.g., on a 90-node cluster that contains 10 Xeon-based servers and 80 Atom-based servers, Hadoop performs worse than on 10-node Xeon-only or 80-node Atom-only homogeneous sub-clusters for many of our benchmarks). This poor performance remains despite previously proposed optimizations related to management of straggler tasks. In this paper, we address MapReduces poor performance on heterogeneous clusters. Our first contribution is that the poor performance is due to two key factors: (1) the non-intuitive effect that MapReduces built-in load balancing results in excessive and bursty network communication during the Map phase, and (2) the intuitive effect that the heterogeneity amplifies load imbalance in the Reduce computation. Our second contribution is Tarazu, a suite of optimizations to improve MapReduce performance on heterogeneous clusters. Tarazu consists of (1) Communication-Aware Load Balancing of Map computation (CALB) across the nodes, (2) Communication-Aware Scheduling of Map computation (CAS) to avoid bursty network traffic and (3) Predictive Load Balancing of Reduce computation (PLB) across the nodes. Using the above 90-node cluster, we show that Tarazu significantly improves performance over a baseline of Hadoop with straightforward tuning for hardware heterogeneity.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 1993

A transitive closure algorithm for test generation

Srimat T. Chakradhar; Vishwani D. Agrawal; Steven G. Rothweiler

A transitive-closure-based test generation algorithm is presented. A test is obtained by determining signal values that satisfy a Boolean equation derived from the neural network model of the circuit incorporating necessary conditions for fault activation and path sensitization. The algorithm is a sequence of two main steps that are repeatedly executed: transitive closure computation and decision-making. A key feature of the algorithm is that dependences derived from the transitive closure are used to reduce ternary relations to binary relations that in turn dynamically update the transitive closure. The signals are either determined from the transitive closure or are enumerated until the Boolean equation is satisfied. Experimental results on the ISCAS 1985 and the combinational parts of ISCAS 1989 benchmark circuits are presented to demonstrate efficient test generation and redundancy identification. Results on four state-of-the-art production VLSI circuits are also presented. >

international conference on vlsi design | 2004

Tamper resistance mechanisms for secure embedded systems

Srivaths Ravi; Anand Raghunathan; Srimat T. Chakradhar

Security is a concern in the design of a wide range of embedded systems. Extensive research has been devoted to the development of cryptographic algorithms that provide the theoretical underpinnings of information security. Functional security mechanisms, such as security protocols, suitably employ these mathematical primitives in order to achieve the desired security objectives. However, functional security mechanisms alone cannot ensure security, since most embedded systems present attackers with an abundance of opportunities to observe or interfere with their implementation, and hence to compromise their theoretical strength. This paper surveys various tamper or attack techniques, and explains how they can be used to undermine or weaken security functions in embedded systems. Tamper-resistant design refers to the process of designing a system architecture and implementation that is resistant to such attacks. We outline approaches that have been proposed to design tamper-resistant embedded systems, with examples drawn from recent commercial products.

international conference on vlsi design | 2004

On-chip networks: a scalable, communication-centric embedded system design paradigm

Jörg Henkel; Wayne H. Wolf; Srimat T. Chakradhar

As chip complexity grows, design productivity boost is expected from reuse of large parts and blocks of previous designs with the design effort largely invested into the new parts. More and more processor cores and large, reusable components are being integrated on a single silicon die but reuse of the communication infrastructure has been difficult. Buses and point to point connections, that have been the main means to connect components on a chip today, will not result in a scalable platform architecture for the billion transistor chip era. Buses can cost efficiently connect a few tens of components. Point to point connections between communication partners is practical for even fewer components. As more and more components are integrated on a single silicon die, performance bottlenecks of long, global wires preclude reuse of buses. Therefore, scalable on-chip communication infrastructure is playing an increasingly dominant role in system-on-chip designs. With the super-abundance of cheap, function-specific IP cores, design effort will focus on the weakest link: efficient on-chip communication. Future on-chip communication infrastructure will overcome the limits of bus-based systems by providing higher bandwidth, higher flexibility and by solving the clock skew problem on large chips. It may, however, present new problems: higher power consumption of the communication infrastructure and harder-to-predict performance patterns. Solutions to these problems may result in a complete overhaul of SOC design methodologies into a communication-centric design style. The envisioning of upcoming problems and possible benefits has led to intensified research in the field of what is called NoCs: Networks on Chips. The term NoCs is used in a broad meaning, encompassing the hardware communication infrastructure, the middleware and operating system communication services, and a design methodology and tools to map applications onto a network on chip. This paper discusses trends in system-on-chip designs, critiques problems and opportunities of the NoC paradigm, summarizes research activities, and outlines several directions for future research.

design automation conference | 2013

Analysis and characterization of inherent application resilience for approximate computing

Vinay K. Chippa; Srimat T. Chakradhar; Kaushik Roy; Anand Raghunathan

Approximate computing is an emerging design paradigm that enables highly efficient hardware and software implementations by exploiting the inherent resilience of applications to in-exactness in their computations. Previous work in this area has demonstrated the potential for significant energy and performance improvements, but largely consists of ad hoc techniques that have been applied to a small number of applications. Taking approximate computing closer to mainstream adoption requires (i) a deeper understanding of inherent application resilience across a broader range of applications (ii) tools that can quantitatively establish the inherent resilience of an application, and (iii) methods to quickly assess the potential of various approximate computing techniques for a given application. We make two key contributions in this direction. Our primary contribution is the analysis and characterization of inherent application resilience present in a suite of 12 widely used applications from the domains of recognition, data mining, and search. Based on this analysis, we present several new insights into the nature of resilience and its relationship to various key application characteristics. To facilitate our analysis, we propose a systematic framework for Application Resilience Characterization (ARC) that (a) partitions an application into resilient and sensitive parts and (b) characterizes the resilient parts using approximation models that abstract a wide range of approximate computing techniques. We believe that the key insights that we present can help shape further research in the area of approximate computing, while automatic resilience characterization frameworks such as ARC can greatly aid designers in the adoption approximate computing.

international symposium on computer architecture | 2010

A dynamically configurable coprocessor for convolutional neural networks

Srimat T. Chakradhar; Murugan Sankaradas; Venkata Jakkula; Srihari Cadambi

Convolutional neural networks (CNN) applications range from recognition and reasoning (such as handwriting recognition, facial expression recognition and video surveillance) to intelligent text applications such as semantic text analysis and natural language processing applications. Two key observations drive the design of a new architecture for CNN. First, CNN workloads exhibit a widely varying mix of three types of parallelism: parallelism within a convolution operation, intra-output parallelism where multiple input sources (features) are combined to create a single output, and inter-output parallelism where multiple, independent outputs (features) are computed simultaneously. Workloads differ significantly across different CNN applications, and across different layers of a CNN. Second, the number of processing elements in an architecture continues to scale (as per Moores law) much faster than off-chip memory bandwidth (or pin-count) of chips. Based on these two observations, we show that for a given number of processing elements and off-chip memory bandwidth, a new CNN hardware architecture that dynamically configures the hardware on-the-fly to match the specific mix of parallelism in a given workload gives the best throughput performance. Our CNN compiler automatically translates high abstraction network specification into a parallel microprogram (a sequence of low-level VLIW instructions) that is mapped, scheduled and executed by the coprocessor. Compared to a 2.3 GHz quad-core, dual socket Intel Xeon, 1.35 GHz C870 GPU, and a 200 MHz FPGA implementation, our 120 MHz dynamically configurable architecture is 4x to 8x faster. This is the first CNN architecture to achieve real-time video stream processing (25 to 30 frames per second) on a wide range of object detection and recognition tasks.

international symposium on microarchitecture | 2013

Quality programmable vector processors for approximate computing

Swagath Venkataramani; Vinay K. Chippa; Srimat T. Chakradhar; Kaushik Roy; Anand Raghunathan

Approximate computing leverages the intrinsic resilience of applications to inexactness in their computations, to achieve a desirable trade-off between efficiency (performance or energy) and acceptable quality of results. To broaden the applicability of approximate computing, we propose quality programmable processors, in which the notion of quality is explicitly codified in the HW/SW interface, i.e., the instruction set. The ISA of a quality programmable processor contains instructions associated with quality fields to specify the accuracy level that must be met during their execution. We show that this ability to control the accuracy of instruction execution greatly enhances the scope of approximate computing, allowing it to be applied to larger parts of programs. The micro-architecture of a quality programmable processor contains hardware mechanisms that translate the instruction-level quality specifications into energy savings. Additionally, it may expose the actual error incurred during the execution of each instruction (which may be less than the specified limit) back to software. As a first embodiment of quality programmable processors, we present the design of Quora, an energy efficient, quality programmable vector processor. Quora utilizes a 3-tiered hierarchy of processing elements that provide distinctly different energy vs. quality trade-offs, and uses hardware mechanisms based on precision scaling with error monitoring and compensation to facilitate quality programmable execution. We evaluate an implementation of Quora with 289 processing elements in 45nm technology. The results demonstrate that leveraging quality-programmability leads to 1.05×–1.7× savings in energy for virtually no loss (< 0.5%) in application output quality, and 1.18×–2.1× energy savings for modest impact (<2.5%) on output quality. Our work suggests that quality programmable processors are a significant step towards bringing approximate computing to the mainstream.

design automation conference | 2010

Scalable effort hardware design: exploiting algorithmic resilience for energy efficiency

Vinay K. Chippa; Debabrata Mohapatra; Anand Raghunathan; Kaushik Roy; Srimat T. Chakradhar

Algorithms from several interesting application domains exhibit the property of inherent resilience to “errors” from extrinsic or intrinsic sources, offering entirely new avenues for performance and power optimization by relaxing the conventional requirement of exact (numerical or Boolean) equivalence between the specification and hardware implementation. We propose scalable effort hardware design as an approach to tap the reservoir of algorithmic resilience and translate it into highly efficient hardware implementations. The basic tenet of the scalable effort design approach is to identify mechanisms at each level of design abstraction (circuit, architecture and algorithm) that can be used to vary the computational effort expended towards generation of the correct (exact) result, and expose them as control knobs in the implementation. These scaling mechanisms can be utilized to achieve improved energy efficiency while maintaining an acceptable (and often, near identical) level of quality of the overall result. A second major tenet of the scalable effort design approach is that fully exploiting the potential of algorithmic resilience requires synergistic cross-layer optimization of scaling mechanisms identified at different levels of design abstraction. We have implemented an energy-efficient SVM classification chip based on the proposed scalable effort design approach. We present results from post-layout simulations and demonstrate that scalable effort hardware can achieve large energy reductions (1.2X-2.2X with no impact on classification accuracy, and 2.2X-4.1X with modest reductions in accuracy) across various sets. Our results also establish that cross-layer optimization leads to much improved energy vs. quality tradeoffs compared to each of the individual techniques.

ACM Transactions in Embedded Computing Systems | 2006

A design methodology for application-specific networks-on-chip

Jiang Xu; Wayne H. Wolf; Jörg Henkel; Srimat T. Chakradhar

With the help of HW/SW codesign, system-on-chip (SoC) can effectively reduce cost, improve reliability, and produce versatile products. The growing complexity of SoC designs makes on-chip communication subsystem design as important as computation subsystem design. While a number of codesign methodologies have been proposed for on-chip computation subsystems, many works are needed for on-chip communication subsystems. This paper proposes application-specific networks-on-chip (ASNoC) and its design methodology. ASNoC is used for two high-performance SoC applications. The methodology (1) can automatically generate optimized ASNoC for different applications, (2) can generate a corresponding distributed shared memory along with an ASNoC, (3) can use both recorded and statistical communication traces for cycle-accurate performance analysis, (4) is based on standardized network component library and floorplan to estimate power and area, (5) adapts an industrial-grade network modeling and simulation environment, OPNET, which makes the methodology ready to use, and (6) can be easily integrated into current HW/SW codesign flow. Using the methodology, ASNoC is generated for a H.264 HDTV decoder SoC and Smart Camera SoC. ASNoC and 2D mesh networks-on-chip are compared in performance, power, and area in detail. The comparison results show that ASNoC provide substantial improvements in power, performance, and cost compared to 2D mesh networks-on-chip. In the H.264 HDTV decoder SoC, ASNoC uses 39% less power, 59% less silicon area, 74% less metal area, 63% less switch capacity, and 69% less interconnection capacity to achieve 2X performance compared to 2D mesh networks-on-chip.

high performance distributed computing | 2011

Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework

Vignesh T. Ravi; Michela Becchi; Gagan Agrawal; Srimat T. Chakradhar

Driven by the emergence of GPUs as a major player in high performance computing and the rapidly growing popularity of cloud environments, GPU instances are now being offered by cloud providers. The use of GPUs in a cloud environment, however, is still at initial stages, and the challenge of making GPU a true shared resource in the cloud has not yet been addressed. This paper presents a framework to enable applications executing within virtual machines to transparently share one or more GPUs. Our contributions are twofold: we extend an open source GPU virtualization software to include efficient GPU sharing, and we propose solutions to the conceptual problem of GPU kernel consolidation. In particular, we introduce a method for computing the affinity score between two or more kernels, which provides an indication of potential performance improvements upon kernel consolidation. In addition, we explore molding as a means to achieve efficient GPU sharing also in the case of kernels with high or conflicting resource requirements. We use these concepts to develop an algorithm to efficiently map a set of kernels on a pair of GPUs. We extensively evaluate our framework using eight popular GPU kernels and two Fermi GPUs. We find that even when contention is high our consolidation algorithm is effective in improving the throughput, and that the runtime overhead of our framework is low.

Explore More