Srihari Cadambi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Srihari Cadambi is active.

Explore More

Publication

Featured researches published by Srihari Cadambi.

international symposium on computer architecture | 2010

A dynamically configurable coprocessor for convolutional neural networks

Srimat T. Chakradhar; Murugan Sankaradas; Venkata Jakkula; Srihari Cadambi

Convolutional neural networks (CNN) applications range from recognition and reasoning (such as handwriting recognition, facial expression recognition and video surveillance) to intelligent text applications such as semantic text analysis and natural language processing applications. Two key observations drive the design of a new architecture for CNN. First, CNN workloads exhibit a widely varying mix of three types of parallelism: parallelism within a convolution operation, intra-output parallelism where multiple input sources (features) are combined to create a single output, and inter-output parallelism where multiple, independent outputs (features) are computed simultaneously. Workloads differ significantly across different CNN applications, and across different layers of a CNN. Second, the number of processing elements in an architecture continues to scale (as per Moores law) much faster than off-chip memory bandwidth (or pin-count) of chips. Based on these two observations, we show that for a given number of processing elements and off-chip memory bandwidth, a new CNN hardware architecture that dynamically configures the hardware on-the-fly to match the specific mix of parallelism in a given workload gives the best throughput performance. Our CNN compiler automatically translates high abstraction network specification into a parallel microprogram (a sequence of low-level VLIW instructions) that is mapped, scheduled and executed by the coprocessor. Compared to a 2.3 GHz quad-core, dual socket Intel Xeon, 1.35 GHz C870 GPU, and a 200 MHz FPGA implementation, our 120 MHz dynamically configurable architecture is 4x to 8x faster. This is the first CNN architecture to achieve real-time video stream processing (25 to 30 frames per second) on a wide range of object detection and recognition tasks.

application specific systems architectures and processors | 2009

A Massively Parallel Coprocessor for Convolutional Neural Networks

Murugan Sankaradas; Venkata Jakkula; Srihari Cadambi; Srimat T. Chakradhar; Igor Durdanovic; Eric Cosatto; Hans Peter Graf

We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. The coprocessor functional units, consisting of parallel 2D convolution primitives and programmable units performing sub-sampling and non-linear functions specific to CNNs, implement a “meta-operator” to which a CNN may be compiled to. The coprocessor is serviced by distributed off-chip memory banks with large data bandwidth. As a key feature, we use low precision data and further increase the effective memory bandwidth by packing multiple words in every memory operation, and leverage the algorithm’s simple data access patterns to use off-chip memory as a scratchpad for intermediate data, critical for CNNs. A CNN is mapped to the coprocessor hardware primitives with instructions to transfer data between the memory and coprocessor. We have implemented a prototype of the CNN coprocessor on an off-the-shelf PCI FPGA card with a single Xilinx Virtex5 LX330T FPGA and 4 DDR2 memory banks totaling 1GB. The coprocessor prototype can process at the rate of 3.4 billion multiply accumulates per second (GMACs) for CNN forward propagation, a speed that is 31x faster than a software implementation on a 2.2 GHz AMD Opteron processor. For a complete face recognition application with the CNN on the coprocessor and the rest of the image processing tasks on the host, the prototype is 6-10x faster, depending on the host-coprocessor bandwidth.

design, automation, and test in europe | 2006

Power Analysis of Mobile 3D Graphics

Bren Mochocki; Kanishka Lahiri; Srihari Cadambi

The world of 3D graphics, until recently restricted to high-end workstations and game consoles, is rapidly expanding into the domain of mobile platforms such as cellular phones and PDAs. Even as the mobile chip market is poised to exceed production of 500 million chips per year, incorporation of 3D graphics in handhelds poses several serious challenges to the hardware designer. Compared with other platforms, graphics on handhelds have to contend with limited energy supplies and lower computing horsepower. Nevertheless, images must still be rendered at high quality since handheld screens are typically held closer to the observers eye, making imperfections and approximations very noticeable. In this paper, we provide an in-depth quantitative analysis of the power consumption of mobile 3D graphics pipelines. We analyze the effects of various 3D graphics factors such as resolution, frame rate, level of detail, lighting and texture maps on power consumption. We demonstrate that significant imbalance exists across the workloads of different graphics pipeline stages. In addition, we illustrate how this imbalance may vary dynamically, depending on the characteristics of the graphics application. Based on this observation, we identify and compare the benefits of candidate dynamic voltage and frequency scaling (DVFS) schemes for mobile 3D graphics pipelines. In our experiments we observe that DVFS for mobile 3D graphics reduces energy by as much as 50%

international symposium on microarchitecture | 2012

Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

Haicheng Wu; Gregory Frederick Diamos; Srihari Cadambi; Sudhakar Yalamanchili

Data warehousing applications represent an emerging application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high bandwidth architectures that potentially offer substantial improvements in throughput for these applications. However, there are significant challenges that arise due to the overheads of data movement through the memory hierarchy and between the GPU and host CPU. This paper proposes data movement optimizations to address these challenges. Inspired in part by loop fusion optimizations in the scientific computing community, we propose kernel fusion as a basis for data movement optimizations. Kernel fusion fuses the code bodies of two GPU kernels to i) reduce data footprint to cut down data movement throughout GPU and CPU memory hierarchy, and ii) enlarge compiler optimization scope. We classify producer consumer dependences between compute kernels into three types, i) fine-grained thread-to-thread dependences, ii) medium-grained thread block dependences, and iii) coarse-grained kernel dependences. Based on this classification, we propose a compiler framework, Kernel Weaver, that can automatically fuse relational algebra operators thereby eliminating redundant data movement. The experiments on NVIDIA Fermi platforms demonstrate that kernel fusion achieves 2.89x speedup in GPU computation and a 2.35x speedup in PCIe transfer time on average across the micro-benchmarks tested. We present key insights, lessons learned, measurements from our compiler implementation, and opportunities for further improvements.

field-programmable custom computing machines | 2009

A Massively Parallel FPGA-Based Coprocessor for Support Vector Machines

Srihari Cadambi; Igor Durdanovic; Venkata Jakkula; Murugan Sankaradass; Eric Cosatto; Srimat T. Chakradhar; Hans Peter Graf

We present a massively parallel FPGA-based coprocessor for Support Vector Machines (SVMs), a machine learning algorithm whose applications include recognition tasks such as learning scenes, situations and concepts, and reasoning tasks such as analyzing the recognized scenes and semantics. The coprocessor architecture, targeted at both SVM training and classification, is based on clusters of vector processing elements (VPEs) operating in single-instruction multiple data (SIMD) mode to take advantage of large amounts of data parallelism in the application. We use the FPGA’s DSP elements as parallel multiply-accumulators (MACs), a core computation in SVMs. A key feature of the architecture is that it is customized to low precision arithmetic which permits one DSP unit to perform two or more MACs in parallel. Low precision also reduces the required number of parallel off-chip memory accesses by packing multiple data words on the FPGA-memory bus. We have built a prototype using an off-the-shelf PCI-based FPGA card with a Xilinx Virtex 5 FPGA and 1GB DDR2 memory. For SVM training, we observe application-level end-to-end computation speeds of over 9 billion multiply-accumulates per second (GMACs). For SVM classification, using data packing, the application speed increases to 14 GMACs. The FPGA-based system is about 20x faster than a dual Opteron 2.2 GHz processor CPU, and dissipates around 10W of power.

acm symposium on parallel algorithms and architectures | 2010

Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

Michela Becchi; Surendra Byna; Srihari Cadambi; Srimat T. Chakradhar

In this paper, we describe a runtime to automatically enhance the performance of applications running on heterogeneous platforms consisting of a multi-core (CPU) and a throughput-oriented many-core (GPU). The CPU and GPU are connected by a non-coherent interconnect such as PCI-E, and as such do not have shared memory. Heterogeneous platforms available today such as [9] are of this type. Our goal is to enable the programmer to seamlessly use such a system without rewriting the application and with minimal knowledge of the underlying architectural details. Assuming that applications perform function calls to computational kernels with available CPU and GPU implementations, our runtime achieves this goal by automatically scheduling the kernels and managing data placement. In particular, it intercepts function calls to well-known computational kernels and schedules them on CPU or GPU based on their argument size and location. To improve performance, it defers all data transfers between the CPU and the GPU until necessary. By managing data placement transparently to the programmer, it provides a unified memory view despite the underlying separate memory sub-systems. We experimentally evaluate our runtime on a heterogeneous platform consisting of a 2.5GHz quad-core Xeon CPU and an NVIDIA C870 GPU. Given array sorting, parallel reduction, dense and sparse matrix operations and ranking as computational kernels, we use our runtime to automatically retarget SSI [25], K-means [32] and two synthetic applications to the above platform with no code changes. We find that, in most cases, performance improves if the computation is moved to the data, and not vice-versa. For instance, even if a particular instance of a kernel is slower on the GPU than on the CPU, the overall application may be faster if the kernel is scheduled on the GPU anyway, especially if the kernel data is already located on the GPU memory due to prior decisions. Our results show that data-aware CPU/GPU scheduling improves performance by up to 25% over the best data-agnostic scheduling on the same platform.

design automation conference | 2006

Signature-based workload estimation for mobile 3D graphics

Bren Mochocki; Kanishka Lahiri; Srihari Cadambi; X. Sharon Hu

Until recently, most 3D graphics applications had been regarded as too computationally intensive for devices other than desktop computers and gaming consoles. This notion is rapidly changing due to improving screen resolutions and computing capabilities of mass-market handheld devices such as cellular phones and PDAs. As the mobile 3D gaming industry is poised to expand, significant innovations are required to provide users with high-quality 3D experience under limited processing, memory and energy budgets that are characteristic of the mobile domain. Energy saving schemes such as dynamic voltage and frequency scaling (DVFS), as well as system-level power and performance optimization methods for mobile devices require accurate and fast workload prediction. In this paper, we address the problem of workload prediction for mobile 3D graphics. We propose and describe a signature-based estimation technique for predicting 3D graphics workloads. By analyzing a gaming benchmark, we show that monitoring specific parameters of the 3D pipeline provides better prediction accuracy over conventional approaches. We describe how signatures capture such parameters concisely to make accurate workload predictions. Signature-based prediction is computationally efficient because first, signatures are compact, and second, they do not require elaborate model evaluations. Thus, they are amenable to efficient, real-time prediction. A fundamental difference between signatures and standard history-based predictors is that signatures capture previous outcomes as well as the cause that led to the outcome, and use both to predict future outcomes. We illustrate the utility of signature-based workload estimation technique by using it as a basis for DVFS in 3D graphics pipelines

international parallel and distributed processing symposium | 2012

Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

Haicheng Wu; Gregory Frederick Diamos; Jin Wang; Srihari Cadambi; Sudhakar Yalamanchili; Srimat T. Chakradhar

Data warehousing applications represent an emergent application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high core count architectures that potentially offer substantial improvements in throughput for these applications. However, there are significant challenges that arise due to the overheads of data movement through the memory hierarchy and between the GPU and host CPU. This paper proposes a set of compiler optimizations to address these challenges. Inspired in part by loop fusion/fission optimizations in the scientific computing community, we propose kernel fusion and kernel fission. Kernel fusion fuses the code bodies of two GPU kernels to i) eliminate redundant operations across dependent kernels, ii) reduce data movement between GPU registers and GPU memory, iii) reduce data movement between GPU memory and CPU memory, and iv) improve spatial and temporal locality of memory references. Kernel fission partitions a kernel into segments such that segment computations and data transfers between the GPU and host CPU can be overlapped. Fusion and fission can also be applied concurrently to a set of kernels. We empirically evaluate the benefits of fusion/fission on relational algebra operators drawn from the TPC-H benchmark suite. All kernels are implemented in CUDA and the experiments are performed with NVIDIA Fermi GPUs. In general, we observed data throughput improvements ranging from 13.1% to 41.4% for the SELECT operator and queries Q1 and Q21 in the TPC-H benchmark suite. We present key insights, lessons learned, and opportunities for further improvements.

international conference on parallel architectures and compilation techniques | 2010

A programmable parallel accelerator for learning and classification

Srihari Cadambi; Abhinandan Majumdar; Michela Becchi; Srimat T. Chakradhar; Hans Peter Graf

For learning and classification workloads that operate on large amounts of unstructured data with stringent performance constraints, general purpose processor performance scales poorly with data size. In this paper, we present a programmable accelerator for this workload domain. To architect the accelerator, we profile five representative workloads, and find that their computationally intensive portions can be formulated as matrix or vector operations generating large amounts of intermediate data, which are then reduced by a secondary operation such as array ranking, finding max/min and aggregation. The proposed accelerator, called MAPLE, has hundreds of simple processing elements (PEs) laid out in a two-dimensional grid, with two key features. First, it uses in-memory processing where on-chip memory blocks perform the secondary reduction operations. By doing so, the intermediate data are dynamically processed and never stored or sent off-chip. Second, MAPLE uses banked off-chip memory, and organizes its PEs into independent groups each with its own off-chip memory bank. These two features together allow MAPLE to scale its performance with data size. This paper describes the MAPLE architecture, explores its design space with a simulator, and illustrates how to automatically map application kernels to the hardware. We also implement a 512-PE FPGA prototype of MAPLE and find that it is 1.5–10x faster than a 2.5 GHz quad-core Xeon processor despite running at a modest 125 MHz.

design automation conference | 2002

A fast, inexpensive and scalable hardware acceleration technique for functional simulation

Srihari Cadambi; Chandra S Mulpuri; Pranav Ashar

We introduce a novel approach to accelerating functional simulation. The key attributes of our approach are high-performance, low-cost, scalability and low turn-around-time (TAT). We achieve speedups between 25 and 2000x over zero delay event-driven simulation and between 75 and 1000x over cycle-based simulation on benchmark and industrial circuits while maintaining the cost, scalability and TAT advantages of simulation. Owing to these attributes, we believe that such an approach has potential for very wide deployment as replacement or enhancement for existing simulators. Our technology relies on a VLIW-like virtual simulation processor (SimPLE) mapped to a single FPGA on an off-the-shelf PCI board. Primarily responsible for the speed are (i) parallelism in the processor architecture (ii) high pin count on the FPGA enabling large instruction bandwidth and (iii) high speed (124 MHz on Xilinx Virtex-II) single-FPGA implementation of the processor with regularity driven efficient place and route. Companion to the processor is the very fast SimPLE compiler which achieves compilation rates of 4 million gates/hour. In order to simulate the netlist, the compiled instructions are streamed through the FPGA, along with the simulation vectors. This architecture plugs in naturally into any existing HDL simulation environment. We have a working prototype based on a commercially available PCI-based FPGA board.

Explore More