Sai Rahul Chalamalasetti

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sai Rahul Chalamalasetti is active.

Explore More

Publication

Featured researches published by Sai Rahul Chalamalasetti.

field programmable gate arrays | 2013

An FPGA memcached appliance

Sai Rahul Chalamalasetti; Kevin T. Lim; Mitch Wright; Alvin AuYoung; Parthasarathy Ranganathan; Martin Margala

Providing low-latency access to large amounts of data is one of the foremost requirements for many web services. To address these needs, systems such as Memcached have been created which provide a distributed, all in-memory key-value store. These systems are critical and often deployed across hundreds or thousands of servers. However, these systems are not well matched for commodity servers, as they require significant CPU resources to achieve reasonable network bandwidth, yet the core Memcached functions do not benefit from the high performance of standard server CPUs. In this paper, we demonstrate the design of an FPGA-based Memcached appliance. We take Memcached, a complex software system, and implement its core functionality on an FPGA. By leveraging the FPGAs design and utilizing its customizable logic to create a specialized appliance we are able to tightly integrate networking, compute, and memory. This integration allows us to overcome many of the bottlenecks found in standard servers. Our design provides performance on-par with baseline servers, but consumes only 9% of the power of the baseline. Scaled out, we see benefits at the data center level, substantially improving the performance-per-dollar while improving energy efficiency by 3.2X to 10.9X.

adaptive hardware and systems | 2009

MORA - An Architecture and Programming Model for a Resource Efficient Coarse Grained Reconfigurable Processor

Sai Rahul Chalamalasetti; Sohan Purohit; Martin Margala; Wim Vanderbauwhede

This paper presents an architecture and implementation details for MORA, a novel coarse grained reconfigurable processor for accelerating media processing applications. The MORA architecture involves a 2-D array of several such processors, to deliver low cost, high throughput performance in media processing applications. A distinguishing feature of the MORA architecture is the co-design of hardware architecture and low-level programming language throughout the design cycle. The implementation details for the single MORA processor, and benchmark evaluation using a cycle accurate simulator are presented.

international symposium on performance analysis of systems and software | 2012

Evaluating FPGA-acceleration for real-time unstructured search

Sai Rahul Chalamalasetti; Martin Margala; Wim Vanderbauwhede; Mitch Wright; Parthasarathy Ranganathan

Emerging data-centric workloads that operate on and harvest useful insights from large amounts of unstructured data require corresponding new data-centric system architecture optimizations. In particular, with the growing importance of power and cooling costs, a key challenge for such future designs is to achieve increased performance at high energy efficiency. At the same time, recent trends towards better support for reconfigurable logic enable the use of energy-efficient accelerators. Combining these trends, in this paper, we examine the applicability of acceleration in future data-centric system architectures. We focus on an important class of data-centric workloads, real-time unstructured search, or information filtering, where large collections of documents are scored against specific topic profiles, and present an FPGA-based implementation to accelerate such workloads. Our implementation, based on the GiDEL PROCStar IV board using Altera Stratix IV FPGAs, demonstrates excellent performance and energy efficiency, 20 to 40 times better than baseline server systems for typical usage scenarios. Our results also highlight interesting insights for the design of accelerators in future data-centric systems.

reconfigurable computing and fpgas | 2008

Power-Efficient High Throughput Reconfigurable Datapath Design for Portable Multimedia Devices

Sohan Purohit; Sai Rahul Chalamalasetti; Martin Margala; Pasquale Corsonello

This paper presents new power efficient high throughput data paths for portable multimedia devices. The various data paths provide support for dense arithmetic operations. This work provides the performance evaluation for a library of reconfigurable data path elements (Processing Elements) previously proposed and presents two new processing element architectures to be part of power efficient portable, multimedia processing systems. The performance results show that the proposed designs will provide a higher efficiency in power and area consumption compared to the previously suggested and commercial solutions, and could prove highly beneficial for the target domain of multimedia operations on portable systems.

field programmable logic and applications | 2014

High level programming framework for FPGAs in the data center

Oren Segal; Martin Margala; Sai Rahul Chalamalasetti; Mitch Wright

Heterogeneous computing offers a promising solution for energy efficient computing in the data center. FPGA based heterogeneous computing is an especially promising direction since it allows for the creation of custom hardware solutions for data centric parallel applications. One of the main issues delaying wide spread adoption of FPGAs as main stream high performance computing devices is the difficulty in programming them. OpenCL was meant to address the difficulties and the non-uniformity related to programming heterogeneous devices, unfortunately because of its complexity it sets the bar high for many software programmers, preventing them from directly benefiting from the computing power and energy efficiency that OpenCL and heterogeneous computing have to offer. This work presents an effort to bridge the gap by extending an existing Java programming framework (APARAPI), based on OpenCL, so that it can be used to program FPGAs at a high level of abstraction and increased ease of programmability. We run several real world algorithms to assess the performance of the APARAPI framework on both a low end and a high end system. On the low end and high and systems respectively we find up to 78-80 percent power reduction and 4.8X-5.3X speed increase running NBody simulation, as well as up to 65-80 percent power reduction and 6.2X-7X speed increase for a K-Means MapReduce algorithm running on top of the Hadoop framework and APARAPI.

IEEE Transactions on Very Large Scale Integration Systems | 2013

Design and Evaluation of High-Performance Processing Elements for Reconfigurable Systems

Sohan Purohit; Sai Rahul Chalamalasetti; Martin Margala; Wim Vanderbauwhede

In this paper, we present the design and evaluation of two new processing elements for reconfigurable computing. We also present a circuit-level implementation of the data paths in static and dynamic design styles to explore the various performance-power tradeoffs involved. When implemented in IBM 90-nm CMOS process, the 8-b data paths achieve operating frequencies ranging over 1 GHz both for static and dynamic implementations, with each data path supporting single-cycle computational capability. A novel single-precision floating point processing element (FPPE) using a 24-b variant of the proposed data paths is also presented. The full dynamic implementation of the FPPE shows that it operates at a frequency of 1 GHz with 6.5-mW average power consumption. Comparison with competing architectures shows that the FPPE provides two orders of magnitude higher throughput. Furthermore, to evaluate its feasibility as a soft-processing solution, we also map the floating point unit onto the Virtex 4 and 5 devices, and observe that the unit requires less than 1% of the total logic slices, while utilizing only around 4% of the DSP blocks available. When compared against popular field-programmable-gate-array-based floating point units, our design on Virtex 5 showed significantly lower resource utilization, while achieving comparable peak operating frequency.

field-programmable logic and applications | 2009

A low cost reconfigurable soft processor for multimedia applications: Design synthesis and programming model

Sai Rahul Chalamalasetti; Wim Vanderbauwhede; Sohan Purohit; Martin Margala

This paper presents an FPGA implementation of a low cost 8bit reconfigurable processor core for media processing applications. The core is optimized to provide all basic arithmetic and logic functions required by the media processing and other domains, as well as to make it easily integrable into a 2D array. This paper presents an investigation of the feasibility of the core as a potential soft processing architecture for FPGA platforms. The core was synthesized on the entire Virtex FPGA family to evaluate its overall performance, scalability and portability. A special feature of the proposed architecture is its simple programming model which allows low level programming. Throughput results for popular benchmarks coded using the programming model and cycle accurate simulator are presented.

IEEE Transactions on Very Large Scale Integration Systems | 2013

Throughput/Resource-Efficient Reconfigurable Processor for Multimedia Applications

Sohan Purohit; Sai Rahul Chalamalasetti; Martin Margala; Wim Vanderbauwhede

This brief presents the implementation and evaluation of an 8-bit adaptable processor core to be part of the power-throughput-area efficient multimedia oriented reconfigurable architecture reconfigurable array. The design of the processor core was custom implemented in IBMs 90 nm CMOS technology and occupies 0.115 mm2 silicon area with approximately 70% area utilized by core circuits. The processor shows a peak throughput performance of 75 MOPS/mW. Benchmarking results show estimated throughputs of 9.5, 21.36, 39.78, 170.88, and 4.54 MSamples/s for variants of 2-D discrete cosine transform (DCT), 4 × 4 H.264 integer transform, and 2-D discrete wavelet transform, respectively. Our analysis shows that the proposed design provides approximately 4-8 times higher throughput for 2-D DCT when compared against popular architectures.

International Journal of Reconfigurable Computing | 2012

Throughput analysis for a high-performance FPGA-accelerated real-time search application

Wim Vanderbauwhede; Sai Rahul Chalamalasetti; Martin Margala

We propose an FPGA design for the relevancy computation part of a high-throughput real-time search application. The application matches terms in a stream of documents against a static profile, held in off-chip memory. We present a mathematical analysis of the throughput of the application and apply it to the problem of scaling the Bloom filter used to discard nonmatches.

adaptive hardware and systems | 2010

Low overhead soft error detection and correction scheme for reconfigurable pipelined data paths

Sohan Purohit; Sai Rahul Chalamalasetti; Martin Margala

In this paper, we describe a novel scheme for radiation hardening of high performance pipelined architectures and data paths. The proposed technique uses a local ground bus decoupled from the global ground using an additional pull down device, to detect a transient error. Combining the detector output with duplicated pipeline registers enables an instruction execution through the data path to be repeated as soon as the error is detected. The detector outputs from various stages in a pipelined data path are manipulated to maintain correctness of data in the event of a transient error detection and corresponding instruction roll back. The proposed technique is extremely effective for errors of different pulse widths and comes without the extra cost of error checking codes, watch dog processors and logic core duplication as used by other techniques in literature. Our scheme provides 100% radiation hardening over all process corners with only 9.7% and 21.73% area and power overhead respectively with the delay overhead being masked out by the pipeline stages used in modern high performance data path architectures.

Explore More