Pallav Kumar Baruah
Sri Sathya Sai University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Pallav Kumar Baruah.
ieee international conference on high performance computing data and analytics | 2007
M. K. Velamati; Arun Kumar; Naresh Jayam; Ganapathy Senthilkumar; Pallav Kumar Baruah; Raghunath Sharma; Shakti Kapoor; Ashok Srinivasan
The Cell is a heterogeneous multi-core processor, which has eight coprocessors, called SPEs. The SPEs can access a common shared main memory through DMA, and each SPE can directly operate on a small distinct local store. An MPI implementation can use each SPE as if it were a node for an MPI process. In this paper, we discuss the efficient implementation of collective communication operations for intra-Cell MPI, both for cores on a single chip, and for a Cell blade. While we have implemented all the collective operations, we describe in detail the following: barrier, broadcast, and reduce. The main contributions of this work are (i) describing our implementation, which achieves low latencies and high bandwidths using the unique features of the Cell, and (ii) comparing different algorithms, and evaluating the influence of the architectural features of the Cell processor on their effectiveness.
international conference on conceptual structures | 2007
Arun Kumar; Ganapathy Senthilkumar; Murali Krishna; Naresh Jayam; Pallav Kumar Baruah; Raghunath Sharma; Ashok Srinivasan; Shakti Kapoor
The Cell Broadband EngineTMis a heterogeneous multi-core architecture developed by IBM, Sony and Toshiba. It has eight computation intensive cores (SPEs) with a small local memory, and a single PowerPC core. The SPEs have a total peak single precision performance of 204.8 Gflops/s, and 14.64 Gflops/s in double precision. Therefore, the Cell has a good potential for high performance computing. But the unconventional architecture makes it difficult to program. We propose an implementation of the core features of MPI as a solution to this problem. This can enable a large class of existing applications to be ported to the Cell. Our MPI implementation attains bandwidth up to 6.01 GB/s, and latency as small as 0.41 μs. The significance of our work is in demonstrating the effectiveness of intra-Cell MPI, consequently enabling the porting of MPI applications to the Cell with minimal effort.
acm symposium on parallel algorithms and architectures | 2007
Arun Kumar; Naresh Jayam; Ashok Srinivasan; Ganapathy Senthilkumar; Pallav Kumar Baruah; Shakti Kapoor; Murali Krishna; Raghunath Sarma
The Cell Broadband Engine™ is a new heterogeneous multi-core processor from IBM, Sony, and Toshiba. It contains eight co-processors, called Synergistic Processing Elements (SPEs), which operate directly on distinct 256 KB local stores, and also have access to a shared 512 MB to 2 GB main memory. The combined peak speed of the SPEs is 204.8 Gflop/s in single precision and 14.64 Gflop/s in double precision. There is, therefore, much interest in using the Cell BE™ for high performance computing applications. However, the unconventional architecture of the SPEs, in particular their local stores, creates some programming challenges. We describe our implementation of certain core features of MPI, such as blocking point-to-point calls and collective communication calls, which can help meet these challenges, by enabling a large class of MPI applications to be ported to the Cell BE™ processor. This implementation views each SPE as a node for an MPI process. We store the application data in main memory in order to avoid being limited by the local store size. The local store is abstracted in the library and thus hidden from the application with respect to MPI calls. We have achieved bandwidth up to 6.01 GB/s and latency as low as 0.41 ms on the ping-pong test. The contribution of this work lies in (i) demonstrating that the Cell BE™ has good potential for running intra-Cell BE™ MPI applications, (ii) enabling such applications to be ported to the Cell BE™ with minimal effort, and (iii) evaluating the performance impact of different design choices.
international conference on cloud computing | 2011
Telidevara Aditya; Pallav Kumar Baruah; Ravi Mukkamala
With the increasing growth of cloud computing and the resulting outsourcing of data, concerns of data integrity, security, and privacy are also on the rise. Among these, evidence of data integrity -- being tamper-evident and up-to-date, seem to be of immediate concern. While several integrity techniques currently exist, most result in significant storage overhead at the data owner site. For clients with large data sets, these are not viable solutions. In this paper, we propose a space-efficient alternative -- data integrity using Bloom filters. We propose the basic method and discuss different alternatives to implement the scheme based on the trust/threat model and processing/storage capacity at the server and the client. For one of these alternatives, we present a detailed analysis and experimental results. These results are compared with the traditional security hash functions such as SHA-1 and MD5. The Bloom filter implementations are shown to be highly space-efficient at the expense of additional computational overhead. To overcome this bottleneck, we have implemented the schemes on multiprocessor systems. The multicore implementations have significantly reduced the execution time. Our results clearly demonstrate the feasibility and efficacy of employing Bloom filters to enforce data integrity for outsourced data in cloud environments.
international parallel and distributed processing symposium | 2009
C.D. Sudheer; T. Nagaraju; Pallav Kumar Baruah; Ashok Srinivasan
The Cell is a heterogeneous multicore processor that has attracted much attention in the HPC community. The bulk of the computational workload on the Cell processor is carried by eight co-processors called SPEs. The SPEs are connected to each other and to main memory by a high speed bus called the Element Interconnect Bus (EIB), which is capable of 204.8 GB/s. However, access to the main memory is limited by the performance of the Memory Interface Controller (MIC) to 25.6 GB/s. It is, therefore, advantageous for the algorithms to be structured such that SPEs communicate directly between themselves over the EIB, and make less use of memory. We show that the actual bandwidth obtained for inter-SPE communication is strongly influenced by the assignment of threads to SPEs (thread-SPE affinity) in many realistic communication patterns. We identify the bottlenecks to optimal performance and use this information to determine good affinities for common communication patterns. Our solutions improve performance by up to a factor of two over the default assignment. We also discuss the optimization of affinity on a Cell blade consisting of two Cell processors, and provide a software tool to help with this. Our results will help Cell application developers choose good affinities for their applications.
international conference on information technology: new generations | 2012
Tummalapalli J. V. R. K. M. K. Sayi; R. K. N. Sai Krishna; Ravi Mukkamala; Pallav Kumar Baruah
With the increasing cost of maintaining IT centers, organizations are looking into outsourcing thier storage and computational needs to a cloud server. However, such outsourcing has also raised the more serious issue of data privacy. In this paper, we summarize our work in privacy-preserving data outsourcing. In particular, we discuss the issue of employing vertical fragmentation to a relation so that the fragment that is assigned to the cloud server contains maximum data without violating privacy. Here, privacy is expressed in terms of a set of confidentiality constraints. We represent the confidentiality constraints as a graph where the nodes are the attributes and links represent paired confidentiality. We then apply the graph coloring problem with two colors for the a cyclic portion of the graph. We use some heuristic to eliminate the cycles, and complete the coloring of all nodes. We are currently extending the work to multiple relations and constraints with multiple attributes in a constraint (i.e., triplet, quadruplet, etc.) instead of just pairs.
international symposium on parallel and distributed processing and applications | 2007
Murali Krishna; Arun Kumar; Naresh Jayam; Ganapathy Senthilkumar; Pallav Kumar Baruah; Raghunath Sharma; Shakti Kapoor; Ashok Srinivasan
The Cell Broadband Engine shows much promise in high performance computing applications. The Cell is a heterogeneous multicore processor, with the bulk of the computational work load meant to be borne by eight co-processors called SPEs. Each SPE operates on a distinct 256 KB local store, and all the SPEs also have access to a shared 512 MB to 2 GB main memory through DMA. The unconventional architecture of the SPEs, and in particular their small local store, creates some programming challenges. We have provided an implementation of core features of MPI for the Cell to help deal with this. This implementation views each SPE as a node for an MPI process, with the local store used as if it were a cache. In this paper, we describe synchronous mode communication in our implementation, using the rendezvous protocol, which makes MPI communication for long messages efficient. We further present experimental results on the Cell hardware, where it demonstrates good performance, such as throughput up to 6.01 GB/s and latency as low as 0.65 µs on the pingpong test. This demonstrates that it is possible to efficiently implement MPI calls even on the simple SPE cores.
mobile data management | 2012
R. K. N. Sai Krishna; Tummalapalli J. V. R. K. M. K. Sayi; Ravi Mukkamala; Pallav Kumar Baruah
With the growing demand for data-sensitive applications employing mobile devices, such as mobile clinics in remote villages and remote sensors collecting sensitive data, there is a need for a new architectural paradigm for mobile data management. Typically, these mobile devices have limited storage and processing capabilities, and operate in unreliable environments leading to possible loss of valuable data, if not properly managed. Often, the collected data is periodically offloaded to a remote server such as a cloud. However, such offloading may lead to violation of privacy if the network/server cannot be fully trusted. While encrypting the data prior to offloading appears to be a solution for this problem, this is computationally intensive and infeasible when mobile devices are employed. In this paper, we propose a partial-encryption scheme that takes into account both the privacy (confidentiality) constraints of the data being collected and the limitations of the mobile devices. The scheme employs vertical and horizontal fragmentation to determine those parts that need to be encrypted and those that can be sent in clear. The privacy constraints are represented in terms of a constraint graph and two-coloring problem solution is applied to identify the portions of the data that need to be encrypted. Any cycles in the constraint graph are handled using heuristics. The approach is effective in integrating unsecured wireless/internet with the untrusted yet cheap cloud storage servers, using the capacity-constrained mobile devices to manage sensitive data.
information reuse and integration | 2012
Tummalapalli J. V. R. K. M. K. Sayi; R. K. N. Sai Krishna; Ravi Mukkamala; Pallav Kumar Baruah
With increasing opportunities for cheaper outsourcing of data, more and more organizations are seriously considering this option to reduce storage and processing costs. However, it has also given rise to the possibilities of security and privacy violations of data in outsourced environments. In this paper, we look at the privacy aspect, often referred to as data confidentiality. Our solution employs partitioning of the data into fragments (horizontal and vertical) so that only that group of fragments which do not violate the privacy are outsourced and the remaining are retained by the owner. The primary objective of the partitioning algorithm is to maximize the size of the outsourced fragment. Since obtaining optimal fragments that satisfy the privacy constraints is NP-hard, we suggest the use of clustering algorithms to provide near-optimal solutions. We provide proof of correctness for the proposed algorithm. We illustrate the proposed scheme using an example and show its efficacy.
grid computing | 2012
Ajith Padyana; C.D. Sudheer; Pallav Kumar Baruah; Ashok Srinivasan
Compute-intensive tasks in high-end high performance computing (HPC) systems often generate large amounts of data, especially floating-point data, that need to be transmitted over the network. Although computation speeds are very high, the overall performance of these applications is affected by the data transfer overhead. Moreover, as data sets are growing in size rapidly, bandwidth limitations pose a serious bottleneck in several scientific applications. Fast floating point compression can ameliorate the bandwidth limitations. If data is compressed well, then the amount of data transfer is reduced. This reduction in data transfer time comes at the expense of the increased computation required by compression and decompression. It is important for compression and decompression rates to be greater than the network bandwidth; otherwise, it will be faster to transmit uncompressed data directly. Accelerators such as Graphics Processing Units (GPU) provide much computational power. In this paper, we show that the computational power of GPUs can be harnessed to provide sufficiently fast compression and decompression for this approach to be effective for data produced by many practical applications. In particularly, we use Holts Exponential smoothing algorithm from time series analysis, and encode the difference between its predictions and the actual data. This yields a lossless compression scheme. We show that it can be implemented efficiently on GPUs to provide an effective compression scheme for the purpose of saving on data transfer overheads. The primary contribution of this work lies in demonstrating the potential of floating point compression in reducing the I/O bandwidth bottleneck on modern hardware for important classes of scientific applications.