Souvik Bhattacherjee | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Souvik Bhattacherjee is active.

Explore More

Publication

Featured researches published by Souvik Bhattacherjee.

very large data bases | 2015

Principles of dataset versioning: exploring the recreation/storage tradeoff

Souvik Bhattacherjee; Amit Chavan; Silu Huang; Amol Deshpande; Aditya G. Parameswaran

The relative ease of collaborative data science and analysis has led to a proliferation of many thousands or millions of versions of the same datasets in many scientific and commercial domains, acquired or constructed at various stages of data analysis across many users, and often over long periods of time. Managing, storing, and recreating these dataset versions is a non-trivial task. The fundamental challenge here is the storage-recreation trade-off: the more storage we use, the faster it is to recreate or retrieve versions, while the less storage we use, the slower it is to recreate or retrieve versions. Despite the fundamental nature of this problem, there has been a surprisingly little amount of work on it. In this paper, we study this trade-off in a principled manner: we formulate six problems under various settings, trading off these quantities in various ways, demonstrate that most of the problems are intractable, and propose a suite of inexpensive heuristics drawing from techniques in delay-constrained scheduling, and spanning tree literature, to solve these problems. We have built a prototype version management system, that aims to serve as a foundation to our DataHub system for facilitating collaborative data science. We demonstrate, via extensive experiments, that our proposed heuristics provide efficient solutions in practical dataset versioning scenarios.

high performance embedded architectures and compilers | 2011

High throughput data redundancy removal algorithm with scalable performance

Souvik Bhattacherjee; Ankur Narang; Vikas K. Garg

The ever growing need to process and analyze massive amounts of data from diverse sources such as telecom call data records, telescope imagery, web pages, stock markets, medical records and other domains has triggered worldwide research in data intensive computing. A key requirement here involves removing redundancy from data, as this enhances the compute efficiency for downstream data processing. These application domains have an intense need for high throughput data deduplication for huge volumes of data flowing at the rate of 1 GB/s or more. In this paper, we present the design of a novel parallel data redundancy removal algorithm. We also present a queueing theoretic analysis to optimize the throughput of our parallel algorithm on multi-core architectures. For 500M records, our parallel algorithm can perform complete deduplication in 255s, on 16 core Intel Xeon 5570 architecture. This gives a throughput of around 2M records/s. For 2048 byte records, we achieve a throughput of 0.81 GB/s. To the best of our knowledge, this is the highest throughput for data redundancy removal on such massive datasets. We also demonstrate strong and weak scalability of our algorithm for both multi-core Power6 and Intel Xeon 5570 architectures.

mobile data management | 2015

Predictive Caching Framework for Mobile Wireless Networks

Sourav Dutta; Ankur Narang; Souvik Bhattacherjee; Ananda Swarup Das; Dilip Krishnaswamy

With increasing popularity of Netflix, Yahoo! Video, etc., interactive multimedia services such as video-on-demand (VoD) provide an interesting and rich field of research. The advent of smarter wireless devices has surged the need for such services through wireless connectivity. However, personalization of individual user needs, reducing latency, coupled with maintaining low operational costs provides a challenging problem. In this paper, we propose an efficient VoD system, for wireless mobile devices, based on a novel caching algorithm, Intelligent Network Caching Algorithm (INCA) using analytics-driven look ahead scheme for both prefetch and replacement policies to deliver higher performance. This enables enhanced Quality Of Experience (QoE) of users with limited infrastructural changes and low operational cost. Alongside, we develop theoretical formulation of the QoE optimization problem that lies at the intersection of MPC (Markov Predictive Control) and MDP (Markov Decision Process). Empirical analysis over realistic user video query logs demonstrate better cache hit rate and QoE with low prefetch bandwidth, compared to existing caching schemes.

conference on information and knowledge management | 2010

Real-time memory efficient data redundancy removal algorithm

Vikas K. Garg; Ankur Narang; Souvik Bhattacherjee

Data intensive computing has become a central theme in research community and industry. There is an ever growing need to process and analyze massive amounts of data from diverse sources such as telecom call data records, telescope imagery, online transaction records, web pages, stock markets, medical records (monitoring critical health conditions of patients), climate warning systems, etc. Removing redundancy in the data is an important problem as it helps in resource and compute efficiency for downstream processing of the massive (1 billion to 10 billion records) datasets. In application domains such as IR, stock markets, telecom and others, there is a strong need for real-time data redundancy removal (referred to as DRR) of enormous amounts of data flowing at the rate of 1 GB/s or more. Real-time scalable data redundancy removal on massive datasets is a challenging problem. We present the design of a novel parallel data redundancy removal algorithm for both in-memory and disk-based execution. We also develop queueing theoretic analysis to optimize the throughput of our parallel algorithm on multi-core architectures. For 500 million records, our parallel algorithm can perform complete de-duplication in 255s, on 16 core Intel Xeon 5570 architecture, with in-memory execution. This gives a throughput of 2M records/s. For 6 billion records, our parallel algorithm can perform complete de-duplication in less than 4.5 hours, using 6 cores of Intel Xeon 5570, with disk-based execution. This gives a throughput of around 370K records/s. To the best of our knowledge, this is the highest real-time throughput for data redundancy removal on such massive datasets. We also demonstrate the scalability of our algorithm with increasing number of cores and data.

statistical and scientific database management | 2014

PStore: an efficient storage framework for managing scientific data

Souvik Bhattacherjee; Amol Deshpande; Alan Sussman

In this paper, we present the design, implementation, and evaluation of PStore, a no-overwrite storage framework for managing large volumes of array data generated by scientific simulations. PStore consists of two modules, a data ingestion module and a query processing module, that respectively address two of the key challenges in scientific simulation data management. The data ingestion module is geared toward handling the high volumes of simulation data generated at a very rapid rate, which often makes it impossible to offload the data onto storage devices; the module is responsible for selecting an appropriate compression scheme for the data at hand, chunking the data, and then compressing it before sending it to the storage nodes. On the other hand, the query processing module is in charge of efficiently executing different types of queries over the stored data; in this paper, we specifically focus on dicing (also called range) queries. PStore provides a suite of compression schemes that leverage, and in some cases extend, existing techniques to provide support for diverse scientific simulation data. To efficiently execute queries over such compressed data, PStore adopts and extends a two-level chunking scheme by incorporating the effect of compression, and hides expensive disk latencies for long running range queries by exploiting chunk prefetching. In addition, we also parallelize the query processing module to further speed up execution. We evaluate PStore on a 140 GB dataset obtained from real-world simulations using the regional climate model CWRF [5]. In this paper, we use both 3D and 4D datasets and demonstrate high performance through extensive experiments.

international conference of distributed computing and networking | 2010

Parallelization of the Lanczos algorithm on multi-core platforms

Souvik Bhattacherjee; Abhijit Das

In this paper, we report our parallel implementations of the Lanczos sparse linear system solving algorithm over large prime fields, on a multi-core platform. We employ several load-balancing methods suited to these platforms. We have carried out process-level and threadlevel parallel implementations under two different arithmetic libraries, and the best speedup obtained is 6.57 on eight cores. To the best of our knowledge, no implementation of the Lanczos algorithm on a multicore platform is ever reported in the literature. Moreover, we seem to have achieved significantly larger speedup compared to all previously reported implementations of this algorithm.

european conference on parallel processing | 2010

Parallel exact time series motif discovery

Ankur Narang; Souvik Bhattacherjee

Time series motifs are an integral part of diverse data mining applications including classification, summarization and near-duplicate detection. These are used across wide variety of domains such as image processing, bioinformatics, medicine, extreme weather prediction, the analysis of web log and customer shopping sequences, the study of XML query access patterns, electroencephalograph interpretation and entomological telemetry data mining. Exact Motif discovery in soft real-time over 100K time series is a challenging problem. We present novel parallel algorithms for soft real-time exact motif discovery on multi-core architectures. Experimental results on large scale P6 SMP system, using real life and synthetic time series data, demonstrate the scalability of our algorithms and their ability to discover motifs in soft real-time. To the best of our knowledge, this is the first such work on parallel scalable soft real-time exact motif discovery.

conference on innovative data systems research | 2015