Aniket Chakrabarti | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Aniket Chakrabarti is active.

Explore More

Publication

Featured researches published by Aniket Chakrabarti.

international world wide web conferences | 2015

Sequential Hypothesis Tests for Adaptive Locality Sensitive Hashing

Aniket Chakrabarti; Srinivasan Parthasarathy

All pairs similarity search is a problem where a set of data objects is given and the task is to find all pairs of objects that have similarity above a certain threshold for a given similarity measure-of-interest. When the number of points or dimensionality is high, standard solutions fail to scale gracefully. Approximate solutions such as Locality Sensitive Hashing (LSH) and its Bayesian variants (BayesLSH and BayesLSHLite) alleviate the problem to some extent and provide substantial speedup over traditional index based approaches. BayesLSH is used for pruning the candidate space and computation of approximate similarity, whereas BayesLSHLite can only prune the candidates, but similarity needs to be computed exactly on the original data. Thus where ever the explicit data representation is available and exact similarity computation is not too expensive, BayesLSHLite can be used to aggressively prune candidates and provide substantial speedup without losing too much on quality. However, the loss in quality is higher in the BayesLSH variant, where explicit data representation is not available, rather only a hash sketch is available and similarity has to be estimated approximately. In this work we revisit the LSH problem from a Frequentist setting and formulate sequential tests for composite hypothesis (similarity greater than or less than threshold) that can be leveraged by such LSH algorithms for adaptively pruning candidates aggressively. We propose a vanilla sequential probability ratio test (SPRT) approach based on this idea and two novel variants. We extend these variants to the case where approximate similarity needs to be computed using fixed-width sequential confidence interval generation technique. We compare these novel variants with the SPRT variant and BayesLSH/Bayes-LSHLite variants and show that they can provide tighter qualitative guarantees over BayesLSH/BayesLSHLite -- a state-of-the-art approach -- while being upto 2.1x faster than a traditional SPRT and 8.8x faster than AllPairs.

ACM Transactions on Knowledge Discovery From Data | 2015

A Bayesian Perspective on Locality Sensitive Hashing with Extensions for Kernel Methods

Aniket Chakrabarti; Venu Satuluri; Atreya Srivathsan; Srinivasan Parthasarathy

Given a collection of objects and an associated similarity measure, the all-pairs similarity search problem asks us to find all pairs of objects with similarity greater than a certain user-specified threshold. In order to reduce the number of candidates to search, locality-sensitive hashing (LSH) based indexing methods are very effective. However, most such methods only use LSH for the first phase of similarity search—that is, efficient indexing for candidate generation. In this article, we present BayesLSH, a principled Bayesian algorithm for the subsequent phase of similarity search—performing candidate pruning and similarity estimation using LSH. A simpler variant, BayesLSH-Lite, which calculates similarities exactly, is also presented. Our algorithms are able to quickly prune away a large majority of the false positive candidate pairs, leading to significant speedups over baseline approaches. For BayesLSH, we also provide probabilistic guarantees on the quality of the output, both in terms of accuracy and recall. Finally, the quality of BayesLSH’s output can be easily tuned and does not require any manual setting of the number of hashes to use for similarity estimation, unlike standard approaches. For two state-of-the-art candidate generation algorithms, AllPairs and LSH, BayesLSH enables significant speedups, typically in the range 2 × --20 × for a wide variety of datasets. We also extend the BayesLSH algorithm for kernel methods—in which the similarity between two data objects is defined by a kernel function. Since the embedding of data points in the transformed kernel space is unknown, algorithms such as AllPairs which rely on building inverted index structure for fast similarity search do not work with kernel functions. Exhaustive search across all possible pairs is also not an option since the dataset can be huge and computing the kernel values for each pair can be prohibitive. We propose K-BayesLSH an all-pairs similarity search problem for kernel functions. K-BayesLSH leverages a recently proposed idea—kernelized locality sensitive hashing (KLSH)—for hash bit computation and candidate generation, and uses the aforementioned BayesLSH idea for candidate pruning and similarity estimation. We ran a broad spectrum of experiments on a variety of datasets drawn from different domains and with distinct kernels and find a speedup of 2 × --7 × over vanilla KLSH.

conference on information and knowledge management | 2016

Topological Graph Sketching for Incremental and Scalable Analytics

Bortik Bandyopadhyay; David Fuhry; Aniket Chakrabarti; Srinivasan Parthasarathy

We propose a novel, scalable, and principled graph sketching technique based on minwise hashing of local neighborhood. For an n-node graph with e-edges (e >> n), we incrementally maintain in real-time a minwise neighbor sampled subgraph using k hash functions in O(n x k) memory, limit being user-configurable by the parameter k. Symmetrization and similarity based techniques can recover from these data structures a significant portion of the original graph. We present theoretical analysis of the minwise sampling strategy and also derive unbiased estimators for important graph properties such as triangle count and neighborhood overlap. We perform an extensive empirical evaluation of our graph sketch and its derivatives on a wide variety of real-world graph data sets drawn from different application domains using important large network analysis algorithms: local and global clustering coefficient, PageRank, and local graph sparsification. With bounded memory, the quality of results using the sketch representation is competitive against baselines which use the full graph, and the computational performance is often better. Our framework is flexible and configurable to be leveraged by numerous other graph analytics algorithms, potentially reducing the information mining time on large streamed graphs for a variety of applications.

international joint conference on artificial intelligence | 2017

Fast Change Point Detection on Dynamic Social Networks

Yu Wang; Aniket Chakrabarti; David Sivakoff; Srinivasan Parthasarathy

A number of real world problems in many domains (e.g. sociology, biology, political science and communication networks) can be modeled as dynamic networks with nodes representing entities of interest and edges representing interactions among the entities at different points in time. A common representation for such models is the snapshot model - where a network is defined at logical time-stamps. An important problem under this model is change point detection. In this work we devise an effective and efficient three-step-approach for detecting change points in dynamic networks under the snapshot model. Our algorithm achieves up to 9X speedup over the state-of-the-art while improving quality on both synthetic and real world networks.

international conference on conceptual structures | 2016

D-STHARk

Svyo Toledo; Danilo Melo; Guilherme Andrade; Fernando Mouro; Aniket Chakrabarti; Renato Ferreira; Srinivasan Parthasarathy; Leonardo C. da Rocha

The emergence of applications that demand to handle efficiently growing amounts of data has stimulated the development of new computing architectures with several Processing Units (PUs), such as CPUs core, graphics processing units (GPUs) and Intel Xeon Phi (MIC). Aiming to better exploit these architectures, recent works focus on proposing novel runtime environments that offer a variety of methods for scheduling tasks dynamically on different PUs. A main limitation of such proposals refers to the constrained system configurations, usually adopted to tune and test the proposals, since setting more complete and diversified evaluation environments is costly. In this context, we present D-STHARk, a GUI tool for evaluating Dynamic Scheduling of Tasks in Hybrid Simulated ARchitectures. D-STHARk provides a complete simulated execution environment that allows evaluating dynamic scheduling strategies on simulated applications and hybrid architectures. We evaluate our tool by simulating the dynamic scheduling strategies presented in [3], using the same architecture and application. D-STHARk was able to achieve the same conclusions originally reported by the authors. Moreover, we performed an experiment varying the number of coprocessors, which was not previously verified due to lack of real architectures, showing that we may reduce the energy consumption, while keeping the same performance.

european conference on machine learning | 2016

Improving Locality Sensitive Hashing Based Similarity Search and Estimation for Kernels

Aniket Chakrabarti; Bortik Bandyopadhyay; Srinivasan Parthasarathy

We present a novel data embedding that significantly reduces the estimation error of locality sensitive hashing (LSH) technique when used in reproducing kernel Hilbert space (RKHS). Efficient and accurate kernel approximation techniques either involve the kernel principal component analysis (KPCA) approach or the Nystrom approximation method. In this work we show that extant LSH methods in this space suffer from a bias problem, that moreover is difficult to estimate apriori. Consequently, the LSH estimate of a kernel is different from that of the KPCA/Nystrom approximation. We provide theoretical rationale for this bias, which is also confirmed empirically. We propose an LSH algorithm that can reduce this bias and consequently our approach can match the KPCA or the Nystrom methods’ estimation accuracy while retaining the traditional benefits of LSH. We evaluate our algorithm on a wide range of realworld image datasets (for which kernels are known to perform well) and show the efficacy of our algorithm using a variety of principled evaluations including mean estimation error, KL divergence and the Kolmogorov-Smirnov test.

web science | 2017

Hierarchical Change Point Detection on Dynamic Networks

Yu Wang; Aniket Chakrabarti; David Sivakoff; Srinivasan Parthasarathy

This paper studies change point detection on networks with community structures. It proposes a framework that can detect both local and global changes in networks efficiently. Importantly, it can clearly distinguish the two types of changes. The framework design is generic and as such several state-of-the-art change point detection algorithms can fit in this design. Experiments on both synthetic and real-world networks show that this framework can accurately detect changes while achieving up to 800X speedup.

international conference on parallel processing | 2017

A Pareto Framework for Data Analytics on Heterogeneous Systems: Implications for Green Energy Usage and Performance

Aniket Chakrabarti; Srinivasan Parthasarathy; Christopher Stewart

Distributed algorithms for data analytics partition their input data across many machines for parallel execution. At scale, it is likely that some machines will perform worse than others because they are slower, power constrained or dependent on undesirable, dirty energy sources. It is challenging to balance analytics workloads across heterogeneous machines because the algorithms are sensitive to statistical skew in data partitions. A skewed partition can slow down the whole workload or degrade the quality of results. Sizing partitions in proportion to each machines performance may introduce or further exacerbate skew. In this paper, we propose a scheme that controls the statistical distribution of each partition and sizes partitions according to the heterogeneity of the computing environment. We model heterogeneity as a multi-objective optimization, with the objectives being functions for execution time and dirty energy consumption. We use stratification to control skew. Experiments show that our computational heterogeneity-aware (Het-Aware) partitioning strategy speeds up running time by up to 51% over the stratified partitioning scheme baseline. We also have a heterogeneity and energy aware (Het-Energy-Aware) partitioning scheme which is slower than the Het-Aware solution but can lower the dirty energy footprint by up to 26%. For some analytic tasks, there is also a significant qualitative benefit when using such partitioning strategies.

international conference on systems for energy efficient built environments | 2016

Robust Anomaly Detection for Large-Scale Sensor Data

Aniket Chakrabarti; Manish Marwah; Martin F. Arlitt

Large scale sensor networks are ubiquitous nowadays. An important objective of deploying sensors is to detect anomalies in the monitored system or infrastructure, which allows remedial measures to be taken to prevent failures, inefficiencies, and security breaches. Most existing sensor anomaly detection methods are local, i.e., they do not capture the global dependency structure of the sensors, nor do they perform well in the presence of missing or erroneous data. In this paper, we propose an anomaly detection technique for large scale sensor data that leverages relationships between sensors to improve robustness even when data is missing or erroneous. We develop a probabilistic graphical model-based global outlier detection technique that represents a sensor network as a pairwise Markov Random Field and uses graphical model inference to detect anomalies. We show our model is more robust than local models, and detects anomalies with 90% accuracy even when 50% of sensors are erroneous. We also build a synthetic graphical model generator that preserves statistical properties of a real data set to test our outlier detection technique at scale.

international conference on computer communications | 2016

Green- and heterogeneity-aware partitioning for data analytics.

Aniket Chakrabarti; Srinivasan Parthasarathy; Christopher Stewart

Distributed algorithms for analytics partition their input data across many machines for parallel execution. At scale, it is likely that some machines will perform worse than others because they are slower, power constrained or dependent on undesirable, dirty energy sources. It is even more challenging to balance analytics workloads as the algorithms are sensitive to statistical skew across data partitions and not just partition size. Here, we propose a lightweight framework that controls the statistical distribution of each partition and sizes partitions according to the heterogeneity of the environment. We model heterogeneity as a multi-objective optimization, with the objectives being functions for execution time and dirty energy consumption. We use stratification to control data skew and model heterogeneity-aware partitioning. We then discover Pareto-optimal partitioning strategies. We built our partitioning framework atop Redis and measured its performance on data mining workloads with realistic data sets. Our framework simultaneously achieved 34% reduction in time and 21% reduction in dirty energy usage for a popular webgraph compression algorithm using 8 partitions.

Explore More