Vibhor Rastogi
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Vibhor Rastogi.
international world wide web conferences | 2013
Nilesh N. Dalvi; Anirban Dasgupta; Ravi Kumar; Vibhor Rastogi
In this paper we analyze a crowdsourcing system consisting of a set of users and a set of binary choice questions. Each user has an unknown, fixed, reliability that determines the users error rate in answering questions. The problem is to determine the truth values of the questions solely based on the user answers. Although this problem has been studied extensively, theoretical error bounds have been shown only for restricted settings: when the graph between users and questions is either random or complete. In this paper we consider a general setting of the problem where the user--question graph can be arbitrary. We obtain bounds on the error rate of our algorithm and show it is governed by the expansion of the graph. We demonstrate, using several synthetic and real datasets, that our algorithm outperforms the state of the art.
very large data bases | 2011
Vibhor Rastogi; Nilesh N. Dalvi; Minos N. Garofalakis
There have been several recent advancements in Machine Learning community on the Entity Matching (EM) problem. However, their lack of scalability has prevented them from being applied in practical settings on large real-life datasets. Towards this end, we propose a principled framework to scale any generic EM algorithm. Our technique consists of running multiple instances of the EM algorithm on small neighborhoods of the data and passing messages across neighborhoods to construct a global solution. We prove formal properties of our framework and experimentally demonstrate the effectiveness of our approach in scaling EM algorithms.
international conference on data engineering | 2013
Vibhor Rastogi; Ashwin Machanavajjhala; Laukik Vilas Chitnis; Anish Das Sarma
Given a large graph G = (V, E) with millions of nodes and edges, how do we compute its connected components efficiently? Recent work addresses this problem in map-reduce, where a fundamental trade-off exists between the number of map-reduce rounds and the communication of each round. Denoting d the diameter of the graph, and n the number of nodes in the largest component, all prior techniques for map-reduce either require a linear, Θ(d), number of rounds, or a quadratic, Θ (n|V| + |E|), communication per round. We propose here two efficient map-reduce algorithms: (i) Hash-Greater-to-Min, which is a randomized algorithm based on PRAM techniques, requiring O(log n) rounds and O(|V | + |E|) communication per round, and (ii) Hash-to-Min, which is a novel algorithm, provably finishing in O(log n) iterations for path graphs. The proof technique used for Hash-to-Min is novel, but not tight, and it is actually faster than Hash-Greater-to-Min in practice. We conjecture that it requires 2 log d rounds and 3(|V| + |E|) communication per round, as demonstrated in our experiments. Using secondary sorting, a standard map-reduce feature, we scale Hash-to-Min to graphs with very large connected components. Our techniques for connected components can be applied to clustering as well. We propose a novel algorithm for agglomerative single linkage clustering in map-reduce. This is the first map-reduce algorithm for clustering in at most O(log n) rounds, where n is the size of the largest cluster. We show the effectiveness of all our algorithms through detailed experiments on large synthetic as well as real-world datasets.
knowledge discovery and data mining | 2012
Kedar Bellare; Suresh Iyengar; Aditya G. Parameswaran; Vibhor Rastogi
In entity matching, a fundamental issue while training a classifier to label pairs of entities as either duplicates or non-duplicates is the one of selecting informative training examples. Although active learning presents an attractive solution to this problem, previous approaches minimize the misclassification rate (0-1 loss) of the classifier, which is an unsuitable metric for entity matching due to class imbalance (i.e., many more non-duplicate pairs than duplicate pairs). To address this, a recent paper [1] proposes to maximize recall of the classifier under the constraint that its precision should be greater than a specified threshold. However, the proposed technique requires the labels of all n input pairs in the worst-case.n Our main result is an active learning algorithm that approximately maximizes recall of the classifier while respecting a precision constraint with provably sub-linear label complexity (under certain distributional assumptions). Our algorithm uses as a black-box any active learning module that minimizes 0-1 loss. We show that label complexity of our algorithm is at most log n times the label complexity of the black-box, and also bound the difference in the recall of classifier learnt by our algorithm and the recall of the optimal classifier satisfying the precision constraint. We provide an empirical evaluation of our algorithm on several real-world matching data sets that demonstrates the effectiveness of our approach.
international world wide web conferences | 2012
Aditya Pal; Vibhor Rastogi; Ashwin Machanavajjhala; Philip Bohannon
Often an interesting true value such as a stock price, sports score, or current temperature is only available via the observations of noisy and potentially conflicting sources. Several techniques have been proposed to reconcile these conflicts by computing a weighted consensus based on source reliabilities, but these techniques focus on static values. When the real-world entity evolves over time, the noisy sources can delay, or even miss, reporting some of the real-world updates. This temporal aspect introduces two key challenges for consensus-based approaches: (i) due to delays, the mapping between a sources noisy observation and the real-world update it observes is unknown, and (ii) missed updates may translate to missing values for the consensus problem, even if the mapping is known. To overcome these challenges, we propose a formal approach that models the history of updates of the real-world entity as a hidden semi-Markovian process (HSMM). The noisy sources are modeled as observations of the hidden state, but the mapping between a hidden state (i.e. real-world update) and the observation (i.e. source value) is unknown. We propose algorithms based on Gibbs Sampling and EM to jointly infer both the history of real-world updates as well as the unknown mapping between them and the source values. We demonstrate using experiments on real-world datasets how our history-based techniques improve upon history-agnostic consensus-based approaches.
symposium on cloud computing | 2014
Raimondas Kiveris; Silvio Lattanzi; Vahab S. Mirrokni; Vibhor Rastogi; Sergei Vassilvitskii
Computing connected components of a graph lies at the core of many data mining algorithms, and is a fundamental subroutine in graph clustering. This problem is well studied, yet many of the algorithms with good theoretical guarantees perform poorly in practice, especially when faced with graphs with hundreds of billions of edges. In this paper, we design improved algorithms based on traditional MapReduce architecture for large scale data analysis. We also explore the effect of augmenting MapReduce with a distributed hash table (DHT) service. We show that these algorithms have provable theoretical guarantees, and easily outperform previously studied algorithms, sometimes by more than an order of magnitude. In particular, our iterative MapReduce algorithms run 3 to 15 times faster than the best previously studied algorithms, and the MapReduce implementation using a DHT is 10 to 30 times faster than the best previously studied algorithms. These are the fastest algorithms that easily scale to graphs with hundreds of billions of edges.
international world wide web conferences | 2013
Nilesh N. Dalvi; Vibhor Rastogi; Anirban Dasgupta; Anish Das Sarma; Tamas Sarlos
In this paper, we consider the problem of devising blocking schemes for entity matching. There is a lot of work on blocking techniques for supporting various kinds of predicates, e.g. exact matches, fuzzy string-similarity matches, and spatial matches. However, given a complex entity matching function in the form of a Boolean expression over several such predicates, we show that it is an important and non-trivial problem to combine the individual blocking techniques into an efficient blocking scheme for the entity matching function, a problem that has not been studied previously.n In this paper, we make fundamental contributions to this problem. We consider an abstraction for modeling complex entity matching functions as well as blocking schemes. We present several results of theoretical and practical interest for the problem. We show that in general, the problem of computing the optimal blocking strategy is NP-hard in the size of the DNF formula describing the matching function. We also present several algorithms for computing the exact optimal strategies (with exponential complexity, but often feasible in practice) as well as fast approximation algorithms. We experimentally demonstrate over commercially used rule-based matching systems over real datasets at Yahoo!, as well as synthetic datasets, that our blocking strategies can be an order of magnitude faster than the baseline methods, and our algorithms can efficiently find good blocking strategies.
knowledge discovery and data mining | 2011
Nilesh N. Dalvi; Ravi Kumar; Ashwin Machanavajjhala; Vibhor Rastogi
Given an unknown set of objects embedded in the Euclidean plane and a nearest-neighbor oracle, how to estimate the set size and other properties of the objects? In this paper we address this problem. We propose an efficient method that uses the Voronoi partitioning of the space by the objects and a nearest-neighbor oracle. Our method can be used in the hidden web/databases context where the goal is to estimate the number of certain objects of interest. Here, we assume that each object has a geographic location and the nearest-neighbor oracle can be realized by applications such as maps, local, or store-locator APIs. We illustrate the performance of our method on several real-world datasets.
ACM Transactions on Knowledge Discovery From Data | 2013
Kedar Bellare; Suresh Iyengar; Aditya G. Parameswaran; Vibhor Rastogi
In entity matching, a fundamental issue while training a classifier to label pairs of entities as either duplicates or nonduplicates is the one of selecting informative training examples. Although active learning presents an attractive solution to this problem, previous approaches minimize the misclassification rate (0--1 loss) of the classifier, which is an unsuitable metric for entity matching due to class imbalance (i.e., many more nonduplicate pairs than duplicate pairs). To address this, a recent paper [Arasu et al. 2010] proposes to maximize recall of the classifier under the constraint that its precision should be greater than a specified threshold. However, the proposed technique requires the labels of all n input pairs in the worst case.n Our main result is an active learning algorithm that approximately maximizes recall of the classifier while respecting a precision constraint with provably sublinear label complexity (under certain distributional assumptions). Our algorithm uses as a black box any active learning module that minimizes 0--1 loss. We show that label complexity of our algorithm is at most log n times the label complexity of the black box, and also bound the difference in the recall of classifier learnt by our algorithm and the recall of the optimal classifier satisfying the precision constraint. We provide an empirical evaluation of our algorithm on several real-world matching data sets that demonstrates the effectiveness of our approach.
neural information processing systems | 2012
Nilesh N. Dalvi; Aditya G. Parameswaran; Vibhor Rastogi