Anish Das Sarma | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Anish Das Sarma is active.

Explore More

Publication

Featured researches published by Anish Das Sarma.

international world wide web conferences | 2007

Detecting near-duplicates for web crawling

Gurmeet Singh Manku; Arvind Jain; Anish Das Sarma

Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikars fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and all batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.

international conference on data engineering | 2006

Working Models for Uncertain Data

Anish Das Sarma; Omar Benjelloun; Alon Y. Halevy; Jennifer Widom

This paper explores an inherent tension in modeling and querying uncertain data: simple, intuitive representations of uncertain data capture many application requirements, but these representations are generally incomplete―standard operations over the data may result in unrepresentable types of uncertainty. Complete models are theoretically attractive, but they can be nonintuitive and more complex than necessary for many applications. To address this tension, we propose a two-layer approach to managing uncertain data: an underlying logical model that is complete, and one or more working models that are easier to understand, visualize, and query, but may lose some information. We explore the space of incomplete working models, place several of them in a strict hierarchy based on expressive power, and study their closure properties. We describe how the two-layer approach is being used in our prototype DBMS for uncertain data, and we identify a number of interesting open problems to fully realize the approach.

international conference on management of data | 2008

Bootstrapping pay-as-you-go data integration systems

Anish Das Sarma; Xin Dong; Alon Y. Halevy

Data integration systems offer a uniform interface to a set of data sources. Despite recent progress, setting up and maintaining a data integration application still requires significant upfront effort of creating a mediated schema and semantic mappings from the data sources to the mediated schema. Many application contexts involving multiple data sources (e.g., the web, personal information management, enterprise intranets) do not require full integration in order to provide useful services, motivating a pay-as-you-go approach to integration. With that approach, a system starts with very few (or inaccurate) semantic mappings and these mappings are improved over time as deemed necessary. This paper describes the first completely self-configuring data integration system. The goal of our work is to investigate how advanced of a starting point we can provide a pay-as-you-go system. Our system is based on the new concept of a probabilistic mediated schema that is automatically created from the data sources. We automatically create probabilistic schema mappings between the sources and the mediated schema. We describe experiments in multiple domains, including 50-800 data sources, and show that our system is able to produce high-quality answers with no human intervention.

very large data bases | 2008

Databases with uncertainty and lineage

Omar Benjelloun; Anish Das Sarma; Alon Y. Halevy; Martin Theobald; Jennifer Widom

This paper introduces uldbs, an extension of relational databases with simple yet expressive constructs for representing and manipulating both lineage and uncertainty. Uncertain data and data lineage are two important areas of data management that have been considered extensively in isolation, however many applications require the features in tandem. Fundamentally, lineage enables simple and consistent representation of uncertain data, it correlates uncertainty in query results with uncertainty in the input data, and query processing with lineage and uncertainty together presents computational benefits over treating them separately. We show that the uldb representation is complete, and that it permits straightforward implementation of many relational operations. We define two notions of uldb minimality—data-minimal and lineage-minimal—and study minimization of uldb representations under both notions. With lineage, derived relations are no longer self-contained: their uncertainty depends on uncertainty in the base data. We provide an algorithm for the new operation of extracting a database subset in the presence of interconnected uncertainty. We also show how uldbs enable a new approach to query processing in probabilistic databases. Finally, we describe the current state of the Trio system, our implementation of uldbs under development at Stanford.

international conference on management of data | 2014

Fusing data with correlations

Ravali Pochampally; Anish Das Sarma; Xin Luna Dong; Alexandra Meliou; Divesh Srivastava

Many applications rely on Web data and extraction systems to accomplish knowledge-driven tasks. Web information is not curated, so many sources provide inaccurate, or conflicting information. Moreover, extraction systems introduce additional noise to the data. We wish to automatically distinguish correct data and erroneous data for creating a cleaner set of integrated data. Previous work has shown that a naive voting strategy that trusts data provided by the majority or at least a certain number of sources may not work well in the presence of copying between the sources. However, correlation between sources can be much broader than copying: sources may provide data from complementary domains (negative correlation), extractors may focus on different types of information (negative correlation), and extractors may apply common rules in extraction (positive correlation, without copying). In this paper we present novel techniques modeling correlations between sources and applying it in truth finding. We provide a comprehensive evaluation of our approach on three real-world datasets with different characteristics, as well as on synthetic data, showing that our algorithms outperform the existing state-of-the-art techniques.

international conference on management of data | 2012

Finding related tables

Anish Das Sarma; Lujun Fang; Nitin Gupta; Alon Y. Halevy; Hongrae Lee; Fei Wu; Reynold S. Xin; Cong Yu

We consider the problem of finding related tables in a large corpus of heterogenous tables. Detecting related tables provides users a powerful tool for enhancing their tables with additional data and enables effective reuse of available public data. Our first contribution is a framework that captures several types of relatedness, including tables that are candidates for joins and tables that are candidates for union. Our second contribution is a set of algorithms for detecting related tables that can be either unioned or joined. We describe a set of experiments that demonstrate that our algorithms produce highly related tables. We also show that we can often improve the results of table search by pulling up tables that are ranked much lower based on their relatedness to top-ranked tables. Finally, we describe how to scale up our algorithms and show the results of running it on a corpus of over a million tables extracted from Wikipedia.

international conference on data engineering | 2012

Fuzzy Joins Using MapReduce

Foto N. Afrati; Anish Das Sarma; David Menestrina; Aditya G. Parameswaran; Jeffrey D. Ullman

Fuzzy/similarity joins have been widely studied in the research community and extensively used in real-world applications. This paper proposes and evaluates several algorithms for finding all pairs of elements from an input set that meet a similarity threshold. The computation model is a single MapReduce job. Because we allow only one MapReduce round, the Reduce function must be designed so a given output pair is produced by only one task, for many algorithms, satisfying this condition is one of the biggest challenges. We break the cost of an algorithm into three components: the execution cost of the mappers, the execution cost of the reducers, and the communication cost from the mappers to reducers. The algorithms are presented first in terms of Hamming distance, but extensions to edit distance and Jaccard distance are shown as well. We find that there are many different approaches to the similarity-join problem using MapReduce, and none dominates the others when both communication and reducer costs are considered. Our cost analyses enable applications to pick the optimal algorithm based on their communication, memory, and cluster requirements.

web search and data mining | 2011

Dynamic relationship and event discovery

Anish Das Sarma; Alpa Jain; Cong Yu

This paper studies the problem of dynamic relationship and event discovery. A large body of previous work on relation extraction focuses on discovering predefined and static relationships between entities. In contrast, we aim to identify temporally defined (e.g., co-bursting) relationships that are not predefined by an existing schema, and we identify the underlying time constrained events that lead to these relationships. The key challenges in identifying such events include discovering and verifying dynamic connections among entities, and consolidating binary dynamic connections into events consisting of a set of entities that are connected at a given time period. We formalize this problem and introduce an efficient end-to-end pipeline as a solution. In particular, we introduce two formal notions, global temporal constraint cluster and local temporal constraint cluster, for detecting dynamic events. We further design efficient algorithms for discovering such events from a large graph of dynamic relationships. Finally, detailed experiments on real data show the effectiveness of our proposed solution.

web search and data mining | 2010

Ranking mechanisms in twitter-like forums

Anish Das Sarma; Atish Das Sarma; Sreenivas Gollapudi; Rina Panigrahy

We study the problem of designing a mechanism to rank items in forums by making use of the user reviews such as thumb and star ratings. We compare mechanisms where forum users rate individual posts and also mechanisms where the user is asked to perform a pairwise comparison and state which one is better. The main metric used to evaluate a mechanism is the ranking accuracy vs the cost of reviews, where the cost is measured as the average number of reviews used per post. We show that for many reasonable probability models, there is no thumb (or star) based ranking mechanism that can produce approximately accurate rankings with bounded number of reviews per item. On the other hand we provide a review mechanism based on pairwise comparisons which achieves approximate rankings with bounded cost. We have implemented a system, shoutvelocity, which is a twitter-like forum but items (i.e., tweets in Twitter) are rated by using comparisons. For each new item the user who posts the item is required to compare two previous entries. This ensures that over a sequence of n posts, we get at least n comparisons requiring one review per item on average. Our mechanism uses this sequence of comparisons to obtain a ranking estimate. It ensures that every item is reviewed at least once and winning entries are reviewed more often to obtain better estimates of top items.

international conference on database theory | 2010

Synthesizing view definitions from data

Anish Das Sarma; Aditya G. Parameswaran; Hector Garcia-Molina; Jennifer Widom

Given a database instance and a corresponding view instance, we address the view definitions problem (VDP): Find the most succinct and accurate view definition, when the view query is restricted to a specific family of queries. We study the tradeoffs among succintness, level of approximation, and the family of queries through algorithms and complexity results. For each family of queries, we address three variants of the VDP: (1) Does there exist an exact view definition, and if so find it. (2) Find the best view definition, i.e., one as close to the input view instance as possible, and as succinct as possible. (3) Find an approximate view definition that satisfies an input approximation threshold, and is as succinct as possible.

Explore More