Nilesh N. Dalvi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nilesh N. Dalvi is active.

Explore More

Publication

Featured researches published by Nilesh N. Dalvi.

very large data bases | 2004

Efficient query evaluation on probabilistic databases

Nilesh N. Dalvi; Dan Suciu

We describe a framework for supporting arbitrarily complex SQL queries with “uncertain” predicates. The query semantics is based on a probabilistic model and the results are ranked, much like in Information Retrieval. Our main focus is query evaluation. We describe an optimization algorithm that can compute efficiently most queries. We show, however, that the data complexity of some queries is #P-complete, which implies that these queries do not admit any efficient evaluation methods. For these queries we describe both an approximation algorithm and a Monte-Carlo simulation algorithm.

international conference on data engineering | 2007

Efficient Top-k Query Evaluation on Probabilistic Data

Christopher Ré; Nilesh N. Dalvi; Dan Suciu

Modern enterprise applications are forced to deal with unreliable, inconsistent and imprecise information. Probabilistic databases can model such data naturally, but SQL query evaluation on probabilistic databases is difficult: previous approaches have either restricted the SQL queries, or computed approximate probabilities, or did not scale, and it was shown recently that precise query evaluation is theoretically hard. In this paper we describe a novel approach, which computes and ranks efficiently the top-k answers to a SQL query on a probabilistic database. The restriction to top-k answers is natural, since imprecisions in the data often lead to a large number of answers of low quality, and users are interested only in the answers with the highest probabilities. The idea in our algorithm is to run in parallel several Monte-Carlo simulations, one for each candidate answer, and approximate each probability only to the extent needed to compute correctly the top-k answers.

knowledge discovery and data mining | 2004

Adversarial classification

Nilesh N. Dalvi; Pedro M. Domingos; Sumit Sanghai; Deepak Verma

Essentially all data mining algorithms assume that the data-generating process is independent of the data miners activities. However, in many domains, including spam detection, intrusion detection, fraud detection, surveillance and counter-terrorism, this is far from the case: the data is actively manipulated by an adversary seeking to make the classifier produce false negatives. In these domains, the performance of a classifier can degrade rapidly after it is deployed, as the adversary learns to defeat it. Currently the only solution to this is repeated, manual, ad hoc reconstruction of the classifier. In this paper we develop a formal framework and algorithms for this problem. We view classification as a game between the classifier and the adversary, and produce a classifier that is optimal given the adversarys optimal strategy. Experiments in a spam detection domain show that this approach can greatly outperform a classifier learned in the standard way, and (within the parameters of the problem) automatically adapt the classifier to the adversarys evolving manipulations.

symposium on principles of database systems | 2007

Management of probabilistic data: foundations and challenges

Nilesh N. Dalvi; Dan Suciu

Many applications today need to manage large data sets with uncertainties. In this paper we describe the foundations of managing data where the uncertainties are quantified as probabilities. We review the basic definitions of the probabilistic data model, present some fundamental theoretical result for query evaluation on probabilistic databases, and discuss several challenges, open problems, and research directions.

international conference on management of data | 2003

The Piazza peer data management project

Igor Tatarinov; Zachary G. Ives; Jayant Madhavan; Alon Y. Halevy; Dan Suciu; Nilesh N. Dalvi; Xin Dong; Yana Kadiyska; Gerome Miklau; Peter Mork

A major problem in todays information-driven world is that sharing heterogeneous, semantically rich data is incredibly difficult. Piazza is a peer data management system that enables sharing heterogeneous data in a distributed and scalable way. Piazza assumes the participants to be interested in sharing data, and willing to define pairwise mappings between their schemas. Then, users formulate queries over their preferred schema, and a query answering system expands recursively any mappings relevant to the query, retrieving data from other peers. In this paper, we provide a brief overview of the Piazza project including our work on developing mapping languages and query reformulation algorithms, assisting the users in defining mappings, indexing, and enforcing access control over shared data.

international conference on management of data | 2005

MYSTIQ: a system for finding more answers by using probabilities

Jihad Boulos; Nilesh N. Dalvi; Bhushan Mandhani; Shobhit Mathur; Christopher Ré; Dan Suciu

MystiQ is a system that uses probabilistic query semantics [3] to find answers in large numbers of data sources of less than perfect quality. There are many reasons why the data originating from many different sources may be of poor quality, and therefore difficult to query: the same data item may have different representation in different sources; the schema alignments needed by a query system are imperfect and noisy; different sources may contain contradictory information, and, in particular, their combined data may violate some global integrity constraints; fuzzy matches between objects from different sources may return false positives or negatives. Even in such environment, users some-times want to ask complex, structurally rich queries, using query constructs typically found in SQL queries: joins, subqueries, existential/universal quantifiers, aggregate and group-by queries: for example scientists may use such queries to query multiple scientific data sources, or a law enforcement agency may use it in order to find rare associations from multiple data sources. If standard query semantics were applied to such queries, all but the most trivial queries will return an empty answer.

Communications of The ACM | 2009

Probabilistic databases: diamonds in the dirt

Nilesh N. Dalvi; Christopher Ré; Dan Suciu

Treasures abound from hidden facts found in imprecise data sets.

international conference on data engineering | 2006

Robust Cardinality and Cost Estimation for Skyline Operator

Surajit Chaudhuri; Nilesh N. Dalvi; Raghav Kaushik

Incorporating the skyline operator inside the relational engine requires solving the cardinality estimation and the cost estimation problem, hitherto unaddressed. We propose robust techniques to estimate the cardinality and the computational cost of Skyline, and through an empirical comparison, show that our technique is substantially more effective than traditional approaches. Finally, we show through an implementation in Microsoft SQL Server that skyline queries can substantially benefit from our techniques.

very large data bases | 2011

Automatic wrappers for large scale web extraction

Nilesh N. Dalvi; Ravi Kumar; Mohamed A. Soliman

We present a generic framework to make wrapper induction algorithms tolerant to noise in the training data. This enables us to learn wrappers in a completely unsupervised manner from automatically and cheaply obtained noisy training data, e.g., using dictionaries and regular expressions. By removing the site-level supervision that wrapper-based techniques require, we are able to perform information extraction at web-scale, with accuracy unattained with existing unsupervised extraction techniques. Our system is used in production at Yahoo! and powers live applications.

symposium on principles of database systems | 2007

The dichotomy of conjunctive queries on probabilistic structures

Nilesh N. Dalvi; Dan Suciu

We show that for every conjunctive query, the complexity of evaluating it on a probabilistic database is either PTIME or P-complete, and we give an algorithm for deciding whether a given conjunctive query is PTIME or P-complete. The dichotomy property is a fundamental result on query evaluation on probabilistic databases and it gives a complete classification of the complexity of conjunctive queries.

Explore More