Mingxi Wu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mingxi Wu is active.

Explore More

Publication

Featured researches published by Mingxi Wu.

IEEE Transactions on Knowledge and Data Engineering | 2007

Conditional Anomaly Detection

Xiuyao Song; Mingxi Wu; Chris Jermaine; Sanjay Ranka

When anomaly detection software is used as a data analysis tool, finding the hardest-to-detect anomalies is not the most critical task. Rather, it is often more important to make sure that those anomalies that are reported to the user are in fact interesting. If too many unremarkable data points are returned to the user labeled as candidate anomalies, the software can soon fall into disuse. One way to ensure that returned anomalies are useful is to make use of domain knowledge provided by the user. Often, the data in question includes a set of environmental attributes whose values a user would never consider to be directly indicative of an anomaly. However, such attributes cannot be ignored because they have a direct effect on the expected distribution of the result attributes whose values can indicate an anomalous observation. This paper describes a general purpose method called conditional anomaly detection for taking such differences among attributes into account, and proposes three different expectation-maximization algorithms for learning the model that is used in conditional anomaly detection. Experiments with more than 13 different data sets compare our algorithms with several other more standard methods for outlier or anomaly detection

knowledge discovery and data mining | 2007

Statistical change detection for multi-dimensional data

Xiuyao Song; Mingxi Wu; Chris Jermaine; Sanjay Ranka

This paper deals with detecting change of distribution in multi-dimensional data sets. For a given baseline data set and a set of newly observed data points, we define a statistical test called the density test for deciding if the observed data points are sampled from the underlying distribution that produced the baseline data set. We define a test statistic that is strictly distribution-free under the null hypothesis. Our experimental results show that the density test has substantially more power than the two existing methods for multi-dimensional change detection.

knowledge discovery and data mining | 2006

Outlier detection by sampling with accuracy guarantees

Mingxi Wu; Christopher Jermaine

An effective approach to detecting anomalous points in a data setis distance-based outlier detection. This paper describes a simplesampling algorithm to effciently detect distance-based outliers indomains where each and every distance computation is veryexpensive. Unlike any existing algorithms, the sampling algorithmrequires a xed number of distance computations and can return goodresults with accuracy guarantees. The most computationallyexpensive aspect of estimating the accuracy of the result issorting all of the distances computed by the sampling algorithm.The experimental study on two expensive domains as well as tenadditional real-life datasets demonstrates both the effciency andeffectiveness of the sampling algorithm in comparison with thestate-of-the-art algorithm and there liability of the accuracyguarantees.

knowledge discovery and data mining | 2009

A LRT framework for fast spatial anomaly detection

Mingxi Wu; Xiuyao Song; Chris Jermaine; Sanjay Ranka; John G. Gums

Given a spatial data set placed on an n x n grid, our goal is to find the rectangular regions within which subsets of the data set exhibit anomalous behavior. We develop algorithms that, given any user-supplied arbitrary likelihood function, conduct a likelihood ratio hypothesis test (LRT) over each rectangular region in the grid, rank all of the rectangles based on the computed LRT statistics, and return the top few most interesting rectangles. To speed this process, we develop methods to prune rectangles without computing their associated LRT statistics.

ACM Transactions on Database Systems | 2011

The monte carlo database system: Stochastic analysis close to the data

Ravi Jampani; Fei Xu; Mingxi Wu; Luis Leopoldo Perez; Christopher Jermaine; Peter J. Haas

The application of stochastic models and analysis techniques to large datasets is now commonplace. Unfortunately, in practice this usually means extracting data from a database system into an external tool (such as SAS, R, Arena, or Matlab), and then running the analysis there. This extract-and-model paradigm is typically error-prone, slow, does not support fine-grained modeling, and discourages what-if and sensitivity analyses. In this article we describe MCDB, a database system that permits a wide spectrum of stochastic models to be used in conjunction with the data stored in a large database, without ever extracting the data. MCDB facilitates in-database execution of tasks such as risk assessment, prediction, and imputation of missing data, as well as management of errors due to data integration, information extraction, and privacy-preserving data anonymization. MCDB allows a user to define “random” relations whose contents are determined by stochastic models. The models can then be queried using standard SQL. Monte Carlo techniques are used to analyze the probability distribution of the result of an SQL query over random relations. Novel “tuple-bundle” processing techniques can effectively control the Monte Carlo overhead, as shown in our experiments.

very large data bases | 2009

Guessing the extreme values in a data set: a Bayesian method and its applications

Mingxi Wu; Chris Jermaine

For a large number of data management problems, it would be very useful to be able to obtain a few samples from a data set, and to use the samples to guess the largest (or smallest) value in the entire data set. Min/max online aggregation, Top-k query processing, outlier detection, and distance join are just a few possible applications. This paper details a statistically rigorous, Bayesian approach to attacking this problem. Just as importantly, we demonstrate the utility of our approach by showing how it can be applied to four specific problems that arise in the context of data management.

ACM Transactions on Knowledge Discovery From Data | 2010

A Model-Agnostic Framework for Fast Spatial Anomaly Detection

Mingxi Wu; Chris Jermaine; Sanjay Ranka; Xiuyao Song; John G. Gums

Given a spatial dataset placed on an n ×n grid, our goal is to find the rectangular regions within which subsets of the dataset exhibit anomalous behavior. We develop algorithms that, given any user-supplied arbitrary likelihood function, conduct a likelihood ratio hypothesis test (LRT) over each rectangular region in the grid, rank all of the rectangles based on the computed LRT statistics, and return the top few most interesting rectangles. To speed this process, we develop methods to prune rectangles without computing their associated LRT statistics.

ACM Transactions on Database Systems | 2015

Workload-Driven Antijoin Cardinality Estimation

Florin Rusu; Zixuan Zhuang; Mingxi Wu; Chris Jermaine

Antijoin cardinality estimation is among a handful of problems that has eluded accurate efficient solutions amenable to implementation in relational query optimizers. Given the widespread use of antijoin and subset-based queries in analytical workloads and the extensive research targeted at join cardinality estimation—a seemingly related problem—the lack of adequate solutions for antijoin cardinality estimation is intriguing. In this article, we introduce a novel sampling-based estimator for antijoin cardinality that (unlike existent estimators) provides sufficient accuracy and efficiency to be implemented in a query optimizer. The proposed estimator incorporates three novel ideas. First, we use prior workload information when learning a mixture superpopulation model of the data offline. Second, we design a Bayesian statistics framework that updates the superpopulation model according to the live queries, thus allowing the estimator to adapt dynamically to the online workload. Third, we develop an efficient algorithm for sampling from a hypergeometric distribution in order to generate Monte Carlo trials, without explicitly instantiating either the population or the sample. When put together, these ideas form the basis of an efficient antijoin cardinality estimator satisfying the strict requirements of a query optimizer, as shown by the extensive experimental results over synthetically-generated as well as massive TPC-H data.

international conference on data engineering | 2010

Surrogate ranking for very expensive similarity queries

Fei Xu; Ravi Jampani; Mingxi Wu; Chris Jermaine; Tamer Kahveci

We consider the problem of similarity search in applications where the cost of computing the similarity between two records is very expensive, and the similarity measure is not a metric. In such applications, comparing even a tiny fraction of the database records to a single query record can be orders of magnitude slower than reading the entire database from disk, and indexing is often not possible. We develop a general-purpose, statistical framework for answering top-k queries in such databases, when the database administrator is able to supply an inexpensive surrogate ranking function that substitutes for the actual similarity measure. We develop a robust method that learns the relationship between the surrogate function and the similarity measure. Given a query, we use Bayesian statistics to update the model by taking into account the observed partial results. Using the updated model, we construct bounds on the accuracy of the result set obtained via the surrogate ranking. Our experiments show that our models can produce useful bounds for several real-life applications.

international conference on management of data | 2008