Chris Jermaine | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chris Jermaine is active.

Explore More

Publication

Featured researches published by Chris Jermaine.

IEEE Transactions on Knowledge and Data Engineering | 2007

Conditional Anomaly Detection

Xiuyao Song; Mingxi Wu; Chris Jermaine; Sanjay Ranka

When anomaly detection software is used as a data analysis tool, finding the hardest-to-detect anomalies is not the most critical task. Rather, it is often more important to make sure that those anomalies that are reported to the user are in fact interesting. If too many unremarkable data points are returned to the user labeled as candidate anomalies, the software can soon fall into disuse. One way to ensure that returned anomalies are useful is to make use of domain knowledge provided by the user. Often, the data in question includes a set of environmental attributes whose values a user would never consider to be directly indicative of an anomaly. However, such attributes cannot be ignored because they have a direct effect on the expected distribution of the result attributes whose values can indicate an anomalous observation. This paper describes a general purpose method called conditional anomaly detection for taking such differences among attributes into account, and proposes three different expectation-maximization algorithms for learning the model that is used in conditional anomaly detection. Experiments with more than 13 different data sets compare our algorithms with several other more standard methods for outlier or anomaly detection

knowledge discovery and data mining | 2007

Statistical change detection for multi-dimensional data

Xiuyao Song; Mingxi Wu; Chris Jermaine; Sanjay Ranka

This paper deals with detecting change of distribution in multi-dimensional data sets. For a given baseline data set and a set of newly observed data points, we define a statistical test called the density test for deciding if the observed data points are sampled from the underlying distribution that produced the baseline data set. We define a test statistic that is strictly distribution-free under the null hypothesis. Our experimental results show that the density test has substantially more power than the two existing methods for multi-dimensional change detection.

very large data bases | 2007

The partitioned exponential file for database storage management

Chris Jermaine; Edward Omiecinski; Wai Gen Yee

The rate of increase in hard disk storage capacity continues to outpace the rate of decrease in hard disk seek time. This trend implies that the value of a seek is increasing exponentially relative to the value of storage.With this trend in mind, we introduce the partitioned exponential file (PE file) which is a generic storage manager that can be customized for many different types of data (e.g., numerical, spatial, or temporal). The PE file is intended for use in environments with intense update loads and concurrent, analytic queries. Such an environment may be found, for example, in long-running scientific applications which can produce petabytes of data. For example, the proposed Large Synoptic Survey Telescope [36] will produce 50–100 petabytes of observational, scientific data over its multi-year lifetime. This database will never be taken off-line, so bursty update loads of tens of terabytes per day must be handled concurrently with data analysis. In the PE file, data are organized as a series of on-disk sorts with a careful, global organization. Because the PE file relies heavily on sequential I/O, only a fraction of a disk seek is required for a typical record insertion or retrieval.In addition to describing the PE file, we also detail a set of benchmarking experiments for T1SM, which is a PE file customized for use with multi-attribute data records ordered on a single numerical attribute. In our benchmarking, we implement and test many competing data organizations that can be used to index and store such data, such as the B+-Tree, the LSM-Tree, the Buffer Tree, the Stepped Merge Method, and the Y-Tree. As expected, no organization is the best over all benchmarks, but our experiments show that T1SM is the best choice in many situations, suggesting that it is the best overall. Specifically, T1SM performs exceptionally well in the case of a heavy query workload that must be handled concurrently with an intense insertion stream. Our experiments show that T1SM (and its close cousin, the T2SM storage manager for spatial data) can handle very heavy mixed workloads of this type, and still maintain acceptably small query latencies.

international conference on data engineering | 2006

Closest-Point-of-Approach Join for Moving Object Histories

Subramanian Arumugam; Chris Jermaine

In applications that produce a large amount of data describing the paths of moving objects, there is a need to ask questions about the interaction of objects over a long recorded history. In this paper, we consider the problem of computing joins over massive moving object histories. The particular join that we study is the Closest-Point-Of- Approach join, which asks: Given a massive moving object history, which objects approached within a distance ‘d’ of one another? We carefully consider several relatively obvious strategies for computing the answer to such a join, and then propose a novel, adaptive join algorithm which naturally alters the way in which it computes the join in response to the characteristics of the underlying data.

international conference on management of data | 2004

Online maintenance of very large random samples

Chris Jermaine; Abhijit Pol; Subramanian Arumugam

Random sampling is one of the most fundamental data management tools available. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a sample is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples from a data management perspective, and no techniques now exist for maintaining very large samples in an online manner from streaming data. In this paper, we present online algorithms for maintaining on-disk samples that are gigabytes or terabytes in size. The algorithms are designed for streaming data, or for any environment where a large sample must be maintained online in a single pass through a data set. The algorithms meet the strict requirement that the sample always be a true, statistically random sample (without replacement) of all of the data processed thus far. Our algorithms are also suitable for biased or unequal probability sampling.

international conference on data engineering | 2006

New Sampling-Based Estimators for OLAP Queries

Ruoming Jin; Leonid Glimcher; Chris Jermaine; Gagan Agrawal

One important way in which sampling for approximate query processing in a database environment differs from traditional applications of sampling is that in a database, it is feasible to collect accurate summary statistics from the data in addition to the sample. This paper describes a set of sampling-based estimators for approximate query processing that make use of simple summary statistics to to greatly increase the accuracy of sampling-based estimators. Our estimators are able to give tight probabilistic guarantees on estimation accuracy. They are suitable for low or high dimensional data, and work with categorical or numerical attributes. Furthermore, the information used by our estimators can easily be gathered in a single pass, making them suitable for use in a streaming environment.

international conference on management of data | 2014

A comparison of platforms for implementing and running very large scale machine learning algorithms

Zhuhua Cai; Zekai J. Gao; Shangyu Luo; Luis Leopoldo Perez; Zografoula Vagena; Chris Jermaine

We describe an extensive benchmark of platforms available to a user who wants to run a machine learning (ML) inference algorithm over a very large data set, but cannot find an existing implementation and thus must roll her own ML code. We have carefully chosen a set of five ML implementation tasks that involve learning relatively complex, hierarchical models. We completed those tasks on four different computational platforms, and using 70,000 hours of Amazon EC2 compute time, we carefully compared running times, tuning requirements, and ease-of-programming of each.

knowledge discovery and data mining | 2009

A LRT framework for fast spatial anomaly detection

Mingxi Wu; Xiuyao Song; Chris Jermaine; Sanjay Ranka; John G. Gums

Given a spatial data set placed on an n x n grid, our goal is to find the rectangular regions within which subsets of the data set exhibit anomalous behavior. We develop algorithms that, given any user-supplied arbitrary likelihood function, conduct a likelihood ratio hypothesis test (LRT) over each rectangular region in the grid, rank all of the rectangles based on the computed LRT statistics, and return the top few most interesting rectangles. To speed this process, we develop methods to prune rectangles without computing their associated LRT statistics.

international conference on management of data | 2005

Relational confidence bounds are easy with the bootstrap

Abhijit Pol; Chris Jermaine

Statistical estimation and approximate query processing have become increasingly prevalent applications for database systems. However, approximation is usually of little use without some sort of guarantee on estimation accuracy, or confidence bound. Analytically deriving probabilistic guarantees for database queries over sampled data is a daunting task, not suitable for the faint of heart, and certainly beyond the expertise of the typical database system end-user. This paper considers the problem of incorporating into a database system a powerful plug-in method for computing confidence bounds on the answer to relational database queries over sampled or incomplete data. This statistical tool, called the bootstrap, is simple enough that it can be used by a data-base programmer with a rudimentary mathematical background, but general enough that it can be applied to almost any statistical inference problem. Given the power and ease-of-use of the bootstrap, we argue that the algorithms presented for supporting the bootstrap should be incorporated into any database system which is intended to support analytic processing.

very large data bases | 2008

Reference-based indexing for metric spaces with costly distance measures

Jayendra Venkateswaran; Tamer Kahveci; Chris Jermaine; Deepak Lachwani

We consider the problem of similarity search in databases with costly metric distance measures. Given limited main memory, our goal is to develop a reference-based index that reduces the number of comparisons in order to answer a query. The idea in reference-based indexing is to select a small set of reference objects that serve as a surrogate for the other objects in the database. We consider novel strategies for selection of references and assigning references to database objects. For dynamic databases with frequent updates, we propose two incremental versions of the selection algorithm. Our experimental results show that our selection and assignment methods far outperform competing methods.

Explore More