Christopher Jermaine | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Christopher Jermaine is active.

Explore More

Publication

Featured researches published by Christopher Jermaine.

international conference on management of data | 2008

MCDB: a monte carlo approach to managing uncertain data

Ravi Jampani; Fei Xu; Mingxi Wu; Luis Leopoldo Perez; Christopher Jermaine; Peter J. Haas

To deal with data uncertainty, existing probabilistic database systems augment tuples with attribute-level or tuple-level probability values, which are loaded into the database along with the data itself. This approach can severely limit the systems ability to gracefully handle complex or unforeseen types of uncertainty, and does not permit the uncertainty model to be dynamically parameterized according to the current state of the database. We introduce MCDB, a system for managing uncertain data that is based on a Monte Carlo approach. MCDB represents uncertainty via VG functions, which are used to pseudorandomly generate realized values for uncertain attributes. VG functions can be parameterized on the results of SQL queries over parameter tables that are stored in the database, facilitating what-if analyses. By storing parameters, and not probabilities, and by estimating, rather than exactly computing, the probability distribution over possible query answers, MCDB avoids many of the limitations of prior systems. For example, MCDB can easily handle arbitrary joint probability distributions over discrete or continuous attributes, arbitrarily complex SQL queries, and arbitrary functionals of the query-result distribution such as means, variances, and quantiles. To achieve good performance, MCDB uses novel query processing techniques, executing a query plan exactly once, but over tuple bundles instead of ordinary tuples. Experiments indicate that our enhanced functionality can be obtained with acceptable overheads relative to traditional systems.

IEEE Transactions on Computers | 2002

Efficient data allocation over multiple channels at broadcast servers

Wai Gen Yee; Shamkant B. Navathe; Edward Omiecinski; Christopher Jermaine

Broadcast is a scalable way of disseminating data because broadcasting an item satisfies all outstanding client requests for it. However, because the transmission medium is shared, individual requests may have high response times. In this paper, we show how to minimize the average response time given multiple broadcast channels by optimally partitioning data among them. We also offer an approximation algorithm that is less complex than the optimal and show that its performance is near-optimal for a wide range of parameters. Finally, we briefly discuss the extensibility of our work with two simple, yet seldom researched extensions, namely, handling varying sized items and generating single channel schedules.

international conference on management of data | 2010

The DataPath system: a data-centric analytic processing engine for large data warehouses

Subi Arumugam; Alin Dobra; Christopher Jermaine; Niketan Pansare; Luis Leopoldo Perez

Since the 1970s, database systems have been compute-centric. When a computation needs the data, it requests the data, and the data are pulled through the system. We believe that this is problematic for two reasons. First, requests for data naturally incur high latency as the data are pulled through the memory hierarchy, and second, it makes it difficult or impossible for multiple queries or operations that are interested in the same data to amortize the bandwidth and latency costs associated with their data access.n In this paper, we describe a purely-push based, research prototype database system called DataPath. DataPath is data-centric. In DataPath, queries do not request data. Instead, data are automatically pushed onto processors, where they are then processed by any interested computation. We show experimentally on a multi-terabyte benchmark that this basic design principle makes for a very lean and fast database system.

knowledge discovery and data mining | 2006

Outlier detection by sampling with accuracy guarantees

Mingxi Wu; Christopher Jermaine

An effective approach to detecting anomalous points in a data setis distance-based outlier detection. This paper describes a simplesampling algorithm to effciently detect distance-based outliers indomains where each and every distance computation is veryexpensive. Unlike any existing algorithms, the sampling algorithmrequires a xed number of distance computations and can return goodresults with accuracy guarantees. The most computationallyexpensive aspect of estimating the accuracy of the result issorting all of the distances computed by the sampling algorithm.The experimental study on two expensive domains as well as tenadditional real-life datasets demonstrates both the effciency andeffectiveness of the sampling algorithm in comparison with thestate-of-the-art algorithm and there liability of the accuracyguarantees.

ACM Transactions on Database Systems | 2008

Scalable approximate query processing with the DBO engine

Christopher Jermaine; Subramanian Arumugam; Abhijit Pol; Alin Dobra

This article describes query processing in the DBO database system. Like other database systems designed for ad hoc analytic processing, DBO is able to compute the exact answers to queries over a large relational database in a scalable fashion. Unlike any other system designed for analytic processing, DBO can constantly maintain a guess as to the final answer to an aggregate query throughout execution, along with statistically meaningful bounds for the guesss accuracy. As DBO gathers more and more information, the guess gets more and more accurate, until it is 100% accurate as the query is completed. This allows users to stop the execution as soon as they are happy with the query accuracy, and thus encourages exploratory data analysis.

international conference on management of data | 2013

Simulation of database-valued markov chains using SimSQL

Zhuhua Cai; Zografoula Vagena; Luis Leopoldo Perez; Subramanian Arumugam; Peter J. Haas; Christopher Jermaine

This paper describes the SimSQL system, which allows for SQLbased specification, simulation, and querying of database-valued Markov chains, i.e., chains whose value at any time step comprises the contents of an entire database. SimSQL extends the earlier Monte Carlo database system (MCDB), which permitted Monte Carlo simulation of static database-valued random variables. Like MCDB, SimSQL uses user-specified VG functions to generate the simulated data values that are the building blocks of a simulated database. The enhanced functionality of SimSQL is enabled by the ability to parametrize VG functions using stochastic tables, so that one stochastic database can be used to parametrize the generation of another stochastic database, which can parametrize another, and so on. Other key extensions include the ability to explicitly define recursive versions of a stochastic table and the ability to execute the simulation in a MapReduce environment. We focus on applying SimSQL to Bayesian machine learning.

ACM Transactions on Database Systems | 2006

The Sort-Merge-Shrink join

Christopher Jermaine; Alin Dobra; Subramanian Arumugam; Shantanu Joshi; Abhijit Pol

One of the most common operations in analytic query processing is the application of an aggregate function to the result of a relational join. We describe an algorithm called the Sort-Merge-Shrink (SMS) Join for computing the answer to such a query over large, disk-based input tables. The key innovation of the SMS join is that if the input data are clustered in a statistically random fashion on disk, then at all times, the join provides an online, statistical estimator for the eventual answer to the query as well as probabilistic confidence bounds. Thus, a user can monitor the progress of the join throughout its execution and stop the join when satisfied with the estimates accuracy or run the algorithm to completion with a total time requirement that is not much longer than that of other common join algorithms. This contrasts with other online join algorithms, which either do not offer such statistical guarantees or can only offer guarantees so long as the input data can fit into main memory.

very large data bases | 2009

Turbo-charging estimate convergence in DBO

Alin Dobra; Christopher Jermaine; Florin Rusu; Fei Xu

DBO is a database system that utilizes randomized algorithms to give statistically meaningful estimates for the final answer to a multi-table, disk-based query from start to finish during query execution. However, DBOs time til utility (or TTU; that is, the time until DBO can give a useful estimate) can be overly large, particularly in the case that many database tables are joined in a query, or in the case that a join query includes a very selective predicate on one or more of the tables, or when the data are skewed. In this paper, we describe Turbo DBO, which is a prototype database system that can answer multi-table join queries in a scalable fashion, just like DBO. However, Turbo DBO often has a much lower TTU than DBO. The key innovation of Turbo DBO is that it makes use of novel algorithms that look for and remember partial match tuples in a randomized fashion. These are tuples that satisfy some of the boolean predicates associated with the query, and can possibly be grown into tuples that actually contribute to the final query result at a later time.

international conference on management of data | 2005

A disk-based join with probabilistic guarantees

Christopher Jermaine; Alin Dobra; Subramanian Arumugam; Shantanu Joshi; Abhijit Pol

One of the most common operations in analytic query processing is the application of an aggregate function to the result of a relational join. We describe an algorithm for computing the answer to such a query over large, disk-based input tables. The key innovation of our algorithm is that at all times, it provides an online, statistical estimator for the eventual answer to the query, as well as probabilistic confidence bounds. Thus, a user can monitor the progress of the join throughout its execution and stop the join when satisfied with the estimates accuracy, or run the algorithm to completion with a total time requirement that is not much longer than other common join algorithms. This contrasts with other online join algorithms, which either do not offer such statistical guarantees or can only offer guarantees so long as the input data can fit into core memory.

ACM Transactions on Database Systems | 2011

The monte carlo database system: Stochastic analysis close to the data

Ravi Jampani; Fei Xu; Mingxi Wu; Luis Leopoldo Perez; Christopher Jermaine; Peter J. Haas

The application of stochastic models and analysis techniques to large datasets is now commonplace. Unfortunately, in practice this usually means extracting data from a database system into an external tool (such as SAS, R, Arena, or Matlab), and then running the analysis there. This extract-and-model paradigm is typically error-prone, slow, does not support fine-grained modeling, and discourages what-if and sensitivity analyses.n In this article we describe MCDB, a database system that permits a wide spectrum of stochastic models to be used in conjunction with the data stored in a large database, without ever extracting the data. MCDB facilitates in-database execution of tasks such as risk assessment, prediction, and imputation of missing data, as well as management of errors due to data integration, information extraction, and privacy-preserving data anonymization. MCDB allows a user to define “random” relations whose contents are determined by stochastic models. The models can then be queried using standard SQL. Monte Carlo techniques are used to analyze the probability distribution of the result of an SQL query over random relations. Novel “tuple-bundle” processing techniques can effectively control the Monte Carlo overhead, as shown in our experiments.

Explore More