Luis Leopoldo Perez
Rice University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Luis Leopoldo Perez.
ACM Transactions on Database Systems | 2011
Ravi Jampani; Fei Xu; Mingxi Wu; Luis Leopoldo Perez; Christopher Jermaine; Peter J. Haas
The application of stochastic models and analysis techniques to large datasets is now commonplace. Unfortunately, in practice this usually means extracting data from a database system into an external tool (such as SAS, R, Arena, or Matlab), and then running the analysis there. This extract-and-model paradigm is typically error-prone, slow, does not support fine-grained modeling, and discourages what-if and sensitivity analyses. In this article we describe MCDB, a database system that permits a wide spectrum of stochastic models to be used in conjunction with the data stored in a large database, without ever extracting the data. MCDB facilitates in-database execution of tasks such as risk assessment, prediction, and imputation of missing data, as well as management of errors due to data integration, information extraction, and privacy-preserving data anonymization. MCDB allows a user to define “random” relations whose contents are determined by stochastic models. The models can then be queried using standard SQL. Monte Carlo techniques are used to analyze the probability distribution of the result of an SQL query over random relations. Novel “tuple-bundle” processing techniques can effectively control the Monte Carlo overhead, as shown in our experiments.
international conference on data engineering | 2017
Shangyu Luo; Zekai J. Gao; Michael N. Gubanov; Luis Leopoldo Perez; Christopher Jermaine
As data analytics has become an important application for modern data management systems, a new category of data management system has appeared recently: the scalable linear algebra system. In this paper, we argue that a parallel or distributed database system is actually an excellent platform upon which to build such functionality. Most relational systems already have support for cost-based optimization—which is vital to scaling linear algebra computations—and it is well-known how to make relational systems scale. We show that by making just a few changes to a parallel/ distributed relational database system, such a system can be a competitive platform for scalable linear algebra. Taken together, our results should at least raise the possibility that brand new systems designed from the ground up to support scalable linear algebra are not absolutely necessary, and that such systems could instead be built on top of existing relational technology. Our results also suggest that if scalable linear algebra is to be added to a modern dataflow platform such as Spark, they should be added on top of the systems more structured (relational) data abstractions, rather than being constructed directly on top of the systems raw dataflow operators.
international conference on data mining | 2013
Zhuhua Cai; Chris Jermaine; Zografoula Vagena; Dionysios Logothetis; Luis Leopoldo Perez
In this paper, we consider the problem of imputation (recovering missing values) in very high-dimensional data with an arbitrary covariance structure. The modern solution to this problem is the Gaussian Markov random field (GMRF). The problem with applying a GMRF to very high-dimensional data imputation is that while the GMRF model itself can be useful even for data having tens of thousands of dimensions, utilizing a GMRF requires access to a sparsified, inverse covariance matrix for the data. Computing this matrix using even state-of-the-art methods is very costly, as it typically requires first estimating the covariance matrix from the data (at a O(nm2) cost for m dimensions and n data points) and then performing a regularized inversion of the estimated covariance matrix, which is also very expensive. This is impractical for even moderately-sized, high-dimensional data sets. In this paper, we propose a very simple alternative to the GMRF called the pair wise Gaussian random field or PGRF for short. The PGRF is a graphical, factor-based model. Unlike traditional Gaussian or GMRF models, a PGRF does not require a covariance or correlation matrix as input. Instead, a PGRF takes as input a set of p (dimension, dimension) pairs for which the user suspects there might be a strong correlation or anti-correlation. This set of pairs defines the graphical structure of the model, with a simple Gaussian factor associated with each of the p (dimension, dimension) pairs. Using this structure, it is easy to perform simultaneous inference and imputation of the model. The key benefit of the approach is that the time required for the PGRF to perform inference is approximately linear with respect to p, where p will typically be much smaller than the number of entries in a m×m covariance or precision matrix.
international conference on management of data | 2017
Zekai J. Gao; Shangyu Luo; Luis Leopoldo Perez; Chris Jermaine
We describe BUDS, a declarative language for succinctly and simply specifying the implementation of large-scale machine learning algorithms on a distributed computing platform. The types supported in BUDS--vectors, arrays, etc.--are simply logical abstractions useful for programming, and do not correspond to the actual implementation. In fact, BUDS automatically chooses the physical realization of these abstractions in a distributed system, by taking into account the characteristics of the data. Likewise, there are many available implementations of the abstract operations offered by BUDS (matrix multiplies, transposes, Hadamard products, etc.). These are tightly coupled with the physical representation. In BUDS, these implementations are co-optimized along with the representation. All of this allows for the BUDS compiler to automatically perform deep optimizations of the users program, and automatically generate efficient implementations.
international conference on management of data | 2008
Ravi Jampani; Fei Xu; Mingxi Wu; Luis Leopoldo Perez; Christopher Jermaine; Peter J. Haas
international conference on management of data | 2010
Subi Arumugam; Alin Dobra; Christopher Jermaine; Niketan Pansare; Luis Leopoldo Perez
international conference on management of data | 2013
Zhuhua Cai; Zografoula Vagena; Luis Leopoldo Perez; Subramanian Arumugam; Peter J. Haas; Christopher Jermaine
international conference on management of data | 2014
Zhuhua Cai; Zekai J. Gao; Shangyu Luo; Luis Leopoldo Perez; Zografoula Vagena; Chris Jermaine
very large data bases | 2010
Subi Arumugam; Fei Xu; Ravi Jampani; Christopher Jermaine; Luis Leopoldo Perez; Peter J. Haas
international conference on management of data | 2008
Florin Rusu; Fei Xu; Luis Leopoldo Perez; Mingxi Wu; Ravi Jampani; Christopher Jermaine; Alin Dobra