Carlos Garcia-Alvarado

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Carlos Garcia-Alvarado is active.

Explore More

Publication

Featured researches published by Carlos Garcia-Alvarado.

international conference on management of data | 2014

Orca: a modular query optimizer architecture for big data

Mohamed A. Soliman; Lyublena Antova; Venkatesh Raghavan; Amr El-Helw; Zhongxian Gu; Entong Shen; George Constantin Caragea; Carlos Garcia-Alvarado; Foyzur Rahman; Michalis Petropoulos; Florian Waas; Sivaramakrishnan Narayanan; Konstantinos Krikellas; Rhonda Baldwin

The performance of analytical query processing in data management systems depends primarily on the capabilities of the systems query optimizer. Increased data volumes and heightened interest in processing complex analytical queries have prompted Pivotal to build a new query optimizer. In this paper we present the architecture of Orca, the new query optimizer for all Pivotal data management products, including Pivotal Greenplum Database and Pivotal HAWQ. Orca is a comprehensive development uniting state-of-the-art query optimization technology with own original research resulting in a modular and portable optimizer architecture. In addition to describing the overall architecture, we highlight several unique features and present performance comparisons against other systems.

data warehousing and olap | 2010

Relational versus non-relational database systems for data warehousing

Carlos Ordonez; Il-Yeol Song; Carlos Garcia-Alvarado

Relational database systems have been the dominating technology to manage and analyze large data warehouses. Moreover, the ER model, the standard in database design has a close relationship with the relational model. Recently, there has been a surge of alternative technologies for large scale analytic processing, most of which are not based on the relational model. Out of these proposals, distributed file systems together with MapReduce have become strong competitors to relational database systems to analyze large data sets, exploiting parallel processing. Moreover, there is progress on using MapReduce to evaluate relational queries. With that motivation in mind, this panel will compare pros and cons of each technology for data warehousing and will identify research issues, considering practical aspects like ease of use, programming flexibility and cost; as well as technical aspects like data modeling, storage, hardware, scalability, query processing, fault tolerance and data mining.

Distributed and Parallel Databases | 2014

PCA for large data sets with parallel data summarization

Carlos Ordonez; Naveen Mohanam; Carlos Garcia-Alvarado

Parallel processing is essential for large-scale analytics. Principal Component Analysis (PCA) is a well known model for dimensionality reduction in statistical analysis, which requires a demanding number of I/O and CPU operations. In this paper, we study how to compute PCA in parallel. We extend a previous sequential method to a highly parallel algorithm that can compute PCA in one pass on a large data set based on summarization matrices. We also study how to integrate our algorithm with a DBMS; our solution is based on a combination of parallel data set summarization via user-defined aggregations and calling the MKL parallel variant of the LAPACK library to solve Singular Value Decomposition (SVD) in RAM. Our algorithm is theoretically shown to achieve linear speedup, linear scalability on data size, quadratic time on dimensionality (but in RAM), spending most of the time on data set summarization, despite the fact that SVD has cubic time complexity on dimensionality. Experiments with large data sets on multicore CPUs show that our solution is much faster than the R statistical package as well as solving PCA with SQL queries. Benchmarking on multicore CPUs and a parallel DBMS running on multiple nodes confirms linear speedup and linear scalability.

international conference on data mining | 2008

Efficient Distance Computation Using SQL Queries and UDFs

Sasi K. Pitchaimalai; Carlos Ordonez; Carlos Garcia-Alvarado

Distance computation is one of the most computationally intensive operations employed by many data mining algorithms. Performing such matrix computations within a DBMS creates many optimization challenges. We propose techniques to efficiently compute Euclidean distance using SQL queries and user-defined functions (UDFs). We concentrate on efficient Euclidean distance computation for the well-known K-means clustering algorithm. We present SQL query optimizations and a scalar UDF to compute Euclidean distance. We experimentally evaluate performance and scalability of our proposed SQL queries and UDF with large data sets on a modern DBMS. We benchmark distance computation on two important data mining techniques: clustering and classification. In general, UDFs are faster than SQL queries because they are executed in main memory. Data set size is the main factor impacting performance, followed by data set dimensionality.

conference on information and knowledge management | 2010

OLAP-based query recommendation

Carlos Garcia-Alvarado; Zhibo Chen; Carlos Ordonez

Query recommendation is an invaluable tool for enabling users to speed up their searches. In this paper, we present algorithms for generating query suggestions, assuming no previous knowledge of the collection. We developed an online OLAP algorithm to generate query suggestions for the users based on the frequency of the keywords in the selected documents and the correlation between the keywords in the collection. In addition, performance and scalability experiments of these algorithms are presented as proof of their feasibility. We also present sampling as an additional approach for improving performance by using approximate results. We show valid recommendations as a result of combinations generated using the correlations between the keywords. The online OLAP algorithm is also compared with the well-known Apriori algorithm and found to be faster only when simple computations were performed in smaller collections with a few keywords. On the other hand, OLAP showed a more stable behavior between collections, and allows us to have more complex policies during the aggregation and term combinations. Additionally, sampling showed improvement in the time without a significant change on the suggested queries, and proved to be an accurate alternative with a few small samples.

international conference on management of data | 2009

Fast and dynamic OLAP exploration using UDFs

Zhibo Chen; Carlos Ordonez; Carlos Garcia-Alvarado

OLAP is a set of database exploratory techniques to efficiently retrieve multiple sets of aggregations from a large dataset. Generally, these techniques have either involved the use of an external OLAP server or required the dataset to be exported to a specialized OLAP tool for more efficient processing. In this work, we show that OLAP techniques can be performed within a modern DBMS without external servers or the exporting of datasets, using standard SQL queries and UDFs. The main challenge of such approach is that SQL and UDFs are not as flexible as the C language to explore the OLAP lattice and therefore it is more difficult to develop optimizations. We compare three different ways of performing OLAP exploration: plain SQL queries, a UDF implementing a lattice structure, and a UDF programming the star cube structure. We demonstrate how such methods can be used to efficiently explore typical OLAP datasets.

international conference on management of data | 2010

Keyword search across databases and documents

Carlos Garcia-Alvarado; Carlos Ordonez

Given the continuous growth of databases and the abundance of diverse files in modern IT environments, there is a pressing need to integrate keyword search on heterogeneous information sources. A particular case in which such integration is needed occurs when a collection of documents (e.g. word processing documents, spreadsheets, text files and so on) is derived directly from a central database, and both repositories are independently updated. Finding hidden relationships between documents and databases is difficult, given the loose connection between them. This problem is especially complicated when database integration techniques must be extended to handle semi-structured data (i.e. documents). Our research focuses on exploiting a relational database system for integrating and exploring complex interrelationships between a database and a collection of potentially related documents. We focus on the discovery and ranking of keyword links (relationships) at different granularity levels between a database schema and a collection of documents. We adapt, extend, and combine information retrieval techniques into the DBMS. As such, we provide algorithms for efficient exploration of discovered relationships among a collection of documents and a DBMS. We experimentally show that our system can discover, query and rank complex relationships discovered between a database and surrounding documents.

web information and data management | 2008

Information retrieval from digital libraries in SQL

Carlos Garcia-Alvarado; Carlos Ordonez

Information retrieval techniques have been traditionally exploited outside of relational database systems, due to storage overhead, the complexity of programming them inside the database system, and their slow performance in SQL implementations. This project supports the idea that searching and querying digital libraries with information retrieval models in relational database systems can be performed with optimized SQL queries and User-Defined Functions. In our research, we propose several techniques divided into two phases: storing and retrieving. The storing phase includes executing document pre-processing, stop-word removal and term extraction, and the retrieval phase is implemented with three fundamental IR models: the popular Vector Space Model, the Okapi Probabilistic Model, and the Dirichlet Prior Language Model. We conduct experiments using article abstracts from the DBLP bibliography and the ACM Digital Library. We evaluate several query optimizations, compare the on-demand and the static weighting approaches, and we study the performance with conjunctive and disjunctive queries with the three ranking models. Our prototype proved to have linear scalability and a satisfactory performance with medium-sized document collections. Our implementation of the Vector Space Model is competitive with the two other models.

data warehousing and olap | 2012

Query processing on cubes mapped from ontologies to dimension hierarchies

Carlos Garcia-Alvarado; Carlos Ordonez

Text columns commonly extend core information stored as atomic values in a relational database, creating a need to explore and summarize text data. OLAP cubes can precisely accomplish such tasks. However, cubes have been overlooked as a mechanism for capturing not only text summarizations, but also for representing and exploring the hierarchical structure of an ontology. In this paper, we focus on exploiting cubes to compute multidimensional aggregations on classified documents stored in a DBMS (keyword frequency, document count, document class frequency and so on). We propose CUBO (CUBed Ontologies), a novel algorithm, which efficiently manipulates the hierarchy behind an ontology. Our algorithm is optimized to compute desired summarizations without having to search all possible dimension combinations, exploiting the sparseness of the document classification frequency matrix. Experiments on large text data sets show CUBO can explore faster more dimension combinations than a standard cube algorithm, especially when the cube has a large number of dimensions. CUBO was developed entirely inside a DBMS, using SQL queries and extensibility features.

ACM Transactions on Knowledge Discovery From Data | 2014

Bayesian Variable Selection in Linear Regression in One Pass for Large Datasets

Carlos Ordonez; Carlos Garcia-Alvarado; Veerabhadaran Baladandayuthapani

Bayesian models are generally computed with Markov Chain Monte Carlo (MCMC) methods. The main disadvantage of MCMC methods is the large number of iterations they need to sample the posterior distributions of model parameters, especially for large datasets. On the other hand, variable selection remains a challenging problem due to its combinatorial search space, where Bayesian models are a promising solution. In this work, we study how to accelerate Bayesian model computation for variable selection in linear regression. We propose a fast Gibbs sampler algorithm, a widely used MCMC method that incorporates several optimizations. We use a Zellner prior for the regression coefficients, an improper prior on variance, and a conjugate prior Gaussian distribution, which enable dataset summarization in one pass, thus exploiting an augmented set of sufficient statistics. Thereafter, the algorithm iterates in main memory. Sufficient statistics are indexed with a sparse binary vector to efficiently compute matrix projections based on selected variables. Discovered variable subsets probabilities, selecting and discarding each variable, are stored on a hash table for fast retrieval in future iterations. We study how to integrate our algorithm into a Database Management System (DBMS), exploiting aggregate User-Defined Functions for parallel data summarization and stored procedures to manipulate matrices with arrays. An experimental evaluation with real datasets evaluates accuracy and time performance, comparing our DBMS-based algorithm with the R package. Our algorithm is shown to produce accurate results, scale linearly on dataset size, and run orders of magnitude faster than the R package.

Explore More