Guadalupe Canahuate | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Guadalupe Canahuate is active.

Explore More

Publication

Featured researches published by Guadalupe Canahuate.

international conference on data engineering | 2014

A tunable compression framework for bitmap indices

Gheorghi Guzun; Guadalupe Canahuate; David Chiu; Jason Sawin

Bitmap indices are widely used for large read-only repositories in data warehouses and scientific databases. Their binary representation allows for the use of bitwise operations and specialized run-length compression techniques. Due to a trade-off between compression and query efficiency, bitmap compression schemes are aligned using a fixed encoding length size (typically the word length) to avoid explicit decompression during query time. In general, smaller encoding lengths provide better compression, but require more decoding during query execution. However, when the difference in size is considerable, it is possible for smaller encodings to also provide better execution time. We posit that a tailored encoding length for each bit vector will provide better performance than a one-size-fits-all approach. We present a framework that optimizes compression and query efficiency by allowing bitmaps to be compressed using variable encoding lengths while still maintaining alignment to avoid explicit decompression. Efficient algorithms are introduced to process queries over bitmaps compressed using different encoding lengths. An input parameter controls the aggressiveness of the compression providing the user with the ability to tune the tradeoff between space and query time. Our empirical study shows this approach achieves significant improvements in terms of both query time and compression ratio for synthetic and real data sets. Compared to 32-bit WAH, VAL-WAH produces up to 1.8× smaller bitmaps and achieves query times that are 30% faster.

very large data bases | 2016

Hybrid query optimization for hard-to-compress bit-vectors

Gheorghi Guzun; Guadalupe Canahuate

Bit-vectors are widely used for indexing and summarizing data due to their efficient processing in modern computers. Sparse bit-vectors can be further compressed to reduce their space requirement. Special compression schemes based on run-length encoders have been designed to avoid explicit decompression and minimize the decoding overhead during query execution. Moreover, highly compressed bit-vectors can exhibit a faster query time than the non-compressed ones. However, for hard-to-compress bit-vectors, compression does not speed up queries and can add considerable overhead. In these cases, bit-vectors are often stored verbatim (non-compressed). On the other hand, queries are answered by executing a cascade of bit-wise operations involving indexed bit-vectors and intermediate results. Often, even when the original bit-vectors are hard to compress, the intermediate results become sparse. It could be feasible to improve query performance by compressing these bit-vectors as the query is executed. In this scenario, it would be necessary to operate verbatim and compressed bit-vectors together. In this paper, we propose a hybrid framework where compressed and verbatim bitmaps can coexist and design algorithms to execute queries under this hybrid model. Our query optimizer is able to decide at run time when to compress or decompress a bit-vector. Our heuristics show that the applications using higher-density bitmaps can benefit from using this hybrid model, improving both their query time and memory utilization.

Trans. Large-Scale Data- and Knowledge-Centered Systems | 2014

Slicing the Dimensionality: Top-k Query Processing for High-Dimensional Spaces

Gheorghi Guzun; Joel E. Tosado; Guadalupe Canahuate

Top-k (preference) queries are used in several domains to retrieve the set of \(k\) tuples that more closely match a given query. For high-dimensional spaces, evaluation of top-k queries is expensive, as data and space partitioning indices perform worse than sequential scan. An alternative approach is the use of sorted lists to speed up query evaluation. This approach extends performance gains when compared to sequential scan to about ten dimensions. However, data-sets for which preference queries are considered, often are high-dimensional. In this paper, we explore the the use of bit-sliced indices (BSI) to encode the attributes or score lists and perform top-k queries over high-dimensional data using bit-wise operations. Our approach does not require sorting or random access to the index. Additionally, bit-sliced indices require less space than other type of indices. The size of the bit-sliced index (without using compression) for a normalized data-set with 3 decimals is 60 times smaller than the size of sorted lists. Furthermore, our experimental evaluation shows that the use of BSI for top-k query processing is more efficient than Sequential Scan for high-dimensional data. When compared to Sequential Top-k Algorithm (STA), BSI is one order of magnitude faster.

Knowledge and Information Systems | 2016

Performance evaluation of word-aligned compression methods for bitmap indices

Gheorghi Guzun; Guadalupe Canahuate

Bitmap indices are a widely used scheme for large read-only repositories in data warehouses and scientific databases. This binary representation allows the use of bit-wise operations for fast query processing and is typically compressed using run-length encoding techniques. Most bitmap compression techniques are aligned using a fixed encoding length (32 or 64 bits) to avoid explicit decompression during query time. They have been proposed to extend or enhance word-aligned hybrid (WAH) compression. This paper presents a comparative study of four bitmap compression techniques: WAH, PLWAH, CONCISE, and EWAH. Experiments are targeted to identify the conditions under which each method should be applied and quantify the overhead incurred during query processing. Performance in terms of compression ratio and query time is evaluated over synthetic-generated bitmap indices, and results are validated over bitmap indices generated from real data sets. Different query optimizations are explored, query time estimation formulas are defined, and the conditions under which one method should be preferred over another are formalized.

international database engineering and applications symposium | 2016

A Two-Phase MapReduce Algorithm for Scalable Preference Queries over High-Dimensional Data

Gheorghi Guzun; Guadalupe Canahuate; David Chiu

Preference (top-k) queries play a key role in modern data analytics tasks. Top-k techniques rely on ranking functions in order to determine an overall score for each of the objects across all the relevant attributes being examined. This ranking function is provided by the user at query time, or generated for a particular user by a personalized search engine which prevents the pre-computation of the global scores. Executing this type of queries is particularly challenging for high-dimensional data. Recently, bit-sliced indices (BSI) were proposed to answer these high-dimensional preference queries efficiently in a centralized environment. As MapReduce and key-value stores proliferate as the preferred methods for analyzing big data, we set up to evaluate the performance of BSI in a distributed environment, in terms of index size, network traffic, and execution time of preference (top-k) queries over high-dimensional data. We implemented three MapReduce algorithms for processing aggregations and top-k queries over the BSI index: a baseline algorithm using a tree reduction of the slices, a group-slice algorithm, and an optimized two-phase algorithm that uses bit-slice mapping. The implementations are on top of Apache Spark using vertical and horizontal data partitioning. The bit-slice mapping approach is shown to outperform the baseline map-reduce implementations by virtue of using a reduced size index and by featuring a better control over task granularity and load balancing.

Scientific Reports | 2018

Investigation of radiomic signatures for local recurrence using primary tumor texture analysis in oropharyngeal head and neck cancer patients

Hesham Elhalawani; Aasheesh Kanwar; Abdallah S.R. Mohamed; Aubrey L. White; James Zafereo; Andrew J. Wong; Joel E. Berends; Shady AboHashem; Bowman Williams; Jeremy M. Aymard; Subha Perni; Jay A. Messer; Ben Warren; Bassem Youssef; Pei Yang; M.A.M. Meheissen; M. Kamal; B. Elgohari; Rachel B. Ger; Carlos E. Cardenas; Xenia Fave; L Zhang; Dennis Mackin; G. Elisabeta Marai; David M. Vock; Guadalupe Canahuate; Stephen Y. Lai; G. Brandon Gunn; Adam S. Garden; David I. Rosenthal

Radiomics is one such “big data” approach that applies advanced image refining/data characterization algorithms to generate imaging features that can quantitatively classify tumor phenotypes in a non-invasive manner. We hypothesize that certain textural features of oropharyngeal cancer (OPC) primary tumors will have statistically significant correlations to patient outcomes such as local control. Patients from an IRB-approved database dispositioned to (chemo)radiotherapy for locally advanced OPC were included in this retrospective series. Pretreatment contrast CT scans were extracted and radiomics-based analysis of gross tumor volume of the primary disease (GTVp) were performed using imaging biomarker explorer (IBEX) software that runs in Matlab platform. Data set was randomly divided into a training dataset and test and tuning holdback dataset. Machine learning methods were applied to yield a radiomic signature consisting of features with minimal overlap and maximum prognostic significance. The radiomic signature was adapted to discriminate patients, in concordance with other key clinical prognosticators. 465 patients were available for analysis. A signature composed of 2 radiomic features from pre-therapy imaging was derived, based on the Intensity Direct and Neighbor Intensity Difference methods. Analysis of resultant groupings showed robust discrimination of recurrence probability and Kaplan-Meier-estimated local control rate (LCR) differences between “favorable” and “unfavorable” clusters were noted.

Scientific Data | 2017

Matched computed tomography segmentation and demographic data for oropharyngeal cancer radiomics challenges

Hesham Elhalawani; Abdallah S.R. Mohamed; Aubrey L. White; James Zafereo; Andrew J. Wong; Joel E. Berends; Shady AboHashem; Bowman Williams; Jeremy M. Aymard; Aasheesh Kanwar; Subha Perni; Crosby D. Rock; Luke Cooksey; Shauna Campbell; Yao Ding; Stephen Y. Lai; Elisabeta G. Marai; David M. Vock; Guadalupe Canahuate; John Freymann; Keyvan Farahani; Jayashree Kalpathy-Cramer; Clifton D. Fuller

Cancers arising from the oropharynx have become increasingly more studied in the past few years, as they are now epidemic domestically. These tumors are treated with definitive (chemo)radiotherapy, and have local recurrence as a primary mode of clinical failure. Recent data suggest that ‘radiomics’, or extraction of image texture analysis to generate mineable quantitative data from medical images, can reflect phenotypes for various cancers. Several groups have shown that developed radiomic signatures, in head and neck cancers, can be correlated with survival outcomes. This data descriptor defines a repository for head and neck radiomic challenges, executed via a Kaggle in Class platform, in partnership with the MICCAI society 2016 annual meeting.These public challenges were designed to leverage radiomics and/or machine learning workflows to discriminate HPV phenotype in one challenge (HPV status challenge) and to identify patients who will develop a local recurrence in the primary tumor volume in the second one (Local recurrence prediction challenge) in a segmented, clinically curated anonymized oropharyngeal cancer (OPC) data set.

international database engineering and applications symposium | 2014

Optimizing query execution for variable-aligned length compression of bitmap indices

Ryan Slechta; Jason Sawin; Ben McCamish; David Chiu; Guadalupe Canahuate

Indexing is a fundamental mechanism for efficient data access. Recently, we proposed the Variable-Aligned Length (VAL) bitmap index encoding framework, which generalizes the commonly used word-aligned compression techniques. VAL presented a variable-aligned compression framework, which allows columns of a bitmap to be compressed using different encoding lengths. This flexibility creates a tunable compression that balances the trade-off between space and query processing time. The variable format of VAL presents several unique opportunities for query optimization. In this paper we explore multiple algorithms to optimize both point queries and range queries in VAL. In particular, we propose a dynamic encoding-length translation heuristic to process point queries. For range queries, we propose several column orderings based on the bitmaps metadata: largest segment length first (lsf), column size (size), and weighted size (ws). In our empirical study over both real and synthetic data sets, we show that our dynamic translation selection scheme produces query execution times only 3.5% below the optimal. We also found that the weighted size column ordering significantly and consistently out-performs other ordering techniques. Finally, we show that algorithms scale to data sets that are row-ordered.

Scientific Reports | 2017

Conditional survival analysis of patients with locally advanced laryngeal cancer: Construction of a dynamic risk model and clinical nomogram

Tommy Sheu; David M. Vock; Abdallah S.R. Mohamed; Neil D. Gross; Collin F. Mulcahy; Mark E. Zafereo; G. Brandon Gunn; Adam S. Garden; Parag R. Sevak; Jack Phan; Jan S. Lewin; Steven J. Frank; Beth M. Beadle; William H. Morrison; Stephen Y. Lai; Katherine A. Hutcheson; G. Elisabeta Marai; Guadalupe Canahuate; Merrill S. Kies; Adel K. El-Naggar; Randal S. Weber; David I. Rosenthal; Clifton D. Fuller

Conditional survival (CS), the survival beyond a pre-defined time interval, can identify periods of higher mortality risk for patients with locally advanced laryngeal cancer who face treatment-related toxicity and comorbidities related to alcohol and smoking in the survivorship setting. Using Weibull regression modeling, we analyzed retrospectively abstracted data from 638 records of patients who received radiation to identify prognostic factors for overall survival (OS) and recurrence free survival (RFS) for the first 3 years of survival and for OS conditional upon 3 years of survival. The CS was iteratively calculated, stratifying on variables that were statistically significant on multivariate regression. Predictive nomograms were generated. The median total follow up time was 175 months. The 3- and 6- year actuarial overall survival (OS) was 68% (95% confidence interval [CI] 65–72%) and 49% (CI 45–53%). The 3-year conditional overall survival (COS) at 3 years was 72% (CI 65–74%). Black patients had worse COS over time. Nodal disease was significantly associated with recurrence, but after 3 years, the 3-year conditional RFS converged for all nodal groups. In conclusion, the CS analysis in this patient cohort identified subgroups and time intervals that may represent opportunities for intervention.

international conference on big data | 2015

Scalable preference queries for high-dimensional data using map-reduce

Gheorghi Guzun; Joel E. Tosado; Guadalupe Canahuate

Preference (top-k) queries play a key role in modern data analytics tasks. Top-k techniques rely on ranking functions in order to determine an overall score for each of the objects across all the relevant attributes being examined. This ranking function is provided by the user at query time, or generated for a particular user by a personalized search engine which prevents the pre-computation of the global scores. Executing this type of queries is particularly challenging for high-dimensional data. Recently, bit-sliced indices (BSI) were proposed to answer these preference queries efficiently in a non-distributed environment for data with hundreds of dimensions. As MapReduce and key-value stores proliferate as the preferred methods for analyzing big data, we set up to evaluate the performance of BSI in a distributed environment, in terms of index size, network traffic, and execution time of preference (top-k) queries, over data with thousands of dimensions. Indexing is implemented on top of Apache Spark for both column and row stores and shown to outperform Hive when running on Map-reduce, and Tez for top-k (preference) queries.

Explore More