Mohamed Y. Eltabakh
Worcester Polytechnic Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mohamed Y. Eltabakh.
very large data bases | 2011
Mohamed Y. Eltabakh; Yuanyuan Tian; Fatma Ozcan; Rainer Gemulla; Aljoscha Krettek; John McPherson
Hadoop has become an attractive platform for large-scale data analytics. In this paper, we identify a major performance bottleneck of Hadoop: its lack of ability to colocate related data on the same set of nodes. To overcome this bottleneck, we introduce CoHadoop, a lightweight extension of Hadoop that allows applications to control where data are stored. In contrast to previous approaches, CoHadoop retains the flexibility of Hadoop in that it does not require users to convert their data to a certain format (e.g., a relational database or a specific file format). Instead, applications give hints to CoHadoop that some set of files are related and may be processed jointly; CoHadoop then tries to colocate these files for improved efficiency. Our approach is designed such that the strong fault tolerance properties of Hadoop are retained. Colocation can be used to improve the efficiency of many operations, including indexing, grouping, aggregation, columnar storage, joins, and sessionization. We conducted a detailed study of joins and sessionization in the context of log processing---a common use case for Hadoop---, and propose efficient map-only algorithms that exploit colocated data partitions. In our experiments, we observed that CoHadoop outperforms both plain Hadoop and previous work. In particular, our approach not only performs better than repartition-based algorithms, but also outperforms map-only algorithms that do exploit data partitioning but not colocation. 8.
extending database technology | 2009
Mohamed Y. Eltabakh; Walid G. Aref; Ahmed K. Elmagarmid; Mourad Ouzzani; Yasin N. Silva
Annotations play a key role in understanding and curating databases. Annotations may represent comments, descriptions, lineage information, among several others. Annotation management is a vital mechanism for sharing knowledge and building an interactive and collaborative environment among database users and scientists. What makes it challenging is that annotations can be attached to database entities at various granularities, e.g., at the table, tuple, column, cell levels, or more generally, to any subset of cells that results from a select statement. Therefore, simple comment fields in tuples would not work because of the combinatorial nature of the annotations. In this paper, we present extensions to current database management systems to support annotations. We propose storage schemes to efficiently store annotations at multiple granularities, i.e., at the table, tuple, column, and cell levels. Compared to storing the annotations with the individual cells, the proposed schemes achieve more than an order-of-magnitude reduction in storage and up to 70% saving in the query execution time. We define types of annotations that inherit different behaviors. Through these types, users can specify, for example, whether or not an annotation is continuously applied over newly inserted data and whether or not an annotation is archived when the base data is modified. These annotation types raise several storage and processing challenges that are addressed in the paper. We propose declarative ways to add, archive, query, and propagate annotations. The proposed mechanisms are realized through extensions to the standard SQL. We implemented the proposed functionalities inside PostgreSQL with an easy to use Excel-based front-end graphical interface.
IEEE Transactions on Knowledge and Data Engineering | 2007
Rafae Bhatti; Arjmand Samuel; Mohamed Y. Eltabakh; Haseeb Amjad; Arif Ghafoor
Policy-based management for federated healthcare systems has recently gained increasing attention due to strict privacy and disclosure rules. Although the work on privacy languages and enforcement mechanisms, such as Hippocratic databases, has advanced our understanding of designing privacy-preserving policies for healthcare databases, the need to integrate these policies in a practical healthcare framework is becoming acute. Additionally, although most work in this area has been organization oriented, dealing with the exchange of information between healthcare organizations (such as referrals), the requirements for the emerging area of personal healthcare information management have so far not been adequately addressed. These shortcomings arise from the lack of a sophisticated policy specification language and enforcement architecture that can capture the requirement for 1) the integration of privacy and disclosure policies with well-known healthcare standards used in the industry in order to specify the precise requirements of a practical healthcare system and 2) the provision of ubiquitous healthcare services to patients using the same infrastructure that enables federated healthcare management for organizations. In this paper, we have designed a policy-based system to mitigate these concerns. First, we have designed our disclosure and privacy policies by using a requirements specification based on a set of use cases for the Clinical Document Architecture (CDA) standard proposed by the community. Second, we present a context-aware policy specification language, which allows encoding of CDA-based requirements use cases into privacy and disclosure policy rules. We have shown that our policy specification language is effective in terms of handling a variety of expressive constraints on CDA-encoded document contents. Our language enables specification of privacy-aware access control for federated healthcare information across organizational boundaries, whereas the use of contextual constraints allows the incorporation of user and environment context in the access control mechanism for personal healthcare information management. Moreover, the declarative syntax of the policy rules makes the policy adaptable to changes in privacy regulations or patient preferences. We also present an enforcement architecture for the federated healthcare framework proposed in this paper.
international conference on data engineering | 2008
Mohamed Y. Eltabakh; Mourad Ouzzani; Walid G. Aref; Ahmed K. Elmagarmid; Yasin N. Laura-Silva; Muhammad U. Arshad; David E. Salt; Ivan Baxter
We demonstrate bdbms, an extensible database engine for biological databases, bdbms started on the observation that database technology has not kept pace with the specific requirements of biological databases and that several needed key functionalities are not supported at the engine level. While bdbms aims at supporting several of these functionalities, this demo focuses on: (1) Annotation and provenance management including storage, indexing, querying, and propagation, (2) Local dependency tracking of dependencies and derivations among data items, and (3) Update authorization to support data curation. We demonstrate how bdbms enables biologists to manipulate their databases, annotations, and derivation information in a unified database system using the Purdue ionomics information management system (PiiMS) as a case study.
international conference on data engineering | 2006
Mohamed Y. Eltabakh; Ramy Eltarras; Walid G. Aref
Many evolving database applications warrant the use of non-traditional indexing mechanisms beyond B+-trees and hash tables. SP-GiST is an extensible indexing framework that broadens the class of supported indexes to include disk-based versions of a wide variety of space-partitioning trees, e.g., disk-based trie variants, quadtree variants, and kd-trees. This paper presents a serious attempt at implementing and realizing SP-GiST-based indexes inside PostgreSQL. Several index types are realized inside PostgreSQL facilitated by rapid SP-GiST instantiations. Challenges, experiences, and performance issues are addressed in the paper. Performance comparisons are conducted from within PostgreSQL to compare update and search performances of SP-GiST-based indexes against the B+-tree and the R-tree for string, point, and line segment data sets. Interesting results that highlight the potential performance gains of SPGiST- based indexes are presented in the paper.
extending database technology | 2008
Mohamed Y. Eltabakh; Wing-Kai Hon; Rahul Shah; Walid G. Aref; Jeffrey Scott Vitter
Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges is how to operate on (e.g., index, search, and retrieve) compressed data without decompressing it. In this paper, we introduce the <i><u>S</u></i>tring <i><u>B</u></i>-tree for <i><u>C</u></i>ompressed sequences, termed the <i>SBC-tree</i>, for indexing and searching RLE-compressed sequences of arbitrary length. The SBC-tree is a two-level index structure based on the well-known String B-tree and a 3-sided range query structure [<i>7</i>]. The SBC-tree supports pattern matching queries such as <i>substring matching, prefix matching</i>, and <i>range search</i> operations over RLE-compressed sequences. The SBC-tree has an optimal external-memory space complexity of <i>O(N/B)</i> pages, where <i>N</i> is the total length of the compressed sequences, and <i>B</i> is the disk page size. <i>Substring matching, prefix matching</i>, and <i>range search</i> execute in an optimal O(log<sub><i>B</i></sub> <i>N</i> + |<i>p</i>|+<i>T</i>/<i>B</i>) I/O operations, where |<i>p</i>| is the length of the compressed query pattern and <i>T</i> is the query output size. The SBC-tree is also dynamic and supports <i>insert</i> and <i>delete</i> operations efficiently. The insertion and deletion of all suffixes of a compressed sequence of length <i>m</i> take O(<i>m</i> log<sub><i>B</i></sub>(<i>N</i> + <i>m</i>)) amortized I/O operations. The SBC-tree index is realized inside PostgreSQL. Performance results illustrate that using the SBC-tree to index RLE-compressed sequences achieves up to an order of magnitude reduction in storage, while retains the optimal search performance achieved by the String B-tree over the uncompressed sequences.
international conference on management of data | 2014
Dongqing Xiao; Mohamed Y. Eltabakh
In this paper, we address the challenges that arise from the growing scale of annotations in scientific databases. On one hand, end-users and scientists are incapable of analyzing and extracting knowledge from the large number of reported annotations, e.g., one tuple may have hundreds of annotations attached to it over time. On the other hand, current annotation management techniques fall short in providing advanced processing over the annotations beyond just propagating them to end-users. To address this limitation, we propose the InsightNotes system, a summary-based annotation management engine in relational databases. InsightNotes integrates data mining and summarization techniques into annotation management in novel ways with the objective of creating and reporting concise representations (summaries) of the raw annotations. We propose an extended summary-aware query processing engine for efficient manipulation and propagation of the annotation summaries in the query pipeline. We introduce several optimizations for the creation, maintenance, and zoom-in processing over the annotations summaries. InsightNotes is implemented on top of an existing annotation management system within which it is experimentally evaluated using real-world datasets. The results illustrate significant performance gain from the proposed techniques and optimizations (up to 100x in some operations) compared to the naive approaches.
very large data bases | 2016
Hai Liu; Dongqing Xiao; Pankaj Didwania; Mohamed Y. Eltabakh
Big data infrastructures are increasingly supporting datasets that are relatively structured. These datasets are full of correlations among their attributes, which if managed in systematic ways would enable optimization opportunities that otherwise will be missed. Unlike relational databases in which discovering and exploiting the correlations in query optimization have been extensively studied, in big data infrastructures, such important data properties and their utilization have been mostly abandoned. The key reason is that domain experts may know many correlations but with a degree of uncertainty (fuzziness or softness). Since the data is big, it is very challenging to validate such correlations, judge their worthiness, and put strategies for utilizing them in query optimization. Existing techniques for exploiting soft correlations in RDBMSs, e.g., BHUNT, CORDS, and CM, are heavily tailored towards optimizing factors inherent in relational databases, e.g., predicate selectivity and random I/O accesses of secondary indexes, which are issues not applicable to big data infrastructures, e.g., Hadoop. In this paper, we propose the EXORD system to fill in this gap by exploiting the datas correlations in big data query optimization. EXORD supports two types of correlations; hard correlations---which are guaranteed to hold for all data records, and soft correlations---which are expected to hold for most, but not all, data records. We introduce a new three-phase approach for (1) Validating and judging the worthiness of soft correlations, (2) Selecting and preparing the soft correlations for deployment by specially handling the violating data records, and (3) Deploying and exploiting the correlations in query optimization. We propose a novel cost-benefit model for adaptively selecting the most beneficial soft correlations w.r.t a given query workload while minimizing the introduced overhead. We show the complexity of this problem (NP-Hard), and propose a heuristic to efficiently solve it in a polynomial time. EXORD can be integrated with various state-of-art big data query optimization techniques, e.g., indexing and partitioning. EXORD prototype is implemented as an extension to the Hive engine on top of Hadoop. The experimental evaluation shows the potential of EXORD in achieving more than 10x speedup while introducing minimal storage overheads.
very large data bases | 2014
Chuan Lei; Zhongfang Zhuang; Elke A. Rundensteiner; Mohamed Y. Eltabakh
This demonstration presents the Redoop infrastructure, the first full-fledged MapReduce framework with native support for recurring big data queries. Recurring queries, repeatedly being executed for long periods of time over evolving high-volume data, have become a bedrock component in most large-scale data analytic applications. Redoop is a comprehensive extension to Hadoop that pushes the support and optimization of recurring queries into Hadoops core functionality. While backward compatible with regular MapReduce jobs, Redoop achieves an order of magnitude better performance than Hadoop for recurring workloads. Redoop employs innovative window-aware optimization techniques for such recurring workloads including adaptive window-aware data partitioning, cache-aware task scheduling, and inter-window caching mechanisms. We will demonstrate Redoops capabilities on a compute cluster against real life workloads including click-stream and sensor data analysis.
symposium on large spatial databases | 2013
Dongqing Xiao; Mohamed Y. Eltabakh
With the increasing complexity and wide diversity of spatio-temporal applications, the query processing requirements over spatio-temporal data go beyond the traditional query types, e.g., range, kNN, and aggregation queries along with their variants. Most applications require support for evaluating powerful spatio-temporal pattern queries (STPQs) that form higher-order correlations and compositions of sequences of events to infer real-world semantics of importance to the targeted application. STPQs can be supported by neither traditional spatio-temporal databases (STDBs) nor by modern complex-event-processing systems (CEP). While the former lack the expressiveness and processing capabilities for handling such complex sequence pattern queries, the later mostly focus on the Time dimension as the driving dimension, and hence lack the power of the special-purpose processing technologies established in STDBs over the past decades. In this paper, we propose an efficient and scalable spatio-temporal engine for complex pattern queries (STEPQ). STEPQ has several innovative features and ideas that will open the research in the area of integration between spatio-temporal databases and complex event processing.