Is this you? Create Your Porfile

Zuhair Khayyat

King Abdullah University of Science and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Zuhair Khayyat is active.

Explore More

Publication

Featured researches published by Zuhair Khayyat.

european conference on computer systems | 2013

Mizan: a system for dynamic load balancing in large-scale graph processing

Zuhair Khayyat; Karim Awara; Amani A. AlOnazi; Hani Jamjoom; Dan Williams; Panos Kalnis

Pregel [23] was recently introduced as a scalable graph mining system that can provide significant performance improvements over traditional MapReduce implementations. Existing implementations focus primarily on graph partitioning as a preprocessing step to balance computation across compute nodes. In this paper, we examine the runtime characteristics of a Pregel system. We show that graph partitioning alone is insufficient for minimizing end-to-end computation. Especially where data is very large or the runtime behavior of the algorithm is unknown, an adaptive approach is needed. To this end, we introduce Mizan, a Pregel system that achieves efficient load balancing to better adapt to changes in computing needs. Unlike known implementations of Pregel, Mizan does not assume any a priori knowledge of the structure of the graph or behavior of the algorithm. Instead, it monitors the runtime characteristics of the system. Mizan then performs efficient fine-grained vertex migration to balance computation and communication. We have fully implemented Mizan; using extensive evaluation we show that---especially for highly-dynamic workloads---Mizan provides up to 84% improvement over techniques leveraging static graph pre-partitioning.

international conference on management of data | 2015

BigDansing: A System for Big Data Cleansing

Zuhair Khayyat; Ihab F. Ilyas; Alekh Jindal; Samuel Madden; Mourad Ouzzani; Paolo Papotti; Jorge Arnulfo Quiané-Ruiz; Nan Tang; Si Yin

Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.

very large data bases | 2015

Lightning fast and space efficient inequality joins

Zuhair Khayyat; William Lucia; Meghna Singh; Mourad Ouzzani; Paolo Papotti; Jorge Arnulfo Quiané-Ruiz; Nan Tang; Panos Kalnis

Inequality joins, which join relational tables on inequality conditions, are used in various applications. While there have been a wide range of optimization methods for joins in database systems, from algorithms such as sort-merge join and band join, to various indices such as B+-tree, R*-tree and Bitmap, inequality joins have received little attention and queries containing such joins are usually very slow. In this paper, we introduce fast inequality join algorithms. We put columns to be joined in sorted arrays and we use permutation arrays to encode positions of tuples in one sorted array w.r.t. the other sorted array. In contrast to sort-merge join, we use space efficient bit-arrays that enable optimizations, such as Bloom filter indices, for fast computation of the join results. We have implemented a centralized version of these algorithms on top of PostgreSQL, and a distributed version on top of Spark SQL. We have compared against well known optimization techniques for inequality joins and show that our solution is more scalable and several orders of magnitude faster.

international conference on management of data | 2016

Rheem: Enabling Multi-Platform Task Execution

D. Agrawal; Lamine Ba; Laure Berti-Equille; Sanjay Chawla; Ahmed K. Elmagarmid; Hossam M. Hammady; Yasser Idris; Zoi Kaoudi; Zuhair Khayyat; Sebastian Kruse; Mourad Ouzzani; Paolo Papotti; Jorge Arnulfo Quiané-Ruiz; Nan Tang; Mohammed Javeed Zaki

Many emerging applications, from domains such as healthcare and oil & gas, require several data processing systems for complex analytics. This demo paper showcases system, a framework that provides multi-platform task execution for such applications. It features a three-layer data processing abstraction and a new query optimization approach for multi-platform settings. We will demonstrate the strengths of system by using real-world scenarios from three different applications, namely, machine learning, data cleaning, and data fusion.

very large data bases | 2017

A survey and experimental comparison of distributed SPARQL engines for very large RDF data

Ibrahim Abdelaziz; Razen Harbi; Zuhair Khayyat; Panos Kalnis

Distributed SPARQL engines promise to support very large RDF datasets by utilizing shared-nothing computer clusters. Some are based on distributed frameworks such as MapReduce; others implement proprietary distributed processing; and some rely on expensive preprocessing for data partitioning. These systems exhibit a variety of trade-offs that are not well-understood, due to the lack of any comprehensive quantitative and qualitative evaluation. In this paper, we present a survey of 22 state-of-the-art systems that cover the entire spectrum of distributed RDF data processing and categorize them by several characteristics. Then, we select 12 representative systems and perform extensive experimental evaluation with respect to preprocessing cost, query performance, scalability and workload adaptability, using a variety of synthetic and real large datasets with up to 4.3 billion triples. Our results provide valuable insights for practitioners to understand the trade-offs for their usage scenarios. Finally, we publish online our evaluation framework, including all datasets and workloads, for researchers to compare their novel systems against the existing ones.

ieee international conference on high performance computing data and analytics | 2016

Scalemine: scalable parallel frequent subgraph mining in a single large graph

Ehab Abdelhamid; Ibrahim Abdelaziz; Panos Kalnis; Zuhair Khayyat; Fuad Jamour

Frequent Subgraph Mining is an essential operation for graph analytics and knowledge extraction. Due to its high computational cost, parallel solutions are necessary. Existing approaches either suffer from load imbalance, or high communication and synchronization overheads. In this paper we propose ScaleMine; a novel parallel frequent subgraph mining system for a single large graph. ScaleMine introduces a novel two-phase approach. The first phase is approximate; it quickly identifies subgraphs that are frequent with high probability, while collecting various statistics. The second phase computes the exact solution by employing the results of the approximation to achieve good load balance; prune the search space; generate efficient execution plans; and guide intra-task parallelism. Our experiments show that ScaleMine scales to 8,192 cores on a Cray XC40 (12× more than competitors); supports graphs with one billion edges (10× larger than competitors), and is at least an order of magnitude faster than existing solutions.

very large data bases | 2017

Errata for Lightning Fast and Space Efficient Inequality Joins (PVLDB 8(13): 2074--2085)

Zuhair Khayyat; William Lucia; Meghna Singh; Mourad Ouzzani; Paolo Papotti; Jorge-Arnulfo Quiané-Ruiz; Nan Tang; Panos Kalnis

This is in response to recent feedback from some readers, which requires some clarifications regarding our IEJoin algorithm published in [1]. The feedback revolves around four points: (1) a typo in our illustrating example of the join process; (2) a naming error for the index used by our algorithm to improve the bit array scan; (3) the sort order used in our algorithms; and (4) a missing explanation on how duplicates are handled by our self join algorithm.

Archive | 2016

Popular Graph Algorithms on Giraph

Sherif Sakr; Faisal Moeen Orakzai; Ibrahim Abdelaziz; Zuhair Khayyat

PageRank [47] algorithm is a commonly used mechanism for identifying the significance or the authority of vertices in a graph.

Archive | 2016

Advanced Giraph Programming

Sherif Sakr; Faisal Moeen Orakzai; Ibrahim Abdelaziz; Zuhair Khayyat

In this section, we discuss alternative approaches to improve the flexibility of graph algorithms in Giraph and to improve their performance.

Archive | 2016

Related Large-Scale Graph Processing Systems

Sherif Sakr; Faisal Moeen Orakzai; Ibrahim Abdelaziz; Zuhair Khayyat

In practice, the introduction of Google’s Pregel system followed by Apache Giraph, as its open source realization, has inspired the development of other various large-scale graph processing systems.

Explore More