Ibrahim Abdelaziz
King Abdullah University of Science and Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ibrahim Abdelaziz.
very large data bases | 2016
Razen Harbi; Ibrahim Abdelaziz; Panos Kalnis; Nikos Mamoulis; Yasser Ebrahim; Majed Sahli
State-of-the-art distributed RDF systems partition data across multiple computer nodes (workers). Some systems perform cheap hash partitioning, which may result in expensive query evaluation. Others try to minimize inter-node communication, which requires an expensive data preprocessing phase, leading to a high startup cost. Apriori knowledge of the query workload has also been used to create partitions, which, however, are static and do not adapt to workload changes. In this paper, we propose AdPart, a distributed RDF system, which addresses the shortcomings of previous work. First, AdPart applies lightweight partitioning on the initial data, which distributes triples by hashing on their subjects; this renders its startup overhead low. At the same time, the locality-aware query optimizer of AdPart takes full advantage of the partitioning to (1) support the fully parallel processing of join patterns on subjects and (2) minimize data communication for general queries by applying hash distribution of intermediate results instead of broadcasting, wherever possible. Second, AdPart monitors the data access patterns and dynamically redistributes and replicates the instances of the most frequent ones among workers. As a result, the communication cost for future queries is drastically reduced or even eliminated. To control replication, AdPart implements an eviction policy for the redistributed patterns. Our experiments with synthetic and real data verify that AdPart: (1) starts faster than all existing systems; (2) processes thousands of queries before other systems become online; and (3) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in subseconds.
very large data bases | 2015
Razen Harbi; Ibrahim Abdelaziz; Panos Kalnis; Nikos Mamoulis
Distributed RDF systems partition data across multiple computer nodes. Partitioning is typically based on heuristics that minimize inter-node communication and it is performed in an initial, data pre-processing phase. Therefore, the resulting partitions are static and do not adapt to changes in the query workload; as a result, existing systems are unable to consistently avoid communication for queries that are not favored by the initial data partitioning. Furthermore, for very large RDF knowledge bases, the partitioning phase becomes prohibitively expensive, leading to high startup costs. In this paper, we propose AdHash, a distributed RDF system which addresses the shortcomings of previous work. First, AdHash initially applies lightweight hash partitioning, which drastically minimizes the startup cost, while favoring the parallel processing of join patterns on subjects, without any data communication. Using a locality-aware planner, queries that cannot be processed in parallel are evaluated with minimal communication. Second, AdHash monitors the data access patterns and adapts dynamically to the query load by incrementally redistributing and replicating frequently accessed data. As a result, the communication cost for future queries is drastically reduced or even eliminated. Our experiments with synthetic and real data verify that AdHash (i) starts faster than all existing systems, (ii) processes thousands of queries before other systems become online, and (iii) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in sub-seconds. In this demonstration, audience can use a graphical interface of AdHash to verify its performance superiority compared to state-of-the-art distributed RDF systems.
very large data bases | 2015
Ibrahim Abdelaziz; Razen Harbi; Semih Salihoglu; Panos Kalnis; Nikos Mamoulis
A growing number of applications require combining SPARQL queries with generic graph search on RDF data. However, the lack of procedural capabilities in SPARQL makes it inappropriate for graph analytics. Moreover, RDF engines focus on SPARQL query evaluation whereas graph management frameworks perform only generic graph computations. In this work, we bridge the gap by introducing SPARTex, an RDF analytics framework based on the vertex-centric computation model. In SPARTex, user-defined vertex centric programs can be invoked from SPARQL as stored procedures. SPARTex allows the execution of a pipeline of graph algorithms without the need for multiple reads/writes of input data and intermediate results. We use a cost-based optimizer for minimizing the communication cost. SPARTex evaluates queries that combine SPARQL and generic graph computations orders of magnitude faster than existing RDF engines. We demonstrate a real system prototype of SPARTex running on a local cluster using real and synthetic datasets. SPARTex has a real-time graphical user interface that allows the participants to write regular SPARQL queries, use our proposed SPARQL extension to declaratively invoke graph algorithms or combine/pipeline both SPARQL querying and generic graph analytics.
IEEE Transactions on Parallel and Distributed Systems | 2017
Ibrahim Abdelaziz; Razen Harbi; Semih Salihoglu; Panos Kalnis
Modern applications require sophisticated analytics on RDF graphs that combine structural queries with generic graph computations. Existing systems support either declarative SPARQL queries, or generic graph processing, but not both. We bridge the gap by introducing Spartex, a versatile framework for complex RDF analytics. Spartex extends SPARQL to combine seamlessly generic graph algorithms (e.g., PageRank, Shortest Paths, etc.) with SPARQL queries. Spartex builds on existing vertex-centric graph processing frameworks, such as Graphlab or Pregel. It implements a generic SPARQL operator as a vertex-centric program that interprets SPARQL queries and executes them efficiently using a built-in optimizer. In addition, any graph algorithm implemented in the underlying vertex-centric framework, can be executed in Spartex. We present various scenarios where our framework simplifies significantly the implementation of complex RDF data analytics programs. We demonstrate that Spartex scales to datasets with billions of edges, and show that our core SPARQL engine is at least as fast as the state-of-the-art specialized RDF engines. For complex analytical tasks that combine generic graph processing with SPARQL, Spartex is at least an order of magnitude faster than existing alternatives.
very large data bases | 2017
Ibrahim Abdelaziz; Razen Harbi; Zuhair Khayyat; Panos Kalnis
Distributed SPARQL engines promise to support very large RDF datasets by utilizing shared-nothing computer clusters. Some are based on distributed frameworks such as MapReduce; others implement proprietary distributed processing; and some rely on expensive preprocessing for data partitioning. These systems exhibit a variety of trade-offs that are not well-understood, due to the lack of any comprehensive quantitative and qualitative evaluation. In this paper, we present a survey of 22 state-of-the-art systems that cover the entire spectrum of distributed RDF data processing and categorize them by several characteristics. Then, we select 12 representative systems and perform extensive experimental evaluation with respect to preprocessing cost, query performance, scalability and workload adaptability, using a variety of synthetic and real large datasets with up to 4.3 billion triples. Our results provide valuable insights for practitioners to understand the trade-offs for their usage scenarios. Finally, we publish online our evaluation framework, including all datasets and workloads, for researchers to compare their novel systems against the existing ones.
ieee international conference on high performance computing data and analytics | 2016
Ehab Abdelhamid; Ibrahim Abdelaziz; Panos Kalnis; Zuhair Khayyat; Fuad Jamour
Frequent Subgraph Mining is an essential operation for graph analytics and knowledge extraction. Due to its high computational cost, parallel solutions are necessary. Existing approaches either suffer from load imbalance, or high communication and synchronization overheads. In this paper we propose ScaleMine; a novel parallel frequent subgraph mining system for a single large graph. ScaleMine introduces a novel two-phase approach. The first phase is approximate; it quickly identifies subgraphs that are frequent with high probability, while collecting various statistics. The second phase computes the exact solution by employing the results of the approximation to achieve good load balance; prune the search space; generate efficient execution plans; and guide intra-task parallelism. Our experiments show that ScaleMine scales to 8,192 cores on a Cray XC40 (12× more than competitors); supports graphs with one billion edges (10× larger than competitors), and is at least an order of magnitude faster than existing solutions.
international conference on management of data | 2017
Essam Mansour; Ibrahim Abdelaziz; Mourad Ouzzani; Ashraf Aboulnaga; Panos Kalnis
There has been a proliferation of datasets available as interlinked RDF data accessible through SPARQL endpoints. This has led to the emergence of various applications in life science, distributed social networks, and Internet of Things that need to integrate data from multiple endpoints. We will demonstrate Lusail; a system that supports the need of emerging applications to access tens to hundreds of geo-distributed datasets. Lusail is a geo-distributed graph engine for querying linked RDF data. Lusail delivers out- standing performance using (i) a novel locality-aware query decomposition technique that minimizes the intermediate data to be accessed by the subqueries, and (ii) selectivity-awareness and parallel query execution to reduce network latency and to increase parallelism. During the demo, the audience will be able to query actually deployed RDF end- points as well as large synthetic and real benchmarks that we have deployed in the public cloud. The demo will also show that Lusail outperforms state-of-the-art systems by orders of magnitude in terms of scalability and response time.
symposium on cloud computing | 2017
Amedeo Sapio; Ibrahim Abdelaziz; Marco Canini; Panos Kalnis
1 CONTEXT AND MOTIVATION Many data center applications nowadays rely on distributed computation models like MapReduce and Bulk Synchronous Parallel (BSP) for data-intensive computation at scale [4]. These models scale by leveraging the partition/aggregate pattern where data and computations are distributed across many worker servers, each performing part of the computation. A communication phase is needed each time workers need to synchronize the computation and, at last, to produce the final output. In these applications, the network communication costs can be one of the dominant scalability bottlenecks especially in case of multi-stage or iterative computations [1]. The advent of flexible networking hardware and expressive data plane programming languages have produced networks that are deeply programmable [2]. This creates the opportunity to co-design distributed systems with their network layer, which can offer substantial performance benefits. A possible use of this emerging technology is to execute the logic traditionally associated with the application layer into the network itself. Given that in the above mentioned applications the intermediate results are necessarily exchanged through the network, it is desirable to offload to it part of the aggregation task to reduce the traffic and lessen the work of the servers. However, these programmable networking devices typically have very stringent constraints on the number and type of operations that can be performed at line rate. Moreover, packet processing at high speed requires a very fast memory, such as TCAM or SRAM, which is expensive and usually available in small capacities.Many data center applications nowadays rely on distributed computation models like MapReduce and Bulk Synchronous Parallel (BSP) for data-intensive computation at scale [4]. These models scale by leveraging the partition/aggregate pattern where data and computations are distributed across many worker servers, each performing part of the computation. A communication phase is needed each time workers need to synchronize the computation and, at last, to produce the final output. In these applications, the network communication costs can be one of the dominant scalability bottlenecks especially in case of multi-stage or iterative computations [1].
international conference on data engineering | 2017
Ibrahim Abdelaziz; Essam Mansour; Mourad Ouzzani; Ashraf Aboulnaga; Panos Kalnis
Applications in life sciences, decentralized social networks, Internet of Things, and statistical linked dataspaces integrate data from multiple decentralized RDF graphs via SPARQL queries. Several approaches have been proposed to optimize query processing over a small number of heterogeneous data sources by utilizing schema information. In the case of schema similarity and interlinks among sources, these approaches cause unnecessary data retrieval and communication, leading to poor scalability and response time. This paper addresses these limitations and presents Lusail, a system for scalable and efficient SPARQL query processing over decentralized graphs. Lusail achieves scalability and low query response time through various optimizations at compile and run times. At compile time, we use a novel locality-aware query decomposition technique that maximizes the number of query triple patterns sent together to a source based on the actual location of the instances satisfying these triple patterns. At run time, we use selectivity-awareness and parallel query execution to reduce network latency and to increase parallelism by delaying the execution of subqueries expected to return large results. We evaluate Lusail using real and synthetic benchmarks, with data sizes up to billions of triples on an in-house cluster and a public cloud. We show that Lusail outperforms state-of-the-art systems by orders of magnitude in terms of scalability and response time.
very large data bases | 2018
Fuad Jamour; Ibrahim Abdelaziz; Panos Kalnis
Existing RDF engines follow one of two design paradigms: relational or graph-based. Such engines are typically designed for specific hardware architectures, mainly CPUs, and are not easily portable to new architectures. Porting an existing engine to a different architecture (e.g., many-core architectures) entails almost redesign from scratch. We explore sparse matrix algebra as a third paradigm for designing a portable, scalable, and efficient RDF engine. We demonstrate MAGiQ; a matrix algebra approach for evaluating complex SPARQL queries over large RDF datasets. MAGiQ represents an RDF graph as a sparse matrix, and translates SPARQL queries to matrix algebra programs. MAGiQ takes advantage of the existing rich software infrastructure for processing sparse matrices, optimized for many architectures (e.g., CPUs, GPUs, distributed), effortlessly. This demo motivates the adoption of matrix algebra in RDF graph processing by showing MAGiQ’s performance with different matrix algebra backend engines. MAGiQ, using a GPU, is orders of magnitude faster in solving complex queries on a billion edge graph than state-of-the-art RDF systems. PVLDB Reference Format: Fuad Jamour, Ibrahim Abdelaziz, and Panos Kalnis. A Demonstration of MAGiQ: Matrix Algebra Approach for Solving RDF Graph Queries. PVLDB, 11 (12): 1978-1981, 2018. DOI: https://doi.org/10.14778/3229863.3236239