Is this you? Create Your Porfile

Ashraf Aboulnaga

Qatar Computing Research Institute

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ashraf Aboulnaga is active.

Explore More

Publication

Featured researches published by Ashraf Aboulnaga.

very large data bases | 2012

ReStore: reusing results of MapReduce jobs

Iman Elghandour; Ashraf Aboulnaga

Analyzing large scale data has emerged as an important activity for many organizations in the past few years. This large scale data analysis is facilitated by the MapReduce programming and execution model and its implementations, most notably Hadoop. Users of MapReduce often have analysis tasks that are too complex to express as individual MapReduce jobs. Instead, they use high-level query languages such as Pig, Hive, or Jaql to express their complex tasks. The compilers of these languages translate queries into workflows of MapReduce jobs. Each job in these workflows reads its input from the distributed file system used by the MapReduce system and produces output that is stored in this distributed file system and read as input by the next job in the workflow. The current practice is to delete these intermediate results from the distributed file system at the end of executing the workflow. One way to improve the performance of workflows of MapReduce jobs is to keep these intermediate results and reuse them for future workflows submitted to the system. In this paper, we present ReStore, a system that manages the storage and reuse of such intermediate results. ReStore can reuse the output of whole MapReduce jobs that are part of a workflow, and it can also create additional reuse opportunities by materializing and storing the output of query execution operators that are executed within a MapReduce job. We have implemented ReStore as an extension to the Pig dataflow system on top of Hadoop, and we experimentally demonstrate significant speedups on queries from the PigMix benchmark.

international conference on data engineering | 2006

XSEED: Accurate and Fast Cardinality Estimation for XPath Queries

Ning Zhang; M.T. Ozsu; Ashraf Aboulnaga; Ihab F. Ilyas

We propose XSEED, a synopsis of path queries for cardinality estimation that is accurate, robust, efficient, and adaptive to memory budgets. XSEED starts from a very small kernel, and then incrementally updates information of the synopsis. With such an incremental construction, a synopsis structure can be dynamically configured to accommodate different memory budgets. Cardinality estimation based on XSEED can be performed very efficiently and accurately. Extensive experiments on both synthetic and real data sets show that even with less memory, XSEED could achieve accuracy that is an order of magnitude better than that of other synopsis structures. The cardinality estimation time is under 2% of the actual querying time for a wide range of queries in all test cases.

conference on information and knowledge management | 2008

Modeling and exploiting query interactions in database systems

Mumtaz Ahmad; Ashraf Aboulnaga; Shivnath Babu; Kamesh Munagala

The typical workload in a database system consists of a mixture of multiple queries of different types, running concurrently and interacting with each other. Hence, optimizing performance requires reasoning about query mixes and their interactions, rather than considering individual queries or query types. In this paper, we show the significant impact that query interactions can have on workload performance. We present a new approach based on planning experiments and statistical modeling to capture the impact of query interactions. This approach requires no prior assumptions about the internal workings of the database system or the nature or cause of query interactions, making it portable across systems. As a concrete demonstration of the potential of capturing, modeling, and exploiting query interactions, we develop a novel interaction-aware query scheduler that targets report-generation workloads in Business Intelligence (BI) settings. Under certain assumptions, the schedule found by this scheduler is within a constant factor of optimal. An experimental evaluation with TPC-H queries on IBM DB2 demonstrates that our scheduler consistently outperforms (up to 4x) conventional schedulers that do not account for query interactions.

international conference on data engineering | 2013

Scalable maximum clique computation using MapReduce

Jingen Xiang; Cong Guo; Ashraf Aboulnaga

We present a scalable and fault-tolerant solution for the maximum clique problem based on the MapReduce framework. The key contribution that enables us to effectively use MapReduce is a recursive partitioning method that partitions the graph into several subgraphs of similar size. After partitioning, the maximum cliques of the different partitions can be computed independently, and the computation is sped up using a branch and bound method. Our experiments show that our approach leads to good scalability, which is unachievable by other partitioning methods since they result in partitions of different sizes and hence lead to load imbalance. Our method is more scalable than an MPI algorithm, and is simpler and more fault tolerant.

very large data bases | 2004

Automated statistics collection in DB2 UDB

Ashraf Aboulnaga; Peter J. Haas; Mokhtar Kandil; Sam Lightstone; Guy M. Lohman; Volker Markl; Ivan Popivanov; Vijayshankar Raman

The use of inaccurate or outdated database statistics by the query optimizer in a relational DBMS often results in a poor choice of query execution plans and hence unacceptably long query processing times. Configuration and maintenance of these statistics has traditionally been a time-consuming manual operation, requiring that the database administrator (DBA) continually monitor query performance and data changes in order to determine when to refresh the statistics values and when and how to adjust the set of statistics that the DBMS maintains. In this paper we describe the new Automated Statistics Collection (ASC) component of IBM® DB2® Universal DatabaseTM (DB2 UDB). This autonomic technology frees the DBA from the tedious task of manually supervising the collection and maintenance of database statistics. ASC monitors both the update-delete-insert (UDI) activities on the data as well as query feedback (QF), i.e., the results of the queries that are executed on the data. ASC uses these two sources of information to automatically decide which statistics to collect and when to collect them. This combination of UDI-driven and QF-driven autonomic processes ensures that the system can handle unforeseen queries while also ensuring good performance for frequent and important queries. We present the basic concepts, architecture, and key implementation details of ASC in DB2 UDB, and present a case study showing how the use of ASC can speed up a query workload by orders of magnitude without requiring any DBA intervention.

very large data bases | 2011

Interaction-aware scheduling of report-generation workloads

Mumtaz Ahmad; Ashraf Aboulnaga; Shivnath Babu; Kamesh Munagala

The typical workload in a database system consists of a mix of multiple queries of different types that run concurrently. Interactions among the different queries in a query mix can have a significant impact on database performance. Hence, optimizing database performance requires reasoning about query mixes rather than considering queries individually. Current database systems lack the ability to do such reasoning. We propose a new approach based on planning experiments and statistical modeling to capture the impact of query interactions. Our approach requires no prior assumptions about the internal workings of the database system or the nature and cause of query interactions, making it portable across systems. To demonstrate the potential of modeling and exploiting query interactions, we have developed a novel interaction-aware query scheduler for report-generation workloads. Our scheduler, called QShuffler, uses two query scheduling algorithms that leverage models of query interactions. The first algorithm is optimized for workloads where queries are submitted in large batches. The second algorithm targets workloads where queries arrive continuously, and scheduling decisions have to be made online. We report an experimental evaluation of QShuffler using TPC-H workloads running on IBM DB2. The evaluation shows that QShuffler, by modeling and exploiting query interactions, can consistently outperform (up to 4x) query schedulers in current database systems.

international conference on data engineering | 2008

Database systems on virtual machines: How much do you lose?

Umar Farooq Minhas; Jitendra Yadav; Ashraf Aboulnaga; Kenneth Salem

Virtual machine technologies offer simple and practical mechanisms to address many manageability problems in database systems. For example, these technologies allow for server consolidation, easier deployment, and more flexible provisioning. Therefore, database systems are increasingly being run on virtual machines. This offers many opportunities for researchers in self-managing database systems, but it is also important to understand the cost of visualization. In this paper, we present an experimental study of the overhead of running a database workload on a virtual machine. We show that the average overhead is less than 10%, and we present details of the different causes of this overhead. Our study shows that the manageability benefits of virtualization come at an acceptable cost.

international congress on big data | 2013

Towards Cloud-Based Analytics-as-a-Service (CLAaaS) for Big Data Analytics in the Cloud

Farhana H. Zulkernine; Patrick Martin; Ying Zou; Michael Anthony Bauer; Femida Gwadry-Sridhar; Ashraf Aboulnaga

Data Analytics has proven its importance in knowledge discovery and decision support in different data and application domains. Big data analytics poses a serious challenge in terms of the necessary hardware and software resources. The cloud technology today offers a promising solution to this challenge by enabling ubiquitous and scalable provisioning of the computing resources. However, there are further challenges that remain to be addressed such as the availability of the required analytic software for various application domains, estimation and subscription of necessary resources for the analytic job or workflow, management of data in the cloud, and design, verification and execution of analytic workflows. We present a taxonomy for analytic workflow systems to highlight the important features in existing systems. Based on the taxonomy and a study of the existing analytic software and systems, we propose the conceptual architecture of CLoud-based Analytics-as-a-Service (CLAaaS), a big data analytics service provisioning platform, in the cloud. We outline the features that are important for CLAaaS as a service provisioning system such as user and domain specific customization and assistance, collaboration, modular architecture for scalable deployment and Service Level Agreement.

ieee international conference on high performance computing data and analytics | 2009

Case study of scientific data processing on a cloud using hadoop

Chen Zhang; Hans De Sterck; Ashraf Aboulnaga; Haig Djambazian; Robert Sladek

With the increasing popularity of cloud computing, Hadoop has become a widely used open source cloud computing framework for large scale data processing. However, few efforts have been made to demonstrate the applicability of Hadoop to various real-world application scenarios in fields other than server side computations such as web indexing, etc. In this paper, we use the Hadoop cloud computing framework to develop a user application that allows processing of scientific data on clouds. A simple extension to Hadoop’s MapReduce is described which allows it to handle scientific data processing problems with arbitrary input formats and explicit control over how the input is split. This approach is used to develop a Hadoop-based cloud computing application that processes sequences of microscope images of live cells, and we test its performance. It is discussed how the approach can be generalized to more complicated scientific data processing problems.

international conference on autonomic computing | 2011

A bayesian approach to online performance modeling for database appliances using gaussian models

Muhammad Bilal Sheikh; Umar Farooq Minhas; Omar Zia Khan; Ashraf Aboulnaga; Pascal Poupart; David J. Taylor

In order to meet service level agreements (SLAs) and to maintain peak performance for database management systems (DBMS), database administrators (DBAs) need to implement policies for effective workload scheduling, admission control, and resource provisioning. Accurately predicting response times of DBMS queries is necessary for a DBA to effectively achieve these goals. This task is particularly challenging due to the fact that a database workload typically consists of many concurrently running queries and an accurate model needs to capture their interactions. Additional challenges are introduced when DBMSes are run in dynamic cloud computing environments, where workload, data, and physical resources can change frequently, on-the-fly. Building an efficient and highly accurate online DBMS performance model that is robust in the face of changing workloads, data evolution, and physical resource allocations is still an unsolved problem. In this work, our goal is to build such an online performance model for database appliances using an experiment-driven modeling approach. We use a Bayesian approach and build novel Gaussian models that take into account the interaction among concurrently executing queries and predict response times of individual DBMS queries. A key feature of our modeling approach is that the models can be updated online in response to new queries or data, or changing resource allocations. We experimentally demonstrate that our models are accurate and effective -- our best models have an average prediction error of 16.3% in the worst case.

Explore More