Mostafa Bamha
University of Orléans
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mostafa Bamha.
Parallel Processing Letters | 2003
Mostafa Bamha; Matthieu Exbrayat
Most standard parallel join algorithms try to overcome data skews with a relatively static approach. The way they distribute data (and then computation) over nodes depends on a data re-distribution algorithm (hashing or range partitioning) that is determined before the actual join begins. On the contrary, our approach consists in pre-scanning data in order to choose an efficient join method for each given value of the join attribute. This approach has already proved to be efficient both theoretically and practically in our previous papers. In this paper we introduce a new pipelined version of our frequency adaptive join algorithm. The use of pipelining offers flexible strategies for resource allocation while avoiding unnecessary disk input/output of intermediate join results when computing multi-join queries. We present a detailed version of the algorithm and a cost analysis based on the BSP (Bulk Synchronous Parallel) model, showing that our pipelined algorithm achieves noticeable improvements compared to the sequential parallel version for multi-join queries while guaranteeing perfect balancing properties.
ieee international conference on high performance computing, data, and analytics | 2009
M. Al Hajj Hassan; Mostafa Bamha
Owing to the fast development of network technologies, executing parallel programs on distributed systems that connect heterogeneous machines became feasible but we still face some challenges: Workload imbalance in such environment may not only be due to uneven load distribution among machines as in parallel systems but also due to distribution that is not adequate with the characteristics of each machine. In this paper, we present a new parallel join algorithm for heterogeneous distributed architectures based on an efficient dynamic data distribution and task allocation which makes it insensitive to data skew and ensures perfect balancing properties during all stages of join computation. The performance of this algorithm is analyzed using the scalable and portable BSP (Bulk Synchronous Parallel) cost model. We show that our algorithm guarantees optimal complexity and near linear speed-up while reducing communication and disk input/output costs to a minimum.
database and expert systems applications | 2005
Mostafa Bamha
The development of scalable parallel database systems requires the design of efficient algorithms for the join operation which is the most frequent and expensive operation in relational database systems. The join is also the most vulnerable operation to data skew and to the high cost of communication in distributed architectures. In this paper, we present a new parallel algorithm for join and multi-join operations on distributed architectures based on an efficient semi-join computation technique. This algorithm is proved to have optimal complexity and deterministic perfect load balancing. Its tradeoff between balancing overhead and speedup is analyzed using the BSP cost model which predicts a negligible join product skew and a linear speed-up. This algorithm improves our fa_join and sfa_join algorithms by reducing their communication and synchronization cost to a minimum while offering the same load balancing properties even for highly skewed data.
international conference on conceptual structures | 2014
Mohamad Al Hajj Hassan; Mostafa Bamha; Frédéric Loulergue
For over a decade, MapReduce has become a prominent programming model to handle vast amounts of raw data in large scale systems. This model ensures scalability, reliability and availability aspects with reasonable query processing time. However these large scale systems still face some challenges: data skew, task imbalance, high disk I/O and redistribution costs can have disastrous effects on performance. In this paper, we introduce MRFA-Join algorithm: a new frequency adaptive algorithm based on MapReduce programming model and a randomised key redistribution approach for join processing of large-scale datasets. A cost analysis of this algorithm shows that our approach is insensitive to data skew and ensures perfect balancing properties during all stages of join computation. These performances have been confirmed by a series of experimentations.
international conference on computational science | 2005
Mostafa Bamha; Gaétan Hains
Semi-joins is the most used technique to optimize the treatment of complex relational queries on distributed architectures. However the overcost related to semi-joins computation can be very high due to data skew and to the high cost of communication in distributed architectures. In this paper we present a parallel equi-semi-join algorithm for shared nothing machines. The performance of this algorithm is analyzed using the BSP cost model and is proved to have asymptotic optimal complexity and perfect load balancing even for highly skewed data. This guarantees unlimited scalability in all situations for this key algorithm.
database and expert systems applications | 2000
Mostafa Bamha; Gaétan Hains
Join is an expensive and frequently used operation whose parallelization is highly desirable. However effectiveness of parallel joins depends on the ability to evenly divide load among processors. Data skew can have a disastrous effect on performance. Although many skew-handling algorithms have been proposed they remain generally inefficient in the case of multi-joins due to join product skew, costly and unnecessary redistribution and communication costs. A parallel join algorithm called fa_join has been introduced in an earlier paper with deterministic and near-perfect balancing properties. Despite its advantages, fa_join is sensitive to the correlation of the attribute value distributions in both relations. We present here an improved version of the algorithm called Sfa_join with a symmetric treatment of both relations. Its predictably low join-product and attribute-value skew makes it suitable for repeated use in multi-join operations. Its performance is analyzed theoretically and experimentally, to confirm its linear speed-up and its superiority over fa_join.
database and expert systems applications | 1999
Mostafa Bamha; Fadila Bentayeb; Gaétan Hains
The problem of maintenance of materialized views has been the object of increased research activity recently mainly because of applications related to data warehousing. Many sequential view maintenance algorithms are developed in the literature. If the view is defined by a relational expression involving join operators, the cost of re-evaluating the view even incrementally may be unacceptable. Moreover, when views are materialized, parallelism can greatly increase processing power as necessary for view maintenance. In this paper, we present a new parallel join algorithm by partial duplication of data and a new parallel view maintenance algorithm where views can involve multi-joins. The performances of these algorithms are analyzed using the scalable and portable BSP1 cost model which predicts a near-linear speedup.
acm symposium on applied computing | 2010
M. Al Hajj Hassan; Mostafa Bamha
Semi-join is the most used technique to optimize the treatment of complex relational queries on distributed architectures. However, the overhead related to semi-join computation can be very high due to data skew and to the high cost of communication in distributed architectures. Internet search engines needs to process vast amounts of raw data every day. Hence, systems that manage such data should assure scalability, reliability and availability issues with reasonable query processing time. Hadoop and Googles File System are examples of such systems. In this paper, we present a new algorithm based on Map-Reduce-Merge model and distributed histograms for processing semi-join operations on such systems. A cost analysis of this algorithm shows that our approach is insensitive to data skew while reducing communication and disk Input/Output costs to a minimum.
international conference on software and data technologies | 2006
M. Al Hajj Hassan; Mostafa Bamha
SQL queries involving join and group-by operations are frequently used in many decision support applications. In these applications, the size of the input relations is usually very large, so the parallelization of these queries is highly recommended in order to obtain a desirable response time. The main drawbacks of the presented parallel algorithms that treat this kind of queries are that they are very sensitive to data skew and involve expansive communication and Input/Output costs in the evaluation of the join operation. In this paper, we present an algorithm that minimizes the communication cost by performing the group-by operation before redistribution where only tuples that will be present in the join result are redistributed. In addition, it evaluates the query without the need of materializing the result of the join operation and thus reducing the Input/Output cost of join intermediate results. The performance of this algorithm is analyzed using the scalable and portable BSP (Bulk Synchronous Parallel) cost model which predicts a near-linear speed-up even for highly skewed data.
database and expert systems applications | 2017
Khanh-Chuong Duong; Mostafa Bamha; Arnaud Giacometti; Dominique Haoyuan Li; Arnaud Soulet; Christel Vrain
Mining frequent itemsets in large datasets has received much attention, in recent years, relying on MapReduce programming models. Many famous FIM algorithms have been parallelized in a MapReduce framework like Parallel Apriori, Parallel FP-Growth and Dist-Eclat. However, most papers focus on work partitioning and/or load balancing but they are not extensible because they require some memory assumptions. A challenge in designing parallel FIM algorithms is thus finding ways to guarantee that data structures used during mining always fit in the local memory of the processing nodes during all computation steps.