Mohamed Helmy Khafagy

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mohamed Helmy Khafagy is active.

Explore More

Publication

Featured researches published by Mohamed Helmy Khafagy.

ieee international conference on cloud computing technology and science | 2011

OCSS: Ontology Cloud Storage System

Haytham Tawfeek Al Feel; Mohamed Helmy Khafagy

Cloud computing is considered a booming trend in the world of information technology which depends on the idea of computing on demand. Cloud computing platform is a set of scalable data servers, providing computing and storage services. The cloud storage is a relatively basic and widely applied service which can provide users with stable, massive data storage space. Our research concerns with searching in content of different kind of files in the cloud based on ontology, this approach resolves the weaknesses that existed in Google File System that depends on metadata. In this paper, we are proposing new cloud storage architecture based on ontology that can store and retrieve files in the cloud based on its content. Our new architecture was tested on Cloud Storage Simulator and the result shows that the new architecture has better scalability, fault tolerance and performance.

international conference on informatics and systems | 2014

JOMR: Multi-join optimizer technique to enhance map-reduce job

Mina Samir Shanoda; Samah Senbel; Mohamed Helmy Khafagy

Map-Reduce is a programming model and execution an environment developed by Google to process very large amounts of data. Query optimizer is needed to find more efficient plans for declarative SQL query. In classic database: join algorithms are optimized to execute the entire query result, but they ignore the importance of tables order especially in multi-join query. But we can see that the orders for tables are an important factor to get the best performance of a query plan and will be very effective in performance when join tables content huge number of rows in addition to more than one join operation. In this paper we proposed a new technique called JOMR (Join Order In Map-Reduce) that optimizes and enhances Map-Reduce job. This technique uses enhanced parallel Travel Salesman Problem (TSP) using Map-Reduce for improving the performance of query plans according to change the order for join tables. Also we build a cost model that supports our algorithm to find best join order. We will focus on Hive especially multi-join query and our experiments result for JOMR algorithm proving the effectiveness of our query optimizer and this performance is improved more when increasing the number of join and size of data.

ieee international conference on cloud computing technology and science | 2012

Distributed Ontology Cloud Storage System

Mohamed Helmy Khafagy; Haytham Tawfeek Al Feel

There are dramatically increasing interests from both academic and industry in the trend of cloud computing. Cloud computing depends on the idea of computing on demand that provide, support and delivery of computing services with stable and large data space. Our research concerns with improving the searching process in the cloud storage via avoiding the bottleneck in central ontology cloud storage system since all data chunks in the cloud must be indexed by a master ontology server.The contribution of this paper proposes new cloud storage architecture based distributed ontology as one of the main semantic technologies. This architecture provides better scalability, fault tolerance and enhanced performance for searching in the cloud storage avoiding the central bottleneck.

Journal of Computational Science | 2017

Exploiting coarse-grained reused-based opportunities in Big Data multi-query optimization

Radhya Sahal; Mohamed Helmy Khafagy; Fatma A. Omara

Abstract Multi-query optimization in Big Data becomes a promising research direction due to the popularity of massive data analytical systems (e.g., MapReduce and Flink). The multi-query is translated into jobs. These jobs are routinely submitted with similar tasks to the underling Big Data analytical systems. These similar tasks are considered complicated and computation overhead. Therefore, there are some existing techniques that have been proposed for exploiting sharing tasks in Big Data multi-query optimization (e.g., MRShare and Relaxed MRShare). These techniques are heavily tailored relaxed optimizing factors of fine-grained reused-based opportunities. In accordance with Big Data multi-query optimization, the existing fine-grained techniques are only concerned with equal tuples size and uniform data distribution. These issues are not applicable to the real-world distributed applications which depend on coarse-grained reused-based opportunities, such as non-equal tuples size and non-uniform data distribution. These two issues receive more-attention in Big Data multi-query optimization, to minimize the data read from or written back to Big Data infrastructures (e.g., Hadoop). In this paper, Multi-Query Optimization using Tuple Size and Histogram (MOTH) system has been proposed to consider the granularity of the reused-based opportunities. The proposed MOTH system exploits the coarse-grained of the fully and partially reused-based opportunities among queries with considering non-equal tuples size and non-uniform data distribution to avoid repeated computations. According to the proposed MOTH system, a combined technique has been introduced for estimating the coarse-grained reused-based opportunities horizontally and vertically. The horizontal estimation of non-equal tuples size has been done by extracting metadata in column-level, while the vertical estimation of non-uniform data distribution is concerned with using pre-computed histogram in row-level. In addition, the MOTH system estimates the coarse-grained reused-based opportunities with considering slow storage (i.e., limited physical resources or fewer allocated virtualized resources) to produce the accurate estimation of the reused results costs. Then, a cost-based heuristic algorithm has been introduced to select the best reused-based opportunity and generate an efficient multi-query execution plan. Because the partial reused-based opportunities have been considered, extra computations are needed to retrieve the non-derived results. Also, a partial reused-based optimizer has been tailored and added to the proposed MOTH system to reformulate the generated multi-query plan to improve the shared partial queries. According to the experimental results of the proposed MOTH system using TPC-H benchmark, it is found that multi-query execution time has been reduced by considering the granularity of the reused results.

Future Generation Computer Systems | 2019

An improved technique for increasing availability in Big Data replication

Mostafa R. Kaseb; Mohamed Helmy Khafagy; Ihab A. Ali; E. M. Saad

Abstract Big Data represents a major challenge for the performance of the cloud computing storage systems. Some distributed file systems (DFS) are widely used to store big data, such as Hadoop Distributed File System (HDFS), Google File System (GFS) and others. These DFS replicate and store data as multiple copies to provide availability and reliability, but they increase storage and resources consumption. In a previous work (Kaseb, Khafagy, Ali, & Saad, 2018), we built a Redundant Independent Files (RIF) system over a cloud provider (CP), called CPRIF, which provides HDFS without replica, to improve the overall performance through reducing storage space, resources consumption, operational costs and improved the writing and reading performance. However, RIF suffers from limited availability, limited reliability and increased data recovery time. In this paper, we overcome the limitations of the RIF system by giving more chances to recover a lost block (availability) and the ability of the system to keep working the presence of a lost block (reliability) with less computation (time overhead). As well as keeping the benefits of storage and resources consumption attained by RIF compared to other systems. We call this technique “High Availability Redundant Independent Files” (HARIF), which is built over CP; called CPHARIF. According to the experimental results of the HARIF system using the TeraGen benchmark, it is found that the execution time of recovering data, availability and reliability using HARIF have been improved as compared with RIF. Also, the stored data size and resources consumption with HARIF system is reduced compared to the other systems. The Big Data storage is saved and the data writing and reading are improved.

world conference on information systems and technologies | 2018

Redundant Independent Files (RIF): A Technique for Reducing Storage and Resources in Big Data Replication

Mostafa R. Kaseb; Mohamed Helmy Khafagy; Ihab A. Ali; E. M. Saad

Most of cloud computing storage systems widely use a distributed file system (DFS) to store big data, such as Hadoop Distributed File System (HDFS) and Google File System (GFS). Therefore, the DFS depends on replicate data and stores it as multiple copies, to achieve high reliability and availability. On the other hand, that technique increases storage and resources consumption.

Journal of Grid Computing | 2018

iHOME: Index-Based JOIN Query Optimization for Limited Big Data Storage

Radhya Sahal; Marwah Nihad; Mohamed Helmy Khafagy; Fatma A. Omara

Query optimization in Big Data becomes a promising research direction due to the popularity of massive data analytical systems such as Hadoop system. The query optimization is getting hard to efficiently execute JOIN queries on top of Hadoop query language, Hive, over limited Big Data storages. According to our previous work, HiveQL Optimization for JOIN query over Multi-session Environment (HOME) system has been introduced over Hadoop system to improve its performance by storing the intermediate results to avoid repeated computations. Time overheads and Big Data storages limitation are considered the main drawback of the HOME system, especially in the case of using additional physical storages or renting extra virtualized storages. In this paper, an index-based system for reusing data called indexing HiveQL Optimization for JOIN over Multi-session Big Data Environment (iHOME) is proposed to overcome HOME overheads by storing only the indexes of the joined rows instead of storing the full intermediate results directly. Moreover, the proposed iHOME system addresses eight cases of JOIN queries which classified into three groups; Similar-to-iHOME, Compute-on-iHOME, and Filter-of-iHOME. According to the experimental results of the iHOME system using TPC-H benchmark, it is found that the execution time of eight JOIN queries using iHOME on Hive has been reduced. Also, the stored data size in the iHOME system is reduced relative to the HOME system, as well as, the Big Data storage is saved. So, by increasing stored data size, the iHOME system guarantees the space scalability and overcomes the storage limitation.

Iete Technical Review | 2018

Optimizing Join in HIVE Star Schema Using Key/Facts Indexing

Hussien Sh. Abdel Azez; Mohamed Helmy Khafagy; Fatma A. Omara

ABSTRACT These days Big Data represents complex and an important issue for the extraction/retrieval of information due to the fact that its analysis requires massive computation power. In addition, database star schema can be considered as one of the complicated data models due to the use of joining queries heavily for information extraction and reports generation, which demands scanning for a large amount of data (tera, peta, zeta bytes, etc.). On the other hand, HIVE is considered one of the essential and efficient Big Data SQL-based tools built on the top of Hadoop as a translator from SQL queries into Map/Reduce tasks. In addition, using data indexing techniques with join queries could improve /speed up HIVE join query tasks execution especially in a star schema. According to the work in this paper, Key/Facts indexing methodology was introduced to materialize the star schema and inject a simple index for data. Based on this, Key/Facts indexing methodology SQL queries’ execution time in HIVE improved without changing HIVE framework. TPC-H benchmark was used in order to estimate the performance of Key/Facts methodology. Experimental results prove that Key/Facts methodology out-performs traditional HIVE join execution time. Also, Key/Facts performance is improved by increasing the data size. Generally, Key/Facts can be considered one of the suitable methodologies for Big Data analysis.

International journal of scientific and engineering research | 2015