Imad Rahal | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Imad Rahal is active.

Explore More

Publication

Featured researches published by Imad Rahal.

Journal of Information & Knowledge Management | 2004

A Scalable Vertical Model for Mining Association Rules

Imad Rahal; Dongmei Ren; William Perrizo

Association rule mining (ARM) is the data-mining process for finding all association rules in datasets matching user-defined measures of interest such as support and confidence. Usually, ARM proceeds by mining all frequent itemsets — a step known to be very computationally intensive — from which rules are then derived in a straight forward manner. In general, mining all frequent itemsets prunes the space by using the downward closure (or anti-monotonicity) property of support which states that no itemset can be frequent unless all of its subsets are frequent. A large number of papers have addressed the problem of ARM but not many of them have focused on scalability over very large datasets (i.e. when datasets contain a very large number of transactions). In this paper, we propose a new model for representing data and mining frequent itemsets that is based on the P-tree technology for compression and faster logical operations over vertically structured data and on set enumeration trees for fast itemset enumeration. Experimental results presented hereinafter show big improvements for our approach over large datasets when compared to other contemporary approaches in the literature.

Knowledge and Information Systems | 2015

Efficient clustering-based source code plagiarism detection using PIY

Tony Ohmann; Imad Rahal

Vast amounts of information available online make plagiarism increasingly easy to commit, and this is particularly true of source code. The traditional approach of detecting copied work in a course setting is manual inspection. This is not only tedious but also typically misses code plagiarized from outside sources or even from an earlier offering of the course. Systems to automatically detect source code plagiarism exist but tend to focus on small submission sets. One such system that has become the standard in automated source code plagiarism detection is measure of software similarity (MOSS) Schleimer et al. in proceedings of the 2003 ACM SIGMOD international conference on management of data, ACM, San Diego, 2003. In this work, we present an approach called program it yourself (PIY) which is empirically shown to outperform MOSS in detection accuracy. By utilizing parallel processing and data clustering, PIY is also capable of maintaining detection accuracy and reasonable runtimes even when using extremely large data repositories.

international conference on data mining | 2007

WC-Clustering: Hierarchical Clustering Using the Weighted Confidence Affinity Measure

Baoying Wang; Imad Rahal

Market-basket data analysis is an important problem that has been well addressed in the literature especially in the context of finding associations among items in large groups of transactions. Recently, there have been many attempts for clustering market-basket data. However, most of those market-basket clustering methods belong to partitional clustering which require at least one input parameter (e.g., the minimum intra- cluster similarity or the desired number of clusters). In this paper, we propose WC-clustering, a hierarchical clustering approach using vertical data structures. In order to minimize the impact of low support items, we devise a weighted confidence (WC) affinity function to calculate the similarity between clusters (or itemsets). Our experimental results show that WC-clustering produces much more compact results than Apriori and that the proposed weighted confidence affinity measure is more accurate than other contemporary affinity measures in the literature.

international conference on data mining | 2008

Parallel Hierarchical Clustering on Market Basket Data

Baoying Wang; Qin Ding; Imad Rahal

Data clustering has been proven to be a promising data mining technique. Recently, there have been many attempts for clustering market-basket data. In this paper, we propose a parallelized hierarchical clustering approach on market-basket data (PH-Clustering), which is implemented using MPI. Based on the analysis of the major clustering steps, we adopt a partial local and partial global approach to decrease the computation time meanwhile keeping communication time at minimum. Load balance issue is always considered especially at data partitioning stage. Our experimental results demonstrate that PH-Clustering speeds up the sequential clustering with a great magnitude. The larger the data size, the more significant the speedup when the number of processors is large. Our results also show that the number of items has more impact on the performance of PH-Clustering than the number of transactions.

International Journal of Data Mining, Modelling and Management | 2011

Parallel hierarchical clustering using weighted confidence affinity

Baoying Wang; Imad Rahal; Aijuan Dong

There have been many attempts for clustering categorical data such as market basket dataset. However, most of categorical clustering approaches belong to partitional clustering which requires at least one input parameter (e.g., the minimum intra-cluster similarity or the desired number of clusters). In this paper, we propose a parallelised hierarchical clustering approach for categorical data (PH-clustering) using vertical data structures. In order to minimise the impact of low support items, we devise a weighted confidence (WC) affinity function to compute the similarity between clusters. Based on our analysis of the major clustering steps, we adopt a partial local and partial global approach to reduce computation time as well as to keep network communication at minimum. Load balance issues are addressed especially during the data partitioning phase. Our experimental results on standardised market basket data show that the proposed weighted confidence affinity measure is more accurate than other contemporary affinity measures in the literature and that our parallel clustering approach provides magnitudes of time improvements over sequential clustering especially over larger data sizes. Our results also indicate that the number of items/attributes in the dataset has a more drastic impact on performance than the number of transactions/tuples.

Journal of Information & Knowledge Management | 2014

Source Code Plagiarism Detection Using Biological String Similarity Algorithms

Imad Rahal; Colin Wielga

Source code plagiarism is easy to commit but difficult to catch. Many approaches have been proposed in the literature to automate its detection; however there is little consensus on what works best. In this paper, we propose two new measures for determining the accuracy of a given technique and describe an approach to convert code files into strings which can then be compared for similarity in order to detect plagiarism. We then compare several string comparison techniques, heavily utilised in the area of biological sequence alignment, and compare their performance on a large collection of student source code containing various types of plagiarism. Experimental results show that the compared techniques succeed in matching a plagiarised file to its original files upwards of 90% of the time. Finally, we propose a modification for these algorithms that drastically improves their runtimes with little or no effect on accuracy. Even though the ideas presented herein are applicable to most programming languages, we focus on a case study pertaining to an introductory-level Visual Basic programming course offered at our institution.

International Journal of Bioinformatics Research and Applications | 2008

CARIBIAM: Constrained Association Rules using Interactive Biological IncrementAl Mining

Imad Rahal; Riad M. Rahhal; Baoying Wang; William Perrizo

This paper analyses annotated genome data by applying a very central data-mining technique known as Association Rule Mining (ARM) with the aim of discovering rules and hypotheses capable of yielding deeper insights into this type of data. In the literature, ARM has been noted for producing an overwhelming number of rules. This work proposes a new technique capable of using domain knowledge in the form of queries in order to efficiently mine only the subset of the associations that are of interest to investigators in an incremental and interactive manner.

International Journal of Data Mining, Modelling and Management | 2013

Using naïve Bayesian classification as a meta-predictor to improve start codon prediction accuracy in prokaryotic organisms

Sean R. Landman; Imad Rahal

Modern gene location prediction techniques are able to achieve near-perfect accuracy for prokaryotic organisms, but this reported accuracy is generally only for the stop codon locations. Accurate prediction of the start codon locations is more difficult to attain, and different approaches often produce conflicting predictions for the same gene. In this paper, we describe a new approach to resolve these conflicts and improve start codon prediction accuracy. Our approach uses a set of gene location prediction results from other popular prediction approaches to find consistently predicted gene locations. It then uses these consistent genes as a training set for a naive Bayesian classifier to improve accuracy in the ambiguous genes, those in which there are some inconsistencies in the predicted start codon location among the original predictions. The methods detailed here apply to prokaryotic organisms, using E. coli and the EcoGene Verified Set database as a case study.

Journal of Information & Knowledge Management | 2009

Automated Gene-Retrieval System for Biological Information Needs

Imad Rahal; Baoying Wang; Riad M. Rahhal

In this day and age, conducting a biological experiment is presumably a very expensive procedure largely owing to the highly sophisticated and expensive equipment necessitated by the process. Conceivably, being capable of isolating and focusing on a smaller set of imperative genes or gene products that are of high relevance to the experiment, pathway, or biological system under investigation is very desirable largely owing to the potential savings in experimental costs. In this work, we propose an intelligent information system capable of generating a ranked list of genes and gene products that are most pertinent to a given biological pathway, experiment or system (referred to as a biological context henceforth). We assume that the biological context of interest can be described by various textual query terms and phrases from the biological domain which, in turn, relate to various molecular functions, biological processes and cellular components of genes and their products. Intelligent text-based analyses and mining are utilised for this purpose by using the published literature, in the form of publication abstracts downloaded from PubMed, with the intention of ranking genes and gene products having identified relationships to the specified description terms based on the gene ontology (GO) standard. At this stage, our approach is capable of producing promising results given all surrounding restrictions, one of which is the lack of similar work in the literature. For demonstration purposes, we report experimental results on the molting regulation pathway in Drosophila melanogaster (fruit fly).

Journal of Information & Knowledge Management | 2006

Efficiency Considerations for Vertical kNN Text Categorisation

Imad Rahal; Hassan Najadat; William Perrizo

The importance of text mining stems from the availability of huge volumes of text databases holding a wealth of valuable information that needs to be mined. Text mining is a coarse area encompassing many finer branches one of which is text categorisation or text classification. Text categorisation is the process of assigning class labels to documents based entirely on their textual contents where we are given a document d, and asked to find its subject matter or class label, Ci.In this paper, an optimised k-Nearest Neighbours classifier that uses discretisation, the P-tree technology, and dimensionality reduction to achieve a high degree of accuracy, space utilisation and time efficiency is proposed. One of the fundamental contributions of this work is that as new samples arrive, the proposed classifier can find the k nearest neighbours to the new sample from the training space without a single database scan.

Explore More