Zahidul Islam | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Zahidul Islam is active.

Explore More

Publication

Featured researches published by Zahidul Islam.

Knowledge Based Systems | 2011

Privacy preserving data mining: a noise addition framework using a novel clustering technique

Zahidul Islam

During the whole process of data mining (from data collection to knowledge discovery) various sensitive data get exposed to several parties including data collectors, cleaners, preprocessors, miners and decision makers. The exposure of sensitive data can potentially lead to breach of individual privacy. Therefore, many privacy preserving techniques have been proposed recently. In this paper we present a framework that uses a few novel noise addition techniques for protecting individual privacy while maintaining a high data quality. We add noise to all attributes, both numerical and categorical. We present a novel technique for clustering categorical values and use it for noise addition purpose. A security analysis is also presented for measuring the security level of a data set.

Knowledge Based Systems | 2014

A hybrid clustering technique combining a novel genetic algorithm with K-Means

Anisur Rahman; Zahidul Islam

Many existing clustering techniques including K-Means require a user input on the number of clusters. It is often extremely difficult for a user to accurately estimate the number of clusters in a data set. The genetic algorithms (GAs) generally determine the number of clusters automatically. However, they typically choose the genes and the number of genes randomly. If we can identify the right genes in the initial population then GAs have better possibility to produce a high quality clustering result than the case when we randomly choose the genes. We propose a novel GA based clustering technique that is capable of automatically finding the right number of clusters and identifying the right genes through a novel initial population selection approach. With the help of our novel fitness function, and gene rearrangement operation it produces high quality cluster centers. The centers are then fed into K-Means as initial seeds in order to produce an even higher quality clustering solution by allowing the initial seeds to readjust as needed. Our experimental results indicate a statistically significant superiority (according to the sign test analysis) of our technique over five recent techniques on twenty natural data sets used in this study based on six evaluation criteria.

Knowledge Based Systems | 2013

Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques

Md. Geaur Rahman; Zahidul Islam

We present two novel techniques for the imputation of both categorical and numerical missing values. The techniques use decision trees and forests to identify horizontal segments of a data set where the records belonging to a segment have higher similarity and attribute correlations. Using the similarity and correlations, missing values are then imputed. To achieve a higher quality of imputation some segments are merged together using a novel approach. We use nine publicly available data sets to experimentally compare our techniques with a few existing ones in terms of four commonly used evaluation criteria. The experimental results indicate a clear superiority of our techniques based on statistical analyses such as confidence interval.

british national conference on databases | 2010

EXPLORE: a novel decision tree classification algorithm

Zahidul Islam

Decision tree algorithms such as See5 (or C5) are typically used in data mining for classification and prediction purposes. In this study we propose EXPLORE, a novel decision tree algorithm, which is a modification of See5. The modifications are made to improve the capability of a tree in extracting hidden patterns. Justification of the proposed modifications is also presented. We experimentally compare EXPLORE with some existing algorithms such as See5, REPTree and J48 on several issues including quality of extracted rules/patterns, simplicity, and classification accuracy of the trees. Our initial experimental results indicate advantages of EXPLORE over existing algorithms.

Information Systems | 2015

Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem

Michael J. Siers; Zahidul Islam

Software development projects inevitably accumulate defects throughout the development process. Due to the high cost that defects can incur, careful consideration is crucial when predicting which sections of code are likely to contain defects. Classification algorithms used in machine learning can be used to create classifiers which can be used to predict defects. While traditional classification algorithms optimize for accuracy, cost-sensitive classification methods attempt to make predictions which incur the lowest classification cost. In this paper we propose a cost-sensitive classification technique called CSForest which is an ensemble of decision trees. We also propose a cost-sensitive voting technique called CSVoting in order to take advantage of the set of decision trees in minimizing the classification cost. We then investigate a potential solution to class imbalance within our decision forest algorithm. We empirically evaluate the proposed techniques comparing them with six (6) classifier algorithms on six (6) publicly available clean datasets that are commonly used in the research on software defect prediction. Our initial experimental results indicate a clear superiority of the proposed techniques over the existing ones. Author-HighlightsSDP is short for Software Defect Prediction.We show that there is not a clear winner in the studied existing methods for SDP*.A cost-sensitive decision forest and voting technique are proposed.The superiority of the proposed techniques is shown.A proposed framework for the forest algorithm for handling class imbalance.

advanced data mining and applications | 2013

kDMI: A Novel Method for Missing Values Imputation Using Two Levels of Horizontal Partitioning in a Data set

Md. Geaur Rahman; Zahidul Islam

Imputation of missing values is an important data mining task for improving the quality of data mining results. The imputation based on similar records is generally more accurate than the imputation based on all records of a data set. Therefore, in this paper we present a novel algorithm called kDMI that employs two levels of horizontal partitioning based on a decision tree and k-NN algorithm of a data set, in order to find the records that are very similar to the one with missing value/s. Additionally, it uses a novel approach to automatically find the value of k for each record. We evaluate the performance of kDMI over three high quality existing methods on two real data sets in terms of four evaluation criteria. Our initial experimental results, including 95% confidence interval analysis and statistical t-test analysis, indicate the superiority of kDMI over the existing methods.

Expert Systems With Applications | 2016

Discretization of continuous attributes through low frequency numerical values and attribute interdependency

Md. Geaur Rahman; Zahidul Islam

A new discretization technique called LFD.Does not require any user input.Interval width, number and frequency are automatically determined; all data driven.Minimizes information loss due to discretization by choosing low frequency cut points.Categorical attributes are taken as reference point for discretization. Discretization is the process of converting numerical values into categorical values. There are many existing techniques for discretization. However, the existing techniques have various limitations such as the requirement of a user input on the number of categories and number of records in each category. Therefore, we propose a new discretization technique called low frequency discretizer (LFD) that does not require any user input. There are some existing techniques that do not require user input, but they rely on various assumptions such as the number of records in each interval is same, and the number of intervals is equal to the number of records in each interval. These assumptions are often difficult to justify. LFD does not require any assumptions. In LFD the number of categories and frequency of each category are not pre-defined, rather data driven. Other contributions of LFD are as follows. LFD uses low frequency values as cut points and thus reduces the information loss due to discretization. It uses all other categorical attributes and any numerical attribute that has already been categorized. It considers that the influence of an attribute in discretization of another attribute depends on the strength of their relationship. We evaluate LFD by comparing it with six (6) existing techniques on eight (8) datasets for three different types of evaluation, namely the classification accuracy, imputation accuracy and noise detection accuracy. Our experimental results indicate a significant improvement based on the sign test analysis.

Science and Engineering Ethics | 2015

Data Mining and Privacy of Social Network Sites' users: Implications of the data mining problem

Yeslam Al-Saggaf; Zahidul Islam

This paper explores the potential of data mining as a technique that could be used by malicious data miners to threaten the privacy of social network sites (SNS) users. It applies a data mining algorithm to a real dataset to provide empirically-based evidence of the ease with which characteristics about the SNS users can be discovered and used in a way that could invade their privacy. One major contribution of this article is the use of the decision forest data mining algorithm (SysFor) to the context of SNS, which does not only build a decision tree but rather a forest allowing the exploration of more logic rules from a dataset. One logic rule that SysFor built in this study, for example, revealed that anyone having a profile picture showing just the face or a picture showing a family is less likely to be lonely. Another contribution of this article is the discussion of the implications of the data mining problem for governments, businesses, developers and the SNS users themselves.

Knowledge Based Systems | 2016

Optimizing the number of trees in a decision forest to discover a subforest with high ensemble accuracy using a genetic algorithm

Nasim Adnan; Zahidul Islam

A decision forest is an ensemble of decision trees, and it is often built to discover more patterns (i.e. logic rules) and predict/classify class values more accurately than a single decision tree. Existing decision forest algorithms are typically used for building huge numbers of decision trees, involving large memory and computational overhead, in order to achieve high accuracy. Generally, many of the trees do not contribute to improving the ensemble accuracy of a forest. As a result, ensemble pruning algorithms aim to get rid of those trees while generating a subforest in order to achieve higher (or comparable) ensemble accuracy than the original forest. The objectives are two fold: select as small number of trees as possible, and maintain the ensemble accuracy of the subforest as high as possible. An optimal subforest can be found by exhaustive search; however it is not practical for any standard-sized forest as the number of candidate subforests grows exponentially. In order to avoid the computational burden of an exhaustive search, many greedy and genetic algorithm-based subforest selection techniques have been proposed in literature. In this paper, we propose a subforest selection technique that achieves small size as well as high accuracy. We use a genetic algorithm where we carefully select high quality individual trees for the initial population of the genetic algorithm in order to improve the final output of the algorithm. Experiments are conducted on 20 data sets from the UCI Machine Learning Repository to compare the proposed technique with several existing state-of-the-art techniques. The results indicate that the proposed technique can select effective subforests which are significantly smaller than original forests while achieving better (or comparable) accuracy than the original forests.

availability, reliability and security | 2010

Communal Reputation and Individual Trust (CRIT) in Wireless Sensor Networks

Tanveer A. Zia; Zahidul Islam

Deployment of wireless sensor networks in sensitive applications such as healthcare, defence, habitat monitoring and early bushfire detection requires a careful consideration. These networks are prone to security attacks due to their wireless and deployment nature. It is very likely that after deployment of the network, sensor nodes are left unattended which causes serious security concerns. Insecure wireless communication aggravates the inherent vulnerabilities of wireless sensor networks. Several countermeasures have been proposed in literature to counter the threats posed by attacks in sensor networks; however, security does not come for free. Especially for the resource limited nodes it is very costly to deploy computationally extensive security solutions. This paper studies the notion of trust in wireless sensor networks and proposes a solution based on communal reputation and individual trust (CRIT) in sensor nodes. A very important aspect which determines the viability of this study is the simulation results and performance analysis.

Explore More