Manoranjan Dash | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Manoranjan Dash is active.

Explore More

Publication

Featured researches published by Manoranjan Dash.

intelligent data analysis | 1997

Feature Selection for Classification

Manoranjan Dash; Huan Liu

Feature selection has been the focus of interest for quite some time and much work has been done. With the creation of huge databases and the consequent requirements for good machine learning techniques, new problems arise and novel approaches to feature selection are in demand. This survey is a comprehensive overview of many existing methods from the 1970s to the present. It identifies four steps of a typical feature selection method, and categorizes the different existing methods in terms of generation procedures and evaluation functions, and reveals hitherto unattempted combinations of generation procedures and evaluation functions. Representative methods are chosen from each category for detailed explanation and discussion via example. Benchmark datasets with different characteristics are used for comparative study. The strengths and weaknesses of different methods are explained. Guidelines for applying feature selection methods are given based on data types and domain characteristics. This survey identifies the future research areas in feature selection, introduces newcomers to this field, and paves the way for practitioners who search for suitable methods for solving domain-specific real-world applications.

Data Mining and Knowledge Discovery | 2002

Discretization: An Enabling Technique

Huan Liu; Farhad Hussain; Chew Lim Tan; Manoranjan Dash

Discrete values have important roles in data mining and knowledge discovery. They are about intervals of numbers which are more concise to represent and specify, easier to use and comprehend as they are closer to a knowledge-level representation than continuous values. Many studies show induction tasks can benefit from discretization: rules with discrete values are normally shorter and more understandable and discretization can lead to improved predictive accuracy. Furthermore, many induction algorithms found in the literature require discrete features. All these prompt researchers and practitioners to discretize continuous features before or during a machine learning or data mining task. There are numerous discretization methods available in the literature. It is time for us to examine these seemingly different methods for discretization and find out how different they really are, what are the key components of a discretization process, how we can improve the current level of research for new development as well as the use of existing methods. This paper aims at a systematic study of discretization methods with their history of development, effect on classification, and trade-off between speed and accuracy. Contributions of this paper are an abstract description summarizing existing discretization methods, a hierarchical framework to categorize the existing methods and pave the way for further development, concise discussions of representative discretization methods, extensive experiments and their analysis, and some guidelines as to how to choose a discretization method under various circumstances. We also identify some issues yet to solve and future research for discretization.

systems man and cybernetics | 2007

Wrapper–Filter Feature Selection Algorithm Using a Memetic Framework

Zexuan Zhu; Yew-Soon Ong; Manoranjan Dash

This correspondence presents a novel hybrid wrapper and filter feature selection algorithm for a classification problem using a memetic framework. It incorporates a filter ranking method in the traditional genetic algorithm to improve classification performance and accelerate the search in identifying the core feature subsets. Particularly, the method adds or deletes a feature from a candidate feature subset based on the univariate feature ranking information. This empirical study on commonly used data sets from the University of California, Irvine repository and microarray data sets shows that the proposed method outperforms existing methods in terms of classification accuracy, number of selected features, and computational efficiency. Furthermore, we investigate several major issues of memetic algorithm (MA) to identify a good balance between local search and genetic search so as to maximize search quality and efficiency in the hybrid filter and wrapper MA

pacific asia conference on knowledge discovery and data mining | 2000

Feature Selection for Clustering

Manoranjan Dash; Huan Liu

Clustering is an important data mining task. Data mining often concerns large and high-dimensional data but unfortunately most of the clustering algorithms in the literature are sensitive to largeness or high-dimensionality or both. Different features affect clusters differently, some are important for clusters while others may hinder the clustering task. An efficient way of handling it is by selecting a subset of important features. It helps in finding clusters efficiently, understanding the data better and reducing data size for efficient storage, collection and processing. The task of finding original important features for unsupervised data is largely untouched. Traditional feature selection algorithms work only for supervised data where class information is available. For unsupervised data, without class information, often principal components (PCs) are used, but PCs still require all features and they may be difficult to understand. Our approach: first features are ranked according to their importance on clustering and then a subset of important features are selected. For large data we use a scalable method using sampling. Empirical evaluation shows the effectiveness and scalability of our approach for benchmark and synthetic data sets.

Fuzzy Sets and Systems | 2000

Entropy-based fuzzy clustering and fuzzy modeling

Jun Yao; Manoranjan Dash; S. T. Tan; Hua Liu

Abstract Fuzzy clustering is capable of finding vague boundaries that crisp clustering fails to obtain. But time complexity of fuzzy clustering is usually high, and the need to specify complicated parameters hinders its use. In this paper, an entropy-based fuzzy clustering method is proposed. It automatically identifies the number and initial locations of cluster centers. It calculates the entropy at each data point and selects the data point with minimum entropy as the first cluster center. Next it removes all data points having similarity larger than a threshold with the chosen cluster center. This process is repeated till all data points are removed. Unlike previous methods of its kind, it does not need to revise entropy value for each data point after a cluster center is determined. This saves a lot of time. Also it requires just two parameters that are easy to specify. It is able to find the natural clusters in the data. The clustering method is also extended to construct a rule-based fuzzy model. A new way of estimating initial membership functions for fuzzy sets is presented. The experimental results show that the fuzzy model is good in predicting output variable values.

Pattern Recognition | 2007

Markov blanket-embedded genetic algorithm for gene selection

Zexuan Zhu; Yew-Soon Ong; Manoranjan Dash

Microarray technologies enable quantitative simultaneous monitoring of expression levels for thousands of genes under various experimental conditions. This new technology has provided a new way of biological classification on a genome-wide scale. However, predictive accuracy is affected by the presence of thousands of genes many of which are unnecessary from the classification point of view. So, a key issue of microarray data classification is to identify the smallest possible set of genes that can achieve good predictive accuracy. In this study, we propose a novel Markov blanket-embedded genetic algorithm (MBEGA) for gene selection problem. In particular, the embedded Markov blanket-based memetic operators add or delete features (or genes) from a genetic algorithm (GA) solution so as to quickly improve the solution and fine-tune the search. Empirical results on synthetic and microarray benchmark datasets suggest that MBEGA is effective and efficient in eliminating irrelevant and redundant features based on both Markov blanket and predictive power in classifier model. A detailed comparative study with other methods from each of filter, wrapper, and standard GA shows that MBEGA gives a best compromise among all four evaluation criteria, i.e., classification accuracy, number of selected genes, computational cost, and robustness.

international conference on tools with artificial intelligence | 1997

Dimensionality reduction of unsupervised data

Manoranjan Dash; Hua Liu; Jun Yao

Dimensionality reduction is an important problem for efficient handling of large databases. Many feature selection methods exist for supervised data having class information. Little work has been done for dimensionality reduction of unsupervised data in which class information is not available. Principal component analysis (PCA) is often used. However, PCA creates new features. It is difficult to obtain intuitive understanding of the data using the new features only. We are concerned with the problem of determining and choosing the important original features for unsupervised data. Our method is based on the observation that removing an irrelevant feature from the feature set may not change the underlying concept of the data, but not so otherwise. We propose an entropy measure for ranking features, and conduct extensive experiments to show that our method is able to find the important features. Also it compares well with a similar feature ranking method (Relief) that requires class information unlike our method.

data warehousing and knowledge discovery | 2008

Efficient K-Means Clustering Using Accelerated Graphics Processors

S. A. Shalom; Manoranjan Dash; Minh Tue

We exploit the parallel architecture of the Graphics Processing Unit (GPU) used in desktops to efficiently implement the traditional K-means algorithm. Our approach in clustering avoids the need for data and cluster information transfer between the GPU and CPU in between the iterations. In this paper we present the novelties in our approach and techniques employed to represent data, compute distances, centroids and identify the cluster elements using the GPU. We measure performance using the metric: computational time per iteration. Our implementation of k-means clustering on an Nvidia 5900 graphics processor is 4 to 12 times faster than the CPU and 7 to 22 times faster on the Nvidia 8500 graphics processor for various data sizes. We also achieved 12 to 64 times speed gain on the 5900 and 20 to 140 times speed gains on the 8500 graphics processor in computational time per iteration for evaluations with various cluster sizes.

european conference on machine learning | 1998

A monotonic measure for optimal feature selection

Huan Liu; Hiroshi Motoda; Manoranjan Dash

Feature selection is a problem of choosing a subset of relevant features. In general, only exhaustive search can bring about the optimal subset. With a monotonic measure, exhaustive search can be avoided without sacrificing optimality. Unfortunately, most error- or distancebased measures are not monotonic. A new measure is employed in this work that is monotonic and fast to compute. The search for relevant features according to this measure is guaranteed to be complete but not exhaustive. Experiments are conducted for verification.

database systems for advanced applications | 2001

'1+1>2': merging distance and density based clustering

Manoranjan Dash; Huan Liu; Xiaowei Xu

Clustering is an important data exploration task. Its use in data mining is growing very fast. Traditional clustering algorithms which no longer cater for the data mining requirements are modified increasingly. Clustering algorithms are numerous which can be divided in several categories. Two prominent categories are distance-based and density-based (e.g. K-means and DBSCAN, respectively). While K-means is fast, easy to implement and converges to local optima almost surely, it is also easily affected by noise. On the other hand, while density-based clustering can find arbitrary shape clusters and handle noise well, it is also slow in comparison due to neighborhood search for each data point, and faces a difficulty in setting the density threshold properly. We propose BRIDGE that efficiently merges the two by exploiting the advantages of one to counter the limitations of the other and vice versa. BRIDGE enables DBSCAN to handle very large data efficiently and improves the quality of K-means clusters by removing the noisy points. It also helps the user in setting the density threshold parameter properly. We further show that other clustering algorithms can be merged using a similar strategy. An example given in the paper merges BIRCH clustering with DBSCAN.

Explore More