Congnan Luo | Researchain

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Congnan Luo is active.

Explore More

Publication

Featured researches published by Congnan Luo.

IEEE Transactions on Knowledge and Data Engineering | 2008

Text Clustering with Feature Selection by Using Statistical Data

Yanjun Li; Congnan Luo; Soon Myoung Chung

Feature selection is an important method for improving the efficiency and accuracy of text categorization algorithms by removing redundant and irrelevant terms from the corpus. In this paper, we propose a new supervised feature selection method, named CHIR, which is based on the chi2 statistic and new statistical data that can measure the positive term-category dependency. We also propose a new text clustering algorithm, named text clustering with feature selection (TCFS). TCFS can incorporate CHIR to identify relevant features (i.e., terms) iteratively, and the clustering becomes a learning process. We compared TCFS and the K-means clustering algorithm in combination with different feature selection methods for various real data sets. Our experimental results show that TCFS with CHIR has better clustering accuracy in terms of the F-measure and the purity.

data and knowledge engineering | 2009

Text document clustering based on neighbors

Congnan Luo; Yanjun Li; Soon Myoung Chung

Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as an optimization process of grouping documents into k clusters so that a particular criterion function is minimized or maximized. Usually, the cosine function is used to measure the similarity between two documents in the criterion function, but it may not work well when the clusters are not well separated. To solve this problem, we applied the concepts of neighbors and link, introduced in [S. Guha, R. Rastogi, K. Shim, ROCK: a robust clustering algorithm for categorical attributes, Information Systems 25 (5) (2000) 345-366], to document clustering. If two documents are similar enough, they are considered as neighbors of each other. And the link between two documents represents the number of their common neighbors. Instead of just considering the pairwise similarity, the neighbors and link involve the global information into the measurement of the closeness of two documents. In this paper, we propose to use the neighbors and link for the family of k-means algorithms in three aspects: a new method to select initial cluster centroids based on the ranks of candidate documents; a new similarity measure which uses a combination of the cosine and link functions; and a new heuristic function for selecting a cluster to split based on the neighbors of the cluster centroids. Our experimental results on real-life data sets demonstrated that our proposed methods can significantly improve the performance of document clustering in terms of accuracy without increasing the execution time much.

Knowledge and Information Systems | 2008

A scalable algorithm for mining maximal frequent sequences using a sample

Congnan Luo; Soon Myoung Chung

In this paper, we propose an efficient scalable algorithm for mining Maximal Sequential Patterns using Sampling (MSPS). The MSPS algorithm reduces much more search space than other algorithms because both the subsequence infrequency-based pruning and the supersequence frequency-based pruning are applied. In MSPS, a sampling technique is used to identify long frequent sequences earlier, instead of enumerating all their subsequences. We propose how to adjust the user-specified minimum support level for mining a sample of the database to achieve better overall performance. This method makes sampling more efficient when the minimum support is small. A signature-based method and a hash-based method are developed for the subsequence infrequency-based pruning when the seed set of frequent sequences for the candidate generation is too big to be loaded into memory. A prefix tree structure is developed to count the candidate sequences of different sizes during the database scanning, and it also facilitates the customer sequence trimming. Our experiments showed MSPS has very good performance and better scalability than other algorithms.

The Journal of Supercomputing | 2006

Distributed Mining of Maximal Frequent Itemsets on a Data Grid System

Congnan Luo; Anil L. Pereira; Soon Myoung Chung

In this paper, we propose a new algorithm, named Grid-based Distributed Max-Miner (GridDMM), for mining maximal frequent itemsets from databases on a Data Grid. A frequent itemset is maximal if none of its supersets is frequent. GridDMM is specifically suitable for use in Grid environments due to low communication and synchronization overhead. GridDMM consists of a local mining phase and a global mining phase. During the local mining phase, each node mines the local database to discover the local maximal frequent itemsets, then they form a set of maximal candidate itemsets for the top-down search in the subsequent global mining phase. A new prefix-tree data structure is developed to facilitate the storage and counting of the global candidate itemsets of different sizes. We built a Data Grid system on a cluster of workstations using the open-source Globus Toolkit, and evaluated the GridDMM algorithm in terms of performance, scalability, and the overhead of communication and synchronization. GridDMM demonstrates better performance than other sequential and parallel algorithms, and its performance is scalable in terms of the database size and the number of nodes.

Cluster Computing | 2015

A parallel text document clustering algorithm based on neighbors

Yanjun Li; Congnan Luo; Soon Myoung Chung

In this paper, we propose a new parallel algorithm for text document clustering based on the concept of neighbor (Guha et al. in Inf Syst 25(5):345–366, 2000). If two documents are similar enough, they are considered as neighbors of each other. The new algorithm is named parallel k-means based on neighbors (PKBN), and it is a parallel version of sequential k-means based on neighbors (SKBN) that we proposed in Luo et al. (Data Knowl Eng 68(11):1271–1288, 2009). PKBN fully exploits the data-parallelism of SKBN and adopts a new parallel pair-generating method to build the neighbor matrix. Our new parallel pair-generating method causes less communication overhead between processors than existing methods. PKBN is designed for message-passing multiprocessor systems and is implemented on a cluster of Linux workstations to analyze its performance. Our experimental results on real-life data sets demonstrate that PKBN is very efficient and has good scalability with respect to the number of processors and the size of data set.

international conference on tools with artificial intelligence | 2003

Parallel mining of maximal frequent itemsets from databases

Soon Myoung Chung; Congnan Luo

In this paper, we propose a parallel algorithm for mining maximal frequent itemsets from databases. A frequent itemset is maximal if none of its supersets is frequent. The new parallel algorithm is named parallel max-miner (PMM), and it is a parallel version of the sequential max-miner algorithm by R.J. Bayardo (1998). Most of existing mining algorithms discover the frequent k-itemsets on the kth pass over the databases, and then generate the candidate (k + 1)-itemsets for the next pass. Compared to those level-wise algorithms, PMM looks ahead at each pass and prunes more candidate itemsets by checking the frequencies of their supersets. We implemented PMM on a cluster of workstations, and evaluated its performance for various cases. PMM demonstrated better performance than other sequential and parallel algorithms, and its performance is quite scalable, even when there are large maximal frequent itemsets (i.e. long patterns) in databases.

International Journal on Artificial Intelligence Tools | 2012

WEIGHTED NAÏVE BAYES FOR TEXT CLASSIFICATION USING POSITIVE TERM-CLASS DEPENDENCY

Yanjun Li; Congnan Luo; Soon Myoung Chung

Naive Bayes is a simple and efficient classification algorithm which performs well on text classification, which is also known as text categorization. Many researches have been done to improve the ...

siam international conference on data mining | 2005

Efficient Mining of Maximal Sequential Patterns Using Multiple Samples.

Congnan Luo; Soon Myoung Chung

The Journal of Supercomputing | 2012

Parallel mining of maximal sequential patterns using multiple samples

Congnan Luo; Soon Myoung Chung

Explore More

Collaboration

Dive into the Congnan Luo's collaboration.

Top Co-Authors

Soon Myoung Chung

Wright State University

View shared research outputs

Top Co-Authors

Yanjun Li

Fordham University

View shared research outputs

Top Co-Authors

Anil L. Pereira

Wright State University

View shared research outputs

Explore More

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot

Dive into the research topics where Congnan Luo is active.

Publication

Featured researches published by Congnan Luo.

Text Clustering with Feature Selection by Using Statistical Data

Text document clustering based on neighbors

A scalable algorithm for mining maximal frequent sequences using a sample

Distributed Mining of Maximal Frequent Itemsets on a Data Grid System

A parallel text document clustering algorithm based on neighbors

Parallel mining of maximal frequent itemsets from databases

WEIGHTED NAÏVE BAYES FOR TEXT CLASSIFICATION USING POSITIVE TERM-CLASS DEPENDENCY

Efficient Mining of Maximal Sequential Patterns Using Multiple Samples.

Parallel mining of maximal sequential patterns using multiple samples

Collaboration

Dive into the Congnan Luo's collaboration.