John D. Holt
Wright State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by John D. Holt.
data and knowledge engineering | 2008
Yanjun Li; Soon Myoung Chung; John D. Holt
Most of existing text clustering algorithms use the vector space model, which treats documents as bags of words. Thus, word sequences in the documents are ignored, while the meaning of natural languages strongly depends on them. In this paper, we propose two new text clustering algorithms, named Clustering based on Frequent Word Sequences (CFWS) and Clustering based on Frequent Word Meaning Sequences (CFWMS). A word is the word form showing in the document, and a word meaning is the concept expressed by synonymous word forms. A word (meaning) sequence is frequent if it occurs in more than certain percentage of the documents in the text database. The frequent word (meaning) sequences can provide compact and valuable information about those text documents. For experiments, we used the Reuters-21578 text collection, CISI documents of the Classic data set [Classic data set, ftp://ftp.cs.cornell.edu/pub/smart/], and a corpus of the Text Retrieval Conference (TREC) [High Accuracy Retrieval from Documents (HARD) Track of Text Retrieval Conference, 2004]. Our experimental results show that CFWS and CFWMS have much better clustering accuracy than Bisecting k-means (BKM) [M. Steinbach, G. Karypis, V. Kumar, A Comparison of Document Clustering Techniques, KDD-2000 Workshop on Text Mining, 2000], a modified bisecting k-means using background knowledge (BBK) [A. Hotho, S. Staab, G. Stumme, Ontologies improve text document clustering, in: Proceedings of the 3rd IEEE International Conference on Data Mining, 2003, pp. 541-544] and Frequent Itemset-based Hierarchical Clustering (FIHC) [B.C.M. Fung, K. Wang, M. Ester, Hierarchical document clustering using frequent itemsets, in: Proceedings of SIAM International Conference on Data Mining, 2003] algorithms.
Information Processing Letters | 2002
John D. Holt; Soon Myoung Chung
In this paper, we propose a new algorithm named Inverted Hashing and Pruning (IHP) for mining association rules between items in transaction databases. The performance of the IHP algorithm was evaluated for various cases and compared with those of two well-known mining algorithms, Apriori algorithm [Proc. 20th VLDB Conf., 1994, pp. 487-499] and Direct Hashing and Pruning algorithm [IEEE Trans. on Knowledge Data Engrg. 9 (5) (1997) 813-825]. It has been shown that the IHP algorithm has better performance for databases with long transactions.
Knowledge and Information Systems | 2001
John D. Holt; Soon Myoung Chung
Abstract. In this paper, we propose two new algorithms for mining association rules between words in text databases. The characteristics of text databases are quite different from those of retail transaction databases, and existing mining algorithms cannot handle text databases efficiently because of the large number of itemsets (i.e., words) that need to be counted. Two well-known mining algorithms, Apriori algorithm and Direct Hashing and Pruning (DHP) algorithm, are evaluated in the context of mining text databases, and are compared with the new proposed algorithms named Multipass-Apriori (M-Apriori) and Multipass-DHP (M-DHP). It has been shown that the proposed algorithms have better performance for large text databases.
conference on information and knowledge management | 1999
John D. Holt; Soon Myoung Chung
In this paper, we propose two new algorithms for mining association rules between words in text databases. The characteristics of text databases are quite different from those of retail transaction databases, and existing mining algorithms cannot handle text databases efficiently because of the large number of itemsets (i.e., words) that need to be counted. Two well-known mining algorithms, Apriori algorithm and Direct Hashing and Pruning (DHP) algorithm, are evaluated in the context of mining text databases, and are compared with the new proposed algorithms named Multipass-Apriori (M-Apriori) and Multipass-DHP (M-DHP). It has been shown that the proposed algorithms have better performance for large text databases.
international parallel and distributed processing symposium | 2004
John D. Holt; Soon Myoung Chung
Summary form only given. We propose a new algorithm named Parallel Multipass with Inverted Hashing and Pruning (PMIHP) for mining association rules between words in text databases. The characteristics of text databases are quite different from those of retail transaction databases, and existing mining algorithms cannot handle text databases efficiently because of the large number of itemsets (i.e., sets of words) that need to be counted. The new PMIHP algorithm is a parallel version of our multipass with inverted hashing and pruning (MIHP) algorithm, which was shown to be quite efficient than other existing algorithms in the context of mining text databases. The PMIHP algorithm reduces the overhead of communication between miners running on different processors because they are mining local databases asynchronously and prune the global candidates by using the inverted hashing and pruning technique.
international conference on tools with artificial intelligence | 2002
John D. Holt; Soon Myoung Chung
In this paper, we propose a new algorithm named multipass with inverted hashing and pruning (MIHP) for mining association rules between words in text databases. The characteristics of text databases are quite different from those of retail transaction databases, and existing mining algorithms cannot handle text databases efficiently because of the large number of itemsets (i.e., words) that need to be counted. Two well-known mining algorithms, the apriori algorithm and the direct hashing and pruning (DHP) algorithm, are evaluated in the context of mining text databases, and are compared with the proposed MIHP algorithm. It has been shown that the MIHP algorithm performs better for large text databases.
international conference on tools with artificial intelligence | 2007
John D. Holt; Soon Myoung Chung; Yanjun Li
In this paper, we evaluated the efficacy of mined association rules between words for measuring the similarity between documents to enhance the text retrieval. In our experiments, for each document relevant to a query, we formed a group of documents having at least one common frequent set of words with the answer document. Then we measured the precision of the documents in the same group as an answer set to the corresponding query. This experiment was performed using a corpus of the Text retrieval conference (TREC) and search results. Our experimental results show that the frequent sets of words mined from our test database are useful in ranking query result sets to improve the precision of retrieval.
data warehousing and knowledge discovery | 2000
John D. Holt; Soon Myoung Chung
In this paper, we propose a new algorithm named Inverted Hashing and Pruning (IHP) for mining association rules between words in text databases. The characteristics of text databases are quite different from those of retail transaction databases, and existing mining algorithms cannot handle text databases efficiently, because of the large number of itemsets (i.e., words) that need to be counted. Two well-known mining algorithms, the Apriori algorithm [1] and Direct Hashing and Pruning (DHP) algorithm [5], are evaluated in the context of mining text databases, and are compared with the proposed IHP algorithm. It has been shown that the IHP algorithm has better performance for large text databases.
The Journal of Supercomputing | 2007
John D. Holt; Soon Myoung Chung
Archive | 2003
Soon Myoung Chung; John D. Holt