Verayuth Lertnattee
Silpakorn University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Verayuth Lertnattee.
Information Sciences | 2004
Verayuth Lertnattee; Thanaruk Theeramunkong
Most of traditional text categorization approaches utilize term frequency (tf) and inverse document frequency (idf) for representing importance of words and/or terms in classifying a text document. This paper describes an approach to apply term distributions, in addition to tf and idf, to improve performance of centroid-based text categorization. Three types of term distributions, called inter-class, intra-class and in-collection distributions, are introduced. These distributions are useful to increase classification accuracy by exploiting information of (1) term distribution among classes, (2) term distribution within a class and (3) term distribution in the whole collection of training data. In addition, this paper investigates how these term distributions contribute to weight each term in documents, e.g., a high term distribution of a word promotes or demotes importance or classification power of that word. To this end, several centroid-based classifiers are constructed with different term weightings. Using various data sets, their performances are investigated and compared to a standard centroid-based classifier (TDIDF) and a centroid-based classifier modified with information gain. Moreover, we also compare them to two well-known methods: k-NN and naive Bayes. In addition to a unigram model of document representation, a bigram model is also explored. Finally, the effectiveness of term distributions to improve classification accuracy is explored with regard to the training set size and the number of classes.
international symposium on communications and information technologies | 2004
Verayuth Lertnattee; Thanaruk Theeramunkong
Most previous works on text categorization applied term occurrence frequency and inverse document frequency for representing importance of terms. This work presents an analysis of inverse class frequency in centroid-based text categorization. There are two aims of this paper. The first one is to find appropriate functions of inverse class frequency. The other is to find the key factors for using inverse class frequency. The experimental results show that the key factors, which improve classification accuracy, are the numbers of few-class terms and most-class terms. When large numbers of few-class terms and most-class terms are obtained, the logarithmic function of inverse class frequency is the most effective when it is combined with term frequency. The square root of inverse class frequency incorporated into TFIDF, works well in the case when data sets include a small number of few-class terms and most-class terms. To increase the numbers of these effective terms, some methods are involved i.e. using higher gram models, small number of classes and large number of training sets.
Information Sciences | 2006
Verayuth Lertnattee; Thanaruk Theeramunkong
Centroid-based categorization is one of the most popular algorithms in text classification. In this approach, normalization is an important factor to improve performance of a centroid-based classifier when documents in text collection have quite different sizes and/or the numbers of documents in classes are unbalanced. In the past, most researchers applied document normalization, e.g., document-length normalization, while some consider a simple kind of class normalization, so-called class-length normalization, to solve the unbalancedness problem. However, there is no intensive work that clarifies how these normalizations affect classification performance and whether there are any other useful normalizations. The purpose of this paper is three folds; (1) to investigate the effectiveness of document- and class-length normalizations on several data sets, (2) to evaluate a number of commonly used normalization functions and (3) to introduce a new type of class normalization, called term-length normalization, which exploits term distribution among documents in the class. The experimental results show that a classifier with weight-merge-normalize approach (class-length normalization) performs better than one with weight-normalize-merge approach (document-length normalization) for the data sets with unbalanced numbers of documents in classes, and is quite competitive for those with balanced numbers of documents. For normalization functions, the normalization based on term weighting performs better than the others on average. For term-length normalization, it is useful for improving classification accuracy. The combination of term- and class-length normalizations outperforms pure class-length normalization and pure term-length normalization as well as unnormalization with the gaps of 4.29%, 11.50%, 30.09%, respectively.
international conference on computational linguistics | 2002
Thanaruk Theeramunkong; Verayuth Lertnattee
This paper proposes a multi-dimensional framework for classifying text documents. In this framework, the concept of multidimensional category model is introduced for representing classes. In contrast with traditional flat and hierarchical category models; the multi-dimensional category model classifies each text document in a collection using multiple predefined sets of categories, where each set corresponds to a dimension. Since a multi-dimensional model can be converted to flat and hierarchical models, three classification strategies are possible, i.e., classifying directly based on the multi-dimensional model and classifying with the equivalent flat or hierarchical models. The efficiency of these three classifications is investigated on two data sets. Using k-NN, naive Bayes and centroid-based classifiers, the experimental results show that the multi-dimensional-based and hierarchical-based classification performs better than the flat-based classifications.
international conference of the ieee engineering in medicine and biology society | 2004
Verayuth Lertnattee; Thanaruk Theeramunkong
This paper proposes a multidimensional model for classifying drug information text documents. The concept of multidimensional category model is introduced for representing classes. In contrast with traditional flat and hierarchical category models, the multidimensional category model classifies each document using multiple predefined sets of categories, where each set corresponds to a dimension. Since a multidimensional model can be converted to flat and hierarchical models, three classification approaches are possible, i.e., classifying directly based on the multidimensional model and classifying with the equivalent flat or hierarchical models. The efficiency of these three approaches is investigated using drug information collection with two different dimensions: 1) drug topics and 2) primary therapeutic classes. In the experiments, k-nearest neighbor, na/spl inodot//spl uml/ve Bayes, and two centroid-based methods are selected as classifiers. The comparisons among three approaches of classification are done using two-way analysis of variance, followed by the Scheffe/spl acute/s test for post hoc comparison. The experimental results show that multidimensional-based classification performs better than the others, especially in the presence of a relatively small training set. As one application, a category-based search engine using the multidimensional category concept was developed to help users retrieve drug information.
computer and information technology | 2009
Verayuth Lertnattee; Sinthop Chomya; Thanaruk Theeramunkong; Virach Sornlertlamvanich
Knowledge about herbal medicine can be contributed from experts in several cultures. With the conventional techniques, it is hard to find the way which the experts can build a self-sustainable community for exchanging their information. In this paper, the Knowledge Unifying Initiator for Herbal Information (KUIHerb) is used as a platform for building a web community for collecting the intercultural herbal knowledge with the concept of a collective intelligence. With this system, herb identification, herbal vocabulary and medicinal usages can be collected from this system. KUIHerb provides herbal vocabulary which is dynamically and confidentially applied for searching improvement on the Thai herbal search engine. Three strategies are utilized: (1) providing a set of technical terms in Thai with can be added into the dictionary. These terms are utilized by Thai word segmentation for improving the indexing process (2) A set of synonyms of these technical terms in both Thai and English is built for helping users from a lot of keywords of the same term and (3) a set of keywords from herbal usages can be combined with the name keyword. From the results, information collected from KUIHerb is useful for searching.
advanced information networking and applications | 2008
Verayuth Lertnattee; Thanaruk Theeramunkong
Automatic text classification for Web collection is a non- trivial task. Since Thai academic Web pages usually present technical articles. They may have many technical terms both in Thai and English. This paper presents two approaches towards the problem of a large number of unique terms in a Web page: 1) term weighting schemes and 2) schemes using Web link information. We propose an approach using inverse class frequency instead of inverse document frequency in centroid-based text categorization. Web link information provides information for users to follow to another part or page. It adds useful unique terms for classification. The experimental results show that inverse class frequency is useful on a set of Thai academic Web documents, which is categorized by sources (sites) of information. It should be applied on both prototype and query vectors. Moreover, Web link information expresses its usefulness when inverse class frequency is also applied.
international symposium on computers and communications | 2002
Verayuth Lertnattee; Thanaruk Theeramunkong
Centroid-based text classification is one of the most popular supervised approaches to classify texts into a set of pre-defined classes. Based on the vector-space model, the performance of this classification particularly depends on the way to weight and select important terms in documents for constructing a prototype class vector for each class. In the past, it was shown that term weighting using statistical term distributions could improve classification accuracy. However, for different data sets, the best weighting systems are different. Towards this problem, we propose a method that uses homogenous centroid-based classification. The effectiveness of this approach is explored using four data sets. Two main factors are taken into account: model selection and score combination. By experiments, the results show that our system can improve the classification accuracy up to 7.5-8.5% compared to k-NN classifier, 3.7-4.0% compared with the naive Bayes classifier and 1.6-2.7% over the best single-model classification method (p<0.05).
Proceedings of the 2009 international workshop on Intercultural collaboration | 2009
Verayuth Lertnattee; Kergrit Robkob; Virach Sornlertlamvanich
Traditional knowledge about herbal medicine can be contributed from several cultures. With conventional techniques, it is hard to find a way in which experts can build a self-sustainable community for exchanging their knowledge. To alleviate the problem of gathering intellectual herbal information based on different cultures, the Knowledge Unifying Initiator for Herbal Information (KUIHerb) is used as a platform for building a web community for collecting the intercultural herbal knowledge. KUIHerb provides a capability for the expression of information about images, local names, parts used, indications, methods for preparation, precautions including toxicity and additional information. In cases where multiple opinions are provided, the popular vote will select the most preferable term, used in the community. Herb identification, herbal vocabulary, a list of experts in herbal medicine and multicultural knowledge can be collected from this system.
international conference on knowledge-based and intelligent information and engineering systems | 2003
Verayuth Lertnattee; Thanaruk Theeramunkong
Centroid-based categorization is one of the most popular algorithms in text classification. Normalization is an important factor to improve performance of a centroid-based classifier when documents in text collection have quite different sizes. In the past, normalization involved with only document- or class-length normalization. In this paper, we propose a new type of normalization called term-length normalization which considers term distribution in a class. The performance of this normalization is investigated in three environments of a standard centroid-based classifier (TFIDF): (1) without class-length normalization, (2) with cosine class-length normalization and (3) with summing weight normalization. The results suggest that our term-length normalization is useful for improving classification accuracy in all cases.