Thanaruk Theeramunkong
Sirindhorn International Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Thanaruk Theeramunkong.
Information Sciences | 2004
Verayuth Lertnattee; Thanaruk Theeramunkong
Most of traditional text categorization approaches utilize term frequency (tf) and inverse document frequency (idf) for representing importance of words and/or terms in classifying a text document. This paper describes an approach to apply term distributions, in addition to tf and idf, to improve performance of centroid-based text categorization. Three types of term distributions, called inter-class, intra-class and in-collection distributions, are introduced. These distributions are useful to increase classification accuracy by exploiting information of (1) term distribution among classes, (2) term distribution within a class and (3) term distribution in the whole collection of training data. In addition, this paper investigates how these term distributions contribute to weight each term in documents, e.g., a high term distribution of a word promotes or demotes importance or classification power of that word. To this end, several centroid-based classifiers are constructed with different term weightings. Using various data sets, their performances are investigated and compared to a standard centroid-based classifier (TDIDF) and a centroid-based classifier modified with information gain. Moreover, we also compare them to two well-known methods: k-NN and naive Bayes. In addition to a unigram model of document representation, a bigram model is also explored. Finally, the effectiveness of term distributions to improve classification accuracy is explored with regard to the training set size and the number of classes.
international symposium on computers and communications | 2002
Kritsada Sriphaew; Thanaruk Theeramunkong
Generalized association rule mining is an extension of traditional association rule mining to discover more informative rules, given a taxonomy. We describe a formal framework for the problem of mining generalized association rules. In the framework, The subset-superset and the parent-child relationships among generalized itemsets are introduced to present the different views of generalized itemsets, i.e. the lattice of generalized itemsets and the taxonomies of k-generalized itemsets respectively. We present an optimization technique to reduce the time consumed by applying two constraints each of which corresponds to each view of generalized itemsets. In the mining process, a new set enumeration algorithm, named SET is proposed. It utilizes these constraints to speed up the mining of all generalized frequent itemsets. By experiments on synthetic data, the results show that SET outperforms the current most efficient algorithm, Prutax, by an order of magnitude or more.
IEEE Transactions on Nanobioscience | 2013
Eakasit Pacharawongsakda; Thanaruk Theeramunkong
Predicting protein subcellular location is one of major challenges in Bioinformatics area since such knowledge helps us understand protein functions and enables us to select the targeted proteins during drug discovery process. While many computational techniques have been proposed to improve predictive performance for protein subcellular location, they have several shortcomings. In this work, we propose a method to solve three main issues in such techniques; i) manipulation of multiplex proteins which may exist or move between multiple cellular compartments, ii) handling of high dimensionality in input and output spaces and iii) requirement of sufficient labeled data for model training. Towards these issues, this work presents a new computational method for predicting proteins which have either single or multiple locations. The proposed technique, namely iFLAST-CORE, incorporates the dimensionality reduction in the feature and label spaces with co-training paradigm for semi-supervised multi-label classification. For this purpose, the Singular Value Decomposition (SVD) is applied to transform the high-dimensional feature space and label space into the lower-dimensional spaces. After that, due to limitation of labeled data, the co-training regression makes use of unlabeled data by predicting the target values in the lower-dimensional spaces of unlabeled data. In the last step, the component of SVD is used to project labels in the lower-dimensional space back to those in the original space and an adaptive threshold is used to map a numeric value to a binary value for label determination. A set of experiments on viral proteins and gram-negative bacterial proteins evidence that our proposed method improve the classification performance in terms of various evaluation metrics such as Aiming (or Precision), Coverage (or Recall) and macro F-measure, compared to the traditional method that uses only labeled data.
international conference on human language technology research | 2001
Thanaruk Theeramunkong; Sasiporn Usanavasin
For languages without word boundary delimiters, dictionaries are needed for segmenting running texts. This figure makes segmentation accuracy depend significantly on the quality of the dictionary used for analysis. If the dictionary is not sufficiently good, it will lead to a great number of unknown or unrecognized words. These unrecognized words certainly reduce segmentation accuracy. To solve such problem, we propose a method based on decision tree models. Without use of a dictionary, specific information, called syntactic attribute, is applied to identify the structure of Thai words. C4.5 is used as a tool for this purpose. Using a Thai corpus, experiment results show that our method outperforms some well-known dictionary-dependent techniques, maximum and longest matching methods, in case of no dictionary.
Proceedings of the fifth international workshop on on Information retrieval with Asian languages | 2000
Thanaruk Theeramunkong; Virach Sornlertlamvanich; Thanasan Tanhermhong; Wirat Chinnan
Some languages including Thai, Japanese and Chinese do not have explicit word boundary. This causes the problem of word boundary ambiguity that results in decreasing the accuracy of information retrieval. This paper proposes a new technique so-called character clustering to reduce the ambiguity of word boundary in Thai documents and hence improve searching efficiency. To investigate the efficiency, a set of experiments using Thai newspapers is conducted in both non-indexing and indexing searching approaches. The experimental results show our method outperform the traditional methods in both non-indexing and indexing approaches in all measures.
international symposium on communications and information technologies | 2004
Verayuth Lertnattee; Thanaruk Theeramunkong
Most previous works on text categorization applied term occurrence frequency and inverse document frequency for representing importance of terms. This work presents an analysis of inverse class frequency in centroid-based text categorization. There are two aims of this paper. The first one is to find appropriate functions of inverse class frequency. The other is to find the key factors for using inverse class frequency. The experimental results show that the key factors, which improve classification accuracy, are the numbers of few-class terms and most-class terms. When large numbers of few-class terms and most-class terms are obtained, the logarithmic function of inverse class frequency is the most effective when it is combined with term frequency. The square root of inverse class frequency incorporated into TFIDF, works well in the case when data sets include a small number of few-class terms and most-class terms. To increase the numbers of these effective terms, some methods are involved i.e. using higher gram models, small number of classes and large number of training sets.
international conference on asian digital libraries | 2002
Thanaruk Theeramunkong; Chainat Wongtapan; Sukree Sinthupinyo
Many traditional works on offline Thai handwritten character recognition use a set of local features including circles, concavity, endpoints and lines to recognize hand-printed characters. However, in natural handwriting, these local features are often missed due to fast writing, resulting in dramatically reduced recognition accuracy. Instead of using such local features, this paper presents a method to extract features from handwritten characters using so-called multi-directional island-based projection. Two statistical recognition approaches using interpolated n-gram model (n-gram) and hidden Markov model (HMM) are also proposed. The performance of our feature extraction and recognition methods is investigated using nearly 23,400 hand-printed and natural-written characters, collected from 25 subjects. The results showed that, in situations where local features are hard to detect, both n-gram and HMM approaches achieved up to 96-99 % accuracy for close tests and 84-90 % for open tests.
Information Sciences | 2006
Verayuth Lertnattee; Thanaruk Theeramunkong
Centroid-based categorization is one of the most popular algorithms in text classification. In this approach, normalization is an important factor to improve performance of a centroid-based classifier when documents in text collection have quite different sizes and/or the numbers of documents in classes are unbalanced. In the past, most researchers applied document normalization, e.g., document-length normalization, while some consider a simple kind of class normalization, so-called class-length normalization, to solve the unbalancedness problem. However, there is no intensive work that clarifies how these normalizations affect classification performance and whether there are any other useful normalizations. The purpose of this paper is three folds; (1) to investigate the effectiveness of document- and class-length normalizations on several data sets, (2) to evaluate a number of commonly used normalization functions and (3) to introduce a new type of class normalization, called term-length normalization, which exploits term distribution among documents in the class. The experimental results show that a classifier with weight-merge-normalize approach (class-length normalization) performs better than one with weight-normalize-merge approach (document-length normalization) for the data sets with unbalanced numbers of documents in classes, and is quite competitive for those with balanced numbers of documents. For normalization functions, the normalization based on term weighting performs better than the others on average. For term-length normalization, it is useful for improving classification accuracy. The combination of term- and class-length normalizations outperforms pure class-length normalization and pure term-length normalization as well as unnormalization with the gaps of 4.29%, 11.50%, 30.09%, respectively.
Information Processing and Management | 2005
Thanaruk Theeramunkong; Chainat Wongtapan
Many traditional works on off-line Thai handwritten character recognition used a set of local features including circles, concavity, endpoints and lines to recognize hand-printed characters. However, in natural handwriting, these local features are often missing due to rough or quick writing, resulting in dramatic reduction of recognition accuracy. Instead of using such local features, this paper presents a method called multi-directional island-based projection to extract global features from handwritten characters. As the recognition model, two statistical approaches, namely interpolated n-gram model (n-gram) and hidden Markov model (HMM), are proposed. The experimental results indicate that the proposed scheme achieves high accuracy in the recognition of naturally-written Thai characters with numerous variations, compared to some common previous feature extraction techniques. Another experiment with English characters also displays quite promising results.
international conference on computational linguistics | 2002
Thanaruk Theeramunkong; Verayuth Lertnattee
This paper proposes a multi-dimensional framework for classifying text documents. In this framework, the concept of multidimensional category model is introduced for representing classes. In contrast with traditional flat and hierarchical category models; the multi-dimensional category model classifies each text document in a collection using multiple predefined sets of categories, where each set corresponds to a dimension. Since a multi-dimensional model can be converted to flat and hierarchical models, three classification strategies are possible, i.e., classifying directly based on the multi-dimensional model and classifying with the equivalent flat or hierarchical models. The efficiency of these three classifications is investigated on two data sets. Using k-NN, naive Bayes and centroid-based classifiers, the experimental results show that the multi-dimensional-based and hierarchical-based classification performs better than the flat-based classifications.