Takashi Ishida | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Takashi Ishida is active.

Explore More

Publication

Featured researches published by Takashi Ishida.

systems, man and cybernetics | 2011

A proposal of extended cosine measure for distance metric learning in text classification

Kenta Mikawa; Takashi Ishida; Masayuki Goto

This paper discusses a new similarity measure between documents on a vector space model from the view point of distance metric learning. The documents are represented by points in the vector space by using the information of frequencies of words appearing in each document. The similarity measure between two different documents is useful to recognize the relationship and can be applied to classification or clustering of the data. Usually, the cosine similarity and the Euclid distance have been used in order to measure the similarity between points in the Euclidean space. However, these measures do not take the correlation among words which appear in documents into consideration on an application of the vector space model to document analysis. Generally speaking, many words which appear in documents have correlation to one another depending on the sentence structures, topics and subjects. Therefore, it is effective to build a suitable metric measure taking the correlation of words into consideration on the vector space in order to improve the performance of document classification and clustering. This paper presents a new effective method to acquire a distance measure on the document vector space based on an extended cosine measure. In addition, the way of distance metric learning is proposed to acquire the proper metric from the view point of supervised learning. The effectiveness of our proposal is clarified by simulation experiments for the text classification problems of the customer review which is posted on the web site and the newspaper article.

systems, man and cybernetics | 2010

On a new model for automatic text categorization based on Vector Space Model

Makoto Suzuki; Naohide Yamagishi; Takashi Ishida; Masayuki Goto; Shigeichi Hirasawa

In our previous paper, we proposed a new classification technique called the Frequency Ratio Accumulation Method (FRAM). This is a simple technique that adds up the ratios of term frequencies among categories, and it is able to use index terms without limit. Then, we adopted the Character N-gram to form index terms, thereby improving FRAM. However, FRAM did not have a satisfactory mathematical basis. Therefore, we present here a new mathematical model based on a “Vector Space Model” and consider its implications. The proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from the English Reuters-21578 data set, a Japanese CD-Mainichi 2002 data set using the proposed method. The Reuters-21578 data set is a benchmark data set for automatic text categorization. It is shown that FRAM has good classification accuracy. Specifically, the micro-averaged F-measure of the proposed method is 92.2% for English. The proposed method can perform classification utilizing a single program and it is language-independent.

computer and information technology | 2007

Statistical Evaluation of Measure and Distance on Document Classification Problems in Text Mining

Masayuki Goto; Takashi Ishida; Shigeichi Hirasawa

This paper discusses the document classification problems in text mining from the viewpoint of asymptotic statistical analysis. By formulation of statistical hypotheses test which is specified as a problem of text mining, some interesting properties can be visualized. In the problem of text mining, the several heuristics are applied to practical analysis because of its experimental effectiveness in many case studies. The theoretical explanation about the performance of text mining techniques is required and this approach will give us very clear idea. The distance measure in word vector space is used to classify the documents. In this paper, the performance of distance measure is also analized from the new viewpoint of asymptotic analysis.

international symposium on information theory and its applications | 2010

English and Taiwanese text categorization using N-gram based on Vector Space Model

Makoto Suzuki; Naohide Yamagishi; Yi Ching Tsai; Takashi Ishida; Masayuki Goto

In this paper, we present a new mathematical model based on a “Vector Space Model” and consider its implications. The proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from the English Reuters-21578 data set, and Taiwanese China Times 2005 data set using the proposed method. The Reuters-21578 data set is a benchmark data set for automatic text categorization. It is shown that FRAM has good classification accuracy. Specifically, the micro-averaged F-measure of the proposed method is 94.5% for English. However, that is 78.0% for Taiwanese. Though the proposed method is language-independent and provides a new perspective, our future work is to improve classification accuracy for Taiwanese.

international symposium on information theory and its applications | 2008

Asymptotic evaluation of distance measure on high dimensional vector spaces in text mining

Masayuki Goto; Takashi Ishida; Makoto Suzuki; Shigeichi Hirasawa

This paper discusses the document classification problems in text mining from the viewpoint of asymptotic statistical analysis. In the problem of text mining, the several heuristics are applied to practical analysis because of its experimental effectiveness in many case studies. The theoretical explanation about the performance of text mining techniques is required and such thinking will give us very clear idea. In this paper, the performances of distance measures used to classify the documents are analyzed from the new viewpoint of asymptotic analysis. We also discuss the asymptotic performance of IDF measure used in the information retrieval field.

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences | 2006

Properties of a Word-Valued Source with a Non-prefix-free Word Set

Takashi Ishida; Masayuki Goto; Toshiyasu Matsushima; Shigeichi Hirasawa

Recently, a word-valued source has been proposed as a new class of information source models. A word-valued source is regarded as a source with a probability distribution over a word set. Although a word-valued source is a nonstationary source in general, it has been proved that an entropy rate of the source exists and the Asymptotic Equipartition Property (AEP) holds when the word set of the source is prefix-free. However, when the word set is not prefix-free (non-prefix-free), only an upper bound on the entropy density rate for an i.i.d. word-valued source has been derived so far. In this paper, we newly derive a lower bound on the entropy density rate for an i.i.d. word-valued source with a finite non-prefix-free word set. Then some numerical examples are given in order to investigate the behavior of the bounds.

international symposium on information theory and its applications | 2008

Refinement of index term set and improvement of classification accuracy on text categorization

Makoto Suzuki; Takashi Ishida; Masayuki Goto

In our previous paper, we proposed a new classification technique called the frequency ratio accumulation method (FRAM). This is a simple technique that adds up the ratios of term frequency among categories. However, in FRAM, the use of index terms is unlimited. Then, we adopt character N-gram as index terms improving the above-described particularity of FRAM. In the present paper, we will refine the DB of the index term set using mutual information and frequency ratio, and improve the classification accuracy. Next, the proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from English Reuters-21578 using FRAM. Reuters-21578 provides benchmark data in automatica text categorization. As a result, we show that it has the good classification accuracy. Specifically, the macro-averaged F-measure of the proposed method is 92.3% for Reuters-21578. Our method is language-independent and provides a new perspective and has excellent potential.

computer and information technology | 2007

Word Segmentation for the Sequences Emitted from a Word-Valued Source

Takashi Ishida; Toshiyasu Matsushima; Shigeichi Hirasawa

Word segmentation is the most fundamental and important process for Japanese or Chinese language processing. Because there is no separation between words in these languages, we firstly have to separate the sequence into words. On this problem, it is known that the approach by probabilistic language model is highly efficient, and this is shown practically. On the other hand, recently, a word-valued source has been proposed as a new class of source model for the source coding problem. This model can be supposed to reflect more of the probability structure of natural languages. We may regard Japanese sentence or Chinese sentence as the sequence emitting from a non-prefix-free WVS. In this paper, as the first phase of applying WVS to natural language processing, we formulate a word segmentation problem for the sequence from non-prefix-free WVS. Then, we examine the performance of word segmentation for the models by numerical computations.

Journal of Japan Industrial Management Association | 2013