Toru Hisamitsu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Toru Hisamitsu is active.

Explore More

Publication

Featured researches published by Toru Hisamitsu.

conference on current trends in theory and practice of informatics | 2000

Information Access Based on Associative Calculation

Akihiko Takano; Yoshiki Niwa; Shingo Nishioka; Makoto Iwayama; Toru Hisamitsu; Osamu Imaichi; Hirofumi Sakurai

The statistical measures for similarity have been widely used in textual information retrieval for many decades. They are the basis to improve the effectiveness of IR systems, including retrieval, clustering, and summarization. We have developed an information retrieval system DualNAVI which provides users with rich interaction both in document space and in word space. We show that associative calculation for measuring similarity among documents or words is the computational basis of this effective information access with DualNAVI. The new approaches in document clustering (Hierarchical Bayesian Clustering), and measuring term representativeness (Baseline method) are also discussed. Both have sound mathematical basis and depend essentially on associative calculation.

international conference on computational linguistics | 2002

A measure of term representativeness based on the number of co-occurring salient words

Toru Hisamitsu; Yoshiki Niwa

We propose a novel measure of the representativeness (i.e., indicativeness or topic specificity) of a term in a given corpus. The measure embodies the idea that the distribution of words co-occurring with a representative term should be biased according to the word distribution in the whole corpus. The bias of the word distribution in the co-occurring words is defined as the number of distinct words whose occurrences are saliently biased in the co-occurring words. The saliency of a word is defined by a threshold probability that can be automatically defined using the whole corpus. Comparative evaluation clarified that the measure is clearly superior to conventional measures in finding topic-specific words in the newspaper archives of different sizes.

international conference on computational linguistics | 2000

A method of measuring term representativeness: baseline method using co-occurrence distribution

Toru Hisamitsu; Yoshiki Niwa; Jun’ichi Tsujii

This paper introduces a scheme, which we call the baseline method, to define a measure of term representativeness and measures defined by using the scheme. The representativeness of a term is measured by a normalized characteristic value defined for a set of all documents that contain the term. Normalization is done by comparing the original characteristic value with the characteristic value defined for a randomly chosen document set of the same size. The latter value is estimated by a baseline function obtained by random sampling and logarithmic linear approximation. We found that the distance between the word distribution in a document set and the word distribution in a whole corpus is an effective characteristic value to use for the baseline method. Measures defined by the baseline method have several advantages including that they can be used to compare the representativeness of two terms with very different frequencies, and that they have well-defined threshold values of being representative. In addition, the baseline function for a corpus is robust against differences in corpora; that is, it can be used for normalization in a different corpus that has a different size or is in a different domain.

international conference on computational linguistics | 1996

Analysis of Japanese compound nouns by direct text scanning

Toru Hisamitsu; Yoshihiko Nitta

This paper aims to analyze word dependency structure in compound nouns appearing in Japanese newspaper articles. The analysis is a difficult problem because such compound nouns can be quite long, have no word boundaries between contained nouns, and often contain unregistered words such as abbreviations. The non-segmentation property and unregistered words cause initial segmentation errors which result in erroneous analysis. This paper presents a corpus-based approach which scans a corpus with a set of pattern matchers and gathers cooccurrence examples to analyze compound nouns. It employs boot-strapping search to cope with unregistered words: if an unregistered word is found in the process of searching the examples, it is recorded and invokes additional searches to gather the examples containing it. This makes it possible to correct initial oversegmentation errors, and leads to higher accuracy. The accuracy of the method is evaluated using the compound nouns of length 5, 6, 7, and 8. A baseline is also introduced and compared.

Lecture Notes in Computer Science | 2002

Measuring Term Representativeness

Toru Hisamitsu; Jun’ichi Tsujii

This report introduces several measures of term representativeness and a scheme called the baseline method for defining the measures. The representativeness of a term T is measured by a normalized characteristic value which indicates the bias of the distribution of words in D(T), the set of all documents that contain the term. Dist(D(T)), the distance between the distribution of words in D(T) and in a whole corpus was, after normalization, found to be effective as a characteristic value for the bias of the distribution of words in D(T). Experiments showed that the measure based on the normalized value of Dist(D(∙)) strongly outperforms existing measures in evaluating the representativeness of terms in newspaper articles. The measure was also effective, in combination with term frequency, as a means for automatically extracting terms from abstracts of papers on artificial intelligence.

international conference on computational linguistics | 1994

An efficient treatment of Japanese verb inflection for morphological analysis

Toru Hisamitsu; Yoshihiko Nitta

Because of its simple appearance, Japanese verb inflection has never been treated seriously. In this paper we reconsider traditional lexical treatments of Japanese verb inflection, and propose a new treatment of verb inflection which uses newly devised segmenting units. We show that our proposed treatment minimizes the number of lexical entries and avoids useless segmentation. It requires 20 to 40% less chart parsing computation and it is also suitable for error correction in optical character readers.

international conference on document analysis and recognition | 1995

Optimal techniques in OCR error correction for Japanese texts

Toru Hisamitsu; Katsumi Marukawa; Yoshihiro Shima; Hiromichi Fujisawa; Yoshihiko Nitta

This paper investigates three fundamental techniques in OCR error correction for Japanese texts using morphological analysis: (1) an optimal method for candidate word extraction from a candidate character lattice, (2) optimal word entries for Japanese verb inflection analysis, and (3) a new method of word matching cost calculation which is more suitable to be used with linguistic criteria. Comparative evaluation shows that the combination of these techniques requires 84% less computation, captures 2.6% more candidate words, reduces the chart parsing computation by 20%, and attains 25% higher error correction rate than a commonly used method.

Systems and Computers in Japan | 1995

A generalized algorithm for Japanese morphological analysis and a comparative evaluation of some heuristics

Toru Hisamitsu; Yoshihiko Nitta

In ordinary written Japanese, words are not separated by spaces. Therefore morphological analysis involves segmenting and tagging sentences. Since each sentence has a huge number of possible tagged segmentations, various criteria have been proposed for making plausible decisions. However, there are still no unified frameworks that incorporate various heuristics, and there has been no comparative evaluation of commonly used heuristics. This paper presents a clear framework to describe various heuristics, and an N-best algorithm for extracting optimal solutions. The time complexity of this algorithm is O(nNlog 2 (1 + N)), where n is the sentence length. The advantage of the N-best algorithm over the standard beam search algorithm is also discussed. This paper also presents a comparative evaluation of three major heuristics, and proposes a precise and portable rule-based heuristic. Estimation was done using the aforementioned algorithm and six criteria. The newly proposed heuristic is based upon the Extended Least Bunsetsu (Phrase) Number method

Archive | 2003