Is this you? Create Your Porfile

Masao Utiyama

National Institute of Information and Communications Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Masao Utiyama is active.

Explore More

Publication

Featured researches published by Masao Utiyama.

meeting of the association for computational linguistics | 2001

A Statistical Model for Domain-Independent Text Segmentation

Masao Utiyama; Hitoshi Isahara

We propose a statistical method that finds the maximum-probability segmentation of a given text. This method does not require training data because it estimates probabilities from the given text. Therefore, it can be applied to any text in any domain. An experiment showed that the method is more accurate than or at least as accurate as a state-of-the-art text segmentation system.

meeting of the association for computational linguistics | 2003

Reliable Measures for Aligning Japanese-English News Articles and Sentences

Masao Utiyama; Hitoshi Isahara

We have aligned Japanese and English news articles and sentences to make a large parallel corpus. We first used a method based on cross-language information retrieval (CLIR) to align the Japanese and English articles and then used a method based on dynamic programming (DP) matching to align the Japanese and English sentences in these articles. However, the results included many incorrect alignments. To remove these, we propose two measures (scores) that evaluate the validity of alignments. The measure for article alignment uses similarities in sentences aligned by DP matching and that for sentence alignment uses similarities in articles aligned by CLIR. They enhance each other to improve the accuracy of alignment. Using these measures, we have successfully constructed a large-scale article and sentence alignment corpus available to the public.

international symposium on neural networks | 2004

A part-versus-part method for massively parallel training of support vector machines

Bao-Liang Lu; Kai-An Wang; Masao Utiyama; Hitoshi Isahara

This work presents a part-versus-part decomposition method for massively parallel training of multi-class support vector machines (SVMs). By using this method, a massive multi-class classification problem is decomposed into a number of two-class subproblems as small as needed. An important advantage of the part-versus-part method over existing popular pair wise-classification approach is that a large-scale two-class subproblem can be further divided into a number of relatively smaller and balanced two-class subproblems, and fast training of SVMs on massive multi-class classification problems can be easily implemented in a massively parallel way. To demonstrate the effectiveness of the proposed method, we perform simulations on a large-scale text categorization problem. The experimental results show that the proposed method is faster than the existing pairwise-classification approach, better generalization performance can be achieved, and the method scales up to massive, complex multi-class classification problems.

arXiv: Computation and Language | 2000

Japanese probabilistic information retrieval using location and category information

Masaki Murata; Qing Ma; Kiyotaka Uchimoto; Hiromi Ozaku; Masao Utiyama; Hitoshi Isahara

Robertsons 2-poisson information retrieve model does not use location and category information. We constructed a framework using location and category information in a 2-poisson model. We submitted two systems based on this framework to the IREX contest, Japanese language information retrieval contest held in Japan in 1999. For precision in the A-judgement measure they scored 0.4926 and 0.4827, the highest values among the 15 teams and 22 systems that participated in the IREX contest. We describe our systems and the comparative experiments done when various parameters were changed. These experiments confirmed the effectiveness of using location and category information.

empirical methods in natural language processing | 2014

Neural Network Based Bilingual Language Model Growing for Statistical Machine Translation

Rui Wang; Hai Zhao; Bao-Liang Lu; Masao Utiyama; Eiichiro Sumita

Since larger n-gram Language Model (LM) usually performs better in Statistical Machine Translation (SMT), how to construct efficient large LM is an important topic in SMT. However, most of the existing LM growing methods need an extra monolingual corpus, where additional LM adaption technology is necessary. In this paper, we propose a novel neural network based bilingual LM growing method, only using the bilingual parallel corpus in SMT. The results show that our method can improve both the perplexity score for LM evaluation and BLEU score for SMT, and significantly outperforms the existing LM growing methods without extra corpus.

international conference on computational linguistics | 2000

A statistical approach to the processing of metonymy

Masao Utiyama; Masaki Murata; Hitoshi Isahara

This paper describes a statistical approach to the interpretation of metonymy. A metonymy is received as an input, then its possible interpretations are ranked by applying a statistical measure. The method has been tested experimentally. It correctly interpreted 53 out of 75 metonymies in Japanese.

IEEE Transactions on Audio, Speech, and Language Processing | 2015

Bilingual continuous-space language model growing for statistical machine translation

Rui Wang; Hai Zhao; Bao-Liang Lu; Masao Utiyama; Eiichiro Sumita

Larger n-gram language models (LMs) perform better in statistical machine translation (SMT). However, the existing approaches have two main drawbacks for constructing larger LMs: 1) it is not convenient to obtain larger corpora in the same domain as the bilingual parallel corpora in SMT; 2) most of the previous studies focus on monolingual information from the target corpora only, and redundant n-grams have not been fully utilized in SMT. Nowadays, continuous-space language model (CSLM), especially neural network language model (NNLM), has been shown great improvement in the estimation accuracies of the probabilities for predicting the target words. However, most of these CSLM and NNLM approaches still consider monolingual information only or require additional corpus. In this paper, we propose a novel neural network based bilingual LM growing method. Compared to the existing approaches, the proposed method enables us to use bilingual parallel corpus for LM growing in SMT. The results show that our new method outperforms the existing approaches on both SMT performance and computational efficiency significantly.

international conference on computational linguistics | 2013

An empirical study on word segmentation for chinese machine translation

Hai Zhao; Masao Utiyama; Eiichiro Sumita; Bao-Liang Lu

Word segmentation has been shown helpful for Chinese-to-English machine translation (MT), yet the way different segmentation strategies affect MT is poorly understood. In this paper, we focus on comparing different segmentation strategies in terms of machine translation quality. Our empirical study covers both English-to-Chinese and Chinese-to-English translation for the first time. Our results show the necessity of word segmentation depends on the translation direction. After comparing two types of segmentation strategies with associated linguistic resources, we demonstrate that optimizing segmentation itself does not guarantee better MT performance, and segmentation strategy choice is not the key to improve MT. Instead, we discover that linguistical resources such as segmented corpora or the dictionaries that segmentation tools rely on actually determine how word segmentation affects machine translation. Based on these findings, we propose an empirical approach that directly optimize dictionary with respect to the MT task for word segmenter, providing a BLEU score improvement of 1.30.

international acm sigir conference on research and development in information retrieval | 2009

Evaluating effects of machine translation accuracy on cross-lingual patent retrieval

Atsushi Fujii; Masao Utiyama; Mikio Yamamoto; Takehito Utsuro

We organized a machine translation (MT) task at the Seventh NTCIR Workshop. Participating groups were requested to machine translate sentences in patent documents and also search topics for retrieving patent documents across languages. We analyzed the relationship between the accuracy of MT and its effects on the retrieval accuracy.

international symposium on neural networks | 2008

Large-scale patent classification with min-max modular support vector machines

Xiao-Lei Chu; Chao Ma; Jing Li; Bao-Liang Lu; Masao Utiyama; Hitoshi Isahara

Patent classification is a large-scale, hierarchical, imbalanced, multi-label problem. The number of samples in a real-world patent classification typically exceeds one million, and this number increases every year. An effective patent classifier must be able to deal with this situation. This paper discusses the use of min-max modular support vector machine (M3-SVM) to deal with large-scale patent classification problems. The method includes three steps: decomposing a large-scale and imbalanced patent classification problem into a group of relatively smaller and more balanced two-class subproblems which are independent of each other, learning these subproblems using support vector machines (SVMs) in parallel, and combining all of the trained SVMs according to the minimization and the maximization rules. M3-SVM has two attractive features which are urgently needed to deal with large-scale patent classification problems. First, it can be realized in a massively parallel form. Second, it can be built up incrementally. Results from experiments using the NTCIR-5 patent data set, which contains more than two million patents, have confirmed these two attractive features, and demonstrate that M3-SVM outperforms conventional SVMs in terms of both training time and generalization performance.

Explore More