Is this you? Create Your Porfile

Yusuke Miyao

National Institute of Informatics

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yusuke Miyao is active.

Explore More

Publication

Featured researches published by Yusuke Miyao.

meeting of the association for computational linguistics | 2005

Probabilistic CFG with Latent Annotations

Takuya Matsuzaki; Yusuke Miyao; Jun’ichi Tsujii

This paper defines a generative probabilistic model of parse trees, which we call PCFG-LA. This model is an extension of PCFG in which non-terminal symbols are augmented with latent variables. Fine-grained CFG rules are automatically induced from a parsed corpus by training a PCFG-LA model using an EM-algorithm. Because exact parsing with a PCFG-LA is NP-hard, several approximations are described and empirically compared. In experiments using the Penn WSJ corpus, our automatically trained model gave a performance of 86.6% (F1, sentences ≤ 40 words), which is comparable to that of an unlexicalized PCFG parser created using extensive manual feature selection.

pacific symposium on biocomputing | 2000

Event extraction from biomedical papers using a full parser.

Akane Yakushiji; Yuka Tateisi; Yusuke Miyao; Jun’ichi Tsujii

We have designed and implemented an information extraction system using a full parser to investigate the plausibility of full analysis of text using general-purpose parser and grammar applied to biomedical domain. We partially solved the problems of full parsing of inefficiency, ambiguity, and low coverage by introducing the preprocessors, and proposed the use of modules that handles partial results of parsing for further improvement. Our approach makes it possible to modularize the system, so that the IE system as a whole becomes easy to be tuned to specific domains, and easy to be maintained and improved by incorporating various techniques of disambiguation, speed up, etc. In preliminary experiment, from 133 argument structures that should be extracted from 97 sentences, we obtained 23% uniquely and 24% with ambiguity. And 20% are extractable from not complete but partial results of full parsing.

Computational Linguistics | 2008

Feature forest models for probabilistic hpsg parsing

Yusuke Miyao; Jun’ichi Tsujii

Probabilistic modeling of lexicalized grammars is difficult because these grammars exploit complicated data structures, such as typed feature structures. This prevents us from applying common methods of probabilistic modeling in which a complete structure is divided into sub-structures under the assumption of statistical independence among sub-structures. For example, part-of-speech tagging of a sentence is decomposed into tagging of each word, and CFG parsing is split into applications of CFG rules. These methods have relied on the structure of the target problem, namely lattices or trees, and cannot be applied to graph structures including typed feature structures. This article proposes the feature forest model as a solution to the problem of probabilistic modeling of complex data structures including typed feature structures. The feature forest model provides a method for probabilistic modeling without the independence assumption when probabilistic events are represented with feature forests. Feature forests are generic data structures that represent ambiguous trees in a packed forest structure. Feature forest models are maximum entropy models defined over feature forests. A dynamic programming algorithm is proposed for maximum entropy estimation without unpacking feature forests. Thus probabilistic modeling of any data structures is possible when they are represented by feature forests. This article also describes methods for representing HPSG syntactic structures and predicate-argument structures with feature forests. Hence, we describe a complete strategy for developing probabilistic models for HPSG parsing. The effectiveness of the proposed methods is empirically evaluated through parsing experiments on the Penn Treebank, and the promise of applicability to parsing of real-world sentences is discussed.

Bioinformatics | 2009

Evaluating contributions of natural language parsers to protein–protein interaction extraction

Yusuke Miyao; Kenji Sagae; Rune Sætre; Takuya Matsuzaki; Jun’ichi Tsujii

Motivation: While text mining technologies for biomedical research have gained popularity as a way to take advantage of the explosive growth of information in text form in biomedical papers, selecting appropriate natural language processing (NLP) tools is still difficult for researchers who are not familiar with recent advances in NLP. This article provides a comparative evaluation of several state-of-the-art natural language parsers, focusing on the task of extracting protein–protein interaction (PPI) from biomedical papers. We measure how each parser, and its output representation, contributes to accuracy improvement when the parser is used as a component in a PPI system. Results: All the parsers attained improvements in accuracy of PPI extraction. The levels of accuracy obtained with these different parsers vary slightly, while differences in parsing speed are larger. The best accuracy in this work was obtained when we combined Miyao and Tsujiis Enju parser and Charniak and Johnsons reranking parser, and the accuracy is better than the state-of-the-art results on the same data. Availability: The PPI extraction system used in this work (AkanePPI) is available online at http://www-tsujii.is.s.u-tokyo.ac.jp/-100downloads/downloads.cgi. The evaluated parsers are also available online from each developers site. Contact: [email protected]

international joint conference on natural language processing | 2004

Corpus-Oriented grammar development for acquiring a head-driven phrase structure grammar from the penn treebank

Yusuke Miyao; Takashi Ninomiya; Jun’ichi Tsujii

This paper describes a method of semi-automatically acquiring an English HPSG grammar from the Penn Treebank. First, heuristic rules are employed to annotate the treebank with partially-specified derivation trees of HPSG. Lexical entries are automatically extracted from the annotated corpus by inversely applying HPSG schemata to partially-specified derivation trees. Predefined HPSG schemata assure the acquired lexicon to conform to the theoretical formulation of HPSG. Experimental results revealed that this approach enabled us to develop an HPSG grammar with significant robustness at small cost.

meeting of the association for computational linguistics | 2005

Probabilistic Disambiguation Models for Wide-Coverage HPSG Parsing

Yusuke Miyao; Jun’ichi Tsujii

This paper reports the development of log-linear models for the disambiguation in wide-coverage HPSG parsing. The estimation of log-linear models requires high computational cost, especially with wide-coverage grammars. Using techniques to reduce the estimation cost, we trained the models using 20 sections of Penn Tree-bank. A series of experiments empirically evaluated the estimation techniques, and also examined the performance of the disambiguation models on the parsing of real-world sentences.

International Journal of Medical Informatics | 2009

Protein–protein interaction extraction by leveraging multiple kernels and parsers

Makoto Miwa; Rune Sætre; Yusuke Miyao; Jun’ichi Tsujii

Protein-protein interaction (PPI) extraction is an important and widely researched task in the biomedical natural language processing (BioNLP) field. Kernel-based machine learning methods have been used widely to extract PPI automatically, and several kernels focusing on different parts of sentence structure have been published for the PPI task. In this paper, we propose a method to combine kernels based on several syntactic parsers, in order to retrieve the widest possible range of important information from a given sentence. We evaluate the method using a support vector machine (SVM), and we achieve better results than other state-of-the-art PPI systems on four out of five corpora. Further, we analyze the compatibility of the five corpora from the viewpoint of PPI extraction, and we see that some of them have small incompatibilities, but they can still be combined with a little effort.

meeting of the association for computational linguistics | 2006

Semantic Retrieval for the Accurate Identification of Relational Concepts in Massive Textbases

Yusuke Miyao; Tomoko Ohta; Katsuya Masuda; Yoshimasa Tsuruoka; Kazuhiro Yoshida; Takashi Ninomiya; Jun’ichi Tsujii

This paper introduces a novel framework for the accurate retrieval of relational concepts from huge texts. Prior to retrieval, all sentences are annotated with predicate argument structures and ontological identifiers by applying a deep parser and a term recognizer. During the run time, user requests are converted into queries of region algebra on these annotations. Structural matching with pre-computed semantic annotations establishes the accurate and efficient retrieval of relational concepts. This framework was applied to a text retrieval system for MEDLINE. Experiments on the retrieval of biomedical correlations revealed that the cost is sufficiently small for real-time applications and that the retrieval precision is significantly improved.

meeting of the association for computational linguistics | 2006

Improving the Scalability of Semi-Markov Conditional Random Fields for Named Entity Recognition

Daisuke Okanohara; Yusuke Miyao; Yoshimasa Tsuruoka; Jun’ichi Tsujii

This paper presents techniques to apply semi-CRFs to Named Entity Recognition tasks with a tractable computational cost. Our framework can handle an NER task that has long named entities and many labels which increase the computational cost. To reduce the computational cost, we propose two techniques: the first is the use of feature forests, which enables us to pack feature-equivalent states, and the second is the introduction of a filtering process which significantly reduces the number of candidate states. This framework allows us to use a rich set of features extracted from the chunk-based representation that can capture informative characteristics of entities. We also introduce a simple trick to transfer information about distant entities by embedding label information into non-entity labels. Experimental results show that our model achieves an F-score of 71.48% on the JNLPBA 2004 shared task without using any external resources or post-processing techniques.

international joint conference on natural language processing | 2005

Adapting a probabilistic disambiguation model of an HPSG parser to a new domain

Tadayoshi Hara; Yusuke Miyao; Jun’ichi Tsujii

This paper describes a method of adapting a domain-independent HPSG parser to a biomedical domain. Without modifying the grammar and the probabilistic model of the original HPSG parser, we develop a log-linear model with additional features on a treebank of the biomedical domain. Since the treebank of the target domain is limited, we need to exploit an original disambiguation model that was trained on a larger treebank. Our model incorporates the original model as a reference probabilistic distribution. The experimental results for our model trained with a small amount of a treebank demonstrated an improvement in parsing accuracy.

Explore More