Jun’ichi Tsujii | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jun’ichi Tsujii is active.

Explore More

Publication

Featured researches published by Jun’ichi Tsujii.

panhellenic conference on informatics | 2005

Developing a robust part-of-speech tagger for biomedical text

Yoshimasa Tsuruoka; Yuka Tateishi; Jin-Dong Kim; Tomoko Ohta; John McNaught; Sophia Ananiadou; Jun’ichi Tsujii

This paper presents a part-of-speech tagger which is specifically tuned for biomedical text. We have built the tagger with maximum entropy modeling and a state-of-the-art tagging algorithm. The tagger was trained on a corpus containing newspaper articles and biomedical documents so that it would work well on various types of biomedical text. Experimental results on the Wall Street Journal corpus, the GENIA corpus, and the PennBioIE corpus revealed that adding training data from a different domain does not hurt the performance of a tagger, and our tagger exhibits very good precision (97% to 98%) on all these corpora. We also evaluated the robustness of the tagger using recent MEDLINE articles.

Bioinformatics | 2002

Accomplishments and challenges in literature data mining for biology

Lynette Hirschman; Jong C. Park; Jun’ichi Tsujii; Limsoon Wong; Cathy H. Wu

We review recent results in literature data mining for biology and discuss the need and the steps for a challenge evaluation for this field. Literature data mining has progressed from simple recognition of terms to extraction of interaction relationships from complex sentences, and has broadened from recognition of protein interactions to a range of problems such as improving homology search, identifying cellular location, and so on. To encourage participation and accelerate progress in this expanding field, we propose creating challenge evaluations, and we describe two specific applications in this context.

meeting of the association for computational linguistics | 2005

Probabilistic CFG with Latent Annotations

Takuya Matsuzaki; Yusuke Miyao; Jun’ichi Tsujii

This paper defines a generative probabilistic model of parse trees, which we call PCFG-LA. This model is an extension of PCFG in which non-terminal symbols are augmented with latent variables. Fine-grained CFG rules are automatically induced from a parsed corpus by training a PCFG-LA model using an EM-algorithm. Because exact parsing with a PCFG-LA is NP-hard, several approximations are described and empirically compared. In experiments using the Penn WSJ corpus, our automatically trained model gave a performance of 86.6% (F1, sentences ≤ 40 words), which is comparable to that of an unlexicalized PCFG parser created using extensive manual feature selection.

empirical methods in natural language processing | 2005

Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data

Yoshimasa Tsuruoka; Jun’ichi Tsujii

This paper presents a bidirectional inference algorithm for sequence labeling problems such as part-of-speech tagging, named entity recognition and text chunking. The algorithm can enumerate all possible decomposition structures and find the highest probability sequence together with the corresponding decomposition structure in polynomial time. We also present an efficient decoding algorithm based on the easiest-first strategy, which gives comparably good performance to full bidirectional inference with significantly lower computational cost. Experimental results of part-of-speech tagging and text chunking show that the proposed bidirectional inference methods consistently outperform unidirectional inference methods and bidirectional MEMMs give comparable performance to that achieved by state-of-the-art learning algorithms including kernel support vector machines.

international conference on computational linguistics | 2000

Extracting the names of genes and gene products with a hidden Markov model

Nigel Collier; Chikashi Nobata; Jun’ichi Tsujii

We report the results of a study into the use of a linear interpolating hidden Markov model (HMM) for the task of extracting technical terminology from MEDLINE abstracts and texts in the molecular-biology domain. This is the first stage in a system that will extract event information for automatically updating biology databases. We trained the HMM entirely with bigrams based on lexical and character features in a relatively small corpus of 100 MEDLINE abstracts that were marked-up by domain experts with term classes such as proteins and DNA. Using cross-validation methods we achieved an F-score of 0.73 and we examine the contribution made by each part of the interpolation model to overcoming data sparseness.

pacific symposium on biocomputing | 2000

Event extraction from biomedical papers using a full parser.

Akane Yakushiji; Yuka Tateisi; Yusuke Miyao; Jun’ichi Tsujii

We have designed and implemented an information extraction system using a full parser to investigate the plausibility of full analysis of text using general-purpose parser and grammar applied to biomedical domain. We partially solved the problems of full parsing of inefficiency, ambiguity, and low coverage by introducing the preprocessors, and proposed the use of modules that handles partial results of parsing for further improvement. Our approach makes it possible to modularize the system, so that the IE system as a whole becomes easy to be tuned to specific domains, and easy to be maintained and improved by incorporating various techniques of disambiguation, speed up, etc. In preliminary experiment, from 133 argument structures that should be extracted from 97 sentences, we obtained 23% uniquely and 24% with ambiguity. And 20% are extractable from not complete but partial results of full parsing.

meeting of the association for computational linguistics | 2002

Tuning support vector machines for biomedical named entity recognition

Jun’ichi Kazama; Takaki Makino; Yoshihiro Ohta; Jun’ichi Tsujii

We explore the use of Support Vector Machines (SVMs) for biomedical named entity recognition. To make the SVM training with the available largest corpus - the GENIA corpus - tractable, we propose to split the non-entity class into sub-classes, using part-of-speech information. In addition, we explore new features such as word cache and the states of an HMM trained by unsupervised learning. Experiments on the GENIA corpus show that our class splitting technique not only enables the training with the GENIA corpus but also improves the accuracy. The proposed new features also contribute to improve the accuracy. We compare our SVM-based recognition system with a system using Maximum Entropy tagging method.

Trends in Biotechnology | 2010

Event Extraction for Systems Biology by Text Mining the Literature

Sophia Ananiadou; Sampo Pyysalo; Jun’ichi Tsujii; Douglas B. Kell

Systems biology recognizes in particular the importance of interactions between biological components and the consequences of these interactions. Such interactions and their downstream effects are known as events. To computationally mine the literature for such events, text mining methods that can detect, extract and annotate them are required. This review summarizes the methods that are currently available, with a specific focus on protein-protein interactions and pathway or network reconstruction. The approaches described will be of considerable value in associating particular pathways and their components with higher-order physiological properties, including disease states.

Computational Linguistics | 2008

Feature forest models for probabilistic hpsg parsing

Yusuke Miyao; Jun’ichi Tsujii

Probabilistic modeling of lexicalized grammars is difficult because these grammars exploit complicated data structures, such as typed feature structures. This prevents us from applying common methods of probabilistic modeling in which a complete structure is divided into sub-structures under the assumption of statistical independence among sub-structures. For example, part-of-speech tagging of a sentence is decomposed into tagging of each word, and CFG parsing is split into applications of CFG rules. These methods have relied on the structure of the target problem, namely lattices or trees, and cannot be applied to graph structures including typed feature structures. This article proposes the feature forest model as a solution to the problem of probabilistic modeling of complex data structures including typed feature structures. The feature forest model provides a method for probabilistic modeling without the independence assumption when probabilistic events are represented with feature forests. Feature forests are generic data structures that represent ambiguous trees in a packed forest structure. Feature forest models are maximum entropy models defined over feature forests. A dynamic programming algorithm is proposed for maximum entropy estimation without unpacking feature forests. Thus probabilistic modeling of any data structures is possible when they are represented by feature forests. This article also describes methods for representing HPSG syntactic structures and predicate-argument structures with feature forests. Hence, we describe a complete strategy for developing probabilistic models for HPSG parsing. The effectiveness of the proposed methods is empirically evaluated through parsing experiments on the Penn Treebank, and the promise of applicability to parsing of real-world sentences is discussed.

Bioinformatics | 2009

Evaluating contributions of natural language parsers to protein–protein interaction extraction

Yusuke Miyao; Kenji Sagae; Rune Sætre; Takuya Matsuzaki; Jun’ichi Tsujii

Motivation: While text mining technologies for biomedical research have gained popularity as a way to take advantage of the explosive growth of information in text form in biomedical papers, selecting appropriate natural language processing (NLP) tools is still difficult for researchers who are not familiar with recent advances in NLP. This article provides a comparative evaluation of several state-of-the-art natural language parsers, focusing on the task of extracting protein–protein interaction (PPI) from biomedical papers. We measure how each parser, and its output representation, contributes to accuracy improvement when the parser is used as a component in a PPI system. Results: All the parsers attained improvements in accuracy of PPI extraction. The levels of accuracy obtained with these different parsers vary slightly, while differences in parsing speed are larger. The best accuracy in this work was obtained when we combined Miyao and Tsujiis Enju parser and Charniak and Johnsons reranking parser, and the accuracy is better than the state-of-the-art results on the same data. Availability: The PPI extraction system used in this work (AkanePPI) is available online at http://www-tsujii.is.s.u-tokyo.ac.jp/-100downloads/downloads.cgi. The evaluated parsers are also available online from each developers site. Contact: [email protected]

Explore More