Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Rie Kubota Ando is active.

Publication


Featured researches published by Rie Kubota Ando.


international conference on machine learning | 2007

Two-view feature generation model for semi-supervised learning

Rie Kubota Ando; Tong Zhang

We consider a setting for discriminative semi-supervised learning where unlabeled data are used with a generative model to learn effective feature representations for discriminative training. Within this framework, we revisit the two-view feature generation model of co-training and prove that the optimum predictor can be expressed as a linear combination of a few features constructed from unlabeled data. From this analysis, we derive methods that employ two views but are very different from co-training. Experiments show that our approach is more robust than co-training and EM, under various data generation conditions.


international acm sigir conference on research and development in information retrieval | 2000

Latent semantic space: iterative scaling improves precision of inter-document similarity measurement

Rie Kubota Ando

We present a novel algorithm that creates document vectors with reduced dimensionality. This work was motivated by an application characterizing relationships among documents in a collection. Our algorithm yielded inter-document similarities with an average precision up to 17.8% higher than that of singular value decomposition (SVD) used for Latent Semantic Indexing. The best performance was achieved with dimensional reduction rates that were 43% higher than SVD on average. Our algorithm creates basis vectors for a reduced space by iteratively “scaling” vectors and computing eigenvectors. Unlike SVD, it breaks the symmetry of documents and terms to capture information more evenly across documents. We also discuss correlation with a probabilistic model and evaluate a method for selecting the dimensionality using log-likelihood estimation.


Journal of Biomedical Informatics | 2005

Domain-specific language models and lexicons for tagging

Anni Coden; Serguei V. S. Pakhomov; Rie Kubota Ando; Patrick H. Duffy; Christopher G. Chute

Accurate and reliable part-of-speech tagging is useful for many Natural Language Processing (NLP) tasks that form the foundation of NLP-based approaches to information retrieval and data mining. In general, large annotated corpora are necessary to achieve desired part-of-speech tagger accuracy. We show that a large annotated general-English corpus is not sufficient for building a part-of-speech tagger model adequate for tagging documents from the medical domain. However, adding a quite small domain-specific corpus to a large general-English one boosts performance to over 92% accuracy from 87% in our studies. We also suggest a number of characteristics to quantify the similarities between a training corpus and the test data. These results give guidance for creating an appropriate corpus for building a part-of-speech tagger model that gives satisfactory accuracy results on a new domain at a relatively small cost.


north american chapter of the association for computational linguistics | 2000

Multi-document summarization by visualizing topical content

Rie Kubota Ando; Branimir Boguraev; Roy J. Byrd; Mary S. Neff

This paper describes a framework for multi-document summarization which combines three premises: coherent themes can be identified reliably; highly representative themes, running across subsets of the document collection, can function as multi-document summary surrogates; and effective end-use of such themes should be facilitated by a visualization environment which clarifies the relationship between themes and documents. We present algorithms that formalize our framework, describe an implementation, and demonstrate a prototype system and interface.


Natural Language Engineering | 2005

Visualization-enabled multi-document summarization by Iterative Residual Rescaling

Rie Kubota Ando; Branimir Boguraev; Roy J. Byrd; Mary S. Neff

This paper describes a novel approach to multi-document summarization, which explicitly addresses the problem of detecting, and retaining for the summary, multiple themes in document collections. We place equal emphasis on the processes of theme identification and theme presentation. For the former, we apply Iterative Residual Rescaling (IRR); for the latter, we argue for graphical display elements. IRR is an algorithm designed to account for correlations between words and to construct multi-dimensional topical space indicative of relationships among linguistic objects (documents, phrases, and sentences). Summaries are composed of objects with certain properties, derived by exploiting the many-to-many relationships in such a space. Given their inherent complexity, our multi-faceted summaries benefit from a visualization environment. We discuss some essential features of such an environment.


language resources and evaluation | 2007

TimeBank evolution as a community resource for TimeML parsing

Branimir Boguraev; James Pustejovsky; Rie Kubota Ando; Marc Verhagen

TimeBank is the only reference corpus for TimeML, an expressive language for annotating complex temporal information. It is a rich resource for a broad range of research into various aspects of the expression of time and temporally related events. This paper traces the development of TimeBank from its initial—and somewhat noisy—version (1.1) to a substantially revised release (1.2), now available via the Linguistic Data Consortium. The development path is motivated by the encouraging empirical results of TimeML-compliant annotators developed on the basis of TimeBank 1.1, and is informed by a detailed study of the characteristics of that initial release, which guides a clean-up process turning TimeBank 1.2 into a consistent and robust community resource.


Natural Language Engineering | 2003

Mostly-unsupervised statistical segmentation of Japanese Kanji sequences

Rie Kubota Ando; Lillian Lee

Given the lack of word delimiters in written Japanese, word segmentation is generally considered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely either on a lexicon and syntactic analysis or on pre-segmented data; but these are labor-intensive, and the lexico-syntactic techniques are vulnerable to the unknown word problem. In contrast, we introduce a novel, more robust statistical method utilizing unsegmented training data. Despite its simplicity, the algorithm yields performance on long kanji sequences comparable to and sometimes surpassing that of state-of-the-art morphological analyzers over a variety of error metrics. The algorithm also outperforms another mostly-unsupervised statistical algorithm previously proposed for Chinese. Additionally, we present a two-level annotation scheme for Japanese to incorporate multiple segmentation granularities, and introduce two novel evaluation metrics, both based on the notion of a compatible bracket, that can account for multiple granularities simultaneously.


meeting of the association for computational linguistics | 2004

Exploiting unannotated corpora for tagging and chunking

Rie Kubota Ando

We present a method that exploits unannotated corpora for compensating the paucity of annotated training data on the chunking and tagging tasks. It collects and compresses feature frequencies from a large unannotated corpus for use by linear classifiers. Experiments on two tasks show that it consistently produces significant performance improvements.


meeting of the association for computational linguistics | 2005

A High-Performance Semi-Supervised Learning Method for Text Chunking

Rie Kubota Ando; Tong Zhang


international joint conference on artificial intelligence | 2005

TimeML-compliant text analysis for temporal reasoning

Branimir Boguraev; Rie Kubota Ando

Collaboration


Dive into the Rie Kubota Ando's collaboration.

Researchain Logo
Decentralizing Knowledge