Monica Rogati | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Monica Rogati is active.

Explore More

Publication

Featured researches published by Monica Rogati.

conference on information and knowledge management | 2002

High-performing feature selection for text classification

Monica Rogati; Yiming Yang

This paper reports a controlled study on a large number of filter feature selection methods for text classification. Over 100 variants of five major feature selection criteria were examined using four well-known classification algorithms: a Naive Bayesian (NB) approach, a Rocchio-style classifier, a k-nearest neighbor (kNN) method and a Support Vector Machine (SVM) system. Two benchmark collections were chosen as the testbeds: Reuters-21578 and small portion of Reuters Corpus Version 1 (RCV1), making the new results comparable to published results. We found that feature selection methods based on chi2 statistics consistently outperformed those based on other criteria (including information gain) for all four classifiers and both data collections, and that a further increase in performance was obtained by combining uncorrelated and high-performing feature selection methods.The results we obtained using only 3% of the available features are among the best reported, including results obtained with the full feature set.

north american chapter of the association for computational linguistics | 2001

SPoT: a trainable sentence planner

Marilyn A. Walker; Owen Rambow; Monica Rogati

Sentence planning is a set of inter-related but distinct tasks, one of which is sentence scoping, i.e. the choice of syntactic structure for elementary speech acts and the decision of how to combine them into one or more sentences. In this paper, we present SPoT, a sentence planner, and a new methodology for automatically training SPoT on the basis of feedback provided by human judges. We reconceptualize the task into two distinct phases. First, a very simple, randomized sentence-plan-generator (SPG) generates a potentially large list of possible sentence plans for a given text-plan input. Second, the sentence-plan-ranker (SPR) ranks the list of output sentence plans, and then selects the top-ranked plan. The SPR uses ranking rules automatically learned from training data. We show that the trained SPR learns to select a sentence plan whose rating on average is only 5% worse than the top human-ranked sentence plan.

Computer Speech & Language | 2002

Training a sentence planner for spoken dialogue using boosting

Marilyn A. Walker; Owen Rambow; Monica Rogati

In the past few years, as the number of dialogue systems has increased, there has been an increasing interest in the use of natural language generation in spoken dialogue. Our research assumes that trainable natural language generation is needed to support more flexible and customized dialogues with human users. This paper focuses on methods for automatically training the sentence planning module of a spoken language generator. Sentence planning is a set of inter-related but distinct tasks, one of which is sentence scoping, i.e., the choice of syntactic structure for elementary speech acts and the decision of how to combine them into one or more sentences. The paper first presents SPoT, a trainable sentence planner, and a new methodology for automatically training SPoT on the basis of feedback provided by human judges. Our methodology is unique in neither depending on hand-crafted rules nor on the existence of a domain-specific corpus. SPoT first randomly generates a candidate set of sentence plans and then selects one. We show that SPoT learns to select a sentence plan whose rating on average is only 5% worse than the top human-ranked sentence plan. We then experimentally evaluate SPoT by asking human judges to compare SPoTs output with a hand-crafted template-based generation component, two rule-based sentence planners, and two baseline sentence planners. We show that SPoT performs better than the rule-based systems and the baselines, and as well as the hand-crafted system.

meeting of the association for computational linguistics | 2003

Unsupervised Learning of Arabic Stemming Using a Parallel Corpus

Monica Rogati; J. Scott McCarley; Yiming Yang

This paper presents an unsupervised learning approach to building a non-English (Arabic) stemmer. The stemming model is based on statistical machine translation and it uses an English stemmer and a small (10 K sentences) parallel corpus as its sole training resources. No parallel text is needed after the training phase. Monolingual, unannotated text can be used to further improve the stemmer by allowing it to adapt to a desired domain or genre. Examples and results will be given for Arabic, but the approach is applicable to any language that needs affix removal. Our resource-frugal approach results in 87.5% agreement with a state of the art, proprietary Arabic stemmer built using rules, affix lists, and human annotated text, in addition to an unsupervised component. Task-based evaluation using Arabic information retrieval indicates an improvement of 22-38% in average precision over unstemmed text, and 96% of the performance of the proprietary stemmer above.

international acm sigir conference on research and development in information retrieval | 2004

Resource selection for domain-specific cross-lingual IR

Monica Rogati; Yiming Yang

An under-explored question in cross-language information retrieval (CLIR) is to what degree the performance of CLIR methods depends on the availability of high-quality translation resources for particular domains. To address this issue, we evaluate several competitive CLIR methods - with different training corpora - on test documents in the medical domain. Our results show severe performance degradation when using a general-purpose training corpus or a commercial machine translation system (SYSTRAN), versus a domain-specific training corpus. A related unexplored question is whether we can improve CLIR performance by systematically analyzing training resources and optimally matching them to target collections. We start exploring this problem by suggesting a simple criterion for automatically matching training resources to target corpora. By using cosine similarity between training and target corpora as resource weights we obtained an average of 5.6% improvement over using all resources with no weights. The same metric yields 99.4% of the performance obtained when an oracle chooses the optimal resource every time.

empirical methods in natural language processing | 2005

BLANC: Learning Evaluation Metrics for MT

Lucian Vlad Lita; Monica Rogati; Alon Lavie

We introduce BLANC, a family of dynamic, trainable evaluation metrics for machine translation. Flexible, parametrized models can be learned from past data and automatically optimized to correlate well with human judgments for different criteria (e.g. adequacy, fluency) using different correlation measures. Towards this end, we discuss ACS (all common skip-ngrams), a practical algorithm with trainable parameters that estimates reference-candidate translation overlap by computing a weighted sum of all common skip-ngrams in polynomial time. We show that the BLEU and ROUGE metric families are special cases of BLANC, and we compare correlations with human judgments across these three metric families. We analyze the algorithmic complexity of ACS and argue that it is more powerful in modeling both local meaning and sentence-level structure, while offering the same practicality as the established algorithms it generalizes.

meeting of the association for computational linguistics | 2001

Evaluating a Trainable Sentence Planner for a Spoken Dialogue System

Owen Rambow; Monica Rogati; Marilyn A. Walker

Techniques for automatically training modules of a natural language generator have recently been proposed, but a fundamental concern is whether the quality of utterances produced with trainable components can compete with hand-crafted template-based or rule-based approaches. In this paper We experimentally evaluate a trainable sentence planner for a spoken dialogue system by eliciting subjective human judgments. In order to perform an exhaustive comparison, we also evaluate a hand-crafted template-based generation component, two rule-based sentence planners, and two baseline sentence planners. We show that the trainable sentence planner performs better than the rule-based systems and the baselines, and as well as the hand-crafted system.

cross language evaluation forum | 2003

Multilingual Information Retrieval Using Open, Transparent Resources in CLEF 2003

Monica Rogati; Yiming Yang

Corpus-based approaches to cross-lingual information retrieval (CLIR) have been studied and applied for many years. However, using general-purpose commercial MT systems for CLEF has been considered easier and better performing, which is to be expected given the non-domain specific nature of newspaper articles we are using in CLEF. Corpus based approaches are easier to adapt to new domains and languages; however, it is possible that their performance would be lower on a general test collection such as CLEF. Our results show that the performance drop is not large enough to justify the loss of control, transparency and flexibility. We have participated in two bilingual runs and the small multilingual run using software and data that are free to obtain, transparent and modifiable.

cross language evaluation forum | 2001

Cross-Lingual Pseudo-Relevance Feedback Using a Comparable Corpus

Monica Rogati; Yiming Yang

We applied a Cross-Lingual PRF (Pseudo-Relevance Feedback) system to both the monolingual task and the German->English task. We focused on the effects of extracting a comparable corpus from the given newspaper data; our corpus doubled the average precision when used together with a parallel corpus made available to participants. The PRF performance was lower for the queries with few relevant documents. We also examined the effects of the PRF first-step retrieval in the parallel corpus vs. the entire document collection.

asia information retrieval symposium | 2004

Applying CLIR techniques to event tracking

Nianli Ma; Yiming Yang; Monica Rogati

Cross-lingual event tracking from a very large number of information sources (thousands of Web sites, for example) is an open challenge. In this paper we investigate effective and scalable solutions for this problem, focusing on the use of cross-lingual information retrieval techniques to translate a small subset of the training documents, as an alternative to the conventional approach of translating all the multilingual test documents. In addition, we present a new variant of weighted pseudo-relevance feedback for adaptive event tracking. This new method simplifies the assumption and the computation in the best-known approach of this kind, yielding a better result than the latter on benchmark datasets in our evaluations.

Explore More