Nancy Ide | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nancy Ide is active.

Explore More

Publication

Featured researches published by Nancy Ide.

linguistic annotation workshop | 2007

GrAF: A Graph-based Format for Linguistic Annotations

Nancy Ide; Keith Suderman

In this paper we describe the Graph Annotation Format (GrAF) and show how it is used represent not only independent linguistic annotations, but also sets of merged annotations as a single graph. To demonstrate this, we have automatically transduced several different annotations of the Wall Street Journal corpus into GrAF and show how the annotations can then be merged, analyzed, and visualized using standard graph algorithms and tools. We also discuss how, as a standard graph representation, it allows for the application of well-established graph traversal and analysis algorithms to produce information about interactions and commonalities among merged annotations. GrAF is an extension of the Linguistic Annotation Framework (LAF) (Ide and Romary, 2004, 2006) developed within ISO TC37 SC4 and as such, implements state-of-the-art best practice guidelines for representing linguistic annotations.

international conference on computational linguistics | 1990

Word sense disambiguation with very large neural networks extracted from machine readable dictionaries

Jean Véronis; Nancy Ide

In this paper, we describe a means for automatically building very large neural networks (VLNNs) from definition texts in machine-readable dictionaries, and demonstrate the use of these networks for word sense disambiguation. Our method brings together two earlier, independent approaches to word sense disambiguation: the use of machine-readable dictionaries and spreading and activation models. The automatic construction of VLNNs enables real-size experiments with neural networks for natural language processing, which in turn provides insight into their behaviour and design and can lead to possible improvements.

Archive | 1995

Text Encoding Initiative

Nancy Ide; Jean Véronis

This paper traces the history of the Text Encoding Initiative, through the Vassar Conference and the Poughkeepsie Principles to the publication, in May 1994, of the Guidelines for the Electronic Text Encoding and Interchange. The authors explain the types of questions that were raised, the attempts made to resolve them, the TEl projects aims, the general organization of the TEl committees, and they discuss the projects future.

Archive | 2007

Making Sense About Sense

Nancy Ide; Yorick Wilks

We first reconsider the role of lexicographers in word-sense disambiguation as a computational task, as providers of both legacy material (dictionaries) and special test material for competitions like SENSEVAL. We suggest that the standard fine-grained division of senses and (larger) homographs by a lexicographer for use by a human reader may not be an appropriate goal for the computational WSD task. We argue that the level of sense-discrimination that NLP needs corresponds roughly to homographs, though we discuss psycholinguistic evidence that there are broad sense divisions with some etymological derivation (i.e. non-homographic) that are as distinct for humans as homographic ones and they may be part of the broad class of sensedivisions we seek to identify here. Fifteen years or more of WSD research has shown that it is this kind of discrimination that existing WSD programs are able to capture at the ~95% success level, whereas the full lexicographicallyderived division of senses seems to remain too hard for both programs and human discriminators. We link this discussion to the observation that major NLP tasks like MT and IR seem not to need independent WSD modules of the sort produced in the research field, even though they are undoubtedly doing WSD by other means. Our conclusion is that WSD should continue to focus on these broad discriminations, at which it can do very well, thereby possibly offering the close-to-100% success that IR needs (especially search-engine, rather than classic long-query) IR, and assume that this is what most NLP requires, with the possible exception of very fine questions of target word choice in MT. This proposal can be seen as reorienting WSD to what it can actually perform at the standard success levels, but we argue that this, rather than some more idealized vision of sense inherited from lexicography, is what humans and machines can reliably discriminate.

meeting of the association for computational linguistics | 1998

Veins Theory: A Model of Global Discourse Cohesion and Coherence

Dan Cristea; Nancy Ide; Laurent Romary

In this paper, we propose a generalization of Centering Theory (CT) (Grosz, Joshi, Weinstein (1995)) called Veins Theory (VT), which extends the applicability of centering rules from local to global discourse. A key facet of the theory involves the identification of «veins» over discourse structure trees such as those defined in RST, which delimit domains of referential accessibility for each unit in a discourse. Once identified, reference chains can be extended across segment boundaries, thus enabling the application of CT over the entire discourse. We describe the processes by which veins are defined over discourse structure trees and how CT can be applied to global discourse by using these chains. We also define a discourse «smoothness» index which can be used to compare different discourse structures and interpretations, and show how VT can be used to abstract a span of text in the context of the whole discourse. Finally, we validate our theory by analyzing examples from corpora of English, French, and Romanian.

meeting of the association for computational linguistics | 1998

Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages

Ludmila Dimitrova; Tomaz Erjavec; Nancy Ide; Heiki-Jaan Kaalep; Vladimír Petkevič; Dan Tufis

The EU Copernicus project Multext-East has created a multi-lingual corpus of text and speech data, covering the six languages of the project: Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene. In addition, wordform lexicons for each of the languages were developed. The corpus includes a parallel component consisting of Orwells Nineteen Eighty-Four, with versions in all six languages tagged for part-of-speech and aligned to English (also tagged for POS). We describe the encoding format and data architecture designed especially for this corpus, which is generally usable for encoding linguistic corpora. We also describe the methodology for the development of a harmonized set of morphosyntactic descriptions (MSDs), which builds upon the scheme for western European languages developed within the EAGLES project. We discuss the special concerns for handling the six project languages, which cover three distinct language families.

international conference on computational linguistics | 1994

MULTEXT: Multilingual Text Tools and Corpora

Nancy Ide; Jean Véronis

MULTEXT (Multilingual Text Tools and Corpora) is the largest project funded in the Commission of European Communities Linguistic Research and Engineering Program. The project will contribute to the development of generally usable software tools to manipulate and analyse text corpora and to create multilingual text corpora with structural and linguistic markup. It will attempt to establish conventions for the encoding of such corpora, building on and contributing to the preliminary recommendations of the relevant international and European standardization initiatives. MULTEXT will also work towards establishing a set of guidelines for text software development, which will be widely published in order to enable future development by others. All tools and data developed within the project will be made freely and publicly available.

international conference on computational linguistics | 2004

Fine-grained word sense disambiguation based on parallel corpora, word alignment, word clustering and aligned wordnets

Dan Tufiş; Radu Ion; Nancy Ide

The paper presents a method for word sense disambiguation based on parallel corpora. The method exploits recent advances in word alignment and word clustering based on automatic extraction of translation equivalents and being supported by available aligned wordnets for the languages in the corpus. The wordnets are aligned to the Princeton Wordnet, according to the principles established by EuroWordNet. The evaluation of the WSD system, implementing the method described herein showed very encouraging results. The same system used in a validation mode, can be used to check and spot alignment errors in multilingually aligned wordnets as BalkaNet and EuroWordNet.

Natural Language Engineering | 2004

International standard for a linguistic annotation framework

Nancy Ide; Laurent Romary

This paper describes the Linguistic Annotation Framework under development within ISO TC37 SC4 WG1. The Linguistic Annotation Framework is intended to serve as a basis for harmonizing existing language resources as well as developing new ones.

Computers and The Humanities | 2000

Cross-lingual Sense Determination: Can It Work?

Nancy Ide

This article reports the results of apreliminary analysis of translation equivalents infour languages from different language families,extracted from an on-line parallel corpus of GeorgeOrwells Nineteen Eighty-Four. The goal ofthe study is to determine the degree to whichtranslation equivalents for different meanings of apolysemous word in English are lexicalized differentlyacross a variety of languages, and to determinewhether this information can be used to structure orcreate a set of sense distinctions useful in naturallanguage processing applications. A coherenceindex is computed that measures the tendency fordifferent senses of the same English word to belexicalized differently, and from this data aclustering algorithm is used to create sensehierarchies.

Explore More