Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Béatrice Daille is active.

Publication


Featured researches published by Béatrice Daille.


international joint conference on natural language processing | 2005

French-english terminology extraction from comparable corpora

Béatrice Daille; Emmanuel Morin

This article presents a method of extracting bilingual lexica composed of single-word terms (SWTs) and multi-word terms (MWTs) from comparable corpora of a technical domain. First, this method extracts MWTs in each language, and then uses statistical methods to align single words and MWTs by exploiting the term contexts. After explaining the difficulties involved in aligning MWTs and specifying our approach, we show the adopted process for bilingual terminology extraction and the resources used in our experiments. Finally, we evaluate our approach and demonstrate its significance, particularly in relation to non-compositional MWT alignment.


language resources and evaluation | 2010

Compositionality and lexical alignment of multi-word terms

Emmanuel Morin; Béatrice Daille

The automatic compilation of bilingual lists of terms from specialized comparable corpora using lexical alignment has been successful for single-word terms (SWTs), but remains disappointing for multi-word terms (MWTs). The low frequency and the variability of the syntactic structures of MWTs in the source and the target languages are the main reported problems. This paper defines a general framework dedicated to the lexical alignment of MWTs from comparable corpora that includes a compositional translation process and the standard lexical context analysis. The compositional method which is based on the translation of lexical items being restrictive, we introduce an extended compositional method that bridges the gap between MWTs of different syntactic structures through morphological links. We experimented with the two compositional methods for the French–Japanese alignment task. The results show a significant improvement for the translation of MWTs and advocate further morphological analysis in lexical alignment.


Machine Translation | 1998

Bricks and Skeletons: Some Ideas for the Near Future of MAHT

Jean-Marc Lange; Eric Gaussier; Béatrice Daille

This paper sets forth some ideas for the evolution of translation tools in the near future. The proposed improvements consist in a closer integration of terminology and sentence databases. In particular, we suggest that bilingual sentence databases (translation memories) could be refined by splitting sentences into large “bricks” of text. We also propose a mechanism through which bilingual sentence databases could be generalized by replacing the known terms by variable placeholders, thus yielding technical sentence “skeletons”.This latter idea is the most original and, in our view, the most promising for future implementations. Although these ideas are not yet supported by experiments, we believe that they can be implemented using simple techniques, following the general philosophy that such tools should go as far as possible while remaining robust and useful for the human translator.


ACM Transactions on Speech and Language Processing | 2010

Brains, not brawn: The use of “smart” comparable corpora in bilingual terminology mining

Emmanuel Morin; Béatrice Daille; Koichi Takeuchi; Kyo Kageura

Current research in text mining favors the quantity of texts over their representativeness. But for bilingual terminology mining, and for many language pairs, large comparable corpora are not available. More importantly, as terms are defined vis-à-vis a specific domain with a restricted register, it is expected that the representativeness rather than the quantity of the corpus matters more in terminology mining. Our hypothesis, therefore, is that the representativeness of the corpus is more important than the quantity and ensures the quality of the acquired terminological resources. This article tests this hypothesis on a French-Japanese bilingual term extraction task. To demonstrate how important the type of discourse is as a characteristic of the comparable corpora, we used a state-of-the-art multilingual terminology mining chain composed of two extraction programs, one in each language, and an alignment program. We evaluated the candidate translations using a reference list, and found that taking discourse type into account resulted in candidate translations of a better quality even when the corpus size was reduced by half.


Information Processing and Management | 2002

In vitro evaluation of a program for machine-aided indexing

Christian Jacquemin; Béatrice Daille; Jean Royanté; Xavier Polanco

This article presents the human evaluation of ILIAD, a program for machine-aided indexing (MAI). It consists of two language engineering modules and is designed to assist expert librarians in computer-aided indexing and document analysis. Our aim is the expert evaluation of automatic multi-word term indexing. Evaluation is performed by documentary engineers. Cataloging and indexing are their principal tasks. They also have a good scientific knowledge of the domain to which the indexed documents belong.We first present the ILIAD program and the two systems submitted to this evaluation, the methodology (protocol) adopted, the differences between the protocol and the implementation, and the results of these evaluations. Human evaluation is divided into three parts: firstly the evaluation of controlled indexing, then free indexing and finally term variant extraction performed during controlled indexing. Finally, we analyze the relevance of this evaluation by calculating the agreement frequency and the Kappa coefficient and propose some future developments.


international symposium on computers and communications | 2008

Multi-word term indexing for Arabic document retrieval

Siham boulaknadel; Béatrice Daille; Aboutajdine driss

To improve information retrieval system performances, it seems important to identify key phrases which constitute a better representation of text semantic content than single word terms. In this paper, we adapt the standard method for multi-word term extraction for Arabic language. We define the linguistic specifications and develop a term extraction tool. We experiment the term extraction program for document retrieval in a specific domain, evaluate two kinds of multi-word term weighting functions considering either the corpus or the document, and demonstrate the efficiency of multi-word term indexing for both weighting up to 5.8% of average precision.


language resources and evaluation | 2011

Annotating opinion--evaluation of blogs: the Blogoscopy corpus

Béatrice Daille; Estelle Dubreil; Laura Monceaux; Matthieu Vernier

The blog phenomenon is universal. Blogs are characterized by their evaluative use, in that they enable Internet users to express their opinion on a given subject. From this point of view, they are an ideal resource for the constitution of an annotated sentiment analysis corpus, crossing the subject and the opinion expressed on this subject. This paper presents the Blogoscopy corpus for the French language which was built up with personal thematic blogs. The annotation was governed by three principles: theoretical, as opinion is grounded in a linguistic theory of evaluation, practical, as every opinion is linked to an object, and methodological as annotation rules and successive phases are defined to ensure quality and thoroughness.


international conference on computational linguistics | 2012

Neoclassical compound alignments from comparable corpora

Rima Harastani; Béatrice Daille; Emmanuel Morin

The paper deals with the automatic compilation of bilingual dictionary from specialized comparable corpora. We concentrate on a method to automatically extract and to align neoclassical compounds in two languages from comparable corpora. In order to do this, we assume that neoclassical compounds translate compositionally to neoclassical compounds from one language to another. The method covers the two main forms of neoclassical compounds and is split into three steps: extraction, generation, and selection. Our program takes as input a list of aligned neoclassical elements and a bilingual dictionary in two languages. We also align neoclassical compounds by a pivot language approach depending on the hypothesis that the neoclassical element remains stable in meaning across languages. We experiment with four languages: English, French, German, and Spanish using corpora in the domain of renewable energy; we obtain a precision of 96%.


international conference on computational linguistics | 2012

Clustering short text and its evaluation

Prajol Shrestha; Christine Jacquin; Béatrice Daille

Recently there has been an increase in interest towards clustering short text because it could be used in many NLP applications. According to the application, a variety of short text could be defined mainly in terms of their length (e.g. sentence, paragraphs) and type (e.g. scientific papers, newspapers). Finding a clustering method that is able to cluster short text in general is difficult. In this paper, we cluster 4 different corpora with different types of text with varying length and evaluate them against the gold standard. Based on these clustering experiments, we show how different similarity measures, clustering algorithms, and cluster evaluation methods effect the resulting clusters. We discuss four existing corpus based similarity methods, Cosine similarity, Latent Semantic Analysis, Short text Vector Space Model, and Kullback-Leibler distance, four well known clustering methods, Complete Link, Single Link, Average Link hierarchical clustering and Spectral clustering, and three evaluation methods, clustering F-measure, adjusted Rand Index, and V. Our experiments show that corpus based similarity measures do not significantly affect the clusters and that the performance of spectral clustering is better than hierarchical clustering. We also show that the values given by the evaluation methods do not always represent the usability of the clusters.


applications of natural language to data bases | 2007

A service oriented architecture for adaptable terminology acquisition

Farid Cerbah; Béatrice Daille

Terminology acquisition has proven to be useful in a large panel of applications, such as Information Retrieval. Robust tools have been developed to support these corpus-based acquisition processes. However, practitioners in this field cannot yet benefit from reference architectures that may greatly help to build large-scale applications. The work described in this paper shows how an open architecture can be designed using web services technology. This architecture is implemented in the HyperTerm acquisition platform. We show how it has been used for coupling HyperTerm and ACABIT term extractor.

Collaboration


Dive into the Béatrice Daille's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge