Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Hervé Déjean is active.

Publication


Featured researches published by Hervé Déjean.


meeting of the association for computational linguistics | 2004

A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora

Eric Gaussier; Jean-Michel Renders; Irina Matveeva; Cyril Goutte; Hervé Déjean

We present a geometric view on bilingual lexicon extraction from comparable corpora, which allows to re-interpret the methods proposed so far and identify unresolved problems. This motivates three new methods that aim at solving these problems. Empirical evaluation shows the strengths and weaknesses of these methods, as well as a significant gain in the accuracy of extracted lexicons.


international conference on computational linguistics | 2002

An approach based on multilingual thesauri and model combination for bilingual lexicon extraction

Hervé Déjean; Eric Gaussier; Fatiha Sadat

This paper focuses on exploiting different models and methods in bilingual lexicon extraction, either from parallel or comparable corpora, in specialized domains. First, a special attention is given to the use of multilingual thesauri, and different search strategies based on such thesauri are investigated. Then, a method to combine the different models for bilingual lexicon extraction is presented. Our results show that the combination of the models significantly improves results, and that the use of the hierarchical information contained in our thesaurus, UMLS/MeSH, is of primary importance. Lastly, methods for bilingual terminology extraction and thesaurus enrichment are discussed.


document analysis systems | 2006

A system for converting PDF documents into structured XML format

Hervé Déjean; Jean-Luc Meunier

We present in this paper a system for converting PDF legacy documents into structured XML format. This conversion system first extracts the different streams contained in PDF files (text, bitmap and vectorial images) and then applies different components in order to express in XML the logically structured documents. Some of these components are traditional in Document Analysis, other more specific to PDF. We also present a graphical user interface in order to check, correct and validate the analysis of the components. We eventually report on two real user cases where this system was applied on.


document engineering | 2005

Structuring documents according to their table of contents

Hervé Déjean; Jean-Luc Meunier

In this paper, we present a method for structuring a document according to the information present in its Table of Contents. The detection of the ToC as well as the determination of the parts it refers to in the document body rely on a series of generic properties characterizing any ToC, while its hierarchization is achieved using clustering techniques. We also report on the robustness and performance of the method before discussing it, in light of related work.


north american chapter of the association for computational linguistics | 2003

Reducing parameter space for word alignment

Hervé Déjean; Eric Gaussier; Cyril Goutte; Kenji Yamada

This paper presents the experimental results of our attemps to reduce the size of the parameter space in word alignment algorithm. We use IBM Model 4 as a baseline. In order to reduce the parameter space, we pre-processed the training corpus using a word lemmatizer and a bilingual term extraction algorithm. Using these additional components, we obtained an improvement in the alignment error rate.


Artificial Intelligence in Medicine | 2005

Automatic processing of multilingual medical terminology: applications to thesaurus enrichment and cross-language information retrieval

Hervé Déjean; Eric Gaussier; Jean-Michel Renders; Fatiha Sadat

OBJECTIVES We present in this article experiments on multi-language information extraction and access in the medical domain. For such applications, multilingual terminology plays a crucial role when working on specialized languages and specific domains. MATERIAL AND METHODS We propose firstly a method for enriching multilingual thesauri which extracts new terms from parallel corpora, and secondly, a new approach for bilingual lexicon extraction from comparable corpora, which uses a bilingual thesaurus as a pivot. We illustrate their use in multi-language information retrieval (English/German) in the medical domains. RESULTS Our experiments show that these automatically extracted bilingual lexicons are accurate enough (85% precision for term extraction) for semi-automatically enriching mono- or bi-lingual thesauri such as the universal medical language system, and that their use in cross-language information retrieval significantly improves the retrieval performance (from 22 to 40% average precision) and clearly outperforms existing bilingual lexicon resources (both general lexicons and specialized ones). CONCLUSION We show in this paper first that bilingual lexicon extraction from parallel corpora in the medical domain could lead to accurate, specialized lexicons, which can be used to help enrich existing thesauri and second that bilingual lexicons extracted from comparable corpora outperform general bilingual resources for cross-language information retrieval.


International Journal on Document Analysis and Recognition | 2009

On tables of contents and how to recognize them

Hervé Déjean; Jean-Luc Meunier

We present a method for structuring a document according to the information present in its different organizational tables: table of contents, tables of figures, etc. This method is based on a two-step approach that leverages functional and formal (layout-based) kinds of knowledge. The functional definition of organizational table, based on five properties, is used to provide a first solution, which is improved in a second step by automatically learning the form of the table of contents. We also report on the robustness and performance of the method and we illustrate its use in a real conversion case.


european conference on research and advanced technology for digital libraries | 2005

From legacy documents to XML: a conversion framework

Jean-Pierre Chanod; Boris Chidlovskii; Hervé Déjean; Olivier Fambon; Jérôme Fuselier; Thierry Jacquin; Jean-Luc Meunier

We present an integrated framework for the document conversion from legacy formats to XML format. We describe the LegDoC project, aimed at automating the conversion of layout annotations layout-oriented formats like PDF, PS and HTML to semantic-oriented annotations. A toolkit of different components covers complementary techniques the logical document analysis and semantic annotations with the methods of machine learning. We use a real case conversion project as a driving example to exemplify different techniques implemented in the project.


document engineering | 2007

Logical document conversion: combining functional and formal knowledge

Hervé Déjean; Jean-Luc Meunier

We present in this paper a method for document layout analysis based on identifying the function of document elements (what they do). This approach is orthogonal and complementary to the traditional view based on the form of document elements (how they are constructed). One key advantage of such functional knowledge is that the functions of some document elements are very stable from document to document and over time. Relying on the stability of such functions, the method is not impacted by layout variability, a key issue in logical document analysis and is thus very robust and versatile. The method starts the recognition process by using functional knowledge and uses in a second step formal knowledge as a source of feedback in order to correct some errors. This allows the method to adapt to specific documents by using formal specificities.


document analysis systems | 2010

Reflections on the INEX structure extraction competition

Hervé Déjean; Jean-Luc Meunier

After two participations to the INEX competition in the Structure Extraction task, which consists in building navigation tools for digitised books by constructing hyperlinked table of contents from OCR text and layout information, we present in this paper some reflections about this competition regarding its dataset, and its evaluation measure. We point out some issues, and propose some recommendations for improving the groundtruth and the measures.

Collaboration


Dive into the Hervé Déjean's collaboration.

Top Co-Authors

Avatar

Eric Gaussier

Centre national de la recherche scientifique

View shared research outputs
Top Co-Authors

Avatar

Cyril Goutte

National Research Council

View shared research outputs
Researchain Logo
Decentralizing Knowledge