Vincent Claveau
Centre national de la recherche scientifique
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Vincent Claveau.
european conference on information retrieval | 2007
Fabienne Moreau; Vincent Claveau; Pascale Sébillot
Information retrieval systems (IRSs) usually suffer from a low ability to recognize a same idea that is expressed in different forms. A way of improving these systems is to take into account morphological variants. We propose here a simple yet effective method to recognize these variants that are further used so as to enrich queries. In comparison with already published methods, our system does not need any external resources or a priori knowledge and thus supports many languages. This new approach is evaluated against several collections, 6 different languages and is compared to existing tools such as a stemmer and a lemmatizer. Reported results show a significant and systematic improvement of the whole IRS efficiency both in terms of precision and recall for every language.
multimedia information retrieval | 2010
Pierre Tirilly; Vincent Claveau; Patrick Gros
Current text retrieval techniques allow to index and retrieve text documents very efficiently and with a good accuracy. Image retrieval, on the contrary, is still very coarse and does not yield satisfying results. Therefore, computer vision researchers try to benefit from text retrieval contributions to enhance their retrieval systems. In particular, Sivic and Zisserman, with their video-google framework [1], propose a description of images similar to standard text descriptors: images are described by elementary image parts, called visual words. Thus, they perform image retrieval using the standard Vector Space Model (VSM) of Information Retrieval (IR) and benefit from some classical IR techniques such as inverted files. Among available text retrieval techniques, automatically finding the most relevant words to describe a document has been intensively studied for texts, but not for images. In this paper, we propose to explore the use of term weighting techniques and classical distances from text retrieval in the case of images. These weights are standard VSM weights, weights derived from probabilistic models of IR or new weighting schemes that we propose. Our experiments, performed on several datasets, show that no weighting scheme can improve retrieval on every dataset, but also that choosing weights fitting the properties of the dataset can improve precision and MAP up to 10%. This study provides some interesting insights about the semantic and statistical differences between textual and visual words, and about the way visual word-based image retrieval systems can be optimized. It also shows some limits of the bag of visual words model, and the relation existing between Minkowski distances and local weighting. At last, this study questions some experimental habits commonly found in the literature (choice of L1 distance, TF*IDF weights and evaluation using one dataset only).
BMC Bioinformatics | 2008
Aurélie Névéol; Sonya E. Shooshan; Vincent Claveau
Background:Indexing is a crucial step in any information retrieval system. In MEDLINE, a widely used database of the biomedical literature, the indexing process involves the selection of Medical Subject Headings in order to describe the subject matter of articles. The need for automatic tools to assist MEDLINE indexers in this task is growing with the increasing number of publications being added to MEDLINE.Methods:In this paper, we describe the use and the customization of Inductive Logic Programming (ILP) to infer indexing rules that may be used to produce automatic indexing recommendations for MEDLINE indexers.Results:Our results show that this original ILP-based approach outperforms manual rules when they exist. In addition, the use of ILP rules also improves the overall performance of the Medical Text Indexer (MTI), a system producing automatic indexing recommendations for MEDLINE.Conclusion:We expect the sets of ILP rules obtained in this experiment to be integrated into MTI.
european conference on information retrieval | 2009
Patrick Bosc; Vincent Claveau; Olivier Pivert; Laurent Ughetto
This paper investigates the use of fuzzy logic mechanisms coming from the database community, namely graded inclusions, to model the information retrieval process. In this framework, documents and queries are represented by fuzzy sets, which are paired with operations like fuzzy implications and T-norms. Through different experiments, it is shown that only some among the wide range of fuzzy operations are relevant for information retrieval. When appropriate settings are chosen, it is possible to mimic classical systems, thus yielding results rivaling those of state-of-the-art systems. These positive results validate the proposed approach, while negative ones give some insights on the properties needed by such a model. Moreover, this paper shows the added-value of this graded inclusion-based model, which gives new and theoretically grounded ways for a user to easily weight his query terms, to include negative information in his queries, or to expand them with related terms.
Computer Speech & Language | 2015
Vincent Claveau; Sébastien Lefèvre
Abstract A fine-grained segmentation of radio or TV broadcasts is an essential step for most multimedia processing tasks. Applying segmentation algorithms to the speech transcripts seems straightforward. Yet, most of these algorithms are not suited when dealing with short segments or noisy data. In this paper, we present a new segmentation technique inspired from the image analysis field and relying on a new way to compute similarities between candidate segments called vectorization. Vectorization makes it possible to match text segments that do not share common words; this property is shown to be particularly useful when dealing with transcripts in which transcription errors and short segments makes the segmentation difficult. This new topic segmentation technique is evaluated on two corpora of transcripts from French TV broadcasts on which it largely outperforms other existing approaches from the state-of-the-art.
conference of european society for fuzzy logic and technology | 2011
Laurent Ughetto; Vincent Claveau
Recently, a theoretical fuzzy IR system, based on gradual inclusion measures, has been proposed [1]. In this model, derived from the division of fuzzy relations, the gradual inclusion of a query in a document is modeled by a fuzzy implication. In previous papers, we have shown that, under some assumptions, this model can be seen as a Vector Space Model. This paper also studies other interpretations of our fuzzy IR models based on gradual inclusions. It is shown that the fuzzy models can be interpreted as language models for IR. The links with logical models to IR are also recalled. More generally, this paper discusses the links between these models, shown from the point of view of our fuzzy models.
ieee international conference on fuzzy systems | 2009
Patrick Bosc; Laurent Ughetto; Olivier Pivert; Vincent Claveau
This paper investigates the use of fuzzy logic mechanisms coming from the database community, namely graded inclusions, to model the information retrieval process. Two kinds of graded inclusions are considered. In this framework, documents and queries are represented by fuzzy sets, which are paired with operations like fuzzy implications and T-norms. Through different experiments, it is shown that only some among the wide range of fuzzy operations are relevant for information retrieval. When appropriate settings are chosen, it is possible to mimic classical systems, thus yielding results rivaling those of state-of-the-art systems. These positive results validate the proposed approach, while negative ones give some insights on the properties needed by such a model. Moreover, this paper shows the added-value of this graded inclusion-based model, which gives new and theoretically grounded ways for a user to easily weight his query terms, to include negative information in his queries, or to expand them with related terms.
european conference on information retrieval | 2018
Vincent Claveau
Examining the properties of representation spaces for documents or words in Information Retrieval (IR) – typically \(\mathbb {R}^n\) with n large – brings precious insights to help the retrieval process. Recently, several authors have studied the real dimensionality of the datasets, called intrinsic dimensionality, in specific parts of these spaces [14]. They have shown that this dimensionality is chiefly tied with the notion of indiscriminateness among neighbors of a query point in the vector space. In this paper, we propose to revisit this notion in the specific case of IR. More precisely, we show how to estimate indiscriminateness from IR similarities in order to use it in representation spaces used for documents and words [7, 18]. We show that indiscriminateness may be used to characterize difficult queries; moreover we show that this notion, applied to word embeddings, can help to choose terms to use for query expansion.
MediaEval 2016: "Verfiying Multimedia Use" task | 2016
Cédric Maigrot; Vincent Claveau; Ewa Kijak; Ronan Sicre
Recent work has presented max-equivocation as a measure of the resistance of a cryptosystem to attacks when the attacker is aware of the encoder function and message distribution. Here we consider the vulnerability of a cryptosystem in the one-try attack scenario when the attacker has incomplete information about the encoder function and message distribution. We show that encoder functions alone yield information to the attacker, and combined with inferable information about the ciphertexts, information about the message distribution can be discovered. We show that the whole encoder function need not be fixed or shared a priori for an effective cryptosystem, and this can be exploited to increase the equivocation over an a priori shared encoder. Finally we present two algorithms that operate in these scenarios and achieve good equivocation results, ExPad that demonstrates the key concepts, and ShortPad that has less overhead than ExPad.We introduce a novel playlist generation algorithm that focuses on the quality of transitions using a recurrent neural network (RNN). The proposed model assumes that optimal transitions between tracks can be modelled and predicted by internal transitions within music tracks. We introduce modelling sequences of high-level music descriptors using RNNs and discuss an experiment involving different similarity functions, where the sequences are provided by a musical structural analysis algorithm. Qualitative observations show that the proposed approach can effectively model transitions of music tracks in playlists.In today’s landscape of more and more software-driven functionalities, spanning more and more fields, model-driven engineering (MDE) promises to ease the development of software. To accomplish this goal, MDE employs domain-specific languages (DSLs). The problem is that, on one hand, DSLs are not easy to create, and, on the other hand, as a result of the increased software-driven functionalities, they need to deal with bigger models. In dealing with these big models, modularity mechanisms are employed regularly by DSLs. These mechanisms need to be introduced over and over again into the developed DSLs, adding to the effort of creating them. To ease the development of DSLs, we propose to introduce a modularisation of models that is independent of the DSLs. We do so via two mechanisms, groups and fragment abstractions, that comprise many modularity use cases found in DSLs. These two mechanisms have been implemented in a prototype tool, MetaMod.This paper introduces a novel graph-analytic approach for detecting anomalies in network flow data called GraphPrints. Building on foundational network-mining techniques, our method represents time slices of traffic as a graph, then counts graphlets -- small induced subgraphs that describe local topology. By performing outlier detection on the sequence of graphlet counts, anomalous intervals of traffic are identified, and furthermore, individual IPs experiencing abnormal behavior are singled-out. Initial testing of GraphPrints is performed on real network data with an implanted anomaly. Evaluation shows false positive rates bounded by 2.84% at the time-interval level, and 0.05% at the IP-level with 100% true positive rates at both.Finding commonalities between descriptions of data or knowledge is a fundamental task in Machine Learning. The formal notion characterizing precisely such commonalities is known as least general generalization of descriptions and was introduced by G. Plotkin in the early 70s, in First Order Logic. Identifying least general generalizations has a large scope of database applications ranging from query optimization (e.g., to share commonalities between queries in view selection or multi-query optimization) to recommendation in social networks (e.g., to establish connections between users based on their commonalities between profiles or searches). To the best of our knowledge, this is the first work that re-visits the notion of least general generalizations in the entire Resource Description Framework (RDF) and popular con-junctive fragment of SPARQL, a.k.a. Basic Graph Pattern (BGP) queries. Our contributions include the definition and the computation of least general generalizations in these two settings, which amounts to finding the largest set of com-monalities between incomplete databases and conjunctive queries, under deductive constraints. We also provide an experimental assessment of our technical contributions.Previous studies have shown the efficiency of using quasi-random mutations on the well-know CMA evolution strategy [13]. Quasi-random mutations have many advantages, in particular their application is stable, efficient and easy to use. In this article, we extend this principle by applying quasi-random mutations on several well known continuous evolutionary algorithms (SA, CMSA, CMA) and do it on several old and new test functions, and with several criteria. The results point out a clear improvement compared to the baseline, in all cases, and in particular for moderate computational budget.We extend the definition and study the algebraic properties of the polylogarithm Li(T) , where T is rational series over the alphabet X = {x 0 , x 1 } belonging to suitable subalgebras of rational series.Multi-task algorithms typically use task similarity information as a bias to speed up learning. We argue that, when the classification problem is unbalanced, task dissimilarity information provides a more effective bias, as rare class labels tend to be better separated from the frequent class labels. In particular, we show that a multi-task extension of the label propagation algorithm for graph-based classification works much better on protein function prediction problems when the task relatedness information is represented using a dissimilarity matrix as opposed to a similarity matrix. CCS Concepts •Computing methodologies → Multi-task learning; Semi-supervised learning settings; •Applied computing → Bioinformatics;This paper proposes an end-to-end convolutional selective autoencoder approach for early detection of combustion instabilities using rapidly arriving flame image frames. The instabilities arising in combustion processes cause significant deterioration and safety issues in various human-engineered systems such as land and air based gas turbine engines. These properties are described as self-sustaining, large amplitude pressure oscillations and show varying spatial scales periodic coherent vortex structure shedding. However, such instability is extremely difficult to detect before a combustion process becomes completely unstable due to its sudden (bifurcation-type) nature. In this context, an autoencoder is trained to selectively mask stable flame and allow unstable flame image frames. In that process, the model learns to identify and extract rich descriptive and explanatory flame shape features. With such a training scheme, the selective autoencoder is shown to be able to detect subtle instability features as a combustion process makes transition from stable to unstable region. As a consequence, the deep learning tool-chain can perform as an early detection framework for combustion instabilities that will have a transformative impact on the safety and performance of modern engines.Content-based publish/subscribe is an attractive option for disseminating event data in cyber-physical systems. To this end, we propose MothPad: a monitoring and visualization tool to demonstrate the performance of various pub/sub solutions within the context of location-based applications. MothPad consists of Mammoth, an online game research framework used as a cyber-physical system simulator, and PADRES, the publish/subscribe dissemination substrate. Both are instrumented and the performance is displayed in real-time using a monitoring client. We show the applicability of our approach through two case studies: network engines for online games and self-evolving subscriptions.This paper presents a multi-modal hoax detection system composed of text, source, and image analysis. As hoax can be very diverse, we want to analyze several modalities to better detect them. This system is applied in the context of the Verifying Multimedia Use task of MediaEval 2016. Experiments show the performance of each separated modality as well as their combination.
applications of natural language to data bases | 2015
Béatrice Arnulphy; Vincent Claveau; Xavier Tannier; Anne Vilnat
Identifying events from texts is an information extraction task necessary for many NLP applications. Through the TimeML specifications and TempEval challenges, it has received some attention in recent years. However, no reference result is available for French. In this paper, we try to fill this gap by proposing several event extraction systems, combining for instance Conditional Random Fields, language modeling and k-nearest-neighbors. These systems are evaluated on French corpora and compared with state-of-the-art methods on English. The very good results obtained on both languages validate our approach.