Lucas Sterckx
Ghent University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Lucas Sterckx.
empirical methods in natural language processing | 2016
Lucas Sterckx; Cornelia Caragea; Thomas Demeester; Chris Develder
The problem of noisy and unbalanced training data for supervised keyphrase extraction results from the subjectivity of keyphrase assignment, which we quantify by crowdsourcing keyphrases for news and fashion magazine articles with many annotators per document. We show that annotators exhibit substantial disagreement, meaning that single annotator data could lead to very different training sets for supervised keyphrase extractors. Thus, annotations from single authors or readers lead to noisy training data and poor extraction performance of the resulting supervised extractor. We provide a simple but effective solution to still work with such data by reweighting the importance of unlabeled candidate phrases in a two stage Positive Unlabeled Learning setting. We show that performance of trained keyphrase extractors approximates a classifier trained on articles labeled by multiple annotators, leading to higher average F1scores and better rankings of keyphrases. We apply this strategy to a variety of test collections from different backgrounds and show improvements over strong baseline models.
international world wide web conferences | 2015
Lucas Sterckx; Thomas Demeester; Johannes Deleu; Chris Develder
We propose an improvement on a state-of-the-art keyphrase extraction algorithm, Topical PageRank (TPR), incorporating topical information from topic models. While the original algorithm requires a random walk for each topic in the topic model being used, ours is independent of the topic model, computing but a single PageRank for each text regardless of the amount of topics in the model. This increases the speed drastically and enables it for use on large collections of text using vast topic models, while not altering performance of the original algorithm.
Knowledge Based Systems | 2016
Lucas Sterckx; Thomas Demeester; Johannes Deleu; Chris Develder
Training relation extractors for the purpose of automated knowledge base population requires the availability of sufficient training data. The amount of manual labeling can be significantly reduced by applying distant supervision, which generates training data by aligning large text corpora with existing knowledge bases. This typically results in a highly noisy training set, where many training sentences do not express the intended relation. In this paper, we propose to combine distant supervision with minimal human supervision by annotating features (in particular shortest dependency paths) rather than complete relation instances. Such feature labeling eliminates noise from the initial training set, resulting in a significant increase of precision at the expense of recall. We further improve on this approach by introducing the Semantic Label Propagation (SLP) method, which uses the similarity between low-dimensional representations of candidate training instances to again extend the (filtered) training set in order to increase recall while maintaining high precision. Our strategy is evaluated on an established test collection designed for knowledge base population (KBP) from the TAC KBP English slot filling task. The experimental results show that SLP leads to substantial performance gains when compared to existing approaches while requiring an almost negligible human annotation effort.
european conference on information retrieval | 2014
Lucas Sterckx; Thomas Demeester; Johannes Deleu; Laurent Mertens; Chris Develder
How useful are topic models based on song lyrics for applications in music information retrieval? Unsupervised topic models on text corpora are often difficult to interpret. Based on a large collection of lyrics, we investigate how well automatically generated topics are related to manual topic annotations. We propose to use the kurtosis metric to align unsupervised topics with a reference model of supervised topics. This metric is well-suited for topic assessments, as it turns out to be more strongly correlated with manual topic quality scores than existing measures for semantic coherence. We also show how it can be used for a detailed graphical topic quality assessment.
international world wide web conferences | 2015
Lucas Sterckx; Thomas Demeester; Johannes Deleu; Chris Develder
We explore how the unsupervised extraction of topic-related keywords benefits from combining multiple topic models. We show that averaging multiple topic models, inferred from different corpora, leads to more accurate keyphrases than when using a single topic model and other state-of-the-art techniques. The experiments confirm the intuitive idea that a prerequisite for the significant benefit of combining multiple models is that the models should be sufficiently different, i.e., they should provide distinct contexts in terms of topical word importance.
language resources and evaluation | 2018
Lucas Sterckx; Thomas Demeester; Johannes Deleu; Chris Develder
AbstractWhile several automatic keyphrase extraction (AKE) techniques have been developed and analyzed, there is little consensus on the definition of the task and a lack of overview of the effectiveness of different techniques. Proper evaluation of keyphrase extraction requires large test collections with multiple opinions, currently not available for research. In this paper, we (i) present a set of test collections derived from various sources with multiple annotations (which we also refer to as opinions in the remained of the paper) for each document, (ii) systematically evaluate keyphrase extraction using several supervised and unsupervised AKE techniques, (iii) and experimentally analyze the effects of disagreement on AKE evaluation. Our newly created set of test collections spans different types of topical content from general news and magazines, and is annotated with multiple annotations per article by a large annotator panel. Our annotator study shows that for a given document there seems to be a large disagreement on the preferred keyphrases, suggesting the need for multiple opinions per document. A first systematic evaluation of ranking and classification of keyphrases using both unsupervised and supervised AKE techniques on the test collections shows a superior effectiveness of supervised models, even for a low annotation effort and with basic positional and frequency features, and highlights the importance of a suitable keyphrase candidate generation approach. We also study the influence of multiple opinions, training data and document length on evaluation of keyphrase extraction. Our new test collection for keyphrase extraction is one of the largest of its kind and will be made available to stimulate future work to improve reliable evaluation of new keyphrase extractors.
international conference on multimedia retrieval | 2016
Baptist Vandersmissen; Lucas Sterckx; Thomas Demeester; Azarakhsh Jalalvand; Wesley De Neve; Rik Van de Walle
The searchability of video content is often limited to the descriptions authors and/or annotators care to provide. The level of description can range from absolutely nothing to fine-grained annotations at the level of frames. Based on these annotations, certain parts of the video content are more searchable than others. Within the context of the STEAMER project, we developed an innovative end-to-end system that attempts to tackle the problem of unsupervised retrieval of news video content, leveraging multiple information streams and deep neural networks. In particular, we extracted keyphrases and named entities from transcripts, subsequently refining these keyphrases and named entities based on their visual appearance in the news video content. Moreover, to allow for fine-grained frame-level annotations, we temporally located high-confidence keyphrases in the news video content. To that end, we had to tackle challenges such as the automatic construction of training sets and the automatic assessment of keyphrase imageability. In this paper, we discuss the main components of our end-to-end system, capable of transforming textual and visual information into fine-grained video annotations.
neural information processing systems | 2014
Lucas Sterckx; Thomas Demeester; Johannes Deleu; Chris Develder
7th Text Analysis Conference, Proceedings | 2014
Matthias Feys; Lucas Sterckx; Laurent Mertens; Johannes Deleu; Thomas Demeester; Chris Develder
north american chapter of the association for computational linguistics | 2018
Klim Zaporojets; Lucas Sterckx; Johannes Deleu; Thomas Demeester; Chris Develder