Gustavo Paetzold
University of Sheffield
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Gustavo Paetzold.
meeting of the association for computational linguistics | 2015
Lucia Specia; Gustavo Paetzold; Carolina Scarton
This paper presents QUEST++ , an open source tool for quality estimation which can predict quality for texts at word, sentence and document level. It also provides pipelined processing, whereby predictions made at a lower level (e.g. for words) can be used as input to build models for predictions at a higher level (e.g. sentences). QUEST++ allows the extraction of a variety of features, and provides machine learning algorithms to build and test quality estimation models. Results on recent datasets show that QUEST++ achieves state-of-the-art performance.
north american chapter of the association for computational linguistics | 2015
Gustavo Paetzold
Lexical Simplification is the task of modifying the lexical content of complex sentences in order to make them simpler. Due to the lack of reliable resources available for the task, most existing approaches have difficulties producing simplifications which are grammatical and that preserve the meaning of the original text. In order to improve on the state-of-the-art of this task, we propose user studies with nonnative speakers, which will result in new, sizeable datasets, as well as novel ways of performing Lexical Simplification. The results of our first experiments show that new types of classifiers, along with the use of additional resources such as spoken text language models, produce the state-of-the-art results for the Lexical Simplification task of SemEval-2012.
north american chapter of the association for computational linguistics | 2016
Gustavo Paetzold; Lucia Specia
We report the findings of the Complex Word Identification task of SemEval 2016. To create a dataset, we conduct a user study with 400 non-native English speakers, and find that complex words tend to be rarer, less ambiguous and shorter. A total of 42 systems were submitted from 21 distinct teams, and nine baselines were provided. The results highlight the effectiveness of Decision Trees and Ensemble methods for the task, but ultimately reveal that word frequencies remain the most reliable predictor of word complexity.
workshop on statistical machine translation | 2015
Kashif Shah; Varvara Logacheva; Gustavo Paetzold; Frédéric Blain; Daniel Beck; Fethi Bougares; Lucia Specia
We describe our systems for Tasks 1 and 2 of the WMT15 Shared Task on Quality Estimation. Our submissions use (i) a continuous space language model to extract additional features for Task 1 (SHEFGP, SHEF-SVM), (ii) a continuous bagof-words model to produce word embeddings as features for Task 2 (SHEF-W2V) and (iii) a combination of features produced by QuEst++ and a feature produced with word embedding models (SHEFQuEst++). Our systems outperform the baseline as well as many other submissions. The results are especially encouraging for Task 2, where our best performing system (SHEF-W2V) only uses features learned in an unsupervised fashion.
north american chapter of the association for computational linguistics | 2016
Gustavo Paetzold; Lucia Specia
We introduce a bootstrapping algorithm for regression that exploits word embedding models. We use it to infer four psycholinguistic properties of words: Familiarity, Age of Acquisition, Concreteness and Imagery and further populate the MRC Psycholinguistic Database with these properties. The approach achieves 0.88 correlation with humanproduced values and the inferred psycholinguistic features lead to state-of-the-art results when used in a Lexical Simplification task.
meeting of the association for computational linguistics | 2015
Gustavo Paetzold; Lucia Specia
Lexical Simplification consists in replacing complex words in a text with simpler alternatives. We introduce LEXenstein, the first open source framework for Lexical Simplification. It covers all major stages of the process and allows for easy benchmarking of various approaches. We test the tool’s performance and report comparisons on different datasets against the state of the art approaches. The results show that combining the novel Substitution Selection and Substitution Ranking approaches introduced in LEXenstein is the most effective approach to Lexical Simplification.
text speech and dialogue | 2017
Leandro Borges dos Santos; Magali Sanches Duran; Nathan Siegle Hartmann; Arnaldo Candido; Gustavo Paetzold; Sandra Maria Aluísio
Psycholinguistic properties of words have been used in various approaches to Natural Language Processing tasks, such as text simplification and readability assessment. Most of these properties are subjective, involving costly and time-consuming surveys to be gathered. Recent approaches use the limited datasets of psycholinguistic properties to extend them automatically to large lexicons. However, some of the resources used by such approaches are not available to most languages. This study presents a method to infer psycholinguistic properties for Brazilian Portuguese (BP) using regressors built with a light set of features usually available for less resourced languages: word length, frequency lists, lexical databases composed of school dictionaries and word embedding models. The correlations between the properties inferred are close to those obtained by related works. The resulting resource contains 26,874 words in BP annotated with concreteness, age of acquisition, imageability and subjective frequency.
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers | 2016
Gustavo Paetzold; Lucia Specia
We introduce SimpleNets: a resource-light solution to the sentence-level Quality Estimation task of WMT16 that combines Recurrent Neural Networks, word embedding models, and the principle of compositionality. The SimpleNets systems explore the idea that the quality of a translation can be derived from the quality of its n-grams. This approach has been successfully employed in Text Simplification quality assessment in the past. Our experiments show that, surprisingly, our models can learn more about a translation’s quality by focusing on the original sentence, rather than on the translation itself.
Journal of Artificial Intelligence Research | 2017
Gustavo Paetzold; Lucia Specia
Lexical Simplification is the process of replacing complex words in a given sentence with simpler alternatives of equivalent meaning. This task has wide applicability both as an assistive technology for readers with cognitive impairments or disabilities, such as Dyslexia and Aphasia, and as a pre-processing tool for other Natural Language Processing tasks, such as machine translation and summarisation. The problem is commonly framed as a pipeline of four steps: the identification of complex words, the generation of substitution candidates, the selection of those candidates that fit the context, and the ranking of the selected substitutes according to their simplicity. In this survey we review the literature for each step in this typical Lexical Simplification pipeline and provide a benchmarking of existing approaches for these steps on publicly available datasets. We also provide pointers for datasets and resources available for the task.
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers | 2016
Daniel Beck; Andreas Vlachos; Gustavo Paetzold; Lucia Specia
We describe University of Sheffield’s submission to the word-level Quality Estimation shared task. Our system is based on imitation learning, an approach to structured prediction which relies on a classifier trained on data generated appropriately to ameliorate error propagation. Compared to other structure prediction approaches such as conditional random fields, it allows the use of arbitrary information from previous tag predictions and the use of non-decomposable loss functions over the structure. We explore these two aspects in our submission while using the baseline features provided by the shared task organisers. Our system outperformed the conditional random field baseline while using the same feature set.