Magali Sanches Duran
University of São Paulo
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Magali Sanches Duran.
conference of the european chapter of the association for computational linguistics | 2014
Magali Sanches Duran; Lucas Avanço; Sandra Maria Aluísio; Thiago Alexandre Salgueiro Pardo; Maria das Graças Volpe Nunes
This paper describes the analysis of different kinds of noises in a corpus of products reviews in Brazilian Portuguese. Case folding, punctuation, spelling and the use of internet slang are the major kinds of noise we face. After noting the effect of these noises on the POS tagging task, we propose some procedures to minimize them.
international conference on computational linguistics | 2014
Carolina Scarton; Lin Sun; Karin Kipper-Schuler; Magali Sanches Duran; Martha Palmer; Anna Korhonen
Levin-style classes which capture the shared syntax and semantics of verbs have proven useful for many Natural Language Processing NLP tasks and applications. However, lexical resources which provide information about such classes are only available for a handful of worlds languages. Because manual development of such resources is extremely time consuming and cannot reliably capture domain variation in classification, methods for automatic induction of verb classes from texts have gained popularity. However, to date such methods have been applied to English and a handful of other, mainly resource-rich languages. In this paper, we apply the methods to Brazilian Portuguese - a language for which no VerbNet or automatic class induction work exists yet. Since Levin-style classification is said to have a strong cross-linguistic component, we use unsupervised clustering techniques similar to those developed for English without language-specific feature engineering. This yields interesting results which line up well with those obtained for other languages, demonstrating the cross-linguistic nature of this type of classification. However, we also discover and discuss issues which require specific consideration when aiming to optimise the performance of verb clustering for Brazilian Portuguese and other less-resourced languages.
processing of the portuguese language | 2014
Carolina Scarton; Magali Sanches Duran; Sandra Maria Aluísio
In this paper, we present a new language-independent method to build VerbNet-based lexical resources. As a proof of concept, we show the use of this method to build a VerbNet-style lexicon for Brazilian Portuguese. The resulting resource was built semi-automatically by using existing lexical resources for English and Portuguese and knowledge extracted from corpora. The results achieved around 60% of f-measure when compared with a gold standard for Brazilian Portuguese, which is also described in this paper. The method proposed here also outperformed state-of-art machine learning method (verb clustering) by around 20% of f-measure.
text speech and dialogue | 2017
Leandro Borges dos Santos; Magali Sanches Duran; Nathan Siegle Hartmann; Arnaldo Candido; Gustavo Paetzold; Sandra Maria Aluísio
Psycholinguistic properties of words have been used in various approaches to Natural Language Processing tasks, such as text simplification and readability assessment. Most of these properties are subjective, involving costly and time-consuming surveys to be gathered. Recent approaches use the limited datasets of psycholinguistic properties to extend them automatically to large lexicons. However, some of the resources used by such approaches are not available to most languages. This study presents a method to infer psycholinguistic properties for Brazilian Portuguese (BP) using regressors built with a light set of features usually available for less resourced languages: word length, frequency lists, lexical databases composed of school dictionaries and word embedding models. The correlations between the properties inferred are close to those obtained by related works. The resulting resource contains 26,874 words in BP annotated with concreteness, age of acquisition, imageability and subjective frequency.
Proceedings of the Workshop on Noisy User-generated Text | 2015
Magali Sanches Duran; Maria das Graças Volpe Nunes; Lucas Avanço
User-generated contents (UGC) represent an important source of information for governments, companies, political candidates and consumers. However, most of the Natural Language Processing tools and techniques are developed from and for texts of standard language, and UGC is a type of text especially full of creativity and idiosyncrasies, which represents noise for NLP purposes. This paper presents UGCNormal, a lexicon-based tool for UGC normalization. It encompasses a tokenizer, a sentence segmentation tool, a phonetic-based speller and some lexicons, which were originated from a deep analysis of a corpus of product reviews in Brazilian Portuguese. The normalizer was evaluated in two different data sets and carried out from 31% to 89% of the appropriate corrections, depending on the type of text noise. The use of UGCNormal was also validated in a task of POS tagging, which improved from 91.35% to 93.15% in accuracy and in a task of opinion classification, which improved the average of F1-score measures (F1-score positive and F1-score negative) from 0.736 to 0.758.
processing of the portuguese language | 2016
Nathan Siegle Hartmann; Magali Sanches Duran; Sandra Maria Aluísio
Semantic Role Labeling (SRL) is a Natural Language Processing task that enables the detection of events described in sentences and the participants of these events. For Brazilian Portuguese (BP), there are two studies recently concluded that perform SRL in journalistic texts. [1] obtained F1-measure scores of 79.6, using the PropBank.Br corpus, which has syntactic trees manually revised; [8], without using a treebank for training, obtained F1-measure scores of 68.0 for the same corpus. However, the use of manually revised syntactic trees for this task does not represent a real scenario of application. The goal of this paper is to evaluate the performance of SRL on revised and non-revised syntactic trees using a larger and balanced corpus of BP journalistic texts. First, we have shown that [1]’s system also performs better than [8]’s system on the larger corpus. Second, the SRL system trained on non-revised syntactic trees performs better over non-revised trees than a system trained on gold-standard data.
processing of the portuguese language | 2016
Gustavo Augusto de Mendonça Almeida; Lucas Avanço; Magali Sanches Duran; Erick Rocha Fonseca; Maria das Graças Volpe Nunes; Sandra Maria Aluísio
Recently, spell checking (or spelling correction systems) has regained attention due to the need of normalizing user-generated content (UGC) on the web. UGC presents new challenges to spellers, as its register is much more informal and contains much more variability than traditional spelling correction systems can handle. This paper proposes two new approaches to deal with spelling correction of UGC in Brazilian Portuguese (BP), both of which take into account phonetic errors. The first approach is based on three phonetic modules running in a pipeline. The second one is based on machine learning, with soft decision making, and considers context-sensitive misspellings. We compared our methods with others on a human annotated UGC corpus of reviews of products. The machine learning approach surpassed all other methods, with 78.0 % correction rate, very low false positive (0.7 %) and false negative rate (21.9 %).
joint conference on lexical and computational semantics | 2015
Magali Sanches Duran; Sandra Maria Aluísio
This paper reports an approach to automatically generate a lexical resource to support incremental semantic role labeling annotation in Portuguese. The data come from the corpus Propbank-Br (Propbank of Brazilian Portuguese) and from the lexical resource of English Propbank, as both share the same structure. In order to enable the strategy, we added extra annotation to Propbank-Br. This approach is part of a previous decision to invert the process of implementing a Propbank project, by first annotating a core corpus and only then generating a lexical resource to enable further annotation tasks. The reasoning behind such inversion is to explore the task empirically before distributing the annotation task and to provide simultaneously: 1) a first training corpus for SRL in Brazilian Portuguese and 2) annotated examples to compose a lexical resource to support SRL. The main contribution of this paper is to point out to what extent linguistic effort may be reduced, thereby speeding up the construction of a lexical resource to support SRL for less resourced languages. The corpus Propbank-Br, with the extra annotation described herein, is publicly available.
language resources and evaluation | 2012
Magali Sanches Duran; Sandra Maria Aluísio
meeting of the association for computational linguistics | 2011
Magali Sanches Duran; Carlos Ramisch; Sandra Maria Aluísio; Aline Villavicencio