Erick Rocha Fonseca
University of São Paulo
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Erick Rocha Fonseca.
Journal of the Brazilian Computer Society | 2015
Erick Rocha Fonseca; João Luís Garcia Rosa; Sandra Maria Aluísio
BackgroundPart-of-speech tagging is an important preprocessing step in many natural language processing applications. Despite much work already carried out in this field, there is still room for improvement, especially in Portuguese. We experiment here with an architecture based on neural networks and word embeddings, and that has achieved promising results in English.MethodsWe tested our classifier in different corpora: a new revision of the Mac-Morpho corpus, in which we merged some tags and performed corrections and two previous versions of it. We evaluate the impact of using different types of word embeddings and explicit features as input.ResultsWe compare our tagger’s performance with other systems and achieve state-of-the-art results in the new corpus. We show how different methods for generating word embeddings and additional features differ in accuracy.ConclusionsThe work reported here contributes with a new revision of the Mac-Morpho corpus and a state-of-the-art new tagger available for use out-of-the-box.
international symposium on neural networks | 2013
Erick Rocha Fonseca; João Luís Garcia Rosa
Semantic role labeling (SRL) is a well known task in Natural Language Processing, consisting of identifying and labeling verbal arguments. It has been widely studied in English, but scarcely explored in other languages. In this paper, we employ a two-step convolutional neural architecture to label semantic arguments in Brazilian Portuguese texts, and avoid the use of external NLP tools. We achieve an F1 score of 62.2, which, although considerably lower than the state-of-the-art for English, seems promising considering the available resources. Also, dividing the process into two easier subtasks makes it more feasible to further improve performance through semi-supervised learning. Our system is available online and ready to be used out of the box to label new texts.
north american chapter of the association for computational linguistics | 2015
Erick Rocha Fonseca; Sandra Maria Aluísio
Graph-based dependency parsing algorithms commonly employ features up to third order in an attempt to capture richer syntactic relations. However, each level and each feature combination must be defined manually. Besides that, input features are usually represented as huge, sparse binary vectors, offering limited generalization. In this work, we present a deep architecture for dependency parsing based on a convolutional neural network. It can examine the whole sentence structure before scoring each head/modifier candidate pair, and uses dense embeddings as input. Our model is still under ongoing work, achieving 91.6% unlabeled attachment score in the Penn Treebank.
processing of the portuguese language | 2012
Erick Rocha Fonseca; João Luís Garcia Rosa
We present an adaptation of the architeture of the system SENNA, which performs various NLP tasks, to Portuguese, considering the richly inflected morphology of the language. We propose to separate words in lemmas and their flexional attributes. We point out the major problems that could arise from this approach as well as their potential solutions. This architecture can greatly benefit from the use of unlabeled data, which is especially good considering the small amounts of labeled resources in Portuguese.
processing of the portuguese language | 2016
Erick Rocha Fonseca; Sandra Maria Aluísio
Brazilian Portuguese (BP) and European Portuguese (EP) have specific NLP resources and tools for many tasks. It is generally agreed upon that applying them to the variant other than their intended one results in a performance drop; however, very little research has measured it. We evaluated a POS tagger in a cross-variant setting under multiple combinations of word embeddings, train and test corpora, and found that (i) BP is easier than EP, (ii) word embeddings help increase tagger performance significantly, but not enough to close the accuracy gap in a cross-variant setting and (iii) embeddings generated from a corpus with both variants are useful in cross-variant scenarios. While we cannot generalize observations from POS tagging to any NLP task, this is an important first step for such evaluations.
processing of the portuguese language | 2018
Erick Rocha Fonseca; Sandra Maria Aluísio
Natural Language Inference (NLI) is the task of detecting relations such as entailment, contradiction and paraphrase in pairs of sentences. With the recent release of the ASSIN corpus, NLI in Portuguese is now getting more attention. However, published results on ASSIN have not explored syntactic structure, neither combined word embedding metrics with other types of features. In this work, we sought to remedy this gap, proposing a new model for NLI that achieves 0.72 F\(_1\) score on ASSIN, setting a new state of the art. Our feature analysis shows that word embeddings and syntactic knowledge are both important to achieve such results.
processing of the portuguese language | 2016
Gustavo Augusto de Mendonça Almeida; Lucas Avanço; Magali Sanches Duran; Erick Rocha Fonseca; Maria das Graças Volpe Nunes; Sandra Maria Aluísio
Recently, spell checking (or spelling correction systems) has regained attention due to the need of normalizing user-generated content (UGC) on the web. UGC presents new challenges to spellers, as its register is much more informal and contains much more variability than traditional spelling correction systems can handle. This paper proposes two new approaches to deal with spelling correction of UGC in Brazilian Portuguese (BP), both of which take into account phonetic errors. The first approach is based on three phonetic modules running in a pipeline. The second one is based on machine learning, with soft decision making, and considers context-sensitive misspellings. We compared our methods with others on a human annotated UGC corpus of reviews of products. The machine learning approach surpassed all other methods, with 78.0 % correction rate, very low false positive (0.7 %) and false negative rate (21.9 %).
STIL | 2013
Erick Rocha Fonseca; João Luís Garcia Rosa
arXiv: Computation and Language | 2017
Nathan Siegle Hartmann; Erick Rocha Fonseca; Christopher Dane Shulby; Marcos Vinícius Treviso; Jessica Rodrigues; Sandra Maria Aluísio
Linguamática | 2016
Erick Rocha Fonseca; Leandro Borges dos Santos; Marcelo Criscuolo; Sandra Maria Aluísio