Gabriel Pereira Lopes

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gabriel Pereira Lopes is active.

Explore More

Publication

Featured researches published by Gabriel Pereira Lopes.

Computational Linguistics | 2005

Clustering Syntactic Positions with Similar Semantic Requirements

Pablo Gamallo; Alexandre Agustini; Gabriel Pereira Lopes

This article describes an unsupervised strategy to acquire syntactico-semantic requirements of nouns, verbs, and adjectives from partially parsed text corpora. The linguistic notion of requirement underlying this strategy is based on two specific assumptions. First, it is assumed that two words in a dependency are mutually required. This phenomenon is called here corequirement. Second, it is also claimed that the set of words occurring in similar positions defines extensionally the requirements associated with these positions. The main aim of the learning strategy presented in this article is to identify clusters of similar positions by identifying the words that define their requirements extensionally. This strategy allows us to learn the syntactic and semantic requirements of words in different positions. This information is used to solve attachment ambiguities. Results of this particular task are evaluated at the end of the article. Extensive experimentation was performed on Portuguese text corpora.

computational intelligence for modelling, control and automation | 2006

Identification of Document Language is Not yet a Completely Solved Problem

Joaquim Ferreira da Silva; Gabriel Pereira Lopes

Existing Language Identification (LID) approaches do reach 100% precision, in most common situations, when dealing with documents written in just one language, and when those documents are large enough. However, LID approaches do not provide a reliable solution for some situations: when there is need to discriminate the correct variant of the language used in a text, for example, European or Brazilian variants of Portuguese, UK or USA English variants, or any other language variants. Another hard context occur with small touristic advertisements on the web, addressing foreigners but using local language to name most local entities. In this paper, we present a fully statistics- based LID approach which learns the most discriminant information according to each context, and identifies the correct language or language variant a text is written in. This methodology is shown to be correct for normal texts and maintains its robustness in hard LID contexts.

meeting of the association for computational linguistics | 2000

Using confidence bands for parallel texts alignment

António Ribeiro; Gabriel Pereira Lopes; João T. Mexia

This paper describes a language independent method for alignment of parallel texts that makes use of homograph tokens for each pair of languages. In order to filter out tokens that may cause misalignment, we use confidence bands of linear regression lines instead of heuristics which are not theoretically supported. This method was originally inspired on work done by Pascale Fung and Kathleen McKeown, and Melamed, providing the statistical support those authors could not claim.

portuguese conference on artificial intelligence | 2009

Phrase Translation Extraction from Aligned Parallel Corpora Using Suffix Arrays and Related Structures

José Aires; Gabriel Pereira Lopes; Luís Gomes

In this paper, we will address term translation extraction from indexed aligned parallel corpora, by using a couple of association measures combined by a voting scheme, for scaling down translation pairs according to the degree of internal cohesiveness, and evaluate results obtained. Precision obtained is clearly much better than results obtained in related work for the very low range of occurrences we have dealt with, and compares with the best results obtained in word translation.

Grammars | 2001

Tabulation for multi-purpose partial parsing

Vitor Rocio; Gabriel Pereira Lopes; Éric Villemonte de la Clergerie

Efficient partial parsing systems (chunkers) are urgently required by various natural language application areas because these parsers always produce partially parsed text even when the text does not fully fit existing lexica and grammars. Availability of partially parsed corpora is absolutely necessary for extracting various kinds of information that may then be fed into those systems, thereby increasing their processing power. In this paper, we propose an efficient partial parsing scheme, based on chart parsing, that is flexible enough to support both normal parsing tasks and diagnosis in previously obtained partial parses of possible causes (kinds of faults) that led to those partial, instead of complete, parses. Through the use of the built-in tabulation capabilites of the DyALog system, we implemented a partial parser that runs as fast as the best non-deterministic parsers. In this paper we elaborate on the implementation of two different grammar formalisms: Definite Clause Grammars (DCG) extended with head declarations and Bound Movement Grammars (BMG).

international conference on data mining | 2001

Document clustering and cluster topic extraction in multilingual corpora

Joaquim Ferreira da Silva; João T. Mexia; A. Coelho; Gabriel Pereira Lopes

A statistics-based approach for clustering documents and for extracting cluster topics is described relevant (meaningful) expressions (REs) automatically extracted from corpora are used as clustering base features. These features are transformed and its number is strongly reduced in order to obtain a small set of document classification features. This is achieved on the basis of principal components analysis. Model-based clustering analysis finds the best number of clusters. Then, the most important REs are extracted from each cluster and taken as document cluster topics.

doctoral conference on computing, electrical and industrial systems | 2011

Automatic Extraction of Document Topics

Luís F. S. Teixeira; Gabriel Pereira Lopes; Rita A. Ribeiro

A keyword or topic for a document is a word or multi-word (sequence of 2 or more words) that summarizes in itself part of that document content. In this paper we compare several statistics-based language independent methodologies to automatically extract keywords. We rank words, multi-words, and word prefixes (with fixed length: 5 characters), by using several similarity measures (some widely known and some newly coined) and evaluate the results obtained as well as the agreement between evaluators. Portuguese, English and Czech were the languages experimented.

european conference on machine learning | 1998

Learning Verbal Transitivity Using LogLinear Models

Nuno Miguel Marques; Gabriel Pereira Lopes; Carlos A. Coelho

In this paper we show how loglinear models can be used to cluster verbs based on their subcategorization preferences. We describe how the information about the phrases or clauses a verb goes with can be computationally learned from an automatically tagged corpus with 9,333,555 words. We will use loglinear modeling to describe the relation between the acquired counts for the part-of-speech tags co-occurring with the verbs on predetermined positions.Based on these results an unsupervised clustering algorithm will be proposed.

portuguese conference on artificial intelligence | 2009

A Document Descriptor Extractor Based on Relevant Expressions

Joaquim Ferreira da Silva; Gabriel Pereira Lopes

People are often asked to associate keywords to documents to enable applications to access the summarized core content of documents. This fact was the main motivation to work on an approach that may contribute to move from this manual procedure to an automatic one. Since Relevant Expressions (REs) or multi-word term expressions can be automatically extracted using the LocalMaxs algorithm, the most relevant ones can be used to describe the core content of each document. In this paper we present a language-independent approach for automatic generation of document descriptors. Results are shown for three different European languages and comparisons are made concerning different metrics for selecting the most informative REs of each document.

text speech and dialogue | 2007

Inducing classes of terms from text

Pablo Gamallo; Gabriel Pereira Lopes; Alexandre Agustini

This paper describes a clustering method for organizing in semantic classes a list of terms. The experiments were made using a POS annotated corpus, the ACL Anthology, which consists of technical articles in the field of Computational Linguistics. The method, mainly based on some assumptions of Formal Concept Analysis, consists in building bi-dimensional clusters of both terms and their lexico-syntactic contexts. Each generated cluster is defined as a semantic class with a set of terms describing the extension of the class and a set of contexts perceived as the intensional attributes (or properties) valid for all the terms in the extension. The clustering process relies on two restrictive operations: abstraction and specification. The result is a concept lattice that describes a domain-specific ontology of terms.

Explore More