Rafael Geraldeli Rossi
University of São Paulo
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Rafael Geraldeli Rossi.
Journal of Computer Science and Technology | 2014
Rafael Geraldeli Rossi; Alneu de Andrade Lopes; Thiago de Paulo Faleiros; Solange Oliveira Rezende
Algorithms for numeric data classification have been applied for text classification. Usually the vector space model is used to represent text collections. The characteristics of this representation such as sparsity and high dimensionality sometimes impair the quality of general-purpose classifiers. Networks can be used to represent text collections, avoiding the high sparsity and allowing to model relationships among different objects that compose a text collection. Such network-based representations can improve the quality of the classification results. One of the simplest ways to represent textual collections by a network is through a bipartite heterogeneous network, which is composed of objects that represent the documents connected to objects that represent the terms. Heterogeneous bipartite networks do not require computation of similarities or relations among the objects and can be used to model any type of text collection. Due to the advantages of representing text collections through bipartite heterogeneous networks, in this article we present a text classifier which builds a classification model using the structure of a bipartite heterogeneous network. Such an algorithm, referred to as IMBHN (Inductive Model Based on Bipartite Heterogeneous Network), induces a classification model assigning weights to objects that represent the terms for each class of the text collection. An empirical evaluation using a large amount of text collections from different domains shows that the proposed IMBHN algorithm produces significantly better results than k-NN, C4.5, SVM, and Naive Bayes algorithms.
Information Processing and Management | 2016
Rafael Geraldeli Rossi; Alneu de Andrade Lopes; Solange Oliveira Rezende
Scalable algorithm based on bipartite networks to perform transduction.Unlabeled data effectively employed to improve classification performance.Better performance than algorithms based on vector space model or networks.Rigorous evaluation to show the drawbacks of the existing transductive algorithms.Trade-off analysis between inductive supervised and transductive classification. Transductive classification is a useful way to classify texts when labeled training examples are insufficient. Several algorithms to perform transductive classification considering text collections represented in a vector space model have been proposed. However, the use of these algorithms is unfeasible in practical applications due to the independence assumption among instances or terms and the drawbacks of these algorithms. Network-based algorithms come up to avoid the drawbacks of the algorithms based on vector space model and to improve transductive classification. Networks are mostly used for label propagation, in which some labeled objects propagate their labels to other objects through the network connections. Bipartite networks are useful to represent text collections as networks and perform label propagation. The generation of this type of network avoids requirements such as collections with hyperlinks or citations, computation of similarities among all texts in the collection, as well as the setup of a number of parameters. In a bipartite heterogeneous network, objects correspond to documents and terms, and the connections are given by the occurrences of terms in documents. The label propagation is performed from documents to terms and then from terms to documents iteratively. Nevertheless, instead of using terms just as means of label propagation, in this article we propose the use of the bipartite network structure to define the relevance scores of terms for classes through an optimization process and then propagate these relevance scores to define labels for unlabeled documents. The new document labels are used to redefine the relevance scores of terms which consequently redefine the labels of unlabeled documents in an iterative process. We demonstrated that the proposed approach surpasses the algorithms for transductive classification based on vector space model or networks. Moreover, we demonstrated that the proposed algorithm effectively makes use of unlabeled documents to improve classification and it is faster than other transductive algorithms.
international conference on data mining | 2012
Rafael Geraldeli Rossi; Thiago de Paulo Faleiros; Alneu de Andrade Lopes; Solange Oliveira Rezende
Usually, algorithms for categorization of numeric data have been applied for text categorization after a preprocessing phase which assigns weights for textual terms deemed as attributes. However, due to characteristics of textual data, some algorithms for data categorization are not efficient for text categorization. Characteristics of textual data such as sparsity and high dimensionality sometimes impair the quality of general purpose classifiers. Here, we propose a text classifier based on a bipartite heterogeneous network used to represent textual document collections. Such algorithm induces a classification model assigning weights to objects that represents terms of the textual document collection. The induced weights correspond to the influence of the terms in the classification of documents they appear. The least-mean-square algorithm is used in the inductive process. Empirical evaluation using a large amount of textual document collections shows that the proposed IMBHN algorithm produces significantly better results than the k-NN, C4.5, SVM and Naïve Bayes algorithms.
acm symposium on applied computing | 2014
Rafael Geraldeli Rossi; Alneu de Andrade Lopes; Solange Oliveira Rezende
A bipartite heterogeneous network is one of the simplest ways to represent a textual document collection. In such case, the network consists of two types of vertices, representing documents and terms, and links connecting terms to the documents. Transductive algorithms are usually applied to perform classification of networked objects. This type of classification is usually applied when few labeled examples are available, which may be worthwhile for practical situations. Nevertheless, for existing transductive algorithms users have to set several parameters that significantly affect the classification accuracy. In this paper, we propose a parameter-free algorithm for transductive classification of textual data, referred to as LPBHN (Label Propagation using Bipartite Heterogeneous Networks). LPBHN uses a bipartite heterogeneous network to perform the classification task. The proposed algorithm presents accuracy equivalent or higher than state-of-the-art algorithms for transductive classification in heterogeneous or homogeneous networks.
conference on intelligent text processing and computational linguistics | 2015
Rafael Geraldeli Rossi; Solange Oliveira Rezende; Alneu de Andrade Lopes
Transductive classification is a useful way to classify texts when just few labeled examples are available. Transductive classification algorithms rely on term frequency to directly classify texts represented in vector space model or to build networks and perform label propagation. Related terms tend to belong to the same class and this information can be used to assign relevance scores of terms for classes and consequently the labels of documents. In this paper we propose the use of term networks to model term relations and perform transductive classification. In order to do so, we propose (i) different ways to generate term networks, (ii) how to assign initial relevance scores for terms, (iii) how to propagate the relevance scores among terms, and (iv) how to use the relevance scores of terms in order to classify documents. We demonstrate that transductive classification based on term networks can surpass the accuracies obtained by transductive classification considering texts represented in other types of networks or vector space model, or even the accuracies obtained by inductive classification. We also demonstrated that we can decrease the size of term networks through feature selection while keeping classification accuracy and decreasing computational complexity.
Knowledge Based Systems | 2018
Roberta Akemi Sinoara; Jose Camacho-Collados; Rafael Geraldeli Rossi; Roberto Navigli; Solange Oliveira Rezende
Abstract Accurate semantic representation models are essential in text mining applications. For a successful application of the text mining process, the text representation adopted must keep the interesting patterns to be discovered. Although competitive results for automatic text classification may be achieved with traditional bag of words, such representation model cannot provide satisfactory classification performances on hard settings where richer text representations are required. In this paper, we present an approach to represent document collections based on embedded representations of words and word senses. We bring together the power of word sense disambiguation and the semantic richness of word- and word-sense embedded vectors to construct embedded representations of document collections. Our approach results in semantically enhanced and low-dimensional representations. We overcome the lack of interpretability of embedded vectors, which is a drawback of this kind of representation, with the use of word sense embedded vectors. Moreover, the experimental evaluation indicates that the use of the proposed representations provides stable classifiers with strong quantitative results, especially in semantically-complex classification scenarios.
Pattern Recognition Letters | 2017
Thiago de Paulo Faleiros; Rafael Geraldeli Rossi; Alneu de Andrade Lopes
Scalable algorithm based on bipartite graphs to perform transduction learning.Label propagation procedure that uses class information associated with vertices and edges.Better performance than state-of-the-art algorithms based on vector space or graphs.Comprehensive evaluation showing the proposal performance with few labeled instances.Optimization process using KL-Divergence. Transductive classification is an useful way to classify a collection of unlabeled textual documents when only a small fraction of this collection can be manually labeled. Graph-based algorithms have aroused considerable interests in recent years to perform transductive classification since the graph-based representation facilitates label propagation through the graph edges. In a bipartite graph representation, nodes represent objects of two types, here documents and terms, and the edges between documents and terms represent the occurrences of the terms in the documents. In this context, the label propagation is performed from documents to terms and then from terms to documents iteratively. In this paper we propose a new graph-based transductive algorithm that use the bipartite graph structure to associate the available class information of labeled documents and then propagate these class information to assign labels for unlabeled documents. By associating the class information to edges linking documents to terms we guarantee that a single term can propagate different class information to its distinct neighbors. We also demonstrated that the proposed method surpasses the algorithms for transductive classification based on vector space model or graphs when only a small number of labeled documents is available.
processing of the portuguese language | 2014
Brett Drury; Rafael Geraldeli Rossi; Alneu de Andrade Lopes
Causative verbs can assist in the identification of causative relations. Portuguese has a large number of verbs that would make the manual labelling of causative verbs an manually expensive task. This paper presents a classification strategy which uses the characteristics of causative verbs co-occurring with common nouns to classify Brazilian Portuguese verbs as either: causative or non-causative. The strategy constructs a graph where verbs extracted from text are nodes. The verbs are connected if the verbs co-occur with common nouns. The classification strategy uses the unique characteristics of links between: 1. causative verbs, 2. causative verbs and non-causative verbs and 3. non-causative verbs to predict a label (causative or non-causative) for unlabelled verbs. The proposed strategy significantly outperforms a baseline and supervised learning strategies.
document engineering | 2011
Rafael Geraldeli Rossi; Solange Oliveira Rezende
Archive | 2008
Bruno M. Nogueira; Maria Fernanda Moura; Merley da Silva Conrado; Rafael Geraldeli Rossi; Ricardo Marcondes Marcacini; Solange Oliveira Rezende