Anderson A. Ferreira
Universidade Federal de Ouro Preto
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Anderson A. Ferreira.
international conference on management of data | 2012
Anderson A. Ferreira; Marcos André Gonçalves; Alberto H. F. Laender
Name ambiguity in the context of bibliographic citation records is a hard problem that affects the quality of services and content in digital libraries and similar systems. The challenges of dealing with author name ambiguity have led to a myriad of disambiguation methods. Generally speaking, the proposed methods usually attempt to group citation records of a same author by finding some similarity among them or try to directly assign them to their respective authors. Both approaches may either exploit supervised or unsupervised techniques. In this article, we propose a taxonomy for characterizing the current author name disambiguation methods described in the literature, present a brief survey of the most representative ones and discuss several open challenges.
acm/ieee joint conference on digital libraries | 2009
Denilson Alves Pereira; Berthier A. Ribeiro-Neto; Nivio Ziviani; Alberto H. F. Laender; Marcos André Gonçalves; Anderson A. Ferreira
In digital libraries, ambiguous author names may occur due to the existence of multiple authors with the same name (polysemes) or different name variations for the same author (synonyms). We proposed here a new method that uses information available on the Web to deal with both problems at the same time. Our idea consists of gathering information from input citations and submitting queries to a Web search engine, aiming at finding curricula vitae and Web pages containing publications of the ambiguous authors. From the content of documents in the answer sets returned by the Web search engine, useful information that can help in the disambiguation process is extracted. Using this information, author names are disambiguated by leveraging a hierarchical clustering method that groups citations in the same document together in a bottom-up fashion. Experimental results show that the our method yields results that outperform those of two state-of-the-art unsupervised methods and are statistically comparable with those of a supervised one, but requiring no training. We observe gains of up to 65.2% in the pairwise F1 metric when compared with our best unsupervised baseline method.
acm/ieee joint conference on digital libraries | 2010
Anderson A. Ferreira; Adriano Veloso; Marcos André Gonçalves; Alberto H. F. Laender
Name ambiguity in the context of bibliographic citation records is a hard problem that affects the quality of services and content in digital libraries and similar systems. Supervised methods that exploit training examples in order to distinguish ambiguous author names are among the most effective solutions for the problem, but they require skilled human annotators in a laborious and continuous process of manually labeling citations in order to provide enough training examples. Thus, addressing the issues of (i) automatic acquisition of examples and (ii) highly effective disambiguation even when only few examples are available, are the need of the hour for such systems. In this paper, we propose a novel two-step disambiguation method, SAND (Self-training Associative Name Disambiguator), that deals with these two issues. The first step eliminates the need of any manual labeling effort by automatically acquiring examples using a clustering method that groups citation records based on the similarity among coauthor names. The second step uses a supervised disambiguation method that is able to detect unseen authors not included in any of the given training examples. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation (i.e., author names, work title and publication venue), demonstrated that our proposed method outperforms representative unsupervised disambiguation methods that exploit similarities between citation records and is as effective as, and in some cases superior to, supervised ones, without manually labeling any training example.
Information Processing and Management | 2012
Adriano Veloso; Anderson A. Ferreira; Marcos André Gonçalves; Alberto H. F. Laender; Wagner Meira
Authorship disambiguation is an urgent issue that affects the quality of digital library services and for which supervised solutions have been proposed, delivering state-of-the-art effectiveness. However, particular challenges such as the prohibitive cost of labeling vast amounts of examples (there are many ambiguous authors), the huge hypothesis space (there are several features and authors from which many different disambiguation functions may be derived), and the skewed author popularity distribution (few authors are very prolific, while most appear in only few citations), may prevent the full potential of such techniques. In this article, we introduce an associative author name disambiguation approach that identifies authorship by extracting, from training examples, rules associating citation features (e.g., coauthor names, work title, publication venue) to specific authors. As our main contribution we propose three associative author name disambiguators: (1) EAND (Eager Associative Name Disambiguation), our basic method that explores association rules for name disambiguation; (2) LAND (Lazy Associative Name Disambiguation), that extracts rules on a demand-driven basis at disambiguation time, reducing the hypothesis space by focusing on examples that are most suitable for the task; and (3) SLAND (Self-Training LAND), that extends LAND with self-training capabilities, thus drastically reducing the amount of examples required for building effective disambiguation functions, besides being able to detect novel/unseen authors in the test set. Experiments demonstrate that all our disambigutators are effective and that, in particular, SLAND is able to outperform state-of-the-art supervised disambiguators, providing gains that range from 12% to more than 400%, being extremely effective and practical.
document engineering | 2008
Alberto H. F. Laender; Marcos André Gonçalves; Ricardo G. Cota; Anderson A. Ferreira; Rodrygo L. T. Santos; Allan J. C. Silva
Digital Libraries are complex information systems that involve rich sets of digital objects and their respective metadata, along with multiple organizational structures and services (e.g., searching, browsing, and personalization), and are normally built having a target community of users with specific interests. Central to the success of this type of system is the quality of their services and content. In the context of DLs of scientific literature, among the many problems faced to sustain their information quality, two specific ones, related to information consistency, have taken a lot of attention from the research community: name disambiguation and lack of information to access the full-text of cataloged documents. In this paper, we examine these two problems and describe the solutions we have proposed to solve them.
acm/ieee joint conference on digital libraries | 2012
Anderson A. Ferreira; Rodrigo M. Silva; Marcos André Gonçalves; Adriano Veloso; Alberto H. F. Laender
One of the hardest problems faced by current scholarly digital libraries is author name ambiguity. This problem occurs when, in a set of citation records, there are records of a same author under distinct names, or citation records belonging to distinct authors with similar names. Among the several proposed methods, the most effective ones seem to be based on the direct assignment of the records to their respective authors by means of the application of supervised machine learning techniques. The effectiveness of such methods is usually directly correlated with the amount of supervised training data available. However, the acquisition of training examples requires skilled human annotators to manually label references. Aiming to reduce the set of examples needed to produce the training data, in this paper we propose a new active sampling strategy based on association rules for the author name disambiguation task. We compare our strategy with state-of-the-art supervised baselines that use the complete labeled training dataset and other active methods and show that very competitive results in terms of disambiguation effectiveness can be obtained with reductions in the training set of up to 71%.
association for information science and technology | 2014
Anderson A. Ferreira; Adriano Veloso; Marcos André Gonçalves; Alberto H. F. Laender
We present a novel 3‐step self‐training method for author name disambiguation—SAND (self‐training associative name disambiguator)—which requires no manual labeling, no parameterization (in real‐world scenarios) and is particularly suitable for the common situation in which only the most basic information about a citation record is available (i.e., author names, and work and venue titles). During the first step, real‐world heuristics on coauthors are able to produce highly pure (although fragmented) clusters. The most representative of these clusters are then selected to serve as training data for the third supervised author assignment step. The third step exploits a state‐of‐the‐art transductive disambiguation method capable of detecting unseen authors not included in any training example and incorporating reliable predictions to the training data. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation, demonstrate that our proposed method outperforms all representative unsupervised author grouping disambiguation methods and is very competitive with fully supervised author assignment methods. Thus, different from other bootstrapping methods that explore privileged, hard to obtain information such as self‐citations and personal information, our proposed method produces topnotch performance with no (manual) training data or parameterization and in the presence of scarce information.
International Journal on Digital Libraries | 2015
Alan Filipe Santana; Marcos André Gonçalves; Alberto H. F. Laender; Anderson A. Ferreira
Author name disambiguation has been one of the hardest problems faced by digital libraries since their early days. Historically, supervised solutions have empirically outperformed those based on heuristics, but with the burden of having to rely on manually labeled training sets for the learning process. Moreover, most supervised solutions just apply some type of generic machine learning solution and do not exploit specific knowledge about the problem. In this article, we follow a similar reasoning, but in the opposite direction. Instead of extending an existing supervised solution, we propose a set of carefully designed heuristics and similarity functions, and apply supervision only to optimize such parameters for each particular dataset. As our experiments show, the result is a very effective, efficient and practical author name disambiguation method that can be used in many different scenarios. In fact, we show that our method can beat state-of-the-art supervised methods in terms of effectiveness in many situations while being orders of magnitude faster. It can also run without any training information, using only default parameters, and still be very competitive when compared to these supervised methods (beating several of them) and better than most existing unsupervised author name disambiguation solutions.
acm/ieee joint conference on digital libraries | 2014
Alan Filipe Santana; Marcos André Gonçalves; Alberto H. F. Laender; Anderson A. Ferreira
Author name disambiguation has been one of the hardest problems faced by digital libraries since their early days. Historically, supervised solutions have empirically outperformed those based on heuristics, but with the burden of having to rely on manually labelled training sets for the learning process. Moreover, most supervised solutions just apply some type of generic machine learning solution and do not exploit specific knowledge about the problem. In this paper, we follow a similar reasoning, but in the opposite direction. Instead of extending an existing supervised solution, we propose a set of carefully designed heuristics and similarity functions and apply supervision only to optimize such parameters for each particular dataset. As our experiments show, the result is a very effective, efficient and practical author name disambiguation method that can be used in many different scenarios.
acm/ieee joint conference on digital libraries | 2013
Thiago Anzolin de Godoi; Ricardo da Silva Torres; Ariadne Maria Brito Rizzoni Carvalho; Marcos André Gonçalves; Anderson A. Ferreira; Weiguo Fan; Edward A. Fox
This paper presents a new name disambiguation method that exploits user feedback on ambiguous references across iterations. An unsupervised step is used to define pure training samples, and a hybrid supervised step is employed to learn a classification model for assigning references to authors. Our classification scheme combines the Optimum-Path Forest (OPF) classifier with complex reference similarity functions generated by a Genetic Programming framework. Experiments demonstrate that the proposed method yields better results than state-of-the-art disambiguation methods on two traditional datasets.