Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Viviane Pereira Moreira is active.

Publication


Featured researches published by Viviane Pereira Moreira.


very large data bases | 2011

Multilingual schema matching for Wikipedia infoboxes

Thanh Hoang Nguyen; Viviane Pereira Moreira; Huong Nguyen; Hoa Nguyen; Juliana Freire

Recent research has taken advantage of Wikipedias multi-lingualism as a resource for cross-language information retrieval and machine translation, as well as proposed techniques for enriching its cross-language structure. The availability of documents in multiple languages also opens up new opportunities for querying structured Wikipedia content, and in particular, to enable answers that straddle different languages. As a step towards supporting such queries, in this paper, we propose a method for identifying mappings between attributes from infoboxes that come from pages in different languages. Our approach finds mappings in a completely automated fashion. Because it does not require training data, it is scalable: not only can it be used to find mappings between many language pairs, but it is also effective for languages that are under-represented and lack sufficient training samples. Another important benefit of our approach is that it does not depend on syntactic similarity between attribute names, and thus, it can be applied to language pairs that have distinct morphologies. We have performed an extensive experimental evaluation using a corpus consisting of pages in Portuguese, Vietnamese, and English. The results show that not only does our approach obtain high precision and recall, but it also outperforms state-of-the-art techniques. We also present a case study which demonstrates that the multilingual mappings we derive lead to substantial improvements in answer quality and coverage for structured queries over Wikipedia content.


cross language evaluation forum | 2010

A new approach for cross-language plagiarism analysis

Rafael Corezola Pereira; Viviane Pereira Moreira; Renata de Matos Galante

This paper presents a new method for Cross-Language Plagiarism Analysis. Our task is to detect the plagiarized passages in the suspicious documents and their corresponding fragments in the source documents. We propose a plagiarism detection method composed by five main phases: language normalization, retrieval of candidate documents, classifier training, plagiarism analysis, and post-processing. To evaluate our method, we created a corpus containing artificial plagiarism offenses. Two different experiments were conducted; the first one considers only monolingual plagiarism cases, while the second one considers only cross-language plagiarism cases. The results showed that the cross-language experiment achieved 86% of the performance of the monolingual baseline. We also analyzed how the plagiarized text length affects the overall performance of the method. This analysis showed that our method achieved better results with medium and large plagiarized passages.


Information Processing and Management | 2016

Assessing the impact of Stemming Accuracy on Information Retrieval - A multilingual perspective

Felipe N. Flores; Viviane Pereira Moreira

We tested the quality of many stemmers for English, French, Spanish and Portuguese with both intrinsic and extrinsic metrics.We found that a correlation between the two types of measures does exist, but it is not as strong as one might have expected.The most accurate stemmer was not the one to have the biggest improvement in Information Retrieval, in none of the languages. The quality of stemming algorithms is typically measured in two different ways: (i) how accurately they map the variant forms of a word to the same stem; or (ii) how much improvement they bring to Information Retrieval systems. In this article, we evaluate various stemming algorithms, in four languages, in terms of accuracy and in terms of their aid to Information Retrieval. The aim is to assess whether the most accurate stemmers are also the ones that bring the biggest gain in Information Retrieval. Experiments in English, French, Portuguese, and Spanish show that this is not always the case, as stemmers with higher error rates yield better retrieval quality. As a byproduct, we also identified the most accurate stemmers and the best for Information Retrieval purposes.


IEEE Transactions on Knowledge and Data Engineering | 2013

Prequery Discovery of Domain-Specific Query Forms: A Survey

Mauricio C. Moraes; Carlos A. Heuser; Viviane Pereira Moreira; Denilson Barbosa

The discovery of HTML query forms is one of the main challenges in Deep Web crawling. Automatic solutions for this problem perform two main tasks. The first is locating HTML forms on the Web, which is done through the use of traditional/focused crawlers. The second is identifying which of these forms are indeed meant for querying, which also typically involves determining a domain for the underlying data source (and thus for the form as well). This problem has attracted a great deal of interest, resulting in a long list of algorithms and techniques. Some methods submit requests through the forms and then analyze the data retrieved in response, typically requiring a great deal of knowledge about the domain as well as semantic processing. Others do not employ form submission, to avoid such difficulties, although some techniques rely to some extent on semantics and domain knowledge. This survey gives an up-to-date review of methods for the discovery of domain-specific query forms that do not involve form submission. We detail these methods and discuss how form discovery has become increasingly more automated over time. We conclude with a forecast of what we believe are the immediate next steps in this trend.


Information Systems | 2009

A strategy for allowing meaningful and comparable scores in approximate matching

Carina F. Dorneles; Marcos Freitas Nunes; Carlos A. Heuser; Viviane Pereira Moreira; Altigran Soares da Silva; Edleno Silva de Moura

Approximate data matching aims at assessing whether two distinct instances of data represent the same real-world object. The comparison between data values is usually done by applying a similarity function which returns a similarity score. If this score surpasses a given threshold, both data instances are considered as representing the same real-world object. These score values depend on the algorithm that implements the function and have no meaning to the user. In addition, score values generated by different functions are not comparable. This will potentially lead to problems when the scores returned by different similarity functions need to be combined for computing the similarity between records. In this article, we propose that thresholds should be defined in terms of the precision that is expected from the matching process rather than in terms of the raw scores returned by the similarity function. Precision is a widely known similarity metric and has a clear interpretation from the users point of view. Our approach defines mappings from score values to precision values, which we call adjusted scores. In order to obtain such mappings, our approach requires training over a small dataset. Experiments show that training can be reused for different datasets on the same domain. Our results also demonstrate that existing methods for combining scores for computing the similarity between records may be enhanced if adjusted scores are used.


Information Sciences | 2011

Automatic threshold estimation for data matching applications

Juliana dos Santos; Carlos A. Heuser; Viviane Pereira Moreira; Leandro Krug Wives

Several advanced data management applications, such as data integration, data deduplication or similarity querying rely on the application of similarity functions. A similarity function requires the definition of a threshold value in order to assess if two different data instances match, i.e., if they represent the same real world object. In this context, the threshold definition is a central problem. In this paper, we propose a method for the estimation of the quality of a similarity function. Quality is measured in terms of recall and precision calculated at several different thresholds. On the basis of the results of the proposed estimation process, and taking into account the requirements of a specific application, a user is able to choose a threshold value that is adequate for the application. The proposed estimation process is based on a clustering phase performed on a sample taken from a data collection and requires no human intervention.


processing of the portuguese language | 2010

Assessing the impact of stemming accuracy on information retrieval

Felipe N. Flores; Viviane Pereira Moreira; Carlos A. Heuser

The quality of stemming algorithms is typically measured in two different ways: (i) how accurately they map the variant forms of a word to the same stem; or (ii) how much improvement they bring to Information Retrieval. In this paper, we evaluate different Portuguese stemming algorithms in terms of accuracy and in terms of their aid to Information Retrieval. The aim is to assess whether the most accurate stemmers are also the ones that bring the biggest gain in Information Retrieval. Our results show that some kind of correlation does exist, but it is not as strong as one might have expected.


association for information science and technology | 2016

Comparing and combining Content- and Citation-based approaches for plagiarism detection

Solange de L. Pertile; Viviane Pereira Moreira; Paolo Rosso

The vast amount of scientific publications available online makes it easier for students and researchers to reuse text from other authors and makes it harder for checking the originality of a given text. Reusing text without crediting the original authors is considered plagiarism. A number of studies have reported the prevalence of plagiarism in academia. As a consequence, numerous institutions and researchers are dedicated to devising systems to automate the process of checking for plagiarism. This work focuses on the problem of detecting text reuse in scientific papers. The contributions of this paper are twofold: (a) we survey the existing approaches for plagiarism detection based on content, based on content and structure, and based on citations and references; and (b) we compare content and citation‐based approaches with the goal of evaluating whether they are complementary and if their combination can improve the quality of the detection. We carry out experiments with real data sets of scientific papers and concluded that a combination of the methods can be beneficial.


international conference on management of data | 2015

Automatic Filling of Hidden Web Forms: A Survey

Gustavo Zanini Kantorski; Viviane Pereira Moreira; Carlos A. Heuser

A significant part of the information available on the Web is stored in online databases which compose what is known as Hidden Web or Deep Web. In order to access information from the Hidden Web, one must fill an HTML form that is submitted as a query to the underlying database. In recent years, many works have focused on how to automate the process of form filling by creating methods for choosing values to fill the fields in the forms. This is a challenging task since forms may contain fields for which there are no predefined values to choose from. This article presents a survey of methods for Web Form Filling, analyzing the existing solutions with respect to the type of forms that they handle and the filling strategy adopted. We provide a comparative analysis of 15 key works in this area and discuss directions for future research.


cross language evaluation forum | 2008

UFRGS@CLEF2008: using association rules for cross-language information retrieval

André Pinto Geraldo; Viviane Pereira Moreira

For UFRGSs participation on the TEL task at CLEF2008, our aim was to assess the validity of using algorithms for mining association rules to find mappings between concepts on a Cross-Language Information Retrieval scenario. Our approach requires a sample of parallel documents to serve as the basis for the generation of the association rules. The results of the experiments show that the performance of our approach is not statistically different from the monolingual baseline in terms of mean average precision. This is an indication that association rules can be effectively used to map concepts between languages. We have also tested a modification to BM25 that aims at increasing the weight of rare terms. The results show that this modified version achieved better performance. The improvements were considered to be statistically significant in terms of MAP on our monolingual runs.

Collaboration


Dive into the Viviane Pereira Moreira's collaboration.

Top Co-Authors

Avatar

Carlos A. Heuser

Universidade Federal do Rio Grande do Sul

View shared research outputs
Top Co-Authors

Avatar

André Pinto Geraldo

Universidade Federal do Rio Grande do Sul

View shared research outputs
Top Co-Authors

Avatar

Anderson Uilian Kauer

Universidade Federal do Rio Grande do Sul

View shared research outputs
Top Co-Authors

Avatar

Edson R. D. Weren

Universidade Federal do Rio Grande do Sul

View shared research outputs
Top Co-Authors

Avatar

José Palazzo Moreira de Oliveira

Universidade Federal do Rio Grande do Sul

View shared research outputs
Top Co-Authors

Avatar

Aline Villavicencio

Universidade Federal do Rio Grande do Sul

View shared research outputs
Top Co-Authors

Avatar

Danny Suarez Vargas

Universidade Federal do Rio Grande do Sul

View shared research outputs
Top Co-Authors

Avatar

Karin Becker

Universidade Federal do Rio Grande do Sul

View shared research outputs
Top Co-Authors

Avatar

Solange de L. Pertile

Universidade Federal do Rio Grande do Sul

View shared research outputs
Top Co-Authors

Avatar

Felipe N. Flores

Universidade Federal do Rio Grande do Sul

View shared research outputs
Researchain Logo
Decentralizing Knowledge