Pável Calado | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Pável Calado is active.

Explore More

Publication

Featured researches published by Pável Calado.

conference on information and knowledge management | 2003

Combining link-based and content-based methods for web document classification

Pável Calado; Marco Cristo; Edleno Silva de Moura; Nivio Ziviani; Berthier A. Ribeiro-Neto; Marcos André Gonçalves

This paper studies how link information can be used to improve classification results for Web collections. We evaluate four different measures of subject similarity, derived from the Web link structure, and determine how accurate they are in predicting document categories. Using a Bayesian network model, we combine these measures with the results obtained by traditional content-based classifiers. Experiments on a Web directory show that best results are achieved when links from pages outside the directory are considered. Link information alone is able to obtain gains of up to 46 points in F1, when compared to a traditional content-based classifier. The combination with content-based methods can further improve the results, but too much noise may be introduced, since the text of Web pages is a much less reliable source of information. This work provides an important insight on which measures derived from links are more appropriate to compare Web documents and how these measures can be combined with content-based algorithms to improve the effectiveness of Web classification.

international acm sigir conference on research and development in information retrieval | 2000

Link-based and content-based evidential information in a belief network model

Ilmério Silva; Berthier A. Ribeiro-Neto; Pável Calado; Edleno Silva de Moura; Nivio Ziviani

This work presents an information retrieval model developed to deal with hyperlinked environments. The model is based on belief networks and provides a framework for combining information extracted from the content of the documents with information derived from cross-references among the documents. The information extracted from the content of the documents is based on statistics regarding the keywords in the collection and is one of the basis for traditional information retrieval (IR) ranking algorithms. The information derived from cross-references among the documents is based on link references in a hyperlinked environment and has received increased attention lately due to the success of the Web. We discuss a set of strategies for combining these two types of sources of evidential information and experiment with them using a reference collection extracted from the Web. The results show that this type of combination can improve the retrieval performance without requiring any extra information from the users at query time. In our experiments, the improvements reach up to 59% in terms of average precision figures.

IEEE Transactions on Knowledge and Data Engineering | 2004

Image retrieval using multiple evidence ranking

Tatiana Almeida Souza Coelho; Pável Calado; Lamarque Vieira Souza; Berthier A. Ribeiro-Neto; Richard R. Muntz

The World Wide Web is the largest publicly available image repository and a natural source of attention. An immediate consequence is that searching for images on the Web has become a current and important task. To search for images of interest, the most direct approach is keyword-based searching. However, since images on the Web are poorly labeled, direct application of standard keyword-based image searching techniques frequently yields poor results. We propose a comprehensive solution to this problem. In our approach, multiple sources of evidence related to the images are considered. To allow combining these distinct sources of evidence, we introduce an image retrieval model based on Bayesian belief networks. To evaluate our approach, we perform experiments on a reference collection composed of 54000 Web images. Our results indicate that retrieval using an image surrounding text passages is as effective as standard retrieval based on HTML tags. This is an interesting result because current image search engines in the Web usually do not take text passages into consideration. Most important, according to our results, the combination of information derived from text passages with information derived from HTML tags leads to improved retrieval, with relative gains in average precision figures of roughly 50 percent, when compared to the results obtained by the use of each source of evidence in isolation.

acm/ieee joint conference on digital libraries | 2009

Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia

Daniel Hasan Dalip; Marcos André Gonçalves; Marco Cristo; Pável Calado

The old dream of a universal repository containing all the human knowledge and culture is becoming possible through the Internet and the Web. Moreover, this is happening with the direct collaborative, participation of people. Wikipedia is a great example. It is an enormous repository of information with free access and edition, created by the community in a collaborative manner. However, this large amount of information, made available democratically and virtually without any control, raises questions about its relative quality. In this work we explore a significant number of quality indicators, some of them proposed by us and used here for the first time, and study their capability to assess the quality of Wikipedia articles. Furthermore, we explore machine learning techniques to combine these quality indicators into one single assessment judgment. Through experiments, we show that the most important quality indicators are the easiest ones to extract, namely, textual features related to length, structure and style. We were also able to determine which indicators did not contribute significantly to the quality assessment. These were, coincidentally, the most complex features, such as those based on link analysis. Finally, we compare our combination method with state-of-the-art solution and show significant improvements in terms of effective quality prediction.

international acm sigir conference on research and development in information retrieval | 2007

A combined component approach for finding collection-adapted ranking functions based on genetic programming

Humberto Mossri de Almeida; Marcos André Gonçalves; Marco Cristo; Pável Calado

In this paper, we propose a new method to discover collection-adapted ranking functions based on Genetic Programming (GP). Our Combined Component Approach (CCA)is based on the combination of several term-weighting components (i.e.,term frequency, collection frequency, normalization) extracted from well-known ranking functions. In contrast to related work, the GP terminals in our CCA are not based on simple statistical information of a document collection, but on meaningful, effective, and proven components. Experimental results show that our approach was able to outper form standard TF-IDF, BM25 and another GP-based approach in two different collections. CCA obtained improvements in mean average precision up to 40.87% for the TREC-8 collection, and 24.85% for the WBR99 collection (a large Brazilian Web collection), over the baseline functions. The CCA evolution process also was able to reduce the overtraining, commonly found in machine learning methods, especially genetic programming, and to converge faster than the other GP-based approach used for comparison.

ACM Transactions on Information Systems | 2003

Local versus global link information in the Web

Pável Calado; Berthier A. Ribeiro-Neto; Nivio Ziviani; Edleno Silva de Moura; Ilmério Silva

Information derived from the cross-references among the documentsin a hyperlinked environment, usually referred to as linkinformation, is considered important since it can be used toeffectively improve document retrieval. Depending on the retrievalstrategy, link information can be local or global. Local linkinformation is derived from the set of documents returned asanswers to the current user query. Global link information isderived from all the documents in the collection. In this work, weinvestigate how the use of local link information compares to theuse of global link information. For the comparison, we run a seriesof experiments using a large document collection extracted from theWeb. For our reference collection, the results indicate that theuse of local link information improves precision by 74&percnt;.When global link information is used, precision improves by35&percnt;. However, when only the first 10 documents in theranking are considered, the average gain in precision obtained withthe use of global link information is higher than the gain obtainedwith the use of local link information. This is an interestingresult since it provides insight and justification for the use ofglobal link information in major Web search engines, where usersare mostly interested in the first 10 answers. Furthermore, globalinformation can be computed in the background, which allowsspeeding up query processing.

international world wide web conferences | 2006

Site level noise removal for search engines

André Luiz da Costa Carvalho; Paul Alexandru Chirita; Edleno Silva de Moura; Pável Calado; Wolfgang Nejdl

The currently booming search engine industry has determined many online organizations to attempt to artificially increase their ranking in order to attract more visitors to their web sites. At the same time, the growth of the web has also inherently generated several navigational hyperlink structures that have a negative impact on the importance measures employed by current search engines. In this paper we propose and evaluate algorithms for identifying all these noisy links on the web graph, may them be spam or simple relationships between real world entities represented by sites, replication of content, etc. Unlike prior work, we target a different type of noisy link structures, residing at the site level, instead of the page level. We thus investigate and annihilate site level mutual reinforcement relationships, abnormal support coming from one site towards another, as well as complex link alliances between web sites. Our experiments with the link database of the TodoBR search engine show a very strong increase in the quality of the output rankings after having applied our techniques.

conference on information and knowledge management | 2007

Structure-based inference of xml similarity for fuzzy duplicate detection

Luís Leitão; Pável Calado; Melanie Weis

Fuzzy duplicate detection aims at identifying multiple representations of real-world objects stored in a data source, and is a task of critical practical relevance in data cleaning, data mining, or data integration. It has a long history for relational data stored in a single table (or in multiple tables with equal schema). Algorithms for fuzzy duplicate detection in more complex structures, e.g., hierarchies of a data warehouse, XML data, or graph data have only recently emerged. These algorithms use similarity measures that consider the duplicate status of their direct neighbors, e.g., children in hierarchical data, to improve duplicate detection effectiveness. In this paper, we propose a novel method for fuzzy duplicate detection in hierarchical and semi-structured XML data. Unlike previous approaches, it not only considers the duplicate status of children, but rather the probability of descendants being duplicates. Probabilities are computed efficiently using a Bayesian network. Experiments show the proposed algorithm is able to maintain high precision and recall values, even when dealing with data containing a high amount of errors and missing information. Our proposal is also able to outperform a state-of-the-art duplicate detection system on three different XML databases.

international acm sigir conference on research and development in information retrieval | 2013

Exploiting user feedback to learn to rank answers in q&a forums: a case study with stack overflow

Daniel Hasan Dalip; Marcos André Gonçalves; Marco Cristo; Pável Calado

Collaborative web sites, such as collaborative encyclopedias, blogs, and forums, are characterized by a loose edit control, which allows anyone to freely edit their content. As a consequence, the quality of this content raises much concern. To deal with this, many sites adopt manual quality control mechanisms. However, given their size and change rate, manual assessment strategies do not scale and content that is new or unpopular is seldom reviewed. This has a negative impact on the many services provided, such as ranking and recommendation. To tackle with this problem, we propose a learning to rank (L2R) approach for ranking answers in Q&A forums. In particular, we adopt an approach based on Random Forests and represent query and answer pairs using eight different groups of features. Some of these features are used in the Q&A domain for the first time. Our L2R method was trained to learn the answer rating, based on the feedback users give to answers in Q&A forums. Using the proposed method, we were able (i) to outperform a state of the art baseline with gains of up to 21% in NDCG, a metric used to evaluate rankings; we also conducted a comprehensive study of the features, showing that (ii) review and user features are the most important in the Q&A domain although text features are useful for assessing quality of new answers; and (iii) the best set of new features we proposed was able to yield the best quality rankings.

acm ieee joint conference on digital libraries | 2003

The web-DL environment for building digital libraries from the web

Pável Calado; Marcos André Gonçalves; Edward A. Fox; Berthier A. Ribeiro-Neto; Alberto H. F. Laender; A.S. da Silva; Davi de Castro Reis; Paulo Roberto; Monique V. Vieira; Juliano Palmieri Lage

The Web contains a huge volume of unstructured data, which is difficult to manage. In digital libraries, on the other hand, information is explicitly organized, described, and managed. Community-oriented services are built to attend specific information needs and tasks. In this paper, we describe an environment, Web-DL, that allows the construction of digital libraries from the Web. The Web-DL environment will allow us to collect data from the Web, standardize it, and publish it through a digital library system. It provides support to services and organizational structure normally available in digital libraries, but benefiting from the breadth of the Web contents. We experimented with applying the Web-DL environment to the Networked Digital Library of Theses and Dissertations (NDLTD), thus demonstrating that the rapid construction of DLs from the Web is possible. Also, Web-DL provides an alternative as a largescale solution for interoperability between independent digital libraries.

Explore More