André Luiz da Costa Carvalho

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where André Luiz da Costa Carvalho is active.

Explore More

Publication

Featured researches published by André Luiz da Costa Carvalho.

international world wide web conferences | 2006

Site level noise removal for search engines

André Luiz da Costa Carvalho; Paul Alexandru Chirita; Edleno Silva de Moura; Pável Calado; Wolfgang Nejdl

The currently booming search engine industry has determined many online organizations to attempt to artificially increase their ranking in order to attract more visitors to their web sites. At the same time, the growth of the web has also inherently generated several navigational hyperlink structures that have a negative impact on the importance measures employed by current search engines. In this paper we propose and evaluate algorithms for identifying all these noisy links on the web graph, may them be spam or simple relationships between real world entities represented by sites, replication of content, etc. Unlike prior work, we target a different type of noisy link structures, residing at the site level, instead of the page level. We thus investigate and annihilate site level mutual reinforcement relationships, abnormal support coming from one site towards another, as well as complex link alliances between web sites. Our experiments with the link database of the TodoBR search engine show a very strong increase in the quality of the output rankings after having applied our techniques.

international acm sigir conference on research and development in information retrieval | 2013

Fast document-at-a-time query processing using two-tier indexes

Cristian Rossi; Edleno Silva de Moura; André Luiz da Costa Carvalho; Altigran Soares da Silva

In this paper we present two new algorithms designed to reduce the overall time required to process top-k queries. These algorithms are based on the document-at-a-time approach and modify the best baseline we found in the literature, Blockmax WAND (BMW), to take advantage of a two-tiered index, in which the first tier is a small index containing only the higher impact entries of each inverted list. This small index is used to pre-process the query before accessing a larger index in the second tier, resulting in considerable speeding up the whole process. The first algorithm we propose, named BMW-CS, achieves higher performance, but may result in small changes in the top results provided in the final ranking. The second algorithm, named BMW-t, preserves the top results and, while slower than BMW-CS, it is faster than BMW. In our experiments, BMW-CS was more than 40 times faster than BMW when computing top 10 results, and, while it does not guarantee preserving the top results, it preserved all ranking results evaluated at this level.

Information Systems | 2010

Modeling the web as a hypergraph to compute page reputation

Klessius Berlt; Edleno Silva de Moura; André Luiz da Costa Carvalho; Marco Cristo; Nivio Ziviani; Thierson Couto

In this work we propose a model to represent the web as a directed hypergraph (instead of a graph), where links connect pairs of disjointed sets of pages. The web hypergraph is derived from the web graph by dividing the set of pages into non-overlapping blocks and using the links between pages of distinct blocks to create hyperarcs. A hyperarc connects a block of pages to a single page, in order to provide more reliable information for link analysis. We use the hypergraph model to create the hypergraph versions of the Pagerank and Indegree algorithms, referred to as HyperPagerank and HyperIndegree, respectively. The hypergraph is derived from the web graph by grouping pages by two different partition criteria: grouping together the pages that belong to the same web host or to the same web domain. We compared the original page-based algorithms with the host-based and domain-based versions of the algorithms, considering a combination of the page reputation, the textual content of the pages and the anchor text. Experimental results using three distinct web collections show that the HyperPagerank and HyperIndegree algorithms may yield better results than the original graph versions of the Pagerank and Indegree algorithms. We also show that the hypergraph versions of the algorithms were slightly less affected by noise links and spamming.

international world wide web conferences | 2009

On Finding Templates on Web Collections

Karane Vieira; André Luiz da Costa Carvalho; Klessius Berlt; Edleno Silva de Moura; Altigran Soares da Silva; Juliana Freire

Templates are pieces of HTML code common to a set of web pages usually adopted by content providers to enhance the uniformity of layout and navigation of theirs Web sites. They are usually generated using authoring/publishing tools or by programs that build HTML pages to publish content from a database. In spite of their usefulness, the content of templates can negatively affect the quality of results produced by systems that automatically process information available in web sites, such as search engines, clustering and automatic categorization programs. Further, the information available in templates is redundant and thus processing and storing such information just once for a set of pages may save computational resources. In this paper, we present and evaluate methods for detecting templates considering a scenario where multiple templates can be found in a collection of Web pages. Most of previous work have studied template detection algorithms in a scenario where the collection has just a single template. The scenario with multiple templates is more realistic and, as it is discussed here, it raises important questions that may require extensions and adjustments in previously proposed template detection algorithms. We show how to apply and evaluate two template detection algorithms in this scenario, creating solutions for detecting multiple templates. The methods studied partitions the input collection into clusters that contain common HTML paths and share a high number of HTML nodes and then apply a single-template detection procedure over each cluster. We also propose a new algorithm for single template detection based on a restricted form of bottom-up tree-mapping that requires only small set of pages to correctly identify a template and which has a worst-case linear complexity. Our experimental results over a representative set of Web pages show that our approach is efficient and scalable while obtaining accurate results.

Information Processing and Management | 2016

Fast top-k preserving query processing using two-tier indexes

Caio Moura Daoud; Edleno Silva de Moura; André Luiz da Costa Carvalho; Altigran Soares da Silva; David Fernandes; Cristian Rossi

We present a new query processing method for text search.We extend the BMW-CS algorithm to now preserve the top-k results, proposing BMW-CSP.We show through experiments that the method is competitive when compared to baselines. In this paper we propose and evaluate the Block Max WAND with Candidate Selection and Preserving Top-K Results algorithm, or BMW-CSP. It is an extension of BMW-CS, a method previously proposed by us. Although very efficient, BMW-CS does not guarantee preserving the top-k results for a given query. Algorithms that do not preserve the top results may reduce the quality of ranking results in search systems. BMW-CSP extends BMW-CS to ensure that the top-k results will have their rankings preserved. In the experiments we performed for computing the top-10 results, the final average time required for processing queries with BMW-CSP was lesser than the ones required by the baselines adopted. For instance, when computing top-10 results, the average time achieved by MBMW, the best multi-tier baseline we found in the literature, was 36.29źms per query, while the average time achieved by BMW-CSP was 19.64źms per query. The price paid by BMW-CSP is an extra memory required to store partial scores of documents. As we show in the experiments, this price is not prohibitive and, in cases where it is acceptable, BMW-CSP may constitute an excellent alternative query processing method.

international workshop on social computing | 2015

A Quantitative Analysis of Learning Objects and Their Metadata in Web Repositories

André Luiz da Costa Carvalho; Moisés G. de Carvalho; Davi Guimaraes; Davi Kalleb; Roberto Cavalcanti; Rodrigo S. Gouveia; Helvio Lopes; Tiago Thompsen Primo; Fernando Koch

This work conducts a quantitative analysis of a number of Learning Object Repositories (LORs) of Learning Objects (LOs) in both English and Portuguese languages. The focus of this exercise is to understand how the contributors organize their metadata, the update frequency, and measurement upon LOR items such as: (i) the size distribution; (ii) growth rate, and; (iii) statistics about metadata completion, blank fields and LO types. We conclude our analysis with a discussion about the implications of our findings upon tasks such as LO search and recommendation.

Journal of the Association for Information Science and Technology | 2012

LePrEF : Learn to precompute evidence fusion for efficient query evaluation

André Luiz da Costa Carvalho; Cristian Rossi; Edleno Silva de Moura; Altigran Soares da Silva; David Fernandes

State-of-the-art search engine ranking methods combine several distinct sources of relevance evidence to produce a high-quality ranking of results for each query. The fusion of information is currently done at query-processing time, which has a direct effect on the response time of search systems. Previous research also shows that an alternative to improve search efficiency in textual databases is to precompute term impacts at indexing time. In this article, we propose a novel alternative to precompute term impacts, providing a generic framework for combining any distinct set of sources of evidence by using a machine-learning technique. This method retains the advantages of producing high-quality results, but avoids the costs of combining evidence at query-processing time. Our method, called Learn to Precompute Evidence Fusion (LePrEF), uses genetic programming to compute a unified precomputed impact value for each term found in each document prior to query processing, at indexing time. Compared with previous research on precomputing term impacts, our method offers the advantage of providing a generic framework to precompute impact using any set of relevance evidence at any text collection, whereas previous research articles do not. The precomputed impact values are indexed and used later for computing document ranking at query-processing time. By doing so, our method effectively reduces the query processing to simple additions of such impacts. We show that this approach, while leading to results comparable to state-of-the-art ranking methods, also can lead to a significant decrease in computational costs during query processing.

Information Retrieval Journal | 2017

Waves: a fast multi-tier top-k query processing algorithm

Caio Moura Daoud; Edleno Silva de Moura; David Fernandes; Altigran Soares da Silva; Cristian Rossi; André Luiz da Costa Carvalho

In this paper, we present Waves, a novel document-at-a-time algorithm for fast computing of top-k query results in search systems. The Waves algorithm uses multi-tier indexes for processing queries. It performs successive tentative evaluations of results which we call waves. Each wave traverses the index, starting from a specific tier level i. Each wave i may insert only those documents that occur in that tier level into the answer. After processing a wave, the algorithm checks whether the answer achieved might be changed by successive waves or not. A new wave is started only if it has a chance of changing the top-k scores. We show through experiments that such lazy query processing strategy results in smaller query processing times when compared to previous approaches proposed in the literature. We present experiments to compare Waves’ performance to the state-of-the-art document-at-a-time query processing methods that preserve top-k results and show scenarios where the method can be a good alternative algorithm for computing top-k results.

conference on information and knowledge management | 2017

Fast Word Recognition for Noise channel-based Models in Scenarios with Noise Specific Domain Knowledge

Marco Cristo; Raíza Hanada; André Luiz da Costa Carvalho; Fernando Anglada Lores; Maria da Graça Campos Pimentel

Word recognition is a challenging task faced by many applications, specially in very noisy scenarios. This problem is usually seen as the transmission of a word through a noisy-channel, such that it is necessary to determine which known word of a lexicon is the received string. To be feasible, just a reduced set of candidate words are selected. They are usually chosen if they can be transformed into the input string by applying up to k character edit operations. To rank the candidates, the most effective estimates use domain knowledge about noise sources and error distributions, extracted from real use data. In scenarios with much noise, however, such estimates, and the index strategies normally required, do not scale well as they grow exponentially with k and the lexicon size. In this work, we propose very efficient methods for word recognition in very noisy scenarios which support effective edit-based distance algorithms in a Mor-Fraenkel index, searchable using a minimum perfect hashing. The method allows the early processing of most promising candidates, such that fast pruned searches present negligible loss in word ranking quality. We also propose a linear heuristic for estimating edit-based distances which take advantage of information already provided by the index. Our methods achieve precision similar to a state-of-the-art approach, being about ten times faster.

Journal of the Association for Information Science and Technology | 2012

Using site-level connections to estimate link confidence

Jucimar Brito de Souza; André Luiz da Costa Carvalho; Marco Cristo; Edleno Silva de Moura; Pável Calado; Paul-Alexandru Chirita; Wolfgang Nejdl

Search engines are essential tools for web users today. They rely on a large number of features to compute the rank of search results for each given query. The estimated reputation of pages is among the effective features available for search engine designers, probably being adopted by most current commercial search engines. Page reputation is estimated by analyzing the linkage relationships between pages. This information is used by link analysis algorithms as a query-independent feature, to be taken into account when computing the rank of the results. Unfortunately, several types of links found on the web may damage the estimated page reputation and thus cause a negative effect on the quality of search results. This work studies alternatives to reduce the negative impact of such noisy links. More specifically, the authors propose and evaluate new methods that deal with noisy links, considering scenarios where the reputation of pages is computed using the PageRank algorithm. They show, through experiments with real web content, that their methods achieve significant improvements when compared to previous solutions proposed in the literature.

Explore More