Is this you? Create Your Porfile

Paulo Braz Golgher

Universidade Federal de Minas Gerais

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Paulo Braz Golgher is active.

Explore More

Publication

Featured researches published by Paulo Braz Golgher.

international world wide web conferences | 2004

Automatic web news extraction using tree edit distance

Davi de Castro Reis; Paulo Braz Golgher; Altigran Soares da Silva; Alberto H. F. Laender

The Web poses itself as the largest data repository ever available in the history of humankind. Major efforts have been made in order to provide efficient access to relevant information within this huge repository of data. Although several techniques have been developed to the problem of Web data extraction, their use is still not spread, mostly because of the need for high human intervention and the low quality of the extraction results.In this paper, we present a domain-oriented approach to Web data extraction and discuss its application to automatically extracting news from Web sites. Our approach is based on a highly efficient tree structure analysis that produces very effective results. We have tested our approach with several important Brazilian on-line news sites and achieved very precise results, correctly extracting 87.71% of the news in a set of 4088 pages distributed among 35 different sites.

conference on information and knowledge management | 2005

Concept-based interactive query expansion

Bruno M. Fonseca; Paulo Braz Golgher; Bruno Pôssas; Berthier A. Ribeiro-Neto; Nivio Ziviani

Despite the recent advances in search quality, the fast increase in the size of the Web collection has introduced new challenges for Web ranking algorithms. In fact, there are still many situations in which the users are presented with imprecise or very poor results. One of the key difficulties is the fact that users usually submit very short and ambiguous queries, and they do not fully specify their information needs. That is, it is necessary to improve the query formation process if better answers are to be provided. In this work we propose a novel concept-based query expansion technique, which allows disambiguating queries submitted to search engines. The concepts are extracted by analyzing and locating cycles in a special type of query relations graph. This is a directed graph built from query relations mined using association rules. The concepts related to the current query are then shown to the user who selects the one concept that he interprets is most related to his query. This concept is used to expand the original query and the expanded query is processed instead. Using a Web test collection, we show that our approach leads to gains in average precision figures of roughly 32%. Further, if the user also provides information on the type of relation between his query and the selected concept, the gains in average precision go up to roughly 52%.

lasers and electro optics society meeting | 2003

Using association rules to discover search engines related queries

Bruno M. Fonseca; Paulo Braz Golgher; E.S. de Moura; Nivio Ziviani

We present a method for automatic generate suggestions of related queries submitted to Web search engines. The method extracts information from the log of past submitted queries to search engines using algorithms for mining association rules. Experimental results were performed on a log containing more than 2.3 million queries submitted to a commercial searching engine giving correct suggestions in 90.5% of the top 5 suggestions presented for common queries extracted from a real log.

data and knowledge engineering | 2004

Automatic generation of agents for collecting hidden web pages for data extraction

Juliano Palmieri Lage; Altigran Soares da Silva; Paulo Braz Golgher; Alberto H. F. Laender

As the Web grows, more and more data has become available under dynamic forms of publication, such as legacy databases accessed by an HTML form (the so called hidden Web). In situations such as this, integration of this data relies more and more on the fast generation of agents that can automatically fetch pages for further processing. As a result, there is an increasing need for tools that can help users generate such agents. In this paper, we describe a method for automatically generating agents to collect hidden Web pages. This method uses a pre-existing data repository for identifying the contents of these pages and takes the advantage of some patterns that can be found among Web sites to identify the navigation paths to follow. To demonstrate the accuracy of our method, we discuss the results of a number of experiments carried out with sites from different domains.

conference on information and knowledge management | 2007

Efficient search ranking in social networks

Monique V. Vieira; Bruno Maciel Fonseca; Rodrigo Damazio; Paulo Braz Golgher; Berthier A. Ribeiro-Neto

In social networks such as Orkut, www.orkut.com, a large portion of the user queries refer to names of other people. Indeed, more than 50% of the queries in Orkut are about names of other users, with an average of 1.8 terms per query. Further, the users usually search for people with whom they maintain relationships in the network. These relationships can be modelled as edges in a friendship graph, a graph in which the nodes represent the users. In this context, search ranking can be modelled as a function that depends on the distances among users in the graph, more specifically, of shortest paths in the friendship graph. However, application of this idea to ranking is not straightforward because the large size of modern social networks (dozens of millions of users) prevents efficient computation of shortest paths at query time. We overcome this by designing a ranking formula that strikes a balance between producing good results and reducing query processing time. Using data from the Orkut social network, which includes over 40 million users, we show that our ranking, augmented by this new signal, produces high quality results, while maintaining query processing time small.

string processing and information retrieval | 1999

CoBWeb-a crawler for the Brazilian Web

A.S. da Silva; E.A. Veloso; Paulo Braz Golgher; Berthier A. Ribeiro-Neto; A.H.F. Laender; Nivio Ziviani

One of the key components of current Web search engines is the document collector. The paper describes CoBWeb, an automatic document collector whose architecture is distributed and highly scalable. CoBWeb aims at collecting large amounts of documents per time period while observing operational and ethical limits in the crawling process. CoBWeb is part of the SIAM (Information Systems in Mobile Computing Environments) search engine which is being implemented to support the Brazilian Web. Thus, several results related to the Brazilian Web are presented.

conceptual modeling approaches for e business | 2000

An Example-Based Environment for Wrapper Generation

Paulo Braz Golgher; Alberto H. F. Laender; Altigran Soares da Silva; Berthier A. Ribeiro-Neto

In the so-called Web information systems, the role of extracting data of interest from Web sites is played by software components generically known as wrappers. As a result, the existence of flexible tools for designing, developing and maintaining wrappers is crucial. In this paper, we present WByE (Wrapping By Example), a user-oriented set of tools for helping the user to build wrappers. WByE is based on information implicitly provided by the user by means of suitable and intuitive interfaces. It includes two components: the ASByE tool, used for generating specifications on how to fetch desired pages (be them static or dynamic), and the DEByE tool, used for the extraction of data implicitly present in the fetched pages.

conference on information and knowledge management | 2001

Bootstrapping for example-based data extraction

Paulo Braz Golgher; Altigran Soares da Silva; Alberto H. F. Laender; Berthier A. Ribeiro-Neto

The effortless generation of wrappers for Web data sources is a crucial task if proper access to the huge amount of semi-structured data on the Web is to be granted. In particular, the development of strategies for wrapper generation based on user-given examples is currently one of the most promising research directions in Web data extraction. In this paper we show how to use a pre-existing data repository to automatically generate examples and allow full automated example-based data extraction. To demonstrate the feasibility of our approach we provide a number of results obtained from experiments we carried out and discuss how our ideas can be used to improve extraction rates and for providing resilience and adaptiveness for example-based generated wrappers.

international acm sigir conference on research and development in information retrieval | 2005

Basic issues on the processing of web queries

Claudine Badue; Ramurti A. Barbosa; Paulo Braz Golgher; Berthier A. Ribeiro-Neto; Nivio Ziviani

In this paper we study three basic and key issues related to Web query processing: load balance, broker behavior, and performance by individual index servers. Our study, while preliminary, does reveal interesting tradeoffs: (1) load unbalance at low query arrival rates can be controlled with a simple measure of randomizing the distribution of documents among the index servers, (2) the broker is not a bottleneck, and (3) disk utilization is higher than CPU utilization.

web information and data management | 2002

Collecting hidden weeb pages for data extraction

Juliano Palmieri Lage; Altigran Soares da Silva; Paulo Braz Golgher; Alberto H. F. Laender

As the Web grows, more and more data has become available under dynamic forms of publication, such as a legacy database accessed by an HTML form (the so called Hidden Web). In situations such as this, integration of this data relies more and more on the fast generation of page fetching agents. As a result, there is an increasing need for tools that can help the user to generate such agents. In this paper, we describe an approach to automatically generating agents to collect hidden Web pages that uses a pre-existing data repository for identifying the contents of these pages and takes the advantage of some regularities that can be found among Web sites. To demonstrate the effectiveness of our approach, we discuss the results of a number of experiments carried out with sites from different domains. We also dicuss how such regularities among sites can be formalized.

Explore More