Guilherme Tavares de Assis
Universidade Federal de Ouro Preto
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Guilherme Tavares de Assis.
string processing and information retrieval | 2007
Guilherme Tavares de Assis; Alberto H. F. Laender; Marcos André Gonçalves; Altigran Soares da Silva
In this paper, we propose a novel approach to focused crawling that exploits genre and content-related information present in Web pages to guide the crawling process. The effectiveness, efficiency and scalability of this approach are demonstrated by a set of experiments involving the crawling of pages related to syllabi (genre) of computer science courses (content). The results of these experiments show that focused crawlers constructed according to our approach achieve levels of F1 superior to 92% (an average gain of 178% over traditional focused crawlers), requiring the analysis of no more than 60% of the visited pages in order to find 90% of the relevant pages (an average gain of 82% over traditional focused crawlers).
international world wide web conferences | 2009
Guilherme Tavares de Assis; Alberto H. F. Laender; Marcos André Gonçalves; Altigran Soares da Silva
Focused crawlers have as their main goal to crawl Web pages that are relevant to a specific topic or user interest, playing an important role for a great variety of applications. In general, they work by trying to find and crawl all kinds of pages deemed as related to an implicitly declared topic. However, users are often not simply interested in any document about a topic, but instead they may want only documents of a given type or genre on that topic to be retrieved. In this article, we describe an approach to focused crawling that exploits not only content-related information but also genre information present in Web pages to guide the crawling process. This approach has been designed to address situations in which the specific topic of interest can be expressed by specifying two sets of terms, the first describing genre aspects of the desired pages and the second related to the subject or content of these pages, thus requiring no training or any kind of preprocessing. The effectiveness, efficiency and scalability of the proposed approach are demonstrated by a set of experiments involving the crawling of pages related to syllabi of computer science courses, job offers in the computer science field and sale offers of computer equipments. These experiments show that focused crawlers constructed according to our genre-aware approach achieve levels of F1 superior to 88%, requiring the analysis of no more than 65% of the visited pages in order to find 90% of the relevant pages. In addition, we experimentally analyze the impact of term selection on our approach and evaluate a proposed strategy for semi-automatic generation of such terms. This analysis shows that a small set of terms selected by an expert or a set of terms specified by a typical user familiar with the topic is usually enough to produce good results and that such a semi-automatic strategy is very effective in supporting the task of selecting the sets of terms required to guide a crawling process.
Multimedia Tools and Applications | 2015
Moisés Henrique Ramos Pereira; Celso Luiz de Souza; Flávio Luis Cardeal Pádua; Giani David Silva; Guilherme Tavares de Assis; Adriano C. M. Pereira
This paper presents a novel multimedia information system, called SAPTE, for supporting the discourse analysis and information retrieval of television programs from their corresponding video recordings. Unlike most common systems, SAPTE uses both content independent and dependent metadata, which are determined by the application of discourse analysis techniques as well as image and audio analysis methods. The proposed system was developed in partnership with the free-to-air Brazilian TV channel Rede Minas in an attempt to provide TV researchers with computational tools to assist their studies about this media universe. The system is based on the Matterhorn framework for managing video libraries, combining: (1) discourse analysis techniques for describing and indexing the videos, by considering aspects, such as, definitions of the subject of analysis, the nature of the speaker and the corpus of data resulting from the discourse; (2) a state of the art decoder software for large vocabulary continuous speech recognition, called Julius; (3) image and frequency domain techniques to compute visual signatures for the video recordings, containing color, shape and texture information; and (4) hashing and k-d tree methods for data indexing. The capabilities of SAPTE were successfully validated, as demonstrated by our experimental results, indicating that SAPTE is a promising computational tool for TV researchers.
Artificial Intelligence Review | 2014
Celso Luiz de Souza; Flávio Luis Cardeal Pádua; Cristiano F. G. Nunes; Guilherme Tavares de Assis; Giani David Silva
This work addresses the development of a unified approach to content-based indexing and retrieval of digital videos fromtelevision archives. The proposed approach has been designed to deal with arbitrary television genres, making it suitablefor various applications. To achieve this goal, the main steps of a content-based video retrieval system are addressed in thiswork, namely: video segmentation, key-frame extraction, content-based video indexing and the video retrieval operation itself.Video segmentation is addressed as a typical TV broadcast structuring problem, which consists in automatically determiningthe boundaries of each broadcasted program (like movies, news, among others) and inter-program (for instance, commercials).Specifically, to segment the videos, Electronic Program Guide (EPG) metadata is combined with the detection of two specialcues, namely, audio cuts (silence) and dark monochrome frames. On the other hand, a color histogram-based approach performskey-frame extraction. Video indexing and retrieval are accomplished by using hashing and k-d tree methods, while visualsignatures containing color, shape and texture information are estimated for the key-frames, by using image and frequencydomain techniques. Experimental results with the dataset of a multimedia information system especially developed for managingtelevision broadcast archives demonstrate that our approach works efficiently, retrieving videos in 0.16 seconds on average andachieving recall, precision and F1 measure values, as high as 0.76, 0.97 and 0.86 respectively.
latin american web congress | 2012
Vitor Mangaravite; Guilherme Tavares de Assis; Anderson A. Ferreira
Focused crawlers attempt to crawl web pages that are relevant to a specific topic or user interest. Although these kinds of crawlers have been proven to be effective, they need to improve their efficiency. Focused crawlers usually use a Frontier of non-visited URLs to visit the web pages and gather relavant ones. In this work, we define and evaluate a queueing policy of non-visited URLs, based on link context, to improve the efficiency of a genre-aware focused crawler. Our experimental evaluation shows, in some situations, an improvement around 100% in efficiency terms.
acm symposium on applied computing | 2008
Guilherme Tavares de Assis; Alberto H. F. Laender; Altigran Soares da Silva; Marcos André Gonçalves
The genre-aware approach to focused crawling aims at crawling pages related to specific topics that can be expressed in terms of both genre and content information. Such an approach requires an expert to specify a set of terms that describe the genre and the content of the pages of interest. In this paper, we analyze the impact of term selection on this approach. Thus, we have performed an experimental study in which we vary the number of genre and content terms used in focused crawling processes aimed at crawling pages related to syllabi (genre) of computer science courses (subject) and sale offers (genre) of computer equipments (subject). This experimental study showed that a small set of terms selected by an expert is usually enough to produce good results. In addition, we propose and experimentally evaluate a strategy for semi-automatic generation of terms to be used in such an approach. The results of these experiments showed that such a strategy is very effective and provides a means to assist an expert in the task of specifying the sets of required terms.
latin american web congress | 2014
Gabriel Resende Gonçalves; Anderson A. Ferreira; Guilherme Tavares de Assis; Andrea Iabrudi Tavares
An undergraduate program must prepare its students for the major needs of the labor market. One of the main ways to identify what are the demands to be met is creating a manner to manage information of its alumni. This consists of gathering data from programs alumni and finding out what are their main areas of employment on the labor market or which are their main fields of research in the academy. Usually, this data is obtained through available forms on the Web or forwarded by mail or email, however, these methods, in addition to being laborious, do not present good feedback from the alumni. Thus, this work proposes a novel method to help teaching staffs of undergraduate programs to gather information on the desired population of alumni, semi-automatically, on the Web. Overall, by using a few alumni pages as an initial set of sample pages, the proposed method was capable of gathering information concerning a number of alumni twice as bigger than adopted conventional methods.
Information Processing and Management | 2017
Leandro Neiva Lopes Figueiredo; Guilherme Tavares de Assis; Anderson A. Ferreira
Abstract Extracting data from web pages is an important task for several applications such as comparison shopping and data mining. Ordinarily, the data in web pages represent records from a database and are obtained using a web search. One of the most important steps for extracting records from a web page is identifying out of the different data regions, the one containing the records to be extracted. An incorrect identification of this region may lead to an extraction of incorrect records. This process is followed by the equally important step of detecting and correctly splitting the necessary records and their attributes from the main data region. In this study, we propose a method for data extraction based on rendering information and an n-gram model (DERIN) that aims to improve wrapper performance by automatically selecting the main data region from a search results page and extracting its records and attributes based on rendering information. The proposed DERIN method can detect different record structures using techniques based on an n-gram model. Moreover, DERIN does not require examples to learn how to extract the data, performs a given domain independently and can detect records that are not children of the same parent element in the DOM tree. Experimental results using web pages from several domains show that DERIN is highly effective and performs well when compared with other methods.
international conference of the chilean computer science society | 1997
Berthier A. Ribeiro-Neto; Guilherme Tavares de Assis
A cooperative database allows the user to specify approximate or vague query conditions. A vague query requires the database system to rank the retrieved answers according to their similarity to that query. In this paper we discuss a reactive ranking strategy for cooperative databases. The reactiveness in our approach is provided by an interactive mechanism which allows the user to select the answers of his preference and then uses these answers to tune the original user query. The ranking formula we propose is derived from a parallel with the AutoClass II system-a Bayesian probabilistic classification engine. The reactive mechanism we propose is derived from a parallel with the technique of relevance feedback widely used in the area of information retrieval. We implemented our cooperative and reactive ranking approach in a Web browser interface and did some experimentation. Our experiments provide initial evidence that our approach might help the user to solve query tasks more precisely and in shorter time.
latin american web congress | 2014
Leandro Neiva Lopes Figueiredo; Anderson A. Ferreira; Guilherme Tavares de Assis
Extracting data from web pages is an important task for several applications, such as comparison shopping and data mining. Much of that data is provided by search result pages, in which each result, called search result record, represents a record from a database. One of the most important steps for extracting such records is identifying, among different data regions from a page, one that contains the records to be extracted. An incorrect identification of this region may lead to an incorrect extraction of the search result records. In this paper, we propose a simple but efficient method that generates path expression to select the main data region from a given page, based on the rendering area information of its elements. The generated path expression may be used by wrappers for extracting the search result records and its data units, reducing its complexity and increasing its accuracy. Experimental results using web pages from several domains show that the method is highly effective.