Ángel Francisco Zazo Rodríguez
University of Salamanca
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ángel Francisco Zazo Rodríguez.
cross language evaluation forum | 2001
Carlos G. Figuerola; Raquel Gómez Díaz; Ángel Francisco Zazo Rodríguez; José Luis Alonso Berrocal
Most of the techniques used in Information Retrieval rely on the identification of terms from queries and documents, as much to carry out calculations based on the frequencies of these terms as to carry out comparisons between documents and queries. Terms coming from the same stem, either by morphological inflection or through derivation, can be presumed to have semantic proximity. The conflation of these words to a common form can produce improvements in retrieval. The stemming mechanisms used depend directly on each language. In this paper, a stemmer for Spanish and the tests conducted by applying it to the CLEF Spanish document collection are described, and the results are discussed.
Journal of Documentation | 2001
Carlos G. Figuerola; Ángel Francisco Zazo Rodríguez; José Luis Alonso Berrocal
Automatic categorisation can be understood as a learning process during which a program recognises the characteristics that distinguish each category or class from others, i.e. those characteristics which the documents should have in order to belong to that category. As yet few experiments have been carried out with documents in Spanish. Here we show the possibilities of elaborating pattern vectors that include the characteristics of different classes or categories of documents, using techniques based on those applied to the expansion of queries by relevance; likewise, the results of applying these techniques to a collection of documents in Spanish are given. The same collection of documents was categorised manually and the results of both procedures were compared.
practical applications of agents and multi agent systems | 2011
Carlos G. Figuerola; Raquel Gómez Díaz; José Luis Alonso Berrocal; Ángel Francisco Zazo Rodríguez
The web is the largest repository of documents available and, for retrieval for various purposes, we must use crawlers to navigate autonomously, to select documents and processing them according to the objectives pursued. However, we can see, even intuitively, that are obtained more or less abundant replications of a significant number of documents. The detection of these duplicates is important because it allows to lighten databases and improve the efficiency of information retrieval engines, but also improve the precision of cybermetric analysis, web mining studies, etc. Hash standard techniques used to detect these duplicates only detect exact duplicates, at the bit level. However, many of the duplicates found in the real world are not exactly alike. For example, we can find web pages with the same content, but with different headers or meta tags, or viewed with style sheets different. A frequent case is that of the same document but in different formats; in these cases we will have completely different documents at binary level. The obvious solution is to compare plain text conversions of all these formats, but these conversions are never identical, because of the different treatments of the converters on various formatting elements (treatment of textual characters, diacritics, spacing, paragraphs ...). In this work we introduce the possibility of using what is known as fuzzy-hashing. The idea is to produce fingerprints of files (or documents, etc..). This way, a comparison between two fingerprints could give us an estimate of the closeness or distance between two files, documents, etc. Based on the concept of “rolling hash”, the fuzzy hashing has been used successfully in computer security tasks, such as identifying malware, spam, virus scanning, etc. We have added capabilities of fuzzy hashing to a slight crawler and have made several tests in a heterogeneous network domain, consisting of multiple servers with different software, static and dynamic pages, etc.. These tests allowed us to measure similarity thresholds and to obtain useful data about the quantity and distribution of duplicate documents on web servers.
cross language evaluation forum | 2000
Carlos G. Figuerola; José Luis Alonso Berrocal; Ángel Francisco Zazo Rodríguez; Raquel Gómez Díaz
This paper describes our participation in the CLEF bilingual retrieval task (formulating queries in Spanish to retrieve documents in English), using an information retrieval (IR) system based on the vector model. Our aim was to use a simple approach to solve the problem, without expecting to obtain great results, especially owing to the short time available. The queries formulated in Spanish were translated to English by a commercial machine translation system. The translations were filtered to eliminate stop words, and then the remaining terms were stemmed using a standard stemmer. Results were poorer than those obtained through monolingual retrieval with original English queries, the difference being slightly over 15%.
cross language evaluation forum | 2008
Carlos G. Figuerola; José Luis Alonso Berrocal; Ángel Francisco Zazo Rodríguez; Montserrat Mateos
This years WebCLEF task was to retrieve snippets and pieces from documents on various topics. The extraction and the choice of the most widely used snippets can be carried out using various methods. However, the way in which web pages are usually converted to plain text introduces a series of problems that cause inefficiency in the retrieval. Duplicate information, absolutely irrelevants snippets or even meaningless, are some of these problems. Also, it is intended in this paper to explore the real impact of the use of several languages in obtaining relevant fragments.
cross language evaluation forum | 2006
Carlos G. Figuerola; José Luis Alonso Berrocal; Ángel Francisco Zazo Rodríguez; Emilio Rodríguez
This article describes the participation of the REINA Research Group of the University of Salamanca in WebCLEF 2006. This year we participated in the Monolingual Mixed Task in Spanish. The entire EuroGOV collection was processed to select all the pages in Spanish. All the pages with domain .es were also pre-selected. Our objective this year was to try pre-retrieval techniques of combining information fields or elements from web pages as well as the retrieval capability of these fields. In vector-based retrieval systems, the combining of terms coming from different sources can be achieved by operating on the frequency of the terms in the document using a weight scheme of tf×idf. The BODY field is, of course, the most useful from the retrieval perspective, but the text of the backlinks brings considerable improvement. META fields or tags, however, contribute little to retrieval improvement.
Revista General de Información y Documentación | 2001
José Luis Alonso Berrocal; Carlos G. Figuerola; Ángel Francisco Zazo Rodríguez
An introduction to the power laws, enunciated by Michalis Faloutsos, is made and that allows us to make a characterization of the Web through the analysis of their topology. Their most important characteristics are described and how calculate some of the values of the most interesting functions.
cross language evaluation forum | 2008
Carlos G. Figuerola; José Luis Alonso Berrocal; Ángel Francisco Zazo Rodríguez
This years WebCLEF task was to retrieve snippets and pieces from documents on various topics. The extraction and the choice of the most widely used snippets can be carried out using various methods. This article illustrates the segmentation process and the choice of snippets produced in this process. It also describes the tests carried out and their results.
Archive | 2004
Ángel Francisco Zazo Rodríguez; Carlos García-Figuerola Paniagua; José Luis Alonso Berrocal
BiD: Textos Universitaris de Biblioteconomia i Documentació | 2000
Carlos G. Figuerola; José Luis Alonso Berrocal; Ángel Francisco Zazo Rodríguez