Víctor M. Prieto
University of A Coruña
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Víctor M. Prieto.
PLOS ONE | 2014
Víctor M. Prieto; Sérgio Matos; Manuel Álvarez; Fidel Cacheda; José Luís Oliveira
With the proliferation of social networks and blogs, the Internet is increasingly being used to disseminate personal health information rather than just as a source of information. In this paper we exploit the wealth of user-generated data, available through the micro-blogging service Twitter, to estimate and track the incidence of health conditions in society. The method is based on two stages: we start by extracting possibly relevant tweets using a set of specially crafted regular expressions, and then classify these initial messages using machine learning methods. Furthermore, we selected relevant features to improve the results and the execution times. To test the method, we considered four health states or conditions, namely flu, depression, pregnancy and eating disorders, and two locations, Portugal and Spain. We present the results obtained and demonstrate that the detection results and the performance of the method are improved after feature selection. The results are promising, with areas under the receiver operating characteristic curve between 0.7 and 0.9, and f-measure values around 0.8 and 0.9. This fact indicates that such approach provides a feasible solution for measuring and tracking the evolution of health states within the society.
Journal of Systems and Software | 2013
Víctor M. Prieto; Manuel Álvarez; Fidel Cacheda
Web Spam is one of the main difficulties that crawlers have to overcome and therefore one of the main problems of the WWW. There are several studies about characterising and detecting Web Spam pages. However, none of them deals with all the possible kinds of Web Spam. This paper shows an analysis of different kinds of Web Spam pages and identifies new elements that characterise it, to define heuristics which are able to partially detect them. We also discuss and explain several heuristics from the point of view of their effectiveness and computational efficiency. Taking them into account, we study several sets of heuristics and demonstrate how they improve the current results. Finally, we propose a new Web Spam detection system called SAAD (Spam Analyzer And Detector), which is based on the set of proposed heuristics and their use in a C4.5 classifier improved by means of Bagging and Boosting techniques. We have also tested our system in some well known Web Spam datasets and we have found it to be very effective.
practical applications of agents and multi agent systems | 2012
Víctor M. Prieto; Manuel Álvarez; Rafael López-García; Fidel Cacheda
The main goal of this study is to present a scale that classifies crawling systems according to their effectiveness in traversing the “client-side” Hidden Web. To that end, we accomplish several tasks. First, we perform a thorough analysis of the different client-side technologies and the main features of the Web 2.0 pages in order to determine the initial levels of the aforementioned scale. Second, we submit a Web site whose purpose is to check what crawlers are capable of dealing with those technologies and features. Third, we propose several methods to evaluate the performance of the crawlers in the Web site and to classify them according to the levels of the scale. Fourth, we show the results of applying those methods to some OpenSource and commercial crawlers, as well as to the robots of the main Web search engines.
international conference on innovative computing technology | 2013
Víctor M. Prieto; Manuel Álvarez; Fidel Cacheda
The WWW is continuously growing, but sometimes, not in the best way due to the proliferation of garbage contents, such as Web Spam pages, duplicate content or dead links. Some web servers do not always use the appropriate HTTP response code for dead links making them to be incorrectly identified, producing a problem for search engines. Our analysis has revealed that 7.35% of web servers send a 200 HTTP code when a request for an unknown document is received, instead of a 404 code, which indicates that the document is not found. These web pages are known as Soft-404 pages. Soft-404 pages are a problem for search engines, and their crawling modules, which process and index these pages, with the consequent loss of resources. There are few studies that analyse this problem and try to solve it. In this article we propose a new detection system for Soft-404 pages, called Soft404Detector, which uses a set of content-based heuristics and combines them with a C4.5 classifier. For testing purposes, we built a Soft-404 pages dataset. Our experiments indicate that our system is very effective, achieving a precision of 0.992 and a recall of 0.980 at Soft-404 pages.
PACBB | 2013
Víctor M. Prieto; Sérgio Matos; Manuel Álvarez; Fidel Cacheda; José Luís Oliveira
The Internet constitutes a huge source of information that can be exploited by individuals in many different ways. With the increasing use of social networks and blogs, the Internet is now used not only as an information source but also to disseminate personal health information. In this paper we exploit the wealth of user-generated data, available through the micro-blogging service Twitter, to estimate and track the incidence of health conditions in society, specifically in Portugal and Spain. We present results for the acquisition of relevant tweets for a set of four different conditions (flu, depression, pregnancy and eating disorders) and for the binary classification of these tweets as relevant or not for each case. The results obtained, ranging in AUC from 0.7 to 0.87, are very promising and indicate that such approach provides a feasible solution for measuring and tracking the evolution of many health related aspects within the society.
information retrieval facility conference | 2012
Víctor M. Prieto; Manuel Álvarez; Rafael López-García; Fidel Cacheda
Computer Science and Information Systems | 2012
Víctor M. Prieto; Manuel Álvarez; Rafael López-García; Fidel Cacheda
Computer Science and Information Systems | 2015
Víctor M. Prieto; Manuel Álvarez; Victor Carneiro; Fidel Cacheda
Journal of Digital Information Management | 2014
Víctor M. Prieto; Manuel Álvarez; Fidel Cacheda
KDIR | 2012
Víctor M. Prieto; Manuel Álvarez; Rafael López-García; Fidel Cacheda