José María Gómez Hidalgo
European University of Madrid
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by José María Gómez Hidalgo.
document engineering | 2006
José María Gómez Hidalgo; Guillermo Cajigas Bringas; Enrique Puertas Sanz; Francisco Carrero García
In the recent years, we have witnessed a dramatic increment in the volume of spam email. Other related forms of spam are increasingly revealing as a problem of importance, specially the spam on Instant Messaging services (the so called SPIM), and Short Message Service (SMS) or mobile spam.Like email spam, the SMS spam problem can be approached with legal, economic or technical measures. Among the wide range of technical measures, Bayesian filters are playing a key role in stopping email spam. In this paper, we analyze to what extent Bayesian filtering techniques used to block email spam, can be applied to the problem of detecting and stopping mobile spam. In particular, we have built two SMS spam test collections of significant size, in English and Spanish. We have tested on them a number of messages representation techniques and Machine Learning algorithms, in terms of effectiveness. Our results demonstrate that Bayesian filtering techniques can be effectively transferred from email to SMS spam.
conference on information and knowledge management | 2007
Gordon V. Cormack; José María Gómez Hidalgo; Enrique Puertas Sanz
We consider the problem of content-based spam filtering for short text messages that arise in three contexts: mobile (SMS) communication, blog comments, and email summary information such as might be displayed by a low-bandwidth client. Short messages often consist of only a few words, and therefore present a challenge to traditional bag-of-words based spam filters. Using three corpora of short messages and message fields derived from real SMS, blog, and spam messages, we evaluate feature-based and compression-model-based spam filters. We observe that bag-of-words filters can be improved substantially using different features, while compression-model filters perform quite well as-is. We conclude that content filtering for short messages is surprisingly effective.
document engineering | 2011
Tiago A. Almeida; José María Gómez Hidalgo; Akebo Yamakami
The growth of mobile phone users has lead to a dramatic increasing of SMS spam messages. In practice, fighting mobile phone spam is difficult by several factors, including the lower rate of SMS that has allowed many users and service providers to ignore the issue, and the limited availability of mobile phone spam-filtering software. On the other hand, in academic settings, a major handicap is the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. Moreover, as SMS messages are fairly short, content-based spam filters may have their performance degraded. In this paper, we offer a new real, public and non-encoded SMS spam collection that is the largest one as far as we know. Moreover, we compare the performance achieved by several established machine learning methods. The results indicate that Support Vector Machine outperforms other evaluated classifiers and, hence, it can be used as a good baseline for further comparison.
acm symposium on applied computing | 2002
José María Gómez Hidalgo
In the recent years, Unsolicited Bulk Email has became an increasingly important problem, with a big economic impact. In this paper, we discuss cost-sensitive Text Categorization methods for UBE filtering. In concrete, we have evaluated a range of Machine Learning methods for the task (C4.5, Naive Bayes, PART, Support Vector Machines and Rocchio), made cost sensitive through several methods (Threshold Optimization, Instance Weighting, and Meta-Cost). We have used the Receiver Operating Characteristic Convex Hull method for the evaluation, that best suits classification problems in which target conditions are not known, as it is the case. Our results do not show a dominant algorithm nor method for making algorithms cost-sensitive, but are the best reported on the test collection used, and approach real-world hand-crafted classifiers accuracy.
international acm sigir conference on research and development in information retrieval | 2007
Gordon V. Cormack; José María Gómez Hidalgo; Enrique Puertas Sanz
Mobile spam in an increasing threat that may be addressed using filtering systems like those employed against email spam. We believe that email filtering techniques require some adaptation to reach good levels of performance on SMS spam, especially regarding message representation. In order to test this assumption, we have performed experiments on SMS filtering using top performing email spam filters on mobile spam messages using a suitable feature representation, with results supporting our hypothesis.
conference on computational natural language learning | 2000
José María Gómez Hidalgo; Manual Maña López; Enrique Puertas Sanz
Spam filtering is a text categorization task that shows especial features that make it interesting and difficult. First, the task has been performed traditionally using heuristics from the domain. Second, a cost model is required to avoid misclassification of legitimate messages. We present a comparative evaluation of several machine learning algorithms applied to spam filtering, considering the text of the messages and a set of heuristics for the task. Cost-oriented biasing and evaluation is performed.
Advances in Computers | 2008
Enrique Puertas Sanz; José María Gómez Hidalgo; José Carlos Cortizo Pérez
Abstract In recent years, email spam has become an increasingly important problem, with a big economic impact in society. In this work, we present the problem of spam, how it affects us, and how we can fight against it. We discuss legal, economic, and technical measures used to stop these unsolicited emails. Among all the technical measures, those based on content analysis have been particularly effective in filtering spam, so we focus on them, explaining how they work in detail. In summary, we explain the structure and the process of different Machine Learning methods used for this task, and how we can make them to be cost sensitive through several methods like threshold optimization, instance weighting, or MetaCost. We also discuss how to evaluate spam filters using basic metrics, TREC metrics, and the receiver operating characteristic convex hull method, that best suits classification problems in which target conditions are not known, as it is the case. We also describe how actual filters are used in practice. We also present different methods used by spammers to attack spam filters and what we can expect to find in the coming years in the battle of spam filters against spammers.
european conference on research and advanced technology for digital libraries | 1999
Manuel J. Maña López; Manuel de Buenaga Rodríguez; José María Gómez Hidalgo
Textual information available has grown so much as to make necessary to study new techniques that assist users in information access (IA). In this paper, we propose utilizing a user directed summarization system in an IA setting for helping users to decide about document relevance. The summaries are generated using a sentence extraction method that scores the sentences performing some heuristics employed successfully in previous works (keywords, title and location). User modeling is carried out exploiting users query to an IA system and expanding query terms using WordNet. We present an objective and systematic evaluation method oriented to measure the summary effectiveness in two IA significant tasks: ad hoc retrieval and relevance feedback. Results obtained prove our initial hypothesis, i.e., user adapted summaries are a useful tool assisting users in an IA context.
international conference on machine learning and applications | 2012
José María Gómez Hidalgo; Tiago A. Almeida; Akebo Yamakami
Mobile phones are becoming the latest target of electronic junk mail. Recent reports clearly indicate that the volume of SMS spam messages are dramatically increasing year by year. Probably, one of the major concerns in academic settings was the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. To address this issue, we have recently proposed a new SMS Spam Collection that, to the best of our knowledge, is the largest, public and real SMS dataset available for academic studies. However, as it has been created by augmenting a previously existing database built using roughly the same sources, it is sensible to certify that there are no duplicates coming from them. So, in this paper we offer a comprehensive analysis of the new SMS Spam Collection in order to ensure that this does not happen, since it may ease the task of learning SMS spam classifiers and, hence, it could compromise the evaluation of methods. The analysis of results indicate that the procedure followed does not lead to near-duplicates and, consequently, the proposed dataset is reliable to use for evaluating and comparing the performance achieved by different classifiers.
international conference natural language processing | 2005
José María Gómez Hidalgo; Manuel de Buenaga Rodríguez; José Carlos Cortizo Pérez
Automated Text Categorization has reached the levels of accuracy of human experts. Provided that enough training data is available, it is possible to learn accurate automatic classifiers by using Information Retrieval and Machine Learning Techniques. However, performance of this approach is damaged by the problems derived from language variation (specially polysemy and synonymy). We investigate how Word Sense Disambiguation can be used to alleviate these problems, by using two traditional methods for thesaurus usage in Information Retrieval, namely Query Expansion and Concept Indexing. These methods are evaluated on the problem of using the Lexical Database WordNet for text categorization, focusing on the Word Sense Disambiguation step involved. Our experiments demonstrate that rather simple dictionary methods, and baseline statistical approaches, can be used to disambiguate words and improve text representation and learning in both Query Expansion and Concept Indexing approaches.