Eva Lorenzo Iglesias | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Eva Lorenzo Iglesias is active.

Explore More

Publication

Featured researches published by Eva Lorenzo Iglesias.

Expert Systems With Applications | 2007

Applying lazy learning algorithms to tackle concept drift in spam filtering

Florentino Fdez-Riverola; Eva Lorenzo Iglesias; Fernando Díaz; José Ramon Méndez; Juan M. Corchado

A great amount of machine learning techniques have been applied to problems where data is collected over an extended period of time. However, the disadvantage with many real-world applications is that the distribution underlying the data is likely to change over time. In these situations, a problem that many global eager learners face is their inability to adapt to local concept drift. Concept drift in spam is particularly difficult as the spammers actively change the nature of their messages to elude spam filters. Algorithms that track concept drift must be able to identify a change in the target concept (spam or legitimate e-mails) without direct knowledge of the underlying shift in distribution. In this paper we show how a previously successful instance-based reasoning e-mail filtering model can be improved in order to better track concept drift in spam domain. Our proposal is based on the definition of two complementary techniques able to select both terms and e-mails representative of the current situation. The enhanced system is evaluated against other well-known successful lazy learning approaches in two scenarios, all within a cost-sensitive framework. The results obtained from the experiments carried out are very promising and back up the idea that instance-based reasoning systems can offer a number of advantages tackling concept drift in dynamic problems, as in the case of the anti-spam filtering domain.

european conference on information retrieval | 2003

An efficient compression code for text databases

Nieves R. Brisaboa; Eva Lorenzo Iglesias; Gonzalo Navarro; José R. Paramá

We present a new compression format for natural language texts, allowing both exact and approximate search without decompression. This new code -called End-Tagged Dense Code- has some advantages with respect to other compression techniques with similar features such as the Tagged Huffman Code of [Moura et al., ACM TOIS 2000]. Our compression method obtains (i) better compression ratios, (ii) a simpler vocabulary representation, and (iii) a simpler and faster encoding. At the same time, it retains the most interesting features of the method based on the Tagged Huffman Code, i.e., exact search for words and phrases directly on the compressed text using any known sequential pattern matching algorithm, efficient word-based approximate and extended searches without any decoding, and efficient decompression of arbitrary portions of the text. As a side effect, our analytical results give new upper and lower bounds for the redundancy of d-ary Huffman codes.

decision support systems | 2007

SpamHunting: An instance-based reasoning system for spam labelling and filtering

Florentino Fdez-Riverola; Eva Lorenzo Iglesias; Fernando Díaz; José Ramon Méndez; Juan M. Corchado

In this paper we show an instance-based reasoning e-mail filtering model that outperforms classical machine learning techniques and other successful lazy learners approaches in the domain of anti-spam filtering. The architecture of the learning-based anti-spam filter is based on a tuneable enhanced instance retrieval network able to accurately generalize e-mail representations. The reuse of similar messages is carried out by a simple unanimous voting mechanism to determine whether the target case is spam or not. Previous to the final response of the system, the revision stage is only performed when the assigned class is spam whereby the system employs general knowledge in the form of meta-rules.

international conference on data mining | 2006

A comparative performance study of feature selection methods for the anti-spam filtering domain

José Ramon Méndez; Florentino Fdez-Riverola; Fernando Díaz; Eva Lorenzo Iglesias; Juan M. Corchado

In this paper we analyse the strengths and weaknesses of the mainly used feature selection methods in text categorization when they are applied to the spam problem domain. Several experiments with different feature selection methods and content-based filtering techniques are carried out and discussed. Information Gain, χ2-text, Mutual Information and Document Frequency feature selection methods have been analysed in conjunction with Naive Bayes, boosting trees, Support Vector Machines and ECUE models in different scenarios. From the experiments carried out the underlying ideas behind feature selection methods are identified and applied for improving the feature selection process of SpamHunting, a novel anti-spam filtering software able to accurate classify suspicious e-mails.

Lecture Notes in Computer Science | 2006

Tracking concept drift at feature selection stage in spamhunting: an anti-spam instance-based reasoning system

José Ramon Méndez; Florentino Fdez-Riverola; Eva Lorenzo Iglesias; Fernando Díaz; Juan M. Corchado

In this paper we propose a novel feature selection method able to handle concept drift problems in spam filtering domain. The proposed technique is applied to a previous successful instance-based reasoning e-mail filtering system called SpamHunting. Our achieved information criterion is based on several ideas extracted from the well-known information measure introduced by Shannon. We show how results obtained by our previous system in combination with the improved feature selection method outperforms classical machine learning techniques and other well-known lazy learning approaches. In order to evaluate the performance of all the analysed models, we employ two different corpus and six well-known metrics in various scenarios.

CAEPIA'05 Proceedings of the 11th Spanish association conference on Current Topics in Artificial Intelligence | 2005

Tokenising, stemming and stopword removal on anti-spam filtering domain

José Ramon Méndez; Eva Lorenzo Iglesias; Florentino Fdez-Riverola; Fernando Díaz; Juan M. Corchado

Junk e-mail detection and filtering can be considered a cost-sensitive classification problem. Nevertheless, preprocessing methods and noise reduction strategies used to enhance the computational efficiency in text classification cannot be so efficient in e-mail filtering. This fact is demonstrated here where a comparative study of the use of stopword removal, stemming and different tokenising schemes is presented. The final goal is to preprocess the training e-mail corpora of several content-based techniques for spam filtering (machine approaches and case-based systems). Soundness conclusions are extracted from the experiments carried out where different scenarios are taken into consideration.

Expert Systems With Applications | 2013

An HMM-based over-sampling technique to improve text classification

Eva Lorenzo Iglesias; A. Seara Vieira; Lourdes Borrajo

Abstract This paper presents a novel over-sampling method based on document content to handle the class imbalance problem in text classification. The new technique, COS-HMM (Content-based Over-Sampling HMM), includes an HMM that is trained with a corpus in order to create new samples according to current documents. The HMM is treated as a document generator which can produce synthetical instances formed on what it was trained with. To demonstrate its achievement, COS-HMM is tested with a Support Vector Machine (SVM) in two medical documental corpora (OHSUMED and TREC Genomics), and is then compared with the Random Over-Sampling (ROS) and SMOTE techniques. Results suggest that the application of over-sampling strategies increases the global performance of the SVM to classify documents. Based on the empirical and statistical studies, the new method clearly outperforms the baseline method (ROS), and offers a greater performance than SMOTE in the majority of tested cases.

Computer Methods and Programs in Biomedicine | 2016

Improving the text classification using clustering and a novel HMM to reduce the dimensionality

A. Seara Vieira; Lourdes Borrajo; Eva Lorenzo Iglesias

In text classification problems, the representation of a document has a strong impact on the performance of learning systems. The high dimensionality of the classical structured representations can lead to burdensome computations due to the great size of real-world data. Consequently, there is a need for reducing the quantity of handled information to improve the classification process. In this paper, we propose a method to reduce the dimensionality of a classical text representation based on a clustering technique to group documents, and a previously developed Hidden Markov Model to represent them. We have applied tests with the k-NN and SVM classifiers on the OHSUMED and TREC benchmark text corpora using the proposed dimensionality reduction technique. The experimental results obtained are very satisfactory compared to commonly used techniques like InfoGain and the statistical tests performed demonstrate the suitability of the proposed technique for the preprocessing step in a text classification task.

Applied Soft Computing | 2015

TCBR-HMM

Lourdes Borrajo; A. Seara Vieira; Eva Lorenzo Iglesias

Graphical abstractDisplay Omitted HighlightsThe paper presents an innovative solution to model distributed adaptive systems in biomedical environments.A Case Based Reasoning system with an original Hidden Markov Model for biomedical text classification is proposed.The model classifies scientific documents by their content, taking into account the relevance of words.The model is able to adapt to new documents in an iterative learning frame.The model is tested with the SVM and k-NN classifiers using the Ohsumed scientific collection.Empirical and statistical results show the method outperforms other efficient text classifiers. This paper presents an innovative solution to model distributed adaptive systems in biomedical environments. We present an original TCBR-HMM (Text Case Based Reasoning-Hidden Markov Model) for biomedical text classification based on document content. The main goal is to propose a more effective classifier than current methods in this environment where the model needs to be adapted to new documents in an iterative learning frame. To demonstrate its achievement, we include a set of experiments, which have been performed on OSHUMED corpus. Our classifier is compared with Naive Bayes and SVM techniques, commonly used in text classification tasks. The results suggest that the TCBR-HMM Model is indeed more suitable for document classification. The model is empirically and statistically comparable to the SVM classifier and outperforms it in terms of time efficiency.

PACBB | 2014

BioClass: A Tool for Biomedical Text Classification

R. Romero; A. Seara Vieira; Eva Lorenzo Iglesias; Lourdes Borrajo

Traditional search engines are not efficient enough to extract useful information from scientific text databases. Therefore, it is necessary to develop advanced information retrieval software tools that allow for further classification of the scientific texts. The aim of this work is to present BioClass, a freely available graphic tool for biomedical text classification. With BioClass an user can parameterize, train and test different text classifiers to determine which technique performs better according to the document corpus. The framework includes data balancing and attribute reduction techniques to prepare the input data and improve the classification efficiency. Classification methods analyze documents by content and differentiate those that are best suited to the user requeriments. BioClass also offers graphical interfaces to get conclusions simply and easily.

Explore More