Julián Luengo
University of Burgos
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Julián Luengo.
Information Sciences | 2010
Salvador García; Alberto Fernández; Julián Luengo; Francisco Herrera
Experimental analysis of the performance of a proposed method is a crucial and necessary task in an investigation. In this paper, we focus on the use of nonparametric statistical inference for analyzing the results obtained in an experiment design in the field of computational intelligence. We present a case study which involves a set of techniques in classification tasks and we study a set of nonparametric procedures useful to analyze the behavior of a method with respect to a set of algorithms, such as the framework in which a new proposal is developed. Particularly, we discuss some basic and advanced nonparametric approaches which improve the results offered by the Friedman test in some circumstances. A set of post hoc procedures for multiple comparisons is presented together with the computation of adjusted p-values. We also perform an experimental analysis for comparing their power, with the objective of detecting the advantages and disadvantages of the statistical tests described. We found that some aspects such as the number of algorithms, number of data sets and differences in performance offered by the control method are very influential in the statistical tests studied. Our final goal is to offer a complete guideline for the use of nonparametric statistical procedures for performing multiple comparisons in experimental studies.
soft computing | 2009
Salvador García; Alberto Fernández; Julián Luengo; Francisco Herrera
The experimental analysis on the performance of a proposed method is a crucial and necessary task to carry out in a research. This paper is focused on the statistical analysis of the results in the field of genetics-based machine Learning. It presents a study involving a set of techniques which can be used for doing a rigorous comparison among algorithms, in terms of obtaining successful classification models. Two accuracy measures for multi-class problems have been employed: classification rate and Cohen’s kappa. Furthermore, two interpretability measures have been employed: size of the rule set and number of antecedents. We have studied whether the samples of results obtained by genetics-based classifiers, using the performance measures cited above, check the necessary conditions for being analysed by means of parametrical tests. The results obtained state that the fulfillment of these conditions are problem-dependent and indefinite, which supports the use of non-parametric statistics in the experimental analysis. In addition, non-parametric tests can be satisfactorily employed for comparing generic classifiers over various data-sets considering any performance measure. According to these facts, we propose the use of the most powerful non-parametric statistical tests to carry out multiple comparisons. However, the statistical analysis conducted on interpretability must be carefully considered.
IEEE Transactions on Knowledge and Data Engineering | 2013
Salvador García; Julián Luengo; José A. Sáez; Victoria López; Francisco Herrera
Discretization is an essential preprocessing technique used in many knowledge discovery and data mining tasks. Its main goal is to transform a set of continuous attributes into discrete ones, by associating categorical values to intervals and thus transforming quantitative data into qualitative data. In this manner, symbolic data mining algorithms can be applied over continuous data and the representation of information is simplified, making it more concise and specific. The literature provides numerous proposals of discretization and some attempts to categorize them into a taxonomy can be found. However, in previous papers, there is a lack of consensus in the definition of the properties and no formal categorization has been established yet, which may be confusing for practitioners. Furthermore, only a small set of discretizers have been widely considered, while many other methods have gone unnoticed. With the intention of alleviating these problems, this paper provides a survey of discretization methods proposed in the literature from a theoretical and empirical perspective. From the theoretical perspective, we develop a taxonomy based on the main properties pointed out in previous research, unifying the notation and including all the known methods up to date. Empirically, we conduct an experimental study in supervised classification involving the most representative and newest discretizers, different types of classifiers, and a large number of data sets. The results of their performances measured in terms of accuracy, number of intervals, and inconsistency have been verified by means of nonparametric statistical tests. Additionally, a set of discretizers are highlighted as the best performing ones.
IEEE Transactions on Evolutionary Computation | 2010
Alberto Fernández; Salvador García; Julián Luengo; Ester Bernadó-Mansilla; Francisco Herrera
The classification problem can be addressed by numerous techniques and algorithms which belong to different paradigms of machine learning. In this paper, we are interested in evolutionary algorithms, the so-called genetics-based machine learning algorithms. In particular, we will focus on evolutionary approaches that evolve a set of rules, i.e., evolutionary rule-based systems, applied to classification tasks, in order to provide a state of the art in this field. This paper has a double aim: to present a taxonomy of the genetics-based machine learning approaches for rule induction, and to develop an empirical analysis both for standard classification and for classification with imbalanced data sets. We also include a comparative study of the genetics-based machine learning (GBML) methods with some classical non-evolutionary algorithms, in order to observe the suitability and high potential of the search performed by evolutionary algorithms and the behavior of the GBML algorithms in contrast to the classical approaches, in terms of classification accuracy.
Expert Systems With Applications | 2009
Julián Luengo; Salvador García; Francisco Herrera
In this paper, we focus on the experimental analysis on the performance in artificial neural networks with the use of statistical tests on the classification task. Particularly, we have studied whether the sample of results from multiple trials obtained by conventional artificial neural networks and support vector machines checks the necessary conditions for being analyzed through parametrical tests. The study is conducted by considering three possibilities on classification experiments: random variation in the selection of test data, the selection of training data and internal randomness in the learning algorithm.The results obtained state that the fulfillment of these conditions are problem-dependent and indefinite, which justifies the need of using non-parametric statistics in the experimental analysis.
Information Sciences | 2015
José A. Sáez; Julián Luengo; Jerzy Stefanowski; Francisco Herrera
Noisy and borderline examples in imbalanced datasets harm classifier performance.Our proposal reduces the noise and makes the class boundaries more regular.Our proposal performs better than other re-sampling methods in this scenario.Ensemble-based noise filters work well when dealing with noise.The iterative noise elimination is a good approach to deal with noisy datasets. Classification datasets often have an unequal class distribution among their examples. This problem is known as imbalanced classification. The Synthetic Minority Over-sampling Technique (SMOTE) is one of the most well-know data pre-processing methods to cope with it and to balance the different number of examples of each class. However, as recent works claim, class imbalance is not a problem in itself and performance degradation is also associated with other factors related to the distribution of the data. One of these is the presence of noisy and borderline examples, the latter lying in the areas surrounding class boundaries. Certain intrinsic limitations of SMOTE can aggravate the problem produced by these types of examples and current generalizations of SMOTE are not correctly adapted to their treatment.This paper proposes the extension of SMOTE through a new element, an iterative ensemble-based noise filter called Iterative-Partitioning Filter (IPF), which can overcome the problems produced by noisy and borderline examples in imbalanced datasets. This extension results in SMOTE-IPF. The properties of this proposal are discussed in a comprehensive experimental study. It is compared against a basic SMOTE and its most well-known generalizations. The experiments are carried out both on a set of synthetic datasets with different levels of noise and shapes of borderline examples as well as real-world datasets. Furthermore, the impact of introducing additional different types and levels of noise into these real-world data is studied. The results show that the new proposal performs better than existing SMOTE generalizations for all these different scenarios. The analysis of these results also helps to identify the characteristics of IPF which differentiate it from other filtering approaches.
Knowledge and Information Systems | 2012
Julián Luengo; Salvador García; Francisco Herrera
In real-life data, information is frequently lost in data mining, caused by the presence of missing values in attributes. Several schemes have been studied to overcome the drawbacks produced by missing values in data mining tasks; one of the most well known is based on preprocessing, formerly known as imputation. In this work, we focus on a classification task with twenty-three classification methods and fourteen different imputation approaches to missing values treatment that are presented and analyzed. The analysis involves a group-based approach, in which we distinguish between three different categories of classification methods. Each category behaves differently, and the evidence obtained shows that the use of determined missing values imputation methods could improve the accuracy obtained for these methods. In this study, the convenience of using imputation methods for preprocessing data sets with missing values is stated. The analysis suggests that the use of particular imputation methods conditioned to the groups is required.
soft computing | 2011
Julián Luengo; Alberto Fernández; Salvador García; Francisco Herrera
In the classification framework there are problems in which the number of examples per class is not equitably distributed, formerly known as imbalanced data sets. This situation is a handicap when trying to identify the minority classes, as the learning algorithms are not usually adapted to such characteristics. An usual approach to deal with the problem of imbalanced data sets is the use of a preprocessing step. In this paper we analyze the usefulness of the data complexity measures in order to evaluate the behavior of undersampling and oversampling methods. Two classical learning methods, C4.5 and PART, are considered over a wide range of imbalanced data sets built from real data. Specifically, oversampling techniques and an evolutionary undersampling one have been selected for the study. We extract behavior patterns from the results in the data complexity space defined by the measures, coding them as intervals. Then, we derive rules from the intervals that describe both good or bad behaviors of C4.5 and PART for the different preprocessing approaches, thus obtaining a complete characterization of the data sets and the differences between the oversampling and undersampling results.
Knowledge and Information Systems | 2014
José A. Sáez; Mikel Galar; Julián Luengo; Francisco Herrera
The presence of noise in data is a common problem that produces several negative consequences in classification problems. In multi-class problems, these consequences are aggravated in terms of accuracy, building time, and complexity of the classifiers. In these cases, an interesting approach to reduce the effect of noise is to decompose the problem into several binary subproblems, reducing the complexity and, consequently, dividing the effects caused by noise into each of these subproblems. This paper analyzes the usage of decomposition strategies, and more specifically the One-vs-One scheme, to deal with noisy multi-class datasets. In order to investigate whether the decomposition is able to reduce the effect of noise or not, a large number of datasets are created introducing different levels and types of noise, as suggested in the literature. Several well-known classification algorithms, with or without decomposition, are trained on them in order to check when decomposition is advantageous. The results obtained show that methods using the One-vs-One strategy lead to better performances and more robust classifiers when dealing with noisy data, especially with the most disruptive noise schemes.
Pattern Recognition | 2013
José A. Sáez; Julián Luengo; Francisco Herrera
Classifier performance, particularly of instance-based learners such as k-nearest neighbors, is affected by the presence of noisy data. Noise filters are traditionally employed to remove these corrupted data and improve the classification performance. However, their efficacy depends on the properties of the data, which can be analyzed by what are known as data complexity measures. This paper studies the relation between the complexity metrics of a dataset and the efficacy of several noise filters to improve the performance of the nearest neighbor classifier. A methodology is proposed to extract a rule set based on data complexity measures that enables one to predict in advance whether the use of noise filters will be statistically profitable. The results obtained show that noise filtering efficacy is to a great extent dependent on the characteristics of the data analyzed by the measures. The validation process carried out shows that the final rule set provided is fairly accurate in predicting the efficacy of noise filters before their application and it produces an improvement with respect to the indiscriminate usage of noise filters.