José-Francisco Díez-Pastor

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where José-Francisco Díez-Pastor is active.

Explore More

Publication

Featured researches published by José-Francisco Díez-Pastor.

Information Sciences | 2015

Diversity techniques improve the performance of the best imbalance learning ensembles

José-Francisco Díez-Pastor; Juan José Rodríguez; César Ignacio García-Osorio; Ludmila I. Kuncheva

Many real-life problems can be described as unbalanced, where the number of instances belonging to one of the classes is much larger than the numbers in other classes. Examples are spam detection, credit card fraud detection or medical diagnosis. Ensembles of classifiers have acquired popularity in this kind of problems for their ability to obtain better results than individual classifiers. The most commonly used techniques by those ensembles especially designed to deal with imbalanced problems are for example Re-weighting, Oversampling and Undersampling. Other techniques, originally intended to increase the ensemble diversity, have not been systematically studied for their effect on imbalanced problems. Among these are Random Oracles, Disturbing Neighbors, Random Feature Weights or Rotation Forest. This paper presents an overview and an experimental study of various ensemble-based methods for imbalanced problems, the methods have been tested in its original form and in conjunction with several diversity-increasing techniques, using 84 imbalanced data sets from two well known repositories. This paper shows that these diversity-increasing techniques significantly improve the performance of ensemble methods for imbalanced problems and provides some ideas about when it is more convenient to use these diversifying techniques.

Knowledge Based Systems | 2016

Instance selection of linear complexity for big data

Álvar Arnaiz-González; José-Francisco Díez-Pastor; Juan José Rodríguez; César Ignacio García-Osorio

Over recent decades, database sizes have grown considerably. Larger sizes present new challenges, because machine learning algorithms are not prepared to process such large volumes of information. Instance selection methods can alleviate this problem when the size of the data set is medium to large. However, even these methods face similar problems with very large-to-massive data sets.In this paper, two new algorithms with linear complexity for instance selection purposes are presented. Both algorithms use locality-sensitive hashing to find similarities between instances. While the complexity of conventional methods (usually quadratic, O ( n 2 ) , or log-linear, O ( n log n ) ) means that they are unable to process large-sized data sets, the new proposal shows competitive results in terms of accuracy. Even more remarkably, it shortens execution time, as the proposal manages to reduce complexity and make it linear with respect to the data set size. The new proposal has been compared with some of the best known instance selection methods for testing and has also been evaluated on large data sets (up to a million instances).

Information Fusion | 2014

Tree ensemble construction using a GRASP-based heuristic and annealed randomness

José-Francisco Díez-Pastor; César Ignacio García-Osorio; Juan José Rodríguez

Abstract Two new methods for tree ensemble construction are presented: G-Forest and GAR-Forest. In a similar way to Random Forest, the tree construction process entails a degree of randomness. The same strategy used in the GRASP metaheuristic for generating random and adaptive solutions is used at each node of the trees. The source of diversity of the ensemble is the randomness of the solution generation method of GRASP. A further key feature of the tree construction method for GAR-Forest is a decreasing level of randomness during the process of constructing the tree: maximum randomness at the root and minimum randomness at the leaves. The method is therefore named “GAR”, GRASP with annealed randomness. The results conclusively demonstrate that G-Forest and GAR-Forest outperform Bagging, AdaBoost, MultiBoost, Random Forest and Random Subspaces. The results are even more convincing in the presence of noise, demonstrating the robustness of the method. The relationship between base classifier accuracy and their diversity is analysed by application of kappa-error diagrams and a variant of these called kappa-error relative movement diagrams.

international conference on multiple classifier systems | 2011

Ensembles of decision trees for imbalanced data

Juan José Rodríguez; José-Francisco Díez-Pastor; César Ignacio García-Osorio

Ensembles of decision trees are considered for imbalanced datasets. Conventional decision trees (C4.5) and trees for imbalanced data (CCPDT: Class Confidence Proportion Decision Tree) are used as base classifiers. Ensemble methods, based on undersampling and oversampling, for imbalanced data are considered. Conventional ensemble methods, not specific for imbalanced data, are also studied: Bagging, Random Subspaces, AdaBoost, Real AdaBoost, MultiBoost and Rotation Forest. The results show that the ensemble method is much more important that the type of decision trees used as base classifier. Rotation Forest is the ensemble method with the best results. For the decision tree methods, CCPDT shows no advantage.

Progress in Artificial Intelligence | 2017

MR-DIS: democratic instance selection for big data by MapReduce

Álvar Arnaiz-González; Alejandro González-Rogel; José-Francisco Díez-Pastor; Carlos López-Nozal

Instance selection is a popular preprocessing task in knowledge discovery and data mining. Its purpose is to reduce the size of data sets maintaining their predictive capabilities. The usual emerging problem at this point is that these methods quite often suffer of high computational complexity, which becomes highly inconvenient for processing huge data sets. In this paper, a parallel implementation for the instance selection algorithm Democratic Instance Selection (DIS) is presented. The main advantages of the DIS algorithm turn out to be its computational complexity, linear in the number of instances, as well as its internal structure, intuitively parallelizable. The purpose of this paper is threefold: firstly, the design of the DIS algorithm by following the MapReduce model; secondly, its implementation in the popular big data framework Spark; and finally, its empirical comparison over large-scale data sets. The results show that the processing time is reduced in a linear manner as the number of Spark executors increases, what makes it suitable for big data applications. In addition, the algorithm is publicly accessible to the scientific community.

international conference on multiple classifier systems | 2011

GRASP forest: a new ensemble method for trees

José-Francisco Díez-Pastor; César Ignacio García-Osorio; Juan José Rodríguez; Andres Bustillo

This paper proposes a method for constructing ensembles of decision trees: GRASP Forest. This method uses the metaheuristic GRASP, usually used in optimization problems, to increase the diversity of the ensemble. While Random Forest increases the diversity by randomly choosing a subset of attributes in each tree node, GRASP Forest takes into account all the attributes, the source of randomness in the method is given by the GRASP metaheuristic. Instead of choosing the best attribute from a randomly selected subset of attributes, as Random Forest does, the attribute is randomly chosen from a subset of selected good attributes candidates. Besides the selection of attributes, GRASP is used to select the split value for each numeric attribute. The method is compared to Bagging, Random Forest, Random Subspaces, AdaBoost and MutliBoost, being the results very competitive for the proposed method.

Expert Systems With Applications | 2018

Study of data transformation techniques for adapting single-label prototype selection algorithms to multi-label learning

Álvar Arnaiz-González; José-Francisco Díez-Pastor; Juan José Rodríguez; César Ignacio García-Osorio

Abstract In this paper, the focus is on the application of prototype selection to multi-label data sets as a preliminary stage in the learning process. There are two general strategies when designing Machine Learning algorithms that are capable of dealing with multi-label problems: data transformation and method adaptation. These strategies have been successfully applied in obtaining classifiers and regressors for multi-label learning. Here we investigate the feasibility of data transformation in obtaining prototype selection algorithms for multi-label data sets from three prototype selection algorithms for single-label. The data transformation methods used were: binary relevance, dependent binary relevance, label powerset, and random k -labelsets. The general conclusion is that the methods of prototype selection obtained using data transformation are not better than those obtained through method adaptation. Moreover, prototype selection algorithms designed for multi-label do not do an entirely satisfactory job, because, although they reduce the size of the data set, without affecting significantly the accuracy, the classifier trained with the reduced data set does not improve the accuracy of the classifier when it is trained with the whole data set.

international conference on multiple classifier systems | 2010

An experimental study on ensembles of functional trees

Juan José Rodríguez; César Ignacio García-Osorio; Jesús Maudes; José-Francisco Díez-Pastor

Functional Trees are one type of multivariate trees. This work studies the performance of different ensemble methods (Bagging, Random Subspaces, AdaBoost, Rotation Forest) using three variants (multivariate internal nodes, multivariate leaves or both) of these trees as base classifiers. The best results, for all the ensemble methods, are obtained using Functional Trees with multivariate leaves and univariate internal nodes. The best overall configuration is obtained with Rotation Forest. Ensembles of Functional Trees are compared to ensembles of univariate Decision Trees, being the results favourable for the variant of Functional Trees with univariate internal nodes and multivariate leaves. Kappa-error diagrams are used to study the diversity and accuracy of the base classifiers.

Computer Applications in Engineering Education | 2018

Seshat - a web-based educational resource for teaching the most common algorithms of lexical analysis

Álvar Arnaiz-González; José-Francisco Díez-Pastor; Ismael Ramos-Pérez; César Ignacio García-Osorio

The theoretical background to automata and formal languages represents a complex learning area for students. Computer tools for interacting with the algorithm and interfaces to visualize its different steps can assist the learning process and make it more attractive. In this paper, we present a web application for learning some of the most common algorithms in an appealing way. They are specifically linked to the recognition of regular languages that are, taught in classes on both automata theory and compiler design. Although several simulators are available to students, they usually only serve to validate grammars, automata, and languages, rather than helping students to learn the internal processes that an algorithm can perform. The resource presented here can execute and display each algorithm process, step by step, providing explanations on each step that assist student comprehension. Additionally, as a web‐based resource, it can be used on any device with no need for specific software installation.

Applied Soft Computing | 2018

Local sets for multi-label instance selection

Álvar Arnaiz-González; José-Francisco Díez-Pastor; Juan José Rodríguez; César Ignacio García-Osorio

Abstract The multi-label classification problem is an extension of traditional (single-label) classification, in which the output is a vector of values rather than a single categorical value. The multi-label problem is therefore a very different and much more challenging one than the single-label problem. Recently, multi-label classification has attracted interest, because of its real-life applications, such as image recognition, bio-informatics, and text categorization, among others. Unfortunately, there are few instance selection techniques capable of processing the data used for these applications. These techniques are also very useful for cleaning and reducing the size of data sets. In single-label problems, the local set of an instance x comprises all instances in the largest hypersphere centered on x, so that they are all of the same class. This concept has been successfully integrated in the design of Iterative Case Filtering, one of the most influential instance selection methods in single-label learning. Unfortunately, the concept that was originally defined for single-label learning cannot be directly applied to multi-label data, as each instance has more than one label. An adaptation of the local set concept to multi-label data is proposed in this paper and its effectiveness is verified in the design of two new algorithms that yielded competitive results. One of the adaptations cleans the data sets, to improve their predictive capabilities, while the other aims to reduce data set sizes. Both are tested and compared against the state-of-the-art instance selection methods available for multi-label learning.

Explore More