Álvar Arnaiz-González

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Álvar Arnaiz-González is active.

Explore More

Publication

Featured researches published by Álvar Arnaiz-González.

Information Fusion | 2016

Fusion of instance selection methods in regression tasks

Álvar Arnaiz-González; Marcin Blachnik; Mirosław Kordos; César Ignacio García-Osorio

Few instance selection (IS) methods exist for regression.Two different families of instance selection methods for regression are compared.One is based in a simple discretization of the output variable, but with good results.Both approaches can be used to adapt to regression IS methods for classification.The fusion of these IS algorithms in an ensemble for regression is also analyzed. Data pre-processing is a very important aspect of data mining. In this paper we discuss instance selection used for prediction algorithms, which is one of the pre-processing approaches. The purpose of instance selection is to improve the data quality by data size reduction and noise elimination. Until recently, instance selection has been applied mainly to classification problems. Very few recent papers address instance selection for regression tasks. This paper proposes fusion of instance selection algorithms for regression tasks to improve the selection performance. As the members of the ensemble two different families of instance selection methods are evaluated: one based on distance threshold and the other one on converting the regression task into a multiple class classification task. Extensive experimental evaluation performed on the two regression versions of the Edited Nearest Neighbor (ENN) and Condensed Nearest Neighbor (CNN) methods showed that the best performance measured by the error value and data size reduction are in most cases obtained for the ensemble methods.

Knowledge Based Systems | 2016

Instance selection of linear complexity for big data

Álvar Arnaiz-González; José-Francisco Díez-Pastor; Juan José Rodríguez; César Ignacio García-Osorio

Over recent decades, database sizes have grown considerably. Larger sizes present new challenges, because machine learning algorithms are not prepared to process such large volumes of information. Instance selection methods can alleviate this problem when the size of the data set is medium to large. However, even these methods face similar problems with very large-to-massive data sets.In this paper, two new algorithms with linear complexity for instance selection purposes are presented. Both algorithms use locality-sensitive hashing to find similarities between instances. While the complexity of conventional methods (usually quadratic, O ( n 2 ) , or log-linear, O ( n log n ) ) means that they are unable to process large-sized data sets, the new proposal shows competitive results in terms of accuracy. Even more remarkably, it shortens execution time, as the proposal manages to reduce complexity and make it linear with respect to the data set size. The new proposal has been compared with some of the best known instance selection methods for testing and has also been evaluated on large data sets (up to a million instances).

Neurocomputing | 2016

Instance selection for regression

Álvar Arnaiz-González; José F. Díez-Pastor; Juan José Rodríguez; César Ignacio García-Osorio

Machine Learning has two central processes of interest that captivate the scientific community: classification and regression. Although instance selection for classification has shown its usefulness and has been researched in depth, instance selection for regression has not followed the same path and there are few published algorithms on the subject. In this paper, we propose that various adaptations of DROP, a well-known family of instance selection methods for classification, be applied to regression. Their behaviour is analysed using a broad range of datasets. The results are presented of the analysis of four new proposals for the reduction of dataset size, the effect on error when several classifiers are trained with the reduced dataset, and their robustness against noise. This last aspect is especially important, since in real life, it is frequent that the registered data be inexact and present distortions due to different causes: errors in the measurement tools, typos when writing results, existence of outliers and spurious readings, corruption in files, etc. When the datasets are small it is possible to manually correct these problems, but for big and huge datasets is better to have automatic methods to deal with these problems. In the experimental part, the proposed methods are found to be quite robust to noise.

Progress in Artificial Intelligence | 2017

MR-DIS: democratic instance selection for big data by MapReduce

Álvar Arnaiz-González; Alejandro González-Rogel; José-Francisco Díez-Pastor; Carlos López-Nozal

Instance selection is a popular preprocessing task in knowledge discovery and data mining. Its purpose is to reduce the size of data sets maintaining their predictive capabilities. The usual emerging problem at this point is that these methods quite often suffer of high computational complexity, which becomes highly inconvenient for processing huge data sets. In this paper, a parallel implementation for the instance selection algorithm Democratic Instance Selection (DIS) is presented. The main advantages of the DIS algorithm turn out to be its computational complexity, linear in the number of instances, as well as its internal structure, intuitively parallelizable. The purpose of this paper is threefold: firstly, the design of the DIS algorithm by following the MapReduce model; secondly, its implementation in the popular big data framework Spark; and finally, its empirical comparison over large-scale data sets. The results show that the processing time is reduced in a linear manner as the number of Spark executors increases, what makes it suitable for big data applications. In addition, the algorithm is publicly accessible to the scientific community.

Neurocomputing | 2018

A taxonomic look at instance-based stream classifiers

Iain A. D. Gunn; Álvar Arnaiz-González; Ludmila I. Kuncheva

Abstract Large numbers of data streams are today generated in many fields. A key challenge when learning from such streams is the problem of concept drift. Many methods, including many prototype methods, have been proposed in recent years to address this problem. This paper presents a refined taxonomy of instance selection and generation methods for the classification of data streams subject to concept drift. The taxonomy allows discrimination among a large number of methods which pre-existing taxonomies for offline instance selection methods did not distinguish. This makes possible a valuable new perspective on experimental results, and provides a framework for discussion of the concepts behind different algorithm-design approaches. We review a selection of modern algorithms for the purpose of illustrating the distinctions made by the taxonomy. We present the results of a numerical experiment which examined the performance of a number of representative methods on both synthetic and real-world data sets with and without concept drift, and discuss the implications for the directions of future research in light of the taxonomy. On the basis of the experimental results, we are able to give recommendations for the experimental evaluation of algorithms which may be proposed in the future.

Expert Systems With Applications | 2018

Study of data transformation techniques for adapting single-label prototype selection algorithms to multi-label learning

Álvar Arnaiz-González; José-Francisco Díez-Pastor; Juan José Rodríguez; César Ignacio García-Osorio

Abstract In this paper, the focus is on the application of prototype selection to multi-label data sets as a preliminary stage in the learning process. There are two general strategies when designing Machine Learning algorithms that are capable of dealing with multi-label problems: data transformation and method adaptation. These strategies have been successfully applied in obtaining classifiers and regressors for multi-label learning. Here we investigate the feasibility of data transformation in obtaining prototype selection algorithms for multi-label data sets from three prototype selection algorithms for single-label. The data transformation methods used were: binary relevance, dependent binary relevance, label powerset, and random k -labelsets. The general conclusion is that the methods of prototype selection obtained using data transformation are not better than those obtained through method adaptation. Moreover, prototype selection algorithms designed for multi-label do not do an entirely satisfactory job, because, although they reduce the size of the data set, without affecting significantly the accuracy, the classifier trained with the reduced data set does not improve the accuracy of the classifier when it is trained with the whole data set.

Computer Applications in Engineering Education | 2018

Seshat - a web-based educational resource for teaching the most common algorithms of lexical analysis

Álvar Arnaiz-González; José-Francisco Díez-Pastor; Ismael Ramos-Pérez; César Ignacio García-Osorio

The theoretical background to automata and formal languages represents a complex learning area for students. Computer tools for interacting with the algorithm and interfaces to visualize its different steps can assist the learning process and make it more attractive. In this paper, we present a web application for learning some of the most common algorithms in an appealing way. They are specifically linked to the recognition of regular languages that are, taught in classes on both automata theory and compiler design. Although several simulators are available to students, they usually only serve to validate grammars, automata, and languages, rather than helping students to learn the internal processes that an algorithm can perform. The resource presented here can execute and display each algorithm process, step by step, providing explanations on each step that assist student comprehension. Additionally, as a web‐based resource, it can be used on any device with no need for specific software installation.

Applied Soft Computing | 2018

Local sets for multi-label instance selection

Álvar Arnaiz-González; José-Francisco Díez-Pastor; Juan José Rodríguez; César Ignacio García-Osorio

Abstract The multi-label classification problem is an extension of traditional (single-label) classification, in which the output is a vector of values rather than a single categorical value. The multi-label problem is therefore a very different and much more challenging one than the single-label problem. Recently, multi-label classification has attracted interest, because of its real-life applications, such as image recognition, bio-informatics, and text categorization, among others. Unfortunately, there are few instance selection techniques capable of processing the data used for these applications. These techniques are also very useful for cleaning and reducing the size of data sets. In single-label problems, the local set of an instance x comprises all instances in the largest hypersphere centered on x, so that they are all of the same class. This concept has been successfully integrated in the design of Iterative Case Filtering, one of the most influential instance selection methods in single-label learning. Unfortunately, the concept that was originally defined for single-label learning cannot be directly applied to multi-label data, as each instance has more than one label. An adaptation of the local set concept to multi-label data is proposed in this paper and its effectiveness is verified in the design of two new algorithms that yielded competitive results. One of the adaptations cleans the data sets, to improve their predictive capabilities, while the other aims to reduce data set sizes. Both are tested and compared against the state-of-the-art instance selection methods available for multi-label learning.

Progress in Artificial Intelligence | 2016

Random feature weights for regression trees

Álvar Arnaiz-González; José F. Díez-Pastor; César Ignacio García-Osorio; Juan José Rodríguez

Ensembles are learning methods the operation of which relies on a combination of different base models. The diversity of ensembles is a fundamental aspect that conditions their operation. Random Feature Weights (

multiple classifier systems | 2015