Sergio Ramírez-Gallego

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sergio Ramírez-Gallego is active.

Explore More

Publication

Featured researches published by Sergio Ramírez-Gallego.

Mathematical Problems in Engineering | 2015

Evolutionary Feature Selection for Big Data Classification : A MapReduce Approach

Daniel Peralta; Sara del Río; Sergio Ramírez-Gallego; Isaac Triguero; José Manuel Benítez; Francisco Herrera

Nowadays, many disciplines have to deal with big datasets that additionally involve a high number of features. Feature selection methods aim at eliminating noisy, redundant, or irrelevant features that may deteriorate the classification performance. However, traditional methods lack enough scalability to cope with datasets of millions of instances and extract successful results in a delimited time. This paper presents a feature selection algorithm based on evolutionary computation that uses the MapReduce paradigm to obtain subsets of features from big datasets. The algorithm decomposes the original dataset in blocks of instances to learn from them in the map phase; then, the reduce phase merges the obtained partial results into a final vector of feature weights, which allows a flexible application of the feature selection procedure using a threshold to determine the selected subset of features. The feature selection method is evaluated by using three well-known classifiers (SVM, Logistic Regression, and Naive Bayes) implemented within the Spark framework to address big data problems. In the experiments, datasets up to 67 millions of instances and up to 2000 attributes have been managed, showing that this is a suitable framework to perform evolutionary feature selection, improving both the classification accuracy and its runtime when dealing with big data problems.

Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery | 2016

Data discretization: taxonomy and big data challenge

Sergio Ramírez-Gallego; Salvador García; Héctor Mouriño-Talín; David Martínez-Rego; Verónica Bolón-Canedo; Amparo Alonso-Betanzos; José Manuel Benítez; Francisco Herrera

Discretization of numerical data is one of the most influential data preprocessing tasks in knowledge discovery and data mining. The purpose of attribute discretization is to find concise data representations as categories which are adequate for the learning task retaining as much information in the original continuous attribute as possible. In this article, we present an updated overview of discretization techniques in conjunction with a complete taxonomy of the leading discretizers. Despite the great impact of discretization as data preprocessing technique, few elementary approaches have been developed in the literature for Big Data. The purpose of this article is twofold: a comprehensive taxonomy of discretization techniques to help the practitioners in the use of the algorithms is presented; the article aims is to demonstrate that standard discretization methods can be parallelized in Big Data platforms such as Apache Spark, boosting both performance and accuracy. We thus propose a distributed implementation of one of the most well‐known discretizers based on Information Theory, obtaining better results than the one produced by: the entropy minimization discretizer proposed by Fayyad and Irani. Our scheme goes beyond a simple parallelization and it is intended to be the first to face the Big Data challenge. WIREs Data Mining Knowl Discov 2016, 6:5–21. doi: 10.1002/widm.1173

IEEE Transactions on Systems, Man, and Cybernetics | 2016

Multivariate Discretization Based on Evolutionary Cut Points Selection for Classification

Sergio Ramírez-Gallego; Salvador García; José Manuel Benítez; Francisco Herrera

Discretization is one of the most relevant techniques for data preprocessing. The main goal of discretization is to transform numerical attributes into discrete ones to help the experts to understand the data more easily, and it also provides the possibility to use some learning algorithms which require discrete data as input, such as Bayesian or rule learning. We focus our attention on handling multivariate classification problems, where high interactions among multiple attributes exist. In this paper, we propose the use of evolutionary algorithms to select a subset of cut points that defines the best possible discretization scheme of a data set using a wrapper fitness function. We also incorporate a reduction mechanism to successfully manage the multivariate approach on large data sets. Our method has been compared with the best state-of-the-art discretizers on 45 real datasets. The experiments show that our proposed algorithm overcomes the rest of the methods producing competitive discretization schemes in terms of accuracy, for C4.5, Naive Bayes, PART, and PrUning and BuiLding Integrated in Classification classifiers; and obtained far simpler solutions.

Information Fusion | 2018

Big Data: Tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce

Sergio Ramírez-Gallego; Alberto Fernández; Salvador García; Min Chen; Francisco Herrera

Abstract We live in a world were data are generated from a myriad of sources, and it is really cheap to collect and storage such data. However, the real benefit is not related to the data itself, but with the algorithms that are capable of processing such data in a tolerable elapse time, and to extract valuable knowledge from it. Therefore, the use of Big Data Analytics tools provide very significant advantages to both industry and academia. The MapReduce programming framework can be stressed as the main paradigm related with such tools. It is mainly identified by carrying out a distributed execution for the sake of providing a high degree of scalability, together with a fault-tolerant scheme. In every MapReduce algorithm, first local models are learned with a subset of the original data within the so-called Map tasks. Then, the Reduce task is devoted to fuse the partial outputs generated by each Map. The ways of designing such fusion of information/models may have a strong impact in the quality of the final system. In this work, we will enumerate and analyze two alternative methodologies that may be found both in the specialized literature and in standard Machine Learning libraries for Big Data. Our main objective is to provide an introduction of the characteristics of these methodologies, as well as giving some guidelines for the design of novel algorithms in this field of research. Finally, a short experimental study will allow us to contrast the scalability issues for each type of process fusion in MapReduce for Big Data Analytics.

International Journal of Intelligent Systems | 2017

Fast-mRMR: Fast Minimum Redundancy Maximum Relevance Algorithm for High-Dimensional Big Data

Sergio Ramírez-Gallego; Iago Lastra; David Martínez-Rego; Verónica Bolón-Canedo; José Manuel Benítez; Francisco Herrera; Amparo Alonso-Betanzos

With the advent of large‐scale problems, feature selection has become a fundamental preprocessing step to reduce input dimensionality. The minimum‐redundancy‐maximum‐relevance (mRMR) selector is considered one of the most relevant methods for dimensionality reduction due to its high accuracy. However, it is a computationally expensive technique, sharply affected by the number of features. This paper presents fast‐mRMR, an extension of mRMR, which tries to overcome this computational burden. Associated with fast‐mRMR, we include a package with three implementations of this algorithm in several platforms, namely, CPU for sequential execution, GPU (graphics processing units) for parallel computing, and Apache Spark for distributed computing using big data technologies.

systems man and cybernetics | 2018

An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark

Sergio Ramírez-Gallego; Héctor Mouriño-Talín; David Martínez-Rego; Verónica Bolón-Canedo; José Manuel Benítez; Amparo Alonso-Betanzos; Francisco Herrera

With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Of the many techniques available, feature selection (FS) is of growing interest for its ability to identify both relevant features and frequently repeated instances in huge datasets. We aim to demonstrate that standard FS methods can be parallelized in big data platforms like Apache Spark so as to boost both performance and accuracy. We propose a distributed implementation of a generic FS framework that includes a broad group of well-known information theory-based methods. Experimental results for a broad set of real-world datasets show that our distributed framework is capable of rapidly dealing with ultrahigh-dimensional datasets as well as those with a huge number of samples, outperforming the sequential version in all the cases studied.

systems man and cybernetics | 2017

Nearest Neighbor Classification for High-Speed Big Data Streams Using Spark

Sergio Ramírez-Gallego; Bartosz Krawczyk; Salvador García; Michal Wozniak; José Manuel Benítez; Francisco Herrera

Mining massive and high-speed data streams among the main contemporary challenges in machine learning. This calls for methods displaying a high computational efficacy, with ability to continuously update their structure and handle ever-arriving big number of instances. In this paper, we present a new incremental and distributed classifier based on the popular nearest neighbor algorithm, adapted to such a demanding scenario. This method, implemented in Apache Spark, includes a distributed metric-space ordering to perform faster searches. Additionally, we propose an efficient incremental instance selection method for massive data streams that continuously update and remove outdated examples from the case-base. This alleviates the high computational requirements of the original classifier, thus making it suitable for the considered problem. Experimental study conducted on a set of real-life massive data streams proves the usefulness of the proposed solution and shows that we are able to provide the first efficient nearest neighbor solution for high-speed big and streaming data.

IEEE Transactions on Cloud Computing | 2016

A Forecasting Methodology for Workload Forecasting in Cloud Systems

Francisco Javier Baldan; Sergio Ramírez-Gallego; Christoph Bergmeir; Jose M. Benitez-Sanchez; Francisco Herrera

Cloud Computing is an essential paradigm of computing services based on the “elasticity” property, where available resources are adapted efficiently to different workloads over time. In elastic platforms, the forecasting component can be considered by far the most important element and the differentiating factor when comparing such systems, with workload forecasting one of the problems to solve if we want to achieve a truly elastic system. When properly addressed the cloud workload forecasting problem becomes a really interesting case study. As there is no general methodology in the literature that addresses this problem analytically and from a time series forecasting perspective (even less so in the cloud field), we propose a combination of these tools based on a state-of-the-art forecasting methodology which we have enhanced with some elements, such as: a specific cost function, statistical tests, visual analysis, etc. The insights obtained from this analysis are used to detect the asymmetrical nature of the forecasting problem and to find the best forecasting model from the viewpoint of the current state of the art in time series forecasting. From an operational point of view the most interesting forecast is a short-time horizon, so we focus on this. To show the feasibility of this methodology, we apply it to several realistic workload datasets from different datacenters. The results indicate that the analyzed series are non-linear in nature and that no seasonal patterns can be found. Moreover, on the analyzed datasets, the penalty cost as usually included in the SLA can be reduced to a 30 percent on average.

Swarm and evolutionary computation | 2018

A distributed evolutionary multivariate discretizer for Big Data processing on Apache Spark

Sergio Ramírez-Gallego; Salvador García; José Manuel Benítez; Francisco Herrera

Abstract Nowadays the phenomenon of Big Data is overwhelming our capacity to extract relevant knowledge through classical machine learning techniques. Discretization (as part of data reduction) is presented as a real solution to reduce this complexity. However, standard discretizers are not designed to perform well with such amounts of data. This paper proposes a distributed discretization algorithm for Big Data analytics based on evolutionary optimization. After comparing with a distributed discretizer based on the Minimum Description Length Principle, we have found that our solution yields more accurate and simpler solutions in reasonable time.

Knowledge Based Systems | 2018

Principal Components Analysis Random Discretization Ensemble for Big Data

Diego García-Gil; Sergio Ramírez-Gallego; Salvador García; Francisco Herrera

Humongous amounts of data have created a lot of challenges in terms of data computation and analysis. Classic data mining techniques are not prepared for the new space and time requirements. Discretization and dimensionality reduction are two of the data reduction tasks in knowledge discovery. Random Projection Random Discretization is a novel and recently proposed ensemble method by Ahmad and Brown in 2014 that performs discretization and dimensionality reduction to create more informative data. Despite the good efficiency of random projections in dimensionality reduction, more robust methods like Principal Components Analysis (PCA) can improve the performance.We propose a new ensemble method to overcome this drawback using the Apache Spark platform and PCA for dimension reduction, named Principal Components Analysis Random Discretization Ensemble. Experimental results on five large-scale datasets show that our solution outperforms both the original algorithm and Random Forest in terms of prediction performance. Results also show that high dimensionality data can affect the runtime of the algorithm.

Explore More