Sara del Río | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sara del Río is active.

Explore More

Publication

Featured researches published by Sara del Río.

Information Sciences | 2014

On the use of MapReduce for imbalanced big data using Random Forest

Sara del Río; Victoria López; José Manuel Benítez; Francisco Herrera

In this age, big data applications are increasingly becoming the main focus of attention because of the enormous increment of data generation and storage that has taken place in the last years. This situation becomes a challenge when huge amounts of data are processed to extract knowledge because the data mining techniques are not adapted to the new space and time requirements. Furthermore, real-world data applications usually present a class distribution where the samples that belong to one class, which is precisely the main interest, are hugely outnumbered by the samples of the other classes. This circumstance, known as the class imbalance problem, complicates the learning process as the standard learning techniques do not correctly address this situation.In this work, we analyse the performance of several techniques used to deal with imbalanced datasets in the big data scenario using the Random Forest classifier. Specifically, oversampling, undersampling and cost-sensitive learning have been adapted to big data using MapReduce so that these techniques are able to manage datasets as large as needed providing the necessary support to correctly identify the underrepresented class. The Random Forest classifier provides a solid basis for the comparison because of its performance, robustness and versatility.An experimental study is carried out to evaluate the performance of the diverse algorithms considered. The results obtained show that there is not an approach to imbalanced big data classification that outperforms the others for all the data considered when using Random Forest. Moreover, even for the same type of problem, the best performing method is dependent on the number of mappers selected to run the experiments. In most of the cases, when the number of splits is increased, an improvement in the running times can be observed, however, this progress in times is obtained at the expense of a slight drop in the accuracy performance obtained. This decrement in the performance is related to the lack of density problem, which is evaluated in this work from the imbalanced data point of view, as this issue degrades the performance of classifiers in the imbalanced scenario more severely than in standard learning.

Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery | 2014

Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks

Alberto Fernández; Sara del Río; Victoria López; Abdullah Bawakid; María José del Jesús; José Manuel Benítez; Francisco Herrera

The term ‘Big Data’ has spread rapidly in the framework of Data Mining and Business Intelligence. This new scenario can be defined by means of those problems that cannot be effectively or efficiently addressed using the standard computing resources that we currently have. We must emphasize that Big Data does not just imply large volumes of data but also the necessity for scalability, i.e., to ensure a response in an acceptable elapsed time. When the scalability term is considered, usually traditional parallel‐type solutions are contemplated, such as the Message Passing Interface or high performance and distributed Database Management Systems. Nowadays there is a new paradigm that has gained popularity over the latter due to the number of benefits it offers. This model is Cloud Computing, and among its main features we has to stress its elasticity in the use of computing resources and space, less management effort, and flexible costs. In this article, we provide an overview on the topic of Big Data, and how the current problem can be addressed from the perspective of Cloud Computing and its programming frameworks. In particular, we focus on those systems for large‐scale analytics based on the MapReduce scheme and Hadoop, its open‐source implementation. We identify several libraries and software projects that have been developed for aiding practitioners to address this new programming model. We also analyze the advantages and disadvantages of MapReduce, in contrast to the classical solutions in this field. Finally, we present a number of programming frameworks that have been proposed as an alternative to MapReduce, developed under the premise of solving the shortcomings of this model in certain scenarios and platforms. WIREs Data Mining Knowl Discov 2014, 4:380–409. doi: 10.1002/widm.1134

Fuzzy Sets and Systems | 2015

Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data

Victoria López; Sara del Río; José Manuel Benítez; Francisco Herrera

Classification with big data has become one of the latest trends when talking about learning from the available information. The data growth in the last years has rocketed the interest in effectively acquiring knowledge to analyze and predict trends. The variety and veracity that are related to big data introduce a degree of uncertainty that has to be handled in addition to the volume and velocity requirements. This data usually also presents what is known as the problem of classification with imbalanced datasets, a class distribution where the most important concepts to be learned are presented by a negligible number of examples in relation to the number of examples from the other classes. In order to adequately deal with imbalanced big data we propose the Chi-FRBCS-BigDataCS algorithm, a fuzzy rule based classification system that is able to deal with the uncertainly that is introduced in large volumes of data without disregarding the learning in the underrepresented class. The method uses the MapReduce framework to distribute the computational operations of the fuzzy model while it includes cost-sensitive learning techniques in its design to address the imbalance that is present in the data. The good performance of this approach is supported by the experimental analysis that is carried out over twenty-four imbalanced big data cases of study. The results obtained show that the proposal is able to handle these problems obtaining competitive results both in the classification performance of the model and the time needed for the computation.

International Journal of Computational Intelligence Systems | 2015

A MapReduce Approach to Address Big Data Classification Problems Based on the Fusion of Linguistic Fuzzy Rules

Sara del Río; Victoria López; José Manuel Benítez; Francisco Herrera

AbstractThe big data term is used to describe the exponential data growth that has recently occurred and represents an immense challenge for traditional learning techniques. To deal with big data classification problems we propose the Chi-FRBCS-BigData algorithm, a linguistic fuzzy rule-based classification system that uses the MapReduce framework to learn and fuse rule bases. It has been developed in two versions with different fusion processes. An experimental study is carried out and the results obtained show that the proposal is able to handle these problems providing competitive results.

Mathematical Problems in Engineering | 2015

Evolutionary Feature Selection for Big Data Classification : A MapReduce Approach

Daniel Peralta; Sara del Río; Sergio Ramírez-Gallego; Isaac Triguero; José Manuel Benítez; Francisco Herrera

Nowadays, many disciplines have to deal with big datasets that additionally involve a high number of features. Feature selection methods aim at eliminating noisy, redundant, or irrelevant features that may deteriorate the classification performance. However, traditional methods lack enough scalability to cope with datasets of millions of instances and extract successful results in a delimited time. This paper presents a feature selection algorithm based on evolutionary computation that uses the MapReduce paradigm to obtain subsets of features from big datasets. The algorithm decomposes the original dataset in blocks of instances to learn from them in the map phase; then, the reduce phase merges the obtained partial results into a final vector of feature weights, which allows a flexible application of the feature selection procedure using a threshold to determine the selected subset of features. The feature selection method is evaluated by using three well-known classifiers (SVM, Logistic Regression, and Naive Bayes) implemented within the Spark framework to address big data problems. In the experiments, datasets up to 67 millions of instances and up to 2000 attributes have been managed, showing that this is a suitable framework to perform evolutionary feature selection, improving both the classification accuracy and its runtime when dealing with big data problems.

BioMed Research International | 2015

An Effective Big Data Supervised Imbalanced Classification Approach for Ortholog Detection in Related Yeast Species

Deborah Galpert; Sara del Río; Francisco Herrera; Evys Ancede-Gallardo; Agostinho Antunes; Guillermin Agüero-Chapin

Orthology detection requires more effective scaling algorithms. In this paper, a set of gene pair features based on similarity measures (alignment scores, sequence length, gene membership to conserved regions, and physicochemical profiles) are combined in a supervised pairwise ortholog detection approach to improve effectiveness considering low ortholog ratios in relation to the possible pairwise comparison between two genomes. In this scenario, big data supervised classifiers managing imbalance between ortholog and nonortholog pair classes allow for an effective scaling solution built from two genomes and extended to other genome pairs. The supervised approach was compared with RBH, RSD, and OMA algorithms by using the following yeast genome pairs: Saccharomyces cerevisiae-Kluyveromyces lactis, Saccharomyces cerevisiae-Candida glabrata, and Saccharomyces cerevisiae-Schizosaccharomyces pombe as benchmark datasets. Because of the large amount of imbalanced data, the building and testing of the supervised model were only possible by using big data supervised classifiers managing imbalance. Evaluation metrics taking low ortholog ratios into account were applied. From the effectiveness perspective, MapReduce Random Oversampling combined with Spark SVM outperformed RBH, RSD, and OMA, probably because of the consideration of gene pair features beyond alignment similarities combined with the advances in big data supervised classification.

ieee international conference on fuzzy systems | 2014

On the use of MapReduce to build linguistic fuzzy rule based classification systems for big data

Victoria López; Sara del Río; José Manuel Benítez; Francisco Herrera

Big data has become one of the emergent topics when learning from data is involved. The notorious increment in the data generation has directed the attention towards the obtaining of effective models that are able to analyze and extract knowledge from these colossal data sources. However, the vast amount of data, the variety of the sources and the need for an immediate intelligent response pose a critical challenge to traditional learning algorithms. To be able to deal with big data, we propose the usage of a linguistic fuzzy rule based classification system, which we have called Chi-FRBCS-BigData. As a fuzzy method, it is able deal with the uncertainty that is inherent to the variety and veracity of big data and because of the usage of linguistic fuzzy rules it is able to provide an interpretable and effective classification model. This method is based on the MapReduce framework, one of the most popular approaches for big data nowadays, and has been developed in two different versions: Chi-FRBCS-BigData-Max and Chi-FRBCS-BigData-Ave. The good performance of the Chi-FRBCS-BigData approach is supported by means of an experimental study over six big data problems. The results show that the proposal is able to provide competitive results, obtaining more precise but slower models in the Chi-FRBCS-BigData-Ave alternative and faster but less accurate classification results for Chi-FRBCS-BigData-Max.

ieee international conference on fuzzy systems | 2016

A First Approach in Evolutionary Fuzzy Systems based on the lateral tuning of the linguistic labels for Big Data classification

Alberto Fernández; Sara del Río; Francisco Herrera

The treatment and processing of Big Data problems imply an essential advantage for researchers and corporations. This is due to the huge quantity of knowledge that is hidden within the vast amount of information that is available nowadays. In order to be able to address with such volume of information in an efficient way, the scalability for Big Data applications is achieved by means of the MapReduce programming model. It is designed to divide the data into several chunks or groups that are processed in parallel, and whose result is “assembled” to provide a single solution. Focusing on classification tasks, Fuzzy Rule Based Classification Systems have shown interesting results with a MapReduce approach for Big Data. It is well known that the behaviour of these type of systems can be further improved in synergy with Evolutionary Algorithms, leading to Evolutionary Fuzzy Systems. However, to be best of our knowledge there are no developments in this field yet. In this work, we propose a first Evolutionary Fuzzy System for Big Data problems. It consists of an initial Knowledge Based build by means of the Chi-FRBCS-BigData algorithm, followed by a genetic tuning of the Data Base by means of the 2-tuples representation. This way, the fuzzy labels will be better contextualized within every subset of the problem, and the coverage of the Rule Base will be enhanced. Then, the Knowledge Bases from each Map process are joined to build a ensemble classifier. Experimental results show the improvement achieved by this model with respect to the standard Chi-FRBCS-BigData approach, and opens the way for promising future work on the topic.

Advanced Data Analysis and Classification | 2017

Fuzzy rule based classification systems for big data with MapReduce: granularity analysis

Alberto Fernández; Sara del Río; Abdullah Bawakid; Francisco Herrera

Due to the vast amount of information available nowadays, and the advantages related to the processing of this data, the topics of big data and data science have acquired a great importance in the current research. Big data applications are mainly about scalability, which can be achieved via the MapReduce programming model.It is designed to divide the data into several chunks or groups that are processed in parallel, and whose result is “assembled” to provide a single solution. Among different classification paradigms adapted to this new framework, fuzzy rule based classification systems have shown interesting results with a MapReduce approach for big data. It is well known that the performance of these types of systems has a strong dependence on the selection of a good granularity level for the Data Base. However, in the context of MapReduce this parameter is even harder to determine as it can be also related with the number of Maps chosen for the processing stage. In this paper, we aim at analyzing the interrelation between the number of labels of the fuzzy variables and the scarcity of the data due to the data sampling in MapReduce. Specifically, we consider that as the partitioning of the initial instance set grows, the level of granularity necessary to achieve a good performance also becomes higher. The experimental results, carried out for several Big Data problems, and using the Chi-FRBCS-BigData algorithms, support our claims.

Knowledge Based Systems | 2015