Verónica Bolón-Canedo

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Verónica Bolón-Canedo is active.

Explore More

Publication

Featured researches published by Verónica Bolón-Canedo.

Knowledge and Information Systems | 2013

A review of feature selection methods on synthetic data

Verónica Bolón-Canedo; Noelia Sánchez-Maroño; Amparo Alonso-Betanzos

With the advent of high dimensionality, adequate identification of relevant features of the data has become indispensable in real-world scenarios. In this context, the importance of feature selection is beyond doubt and different methods have been developed. However, with such a vast body of algorithms available, choosing the adequate feature selection method is not an easy-to-solve question and it is necessary to check their effectiveness on different situations. Nevertheless, the assessment of relevant features is difficult in real datasets and so an interesting option is to use artificial data. In this paper, several synthetic datasets are employed for this purpose, aiming at reviewing the performance of feature selection methods in the presence of a crescent number or irrelevant features, noise in the data, redundancy and interaction between attributes, as well as a small ratio between number of samples and number of features. Seven filters, two embedded methods, and two wrappers are applied over eleven synthetic datasets, tested by four classifiers, so as to be able to choose a robust method, paving the way for its application to real datasets.

Information Sciences | 2014

A review of microarray datasets and applied feature selection methods

Verónica Bolón-Canedo; Noelia Sánchez-Maroño; Amparo Alonso-Betanzos; José Manuel Benítez; Francisco Herrera

Microarray data classification is a difficult challenge for machine learning researchers due to its high number of features and the small sample sizes. Feature selection has been soon considered a de facto standard in this field since its introduction, and a huge number of feature selection methods were utilized trying to reduce the input dimensionality while improving the classification performance. This paper is devoted to reviewing the most up-to-date feature selection methods developed in this field and the microarray databases most frequently used in the literature. We also make the interested reader aware of the problematic of data characteristics in this domain, such as the imbalance of the data, their complexity, or the so-called dataset shift. Finally, an experimental evaluation on the most representative datasets using well-known feature selection methods is presented, bearing in mind that the aim is not to provide the best feature selection method, but to facilitate their comparative study by the research community.

Expert Systems With Applications | 2011

Feature selection and classification in multiple class datasets

Verónica Bolón-Canedo; Noelia Sánchez-Maroño; Amparo Alonso-Betanzos

Research highlights? A combination of discretizers, filters and classifiers is presented. ? This combination is applied to binary and multiple class classification problems. ? Its performance is compared to KDD Cup winner and other methods results. ? It achieves better performance while significantly reduces the number of features. In this work, a new method consisting of a combination of discretizers, filters and classifiers is presented. Its aim is to improve the performance results of classifiers but using a significantly reduced set of features. The method has been applied to a binary and to a multiple class classification problem. Specifically, the KDD Cup 99 benchmark was used for testing its effectiveness. A comparative study with other methods and the KDD winner was accomplished. The results obtained showed the adequacy of the proposed method, achieving better performance in most cases while reducing the number of features in more than 80%.

Pattern Recognition | 2012

An ensemble of filters and classifiers for microarray data classification

Verónica Bolón-Canedo; Noelia Sánchez-Maroño; Amparo Alonso-Betanzos

In this paper a new framework for feature selection consisting of an ensemble of filters and classifiers is described. Five filters, based on different metrics, were employed. Each filter selects a different subset of features which is used to train and to test a specific classifier. The outputs of these five classifiers are combined by simple voting. In this study three well-known classifiers were employed for the classification task: C4.5, naive-Bayes and IB1. The rationale of the ensemble is to reduce the variability of the features selected by filters in different classification domains. Its adequacy was demonstrated by employing 10 microarray data sets.

Knowledge Based Systems | 2015

Recent advances and emerging challenges of feature selection in the context of big data

Verónica Bolón-Canedo; Noelia Sánchez-Maroño; Amparo Alonso-Betanzos

The explosion of big data has posed important challenges to researchers.Feature selection is paramount when dealing with high-dimensional datasets.We review the state-of-the-art and recent contributions in feature selection.The emerging challenges in feature selection are identified and discussed. In an era of growing data complexity and volume and the advent of big data, feature selection has a key role to play in helping reduce high-dimensionality in machine learning problems. We discuss the origins and importance of feature selection and outline recent contributions in a range of applications, from DNA microarray analysis to face recognition. Recent years have witnessed the creation of vast datasets and it seems clear that these will only continue to grow in size and number. This new big data scenario offers both opportunities and challenges to feature selection researchers, as there is a growing need for scalable yet efficient feature selection methods, given that existing methods are likely to prove inadequate.

Applied Soft Computing | 2015

Distributed feature selection

Verónica Bolón-Canedo; Noelia Sánchez-Maroño; Amparo Alonso-Betanzos

Graphical abstractDisplay Omitted HighlightsFeature selection is indispensable when dealing with microarray data.A new method for distributing the filtering process is proposed.The data is distributed by features and then merged in a final subset.The method is tested on 8 microarray datasets.The classification accuracy is maintained and the time considerably shortened. Feature selection is often required as a preliminary step for many pattern recognition problems. However, most of the existing algorithms only work in a centralized fashion, i.e. using the whole dataset at once. In this research a new method for distributing the feature selection process is proposed. It distributes the data by features, i.e. according to a vertical distribution, and then performs a merging procedure which updates the feature subset according to improvements in the classification accuracy. The effectiveness of our proposal is tested on microarray data, which has brought a difficult challenge for researchers due to the high number of gene expression contained and the small samples size. The results on eight microarray datasets show that the execution time is considerably shortened whereas the performance is maintained or even improved compared to the standard algorithms applied to the non-partitioned datasets.

Neurocomputing | 2014

Data classification using an ensemble of filters

Verónica Bolón-Canedo; Noelia Sánchez-Maroño; Amparo Alonso-Betanzos

Ensemble learning has been the focus of much attention, based on the assumption that combining the output of multiple experts is better than the output of any single expert. Many methods have been proposed of which bagging and boosting were the most popular. In this research, the idea of ensembling is adapted for feature selection. We propose an ensemble of filters for classification, aimed at achieving a good classification performance together with a reduction in the input dimensionality. With this approach, we try to overcome the problem of selecting an appropriate method for each problem at hand, as it is overly dependent on the characteristics of the datasets. The adequacy of using an ensemble of filters rather than a single filter was demonstrated on synthetic and real data, paving the way for its final application over a challenging scenario such as DNA microarray classification.

Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery | 2016

Data discretization: taxonomy and big data challenge

Sergio Ramírez-Gallego; Salvador García; Héctor Mouriño-Talín; David Martínez-Rego; Verónica Bolón-Canedo; Amparo Alonso-Betanzos; José Manuel Benítez; Francisco Herrera

Discretization of numerical data is one of the most influential data preprocessing tasks in knowledge discovery and data mining. The purpose of attribute discretization is to find concise data representations as categories which are adequate for the learning task retaining as much information in the original continuous attribute as possible. In this article, we present an updated overview of discretization techniques in conjunction with a complete taxonomy of the leading discretizers. Despite the great impact of discretization as data preprocessing technique, few elementary approaches have been developed in the literature for Big Data. The purpose of this article is twofold: a comprehensive taxonomy of discretization techniques to help the practitioners in the use of the algorithms is presented; the article aims is to demonstrate that standard discretization methods can be parallelized in Big Data platforms such as Apache Spark, boosting both performance and accuracy. We thus propose a distributed implementation of one of the most well‐known discretizers based on Information Theory, obtaining better results than the one produced by: the entropy minimization discretizer proposed by Fayyad and Irani. Our scheme goes beyond a simple parallelization and it is intended to be the first to face the Big Data challenge. WIREs Data Mining Knowl Discov 2016, 6:5–21. doi: 10.1002/widm.1173

Pattern Recognition | 2014

A framework for cost-based feature selection

Verónica Bolón-Canedo; Iago Porto-Díaz; Noelia Sánchez-Maroño; Amparo Alonso-Betanzos

Abstract Over the last few years, the dimensionality of datasets involved in data mining applications has increased dramatically. In this situation, feature selection becomes indispensable as it allows for dimensionality reduction and relevance detection. The research proposed in this paper broadens the scope of feature selection by taking into consideration not only the relevance of the features but also their associated costs. A new general framework is proposed, which consists of adding a new term to the evaluation function of a filter feature selection method so that the cost is taken into account. Although the proposed methodology could be applied to any feature selection filter, in this paper the approach is applied to two representative filter methods: Correlation-based Feature Selection (CFS) and Minimal-Redundancy-Maximal-Relevance (mRMR), as an example of use. The behavior of the proposed framework is tested on 17 heterogeneous classification datasets, employing a Support Vector Machine (SVM) as a classifier. The results of the experimental study show that the approach is sound and that it allows the user to reduce the cost without compromising the classification error.

international symposium on neural networks | 2009

A combination of discretization and filter methods for improving classification performance in KDD Cup 99 dataset

Verónica Bolón-Canedo; Noelia Sánchez-Maroño; Amparo Alonso-Betanzos

KDD Cup 99 dataset is a classical challenge for computer intrusion detection as well as machine learning researchers. Due to the problematic of this dataset, several sophisticated machine learning algorithms have been tried by different authors. In this paper a new approach is proposed that consists in a combination of a discretizator, a filter method and a very simple classical classifier. The results obtained show the adequacy of the method, that achieves comparable or even better performances than those of other more complicated algorithms, but with a considerable reduction in the number of input features. The proposed method has also been tried over another two large datasets maintaining the same behavior as in the KDD Cup 99 dataset.

Explore More