Rafael Pino-Mejías
University of Seville
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Rafael Pino-Mejías.
Expert Systems With Applications | 2013
Antonio Muñoz Blanco; Rafael Pino-Mejías; Juan Lara; Salvador Rayo
Credit scoring systems are currently in common use by numerous financial institutions worldwide. However, credit scoring with the microfinance industry is a relatively recent application, and no model which employs a non-parametric statistical technique has yet, to the best of our knowledge, been published. This lack is surprising since the implementation of credit scoring should contribute towards the efficiency of microfinance institutions, thereby improving their competitiveness in an increasingly constrained environment. This paper builds several non-parametric credit scoring models based on the multilayer perceptron approach (MLP) and benchmarks their performance against other models which employ the traditional linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and logistic regression (LR) techniques. Based on a sample of almost 5500 borrowers from a Peruvian microfinance institution, the results reveal that neural network models outperform the other three classic techniques both in terms of area under the receiver-operating characteristic curve (AUC) and as misclassification costs.
Neural Networks | 2011
Esther-Lydia Silva-Ramírez; Rafael Pino-Mejías; Manuel López-Coello; María-Dolores Cubiles-de-la-Vega
Data mining is based on data files which usually contain errors in the form of missing values. This paper focuses on a methodological framework for the development of an automated data imputation model based on artificial neural networks. Fifteen real and simulated data sets are exposed to a perturbation experiment, based on the random generation of missing values. These data set sizes range from 47 to 1389 records. A perturbation experiment was performed for each data set where the probability of missing value was set to 0.05. Several architectures and learning algorithms for the multilayer perceptron are tested and compared with three classic imputation procedures: mean/mode imputation, regression and hot-deck. The obtained results, considering different performance measures, not only suggest this approach improves the quality of a database with missing values, but also the best results are clearly obtained using the Multilayer Perceptron model in data sets with categorical variables. Three learning rules (Levenberg-Marquardt, BFGS Quasi-Newton and Conjugate Gradient Fletcher-Reeves Update) and a small number of hidden nodes are recommended.
Environmental Modelling and Software | 2010
Rafael Pino-Mejías; María Dolores Cubiles-de-la-Vega; María Anaya-Romero; Antonio Pascual-Acosta; Antonio Jordán-López; Nicolás Bellinfante-Crocci
Oak forests are essential for the ecosystems of many countries, particularly when they are used in vegetal restoration. Therefore, models for predicting the potential habitat of oaks can be a valuable tool for work in the environment. In accordance with this objective, the building and comparison of data mining models are presented for the prediction of potential habitats for the oak forest type in Mediterranean areas (southern Spain), with conclusions applicable to other regions. Thirty-one environmental input variables were measured and six base models for supervised classification problems were selected: linear and quadratic discriminant analysis, logistic regression, classification trees, neural networks and support vector machines. Three ensemble methods, based on the combination of classification tree models fitted from samples and sets of variables generated from the original data set were also evaluated: bagging, random forests and boosting. The available data set was randomly split into three parts: training set (50%), validation set (25%), and test set (25%). The analysis of the accuracy, the sensitivity, the specificity, together with the area under the ROC curve for the test set reveal that the best models for our oak data set are those of bagging and random forests. All of these models can be fitted by free R programs which use the libraries and functions described in this paper. Furthermore, the methodology used in this study will allow researchers to determine the potential distribution of oaks in other kinds of areas.
Applied Soft Computing | 2015
Esther-Lydia Silva-Ramírez; Rafael Pino-Mejías; Manuel López-Coello
Graphical abstractDisplay Omitted HighlightsImputation data for monotone patterns of missing values.An estimation model of missing data based on multilayer perceptron.Combination of neural network and k-nearest neighbour-based multiple imputation.Comparison of the performance of proposed models with three classic procedures.Three classic single imputation models: mean/mode, regression and hot-deck. The knowledge discovery process is supported by data files information gathered from collected data sets, which often contain errors in the form of missing values. Data imputation is the activity aimed at estimating values for missing data items. This study focuses on the development of automated data imputation models, based on artificial neural networks for monotone patterns of missing values. The present work proposes a single imputation approach relying on a multilayer perceptron whose training is conducted with different learning rules, and a multiple imputation approach based on the combination of multilayer perceptron and k-nearest neighbours. Eighteen real and simulated databases were exposed to a perturbation experiment with random generation of monotone missing data pattern. An empirical test was accomplished on these data sets, including both approaches (single and multiple imputations), and three classical single imputation procedures - mean/mode imputation, regression and hot-deck - were also considered. Therefore, the experiments involved five imputation methods. The results, considering different performance measures, demonstrated that, in comparison with traditional tools, both proposals improve the automation level and data quality offering a satisfactory performance.
Science of The Total Environment | 2014
Álvaro Gómez-Losada; Antonio Lozano-García; Rafael Pino-Mejías; Juan Contreras-González
BACKGROUND Existing air quality monitoring programs are, on occasion, not updated according to local, varying conditions and as such the monitoring programs become non-informative over time, under-detecting new sources of pollutants or duplicating information. Furthermore, inadequate maintenance may cause the monitoring equipment to be utterly deficient in providing information. To deal with these issues, a combination of formal statistical methods is used to optimize resources for monitoring and to characterize the monitoring networks, introducing new criteria for their refinement. METHODS Monitoring data were obtained on key pollutants such as carbon monoxide (CO), nitrogen dioxide (NO2), ozone (O3), particulate matter (PM10) and sulfur dioxide (SO2) from 12 air quality monitoring sites in Seville (Spain) during 2012. A total of 49 data sets were fit to mixture models of Gaussian distribution using the expectation-maximization (EM) algorithm. To summarize these 49 models, the mean and coefficient of variation were calculated for each mixture and carried out a hierarchical clustering analysis (HCA) to study the grouping of the sites according to these statistics. To handle the lack of observational data from the sites with unmonitored pollutants, the missing statistical values were imputed by applying the random forests technique and then later, a principal component analysis (PCA) was carried out to better understand the relationship between the level of pollution and the classification of monitoring sites. All of the techniques were applied using free, open-source, statistical software. RESULTS AND CONCLUSION One example of source attribution and contribution is analyzed using mixture models and the potential for mixture models is posed in characterizing pollution trends. The mixture statistics have proven to be a fingerprint for every model and this work presents a novel use of them and represents a promising approach to characterizing mixture models in the air quality management discipline. The imputation technique used is allowed for estimating the missing information from key unmonitored pollutants to gather information about unknown pollution levels and to suggest new possible monitoring configurations for this network. Posterior PCA confirmed the misclassification of one site detected with HCA. The authors consider the stepwise approach used in this work to be promising and able to be applied to other air monitoring network studies.
Expert Systems With Applications | 2013
María-Dolores Cubiles-de-la-Vega; Antonio Blanco-Oliver; Rafael Pino-Mejías; Juan Lara-Rubio
A wide range of supervised classification algorithms have been successfully applied for credit scoring in non-microfinance environments according to recent literature. However, credit scoring in the microfinance industry is a relatively recent application, and current research is based, to the best of our knowledge, on classical statistical methods. This lack is surprising since the implementation of credit scoring based on supervised classification algorithms should contribute towards the efficiency of microfinance institutions, thereby improving their competitiveness in an increasingly constrained environment. This paper explores an extensive list of Statistical Learning techniques as microfinance credit scoring tools from an empirical viewpoint. A data set of microcredits belonging to a Peruvian Microfinance Institution is considered, and the following models are applied to decide between default and non-default credits: linear and quadratic discriminant analysis, logistic regression, multilayer perceptron, support vector machines, classification trees, and ensemble methods based on bagging and boosting algorithm. The obtained results suggest the use of a multilayer perceptron trained in the R statistical system with a second order algorithm. Moreover, our findings show that, with the implementation of this MLP-based model, the MFIs@? misclassification costs could be reduced to 13.7% with respect to the application of other classic models.
Journal of Applied Statistics | 2008
Rafael Pino-Mejías; Mercedes Carrasco-Mairena; Antonio Pascual-Acosta; María-Dolores Cubiles-de-la-Vega; Joaquín Muñoz-García
The main models of machine learning are briefly reviewed and considered for building a classifier to identify the Fragile X Syndrome (FXS). We have analyzed 172 patients potentially affected by FXS in Andalusia (Spain) and, by means of a DNA test, each member of the data set is known to belong to one of two classes: affected, not affected. The whole predictor set, formed by 40 variables, and a reduced set with only nine predictors significantly associated with the response are considered. Four alternative base classification models have been investigated: logistic regression, classification trees, multilayer perceptron and support vector machines. For both predictor sets, the best accuracy, considering both the mean and the standard deviation of the test error rate, is achieved by the support vector machines, confirming the increasing importance of this learning algorithm. Three ensemble methods – bagging, random forests and boosting – were also considered, amongst which the bagged versions of support vector machines stand out, especially when they are constructed with the reduced set of predictor variables. The analysis of the sensitivity, the specificity and the area under the ROC curve agrees with the main conclusions extracted from the accuracy results. All of these models can be fitted by free R programs.
Journal of Applied Statistics | 1997
Joaquín Muñoz-García; Rafael Pino-Mejías; J. M. Muñoz-Pichardo; María-Dolores Cubiles-de-la-Vega
We define a variation of Efrons method II based on the outlier bootstrap sample concept. A criterion for the identification of such samples is given, with which a variation in the bootstrap sample generation algorithm is introduced. The results of several simulations are analyzed in which, in comparison with Efrons method II, a higher degree of closeness to the estimated quantities can be observed.
Statistics & Probability Letters | 2003
M.D. Jiménez-Gamero; Joaquín Muñoz-García; Rafael Pino-Mejías
Let X1,X2,...,Xn be independent random vectors with common distribution function F and let be a parametric family of distributions. Let Tn([theta])=Tn(X1,X2,...,Xn;[theta]) be a degree-2 V statistic and let be a consistent estimator of [theta]. Several test statistics for testing the composite null hypothesis has the form . Typically, the null distribution of depends on the unknown value of [theta]. The purpose of this paper is to show that the bootstrap can be used to approximate the null distribution of this type of statistics. We also give similar results for statistics , with Wn([theta]) a degree-2 U statistic.
Lecture Notes in Computer Science | 2004
Rafael Pino-Mejías; María-Dolores Cubiles-de-la-Vega; Manuel López-Coello; Esther-Lydia Silva-Ramírez; M.D. Jiménez-Gamero
Bagging is an ensemble method proposed to improve the predictive performance of learning algorithms, being specially effective when applied to unstable predictors. It is based on the aggregation of a certain number of prediction models, each one generated from a bootstrap sample of the available training set. We introduce an alternative method for bagging classification models, motivated by the reduced bootstrap methodology, where the generated bootstrap samples are forced to have a number of distinct original observations between two values k 1 and k 2. Five choices for k 1 and k 2 are considered, and the five resulting models are empirically studied and compared with bagging on three real data sets, employing classification trees and neural networks as the base learners. This comparison reveals for this reduced bagging technique a trend to diminish the mean and the variance of the error rate.