Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Manuel Baena-García is active.

Publication


Featured researches published by Manuel Baena-García.


intelligent systems design and applications | 2011

TF-SIDF: Term frequency, sketched inverse document frequency

Manuel Baena-García; José M. Carmona-Cejudo; Gladys Castillo; Rafael Morales-Bueno

Exact calculation of the TF-IDF weighting function in massive streams of documents involves challenging memory space requirements. In this work, we propose TF-SIDF, a novel solution for extracting relevant words from streams of documents with a high number of terms. TF-SIDF relies on the Count-Min Sketch data structure, which allows to estimate the counts of all the terms in the stream. Results of the experiments conducted with two dataset show that this sketch-based algorithm achieves good approximations of the TF-IDF weighting values (as a rule, the top terms with highest TF-IDF values remaining the same), while substantial savings in memory usage are observed. It is also observed that the performance is highly correlated with the sketch size, and that wider sketch configurations are preferable given the same sketch size.


european conference on artificial intelligence | 2010

GNUsmail: Open Framework for On-line Email Classification

José M. Carmona-Cejudo; Manuel Baena-García; José del Campo-Ávila; Rafael Morales-Bueno; Albert Bifet

Real-time classification of massive email data is a challenging task that presents its own particular difficulties. Since email data presents an important temporal component, several problems arise: emails arrive continuously, and the criteria used to classify those emails can change, so the learning algorithms have to be able to deal with concept drift. Our problem is more general than spam detection, which has received much more attention in the literature. In this paper we present GNUsmail, an open-source extensible framework for email classification, which structure supports incremental and on-line learning. This framework enables the incorporation of algorithms developed by other researchers, such as those included in WEKA and MOA. We evaluate this framework, characterized by two overlapping phases (pre-processing and learning), using the ENRON dataset, and we compare the results achieved by WEKA and MOA algorithms.


The Journal of medical research | 2012

DB4US: A Decision Support System for Laboratory Information Management

José M. Carmona-Cejudo; Maria Luisa Hortas; Manuel Baena-García; Jorge Lana-Linati; Carlos A. González; Maximino Redondo; Rafael Morales-Bueno

Background Until recently, laboratory automation has focused primarily on improving hardware. Future advances are concentrated on intelligent software since laboratories performing clinical diagnostic testing require improved information systems to address their data processing needs. In this paper, we propose DB4US, an application that automates information related to laboratory quality indicators information. Currently, there is a lack of ready-to-use management quality measures. This application addresses this deficiency through the extraction, consolidation, statistical analysis, and visualization of data related to the use of demographics, reagents, and turn-around times. The design and implementation issues, as well as the technologies used for the implementation of this system, are discussed in this paper. Objective To develop a general methodology that integrates the computation of ready-to-use management quality measures and a dashboard to easily analyze the overall performance of a laboratory, as well as automatically detect anomalies or errors. The novelty of our approach lies in the application of integrated web-based dashboards as an information management system in hospital laboratories. Methods We propose a new methodology for laboratory information management based on the extraction, consolidation, statistical analysis, and visualization of data related to demographics, reagents, and turn-around times, offering a dashboard-like user web interface to the laboratory manager. The methodology comprises a unified data warehouse that stores and consolidates multidimensional data from different data sources. The methodology is illustrated through the implementation and validation of DB4US, a novel web application based on this methodology that constructs an interface to obtain ready-to-use indicators, and offers the possibility to drill down from high-level metrics to more detailed summaries. The offered indicators are calculated beforehand so that they are ready to use when the user needs them. The design is based on a set of different parallel processes to precalculate indicators. The application displays information related to tests, requests, samples, and turn-around times. The dashboard is designed to show the set of indicators on a single screen. Results DB4US was deployed for the first time in the Hospital Costa del Sol in 2008. In our evaluation we show the positive impact of this methodology for laboratory professionals, since the use of our application has reduced the time needed for the elaboration of the different statistical indicators and has also provided information that has been used to optimize the usage of laboratory resources by the discovery of anomalies in the indicators. DB4US users benefit from Internet-based communication of results, since this information is available from any computer without having to install any additional software. Conclusions The proposed methodology and the accompanying web application, DB4US, automates the processing of information related to laboratory quality indicators and offers a novel approach for managing laboratory-related information, benefiting from an Internet-based communication mechanism. The application of this methodology has been shown to improve the usage of time, as well as other laboratory resources.


computational intelligence and data mining | 2011

Feature extraction for multi-label learning in the domain of email classification

José M. Carmona-Cejudo; Manuel Baena-García; José del Campo-Ávila; Rafael Morales-Bueno

Multi-label learning is a very interesting field in Machine Learning. It allows to generalise standard methods and evaluation procedures, and tackle challenging real problems where one example can be tagged with more than one label. In this paper we study the performance of different multi-label methods in combination with standard single-label algorithms, using several specific multi-label metrics. What we want to show is how a good preprocessing phase can improve the performance of such methods and algorithms. As we will explain, its main advantage is a shorter time to induce the models, while keeping (even improving) other classification quality measures. We use the GNUsmail framework to do the preprocessing of an existing and extensively used dataset, to obtain a reduced feature space that conserves the relevant information and allows improvements on performance. Thanks to the capabilities of GNUsmail, the preprocessing step can be easily applied to different email datasets.


intelligent systems design and applications | 2011

A comparative study on feature selection and adaptive strategies for email foldering

José M. Carmona-Cejudo; Gladys Castillo; Manuel Baena-García; Rafael Morales-Bueno

Email foldering is a challenging problem mainly due to its high dimensionality and dynamic nature. This work presents a comparison of several feature extraction/ selection and adaptive strategies aimed at coping with these two main difficulties. To this end, several studies have been carried out using the ENRON email dataset in order to test how different configuration settings for feature extraction/selection and adapting processes can affect the classification performance.


Knowledge Based Systems | 2013

A comparative study on feature selection and adaptive strategies for email foldering using the ABC-DynF framework

José M. Carmona-Cejudo; Gladys Castillo; Manuel Baena-García; Rafael Morales-Bueno

Email foldering is a challenging problem mainly due to its high dimensionality and dynamic nature. This work presents ABC-DynF, an adaptive learning framework with dynamic feature space that we use to compare several incremental and adaptive strategies to cope with these two difficulties. Several studies have been carried out using datasets from the ENRON email corpus and different configuration settings of the framework. The main aim is to study how feature ranking methods, concept drift monitoring, adaptive strategies and the implementation of a dynamic feature space can affect the performance of Bayesian email classification systems.


intelligent data analysis | 2011

Online evaluation of email streaming classifiers using GNUsmail

José M. Carmona-Cejudo; Manuel Baena-García; José del Campo-Ávila; Albert Bifet; João Gama; Rafael Morales-Bueno

Real-time email classification is a challenging task because of its online nature, subject to concept-drift. Identifying spam, where only two labels exist, has received great attention in the literature. We are nevertheless interested in classification involving multiple folders, which is an additional source of complexity. Moreover, neither cross-validation nor other sampling procedures are suitable for data streams evaluation. Therefore, other metrics, like the prequential error, have been proposed. However, the prequential error poses some problems, which can be alleviated by using mechanisms such as fading factors. In this paper we present GNUsmail, an open-source extensible framework for email classification, and focus on its ability to perform online evaluation. GNUsmails architecture supports incremental and online learning, and it can be used to compare different online mining methods, using state-of-art evaluation metrics. We show how GNUsmail can be used to compare different algorithms, including a tool for launching replicable experiments.


intelligent systems design and applications | 2011

Online calculation of word-clouds for efficient label summarization

José M. Carmona-Cejudo; Manuel Baena-García; Gladys Castillo; Rafael Morales-Bueno

Large amounts of information are available on the Internet in the form of natural language text that can be processed as a stream of documents. Users need solutions that summarize the vast volume of data. Word clouds are a popular graphical representation approach that allows them to obtain such a quick visual summary. Nevertheless, the exact solution to this problem has high memory requirements, and is not scalable as the collection size grows up. In this work, we provide a method for approximate online computation of word clouds to summarize the contents of each label or category of a given text stream, using only the most relevant terms of each document according to some weighting function. We experimentally show that our method, based on sketching techniques, obtains a good performance while using a restricted quantity of memory.


intelligent systems design and applications | 2011

New data structures for analyzing frequent factors in strings

Manuel Baena-García; Rafael Morales-Bueno

Discovering frequent factors from long strings is an important problem in many applications, such as biosequence mining. In classical approaches, the algorithms process a vast database of small strings. However, in this paper we analyze a small database of long strings. The main difference resides in the high number of patterns to analyze. To tackle the problem, we have developed a new algorithm for discovering frequent factors in long strings. This algorithm uses a new data structure to arrange nodes in a trie. A positioning matrix is defined as a new positioning strategy. By using positioning matrices, we can apply advanced prune heuristics in a trie with a minimal computational cost. The positioning matrices let us process strings including Short Tandem Repeats and calculate different interestingness measures efficiently. The algorithm has been successfully used in natural language and biological sequence contexts.


Journal of Computer and System Sciences | 2014

String analysis by sliding positioning strategy

Manuel Baena-García; José M. Carmona-Cejudo; Rafael Morales-Bueno

Discovering frequent factors from long strings is an important problem in many applications, such as biosequence mining. In classical approaches, the algorithms process a vast database of small strings. However, in this paper we analyze a small database of long strings. The main difference resides in the high number of patterns to analyze. To tackle the problem, we have developed a new algorithm for discovering frequent factors in long strings. We present an Apriori-like solution which exploits the fact that any super-pattern of a non-frequent pattern cannot be frequent. The SANSPOS algorithm does a multiple-pass, candidate generation and test approach. Multiple length patterns can be generated in a pass. This algorithm uses a new data structure to arrange nodes in a trie. A Positioning Matrix is defined as a new positioning strategy. By using Positioning Matrices, we can apply advanced prune heuristics in a trie with a minimal computational cost. The Positioning Matrices let us process strings including Short Tandem Repeats and calculate different interestingness measures efficiently. Furthermore, in our algorithm we apply parallelism to transverse different sections of the input strings concurrently, speeding up the resulting running time. The algorithm has been successfully used in natural language and biological sequence contexts.

Collaboration


Dive into the Manuel Baena-García's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Albert Bifet

Université Paris-Saclay

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge