Thiago F. Covoes
University of São Paulo
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Thiago F. Covoes.
hybrid artificial intelligence systems | 2009
Thiago F. Covoes; Eduardo R. Hruschka; Leandro Nunes de Castro; Átila M. Santos
This paper proposes a filter-based method for feature selection. The filter is based on the partitioning of the feature space into clusters of similar features. The number of clusters and, consequently, the cardinality of the subset of selected features, is automatically estimated from the data. Empirical results illustrate the performance of the proposed algorithm, which in general has obtained competitive results in terms of classification accuracy when compared to a state of the art algorithm for feature selection, but with more modest computing time requirements.
Information Sciences | 2011
Thiago F. Covoes; Eduardo R. Hruschka
This paper proposes a filter-based algorithm for feature selection. The filter is based on the partitioning of the set of features into clusters. The number of clusters, and consequently the cardinality of the subset of selected features, is automatically estimated from data. The computational complexity of the proposed algorithm is also investigated. A variant of this filter that considers feature-class correlations is also proposed for classification problems. Empirical results involving ten datasets illustrate the performance of the developed algorithm, which in general has obtained competitive results in terms of classification accuracy when compared to state of the art algorithms that find clusters of features. We show that, if computational efficiency is an important issue, then the proposed filter may be preferred over their counterparts, thus becoming eligible to join a pool of feature selection algorithms to be used in practice. As an additional contribution of this work, a theoretical framework is used to formally analyze some properties of feature selection methods that rely on finding clusters of features.
intelligent data analysis | 2013
Thiago F. Covoes; Eduardo R. Hruschka; Joydeep Ghosh
The problem of clustering with constraints has received considerable attention in the last decade. Indeed, several algorithms have been proposed, but only a few studies have partially compared their performances. In this work, three well-known algorithms for k-means-based clustering with soft constraints --Constrained Vector Quantization Error CVQE, its variant named LCVQE, and the Metric Pairwise Constrained K-Means MPCK-Means --are systematically compared according to three criteria: Adjusted Rand Index, Normalized Mutual Information, and the number of violated constraints. Experiments were performed on 20 datasets, and for each of them 800 sets of constraints were generated. In order to provide some reassurance about the non-randomness of the obtained results, outcomes of statistical tests of significance are presented. In terms of accuracy, LCVQE has shown to be competitive with CVQE, while violating less constraints. In most of the datasets, both CVQE and LCVQE presented better accuracy compared to MPCK-Means, which is capable of learning distance metrics. In this sense, it was also observed that learning a particular distance metric for each cluster does not necessarily lead to better results than learning a single metric for all clusters. The robustness of the algorithms with respect to noisy constraints was also analyzed. From this perspective, the most interesting conclusion is that CVQE has shown better performance than LCVQE in most of the experiments. The computational complexities of the algorithms are also presented. Finally, a variety of more specific new experimental findings are discussed in the paper --e.g., deduced constraints usually do not help finding better data partitions.
IEEE Transactions on Neural Networks | 2013
Thiago F. Covoes; Eduardo R. Hruschka; Joydeep Ghosh
Constrained clustering has been an active research topic since the last decade. Most studies focus on batch-mode algorithms. This brief introduces two algorithms for on-line constrained learning, named on-line linear constrained vector quantization error (O-LCVQE) and constrained rival penalized competitive learning (C-RPCL). The former is a variant of the LCVQE algorithm for on-line settings, whereas the latter is an adaptation of the (on-line) RPCL algorithm to deal with constrained clustering. The accuracy results-in terms of the normalized mutual information (NMI)-from experiments with nine datasets show that the partitions induced by O-LCVQE are competitive with those found by the (batch-mode) LCVQE. Compared with this formidable baseline algorithm, it is surprising that C-RPCL can provide better partitions (in terms of the NMI) for most of the datasets. Also, experiments on a large dataset show that on-line algorithms for constrained clustering can significantly reduce the computational time.
brazilian symposium on neural networks | 2010
Pablo A. Jaskowiak; Ricardo J. G. B. Campello; Thiago F. Covoes; Eduardo R. Hruschka
Simplified Silhouette Filter (SSF) is a recently introduced feature selection method that automatically estimates the number of features to be selected. To do so, a sampling strategy is combined with a clustering algorithm that seeks clusters of correlated (potentially redundant) features. It is well known that the choice of a similarity measure may have great impact in clustering results. As a consequence, in this application scenario, this choice may have great impact in the feature subset to be selected. In this paper we study six correlation coefficients as similarity measures in the clustering stage of SSF, thus giving rise to several variants of the original method. The obtained results show that, in particular scenarios, some correlation measures select fewer features than others, while providing accurate classifiers.
congress on evolutionary computation | 2013
Thiago F. Covoes; Eduardo R. Hruschka
This paper describes the Evolutionary Create & Eliminate for Expectation Maximization algorithm (ECE-EM) for learning finite Gaussian Mixture Models (GMMs). The proposed algorithm is a variant of the recently proposed Evolutionary Split & Merge for Expectation Maximization algorithm (ESM-EM). ECE-EM uses simpler guiding functions and mutation operators compared to ESM-EM, while keeping the appealing properties of its counterpart. As an additional contribution of our work, we compare, in eighteen datasets, both ECE-EM and ESM-EM with two state-of-the-art algorithms able to learn the structure and the parameters of GMMs. Our experimental results suggest that both evolutionary algorithms present a sound tradeoff between computational time and accuracy when compared to the other algorithms. Furthermore, ECE-EM was able to obtain results at least as good as those achieved by ESM-EM.
international conference on machine learning and applications | 2011
Thiago F. Covoes; Eduardo R. Hruschka
This paper introduces an evolutionary algorithm, named ESM-EM (Evolutionary Split & Merge for Expectation Maximization), for estimating both the number of components and parameters of Gaussian Mixture Models. ESM-EM is based on splitting and merging operations, which are applied to the components of the mixture model. By combining such operations with random search procedures, an evolutionary algorithm potentially capable of escaping from local optima can be designed. In our experiments, we compare ESM-EM with a widely used approach that consists of getting a set of mixture models with different numbers of components and then selecting the model that provides the best result according to a particular model selection criterion. The results of the performed experiments show that ESM-EM provides best results (in terms of the Minimum Description Length principle) in seven out of the eight assessed datasets.
intelligent systems design and applications | 2009
Thiago F. Covoes; Eduardo R. Hruschka
Feature selection is an essential task in data mining because it makes it possible not only to reduce computational times and storage requirements, but also to favor model improvement and better data understanding. In this work, we analyze three methods for unsupervised feature selection that are based on the clustering of features for redundancy removal. We report experimental results obtained in ten datasets that illustrate practical scenarios of particular interest, in which one method may be preferred over another. In order to provide some reassurance about the validity and non-randomness of the obtained results, we also present the results of statistical tests.
brazilian symposium on neural networks | 2012
Davidson M. Sestaro; Thiago F. Covoes; Eduardo R. Hruschka
The disparity between the available amount of unlabeled and labeled data in several applications made semi-supervised learning become an active research topic. Most studies on semi-supervised clustering assume that the number of classes is equal to the number of clusters. This paper introduces a semi-supervised clustering algorithm, named Multiple Clusters per Class k-means (MCCK), which estimates the number of clusters per class via pair wise constraints generated from class labels. Experiments with eight datasets indicate that the algorithm outperforms three traditional algorithms for semi-supervised clustering, especially when the one-cluster-per-class assumption does not hold. Finally, the learned structure can offer a valuable description of the data in several applications. For instance, it can aid the identification of subtypes of diseases in medical diagnosis problems.
Evolutionary Computation | 2016
Thiago F. Covoes; Eduardo R. Hruschka; Joydeep Ghosh
This paper describes the evolutionary split and merge for expectation maximization (ESM-EM) algorithm and eight of its variants, which are based on the use of split and merge operations to evolve Gaussian mixture models. Asymptotic time complexity analysis shows that the proposed algorithms are competitive with the state-of-the-art genetic-based expectation maximization (GA-EM) algorithm. Experiments performed in 35 data sets showed that ESM-EM can be computationally more efficient than the widely used multiple runs of EM (for different numbers of components and initializations). Moreover, a variant of ESM-EM free from critical parameters was shown to be able to provide competitive results with GA-EM, even when GA-EM parameters were fine-tuned a priori.