Murilo Coelho Naldi
University of São Paulo
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Murilo Coelho Naldi.
Data Mining and Knowledge Discovery | 2013
Murilo Coelho Naldi; André Carlos Ponce Leon Ferreira de Carvalho; Ricardo J. G. B. Campello
Cluster ensemble aims at producing high quality data partitions by combining a set of different partitions produced from the same data. Diversity and quality are claimed to be critical for the selection of the partitions to be combined. To enhance these characteristics, methods can be applied to evaluate and select a subset of the partitions that provide ensemble results similar or better than those based on the full set of partitions. Previous studies have shown that this selection can significantly improve the quality of the final partitions. For such, an appropriate evaluation of the candidate partitions to be combined must be performed. In this work, several methods to evaluate and select partitions are investigated, most of them based on relative clustering validity indexes. These indexes select the partitions with the highest quality to participate in the ensemble. However, each relative index can be more suitable for particular data conformations. Thus, distinct relative indexes are combined to create a final evaluation that tends to be robust to changes in the application scenario, as the majority of the combined indexes may compensate the poor performance of some individual indexes. We also investigate the impact of the diversity among partitions used for the ensemble. A comparative evaluation of results obtained from an extensive collection of experiments involving state-of-the-art methods and statistical tests is presented. Based on the obtained results, a practical design approach is proposed to support cluster ensemble selection. This approach was successfully applied to real public domain data sets.
Neurocomputing | 2014
Murilo Coelho Naldi; Ricardo J. G. B. Campello
One of the challenges for clustering resides in dealing with data distributed in separated repositories, because most clustering techniques require the data to be centralized. One of them, k-means, has been elected as one of the most influential data mining algorithms for being simple, scalable and easily modifiable to a variety of contexts and application domains. Although distributed versions of k-means have been proposed, the algorithm is still sensitive to the selection of the initial cluster prototypes and requires the number of clusters to be specified in advance. In this paper, we propose the use of evolutionary algorithms to overcome the k-means limitations and, at the same time, to deal with distributed data. Two different distribution approaches are adopted: the first obtains a final model identical to the centralized version of the clustering algorithm; the second generates and selects clusters for each distributed data subset and combines them afterwards. The algorithms are compared experimentally from two perspectives: the theoretical one, through asymptotic complexity analyses; and the experimental one, through a comparative evaluation of results obtained from a collection of experiments and statistical tests. The obtained results indicate which variant is more adequate for each application scenario.
Neurocomputing | 2015
Murilo Coelho Naldi; Ricardo J. G. B. Campello
Dealing with distributed data is one of the challenges for clustering, as most clustering techniques require the data to be centralized. One of them, k-means, has been elected as one of the most influential data mining algorithms for being simple, scalable, and easily modifiable to a variety of contexts and application domains. However, exact distributed versions of k-means are still sensitive to the selection of the initial cluster prototypes and require the number of clusters to be specified in advance. Additionally, preserving data privacy among repositories may be a complicating factor. In order to overcome k-means limitations, two different approaches were adopted in this paper: the first obtains a final model identical to the centralized version of the clustering algorithm and the second generates and selects clusters for each distributed data subset and combines them afterwards. It is also described how to apply the algorithms compared while preserving data privacy. The algorithms are compared experimentally from two perspectives: the theoretical one, through asymptotic complexity analyses, and the experimental one, through a comparative evaluation of results obtained from a collection of experiments and statistical tests. The results obtained indicate which algorithm is more suitable for each application scenario.
knowledge discovery and data mining | 2008
Murilo Coelho Naldi; André Carlos Ponce Leon Ferreira de Carvalho; Ricardo José Gabrielli Barreto Campell; Eduardo R. Hruschka
Genetic Algorithms (GAs) have been successfully applied to several complex data analysis problems in a wide range of domains, such as image processing, bioinformatics, and crude oil analysis. The need for organizing data into categories of similar objects has made the task of clustering increasingly important to those domains. In this chapter, the authors present a survey of the use of GAs for clustering applications. A variety of encoding (chromosome representation) approaches, fitness functions, and genetic operators are described, all of them customized to solve problems in such an application context.
intelligent systems design and applications | 2009
Murilo Coelho Naldi; Andre Fontana; Ricardo J. G. B. Campello
One of the most influential algorithms in data mining, k-means, is broadly used in practical tasks for its simplicity, computational efficiency and effectiveness in high dimensional problems. However, k-means has two major drawbacks, which are the need to choose the number of clusters, k, and the sensibility to the initial prototypes’ position. In this work, systematic, evolutionary and order heuristics used to suppress these drawbacks are compared. 27 variants of 4 algorithmic approaches are used to partition 324 synthetic data sets and the obtained results are compared.
foundations of computational intelligence | 2009
D. Horta; Murilo Coelho Naldi; Ricardo J. G. B. Campello; Eduardo R. Hruschka; A. de Carvalho
Clustering algorithms have been successfully applied to several data analysis problems in a wide range of domains, such as image processing, bioinformatics, crude oil analysis, market segmentation, document categorization, and web mining. The need for organizing data into categories of similar objects has made the task of clustering very important to these domains. In this context, there has been an increasingly interest in the study of evolutionary algorithms for clustering, especially those algorithms capable of finding blurred clusters that are not clearly separated from each other. In particular, a number of evolutionary algorithms for fuzzy clustering have been addressed in the literature. This chapter has two main contributions. First, it presents an overview of evolutionary algorithms designed for fuzzy clustering. Second, it describes a fuzzy version of an evolutionary algorithm for clustering, which has shown to be more computationally efficient than systematic (i.e., repetitive) approaches when the number of clusters in a data set is unknown. Illustrative experiments showing the influence of local optimization on the efficiency of the evolutionary search are also presented. These experiments reveal interesting aspects of the effect of an important parameter found in many evolutionary algorithms for clustering, namely, the number of iterations of a given local search procedure to be performed at each generation.
Neurocomputing | 2017
G. V. Oliveira; F. P. Coutinho; Ricardo J. G. B. Campello; Murilo Coelho Naldi
The recent growing size of datasets requires scalability of data mining algorithms, such as clustering algorithms. The MapReduce programing model provides the scalability needed, alongside with portability as well as automatic data safety and management. k-means is one of the most popular algorithms in data mining and can be easily adapted to the MapReduce model. Nevertheless, k-means has drawbacks, such as the need to provide the number of clusters (k) in advance and the sensitivity of the algorithm to the initial cluster prototypes. This paper presents two evolutionary scalable metaheuristics in MapReduce that automatically seek the solution with the optimal number of clusters and best clustering structure for scalable datasets. The first consists in an algorithm able to iteratively enhance k-means clusterings through evolutionary operators designed to handle distributed data. The second consists in applying evolutionary k-means to cluster each distributed portion of a dataset in an independent way, combining the obtained results into an ensemble afterwards. The proposed techniques are compared asymptotically and experimentally with other state-of-the-art clustering algorithms also developed in MapReduce. The results are analyzed by statistical tests and show that the first proposed metaheuristic yielded results with the best quality, while the second achieved the best computing times.
brazilian conference on intelligent systems | 2014
Kemilly Dearo Garcia; Murilo Coelho Naldi
Dealing with big amounts of data is one of the challenges for clustering, which causes the need for distribution and management of huge data sets in separate repositories. New distributed systems have been designed to scale up from a single server to thousands of machines. The MapReduce framework allows to divide a job and combine the results seamlessly. The k-means is one of the few clustering algorithms that satisfies the MapReduce constrains, but it requires the previous specification of the number of clusters and is sensitive to their initialization. In this work, we propose a MapReduce clustering algorithm to execute multiple parallel runs of k-means with different initializations and number of clusters. Additionally, a MapReduce version of a cluster relative validity index is implemented and used to find the best result. The proposed algorithm is experimentally compared with the Apache Mahout Projects MapReduce implementation of k-means. Statistical tests applied on the results indicate that the proposed algorithm can outperform the Mahouts implementation when multiple k-means partitions are required.
brazilian symposium on neural networks | 2012
Murilo Coelho Naldi; Ricardo J. G. B. Campello
One of the challenges for clustering resides in dealing with huge amounts of data, which causes the need for distribution of large data sets in separate repositories. However, most clustering techniques require the data to be centralized. One of them, the k-means, has been elected one of the most influential data mining algorithms. Although exact distributed versions of the k-means algorithm have been proposed, the algorithm is still sensitive to the selection of the initial cluster prototypes and requires that the number of clusters be specified in advance. This work tackles the problem of generating an approximated model for distributed clustering, based on k-means, for scenarios where the number of clusters of the distributed data is unknown. We propose a collection of algorithms that generate and select k-means clustering for each distributed subset of the data and combine them afterwards. The variants of the algorithm are compared from two perspectives: the theoretical one, through asymptotic complexity analyses, and the experimental one, through a comparative evaluation of results obtained from a collection of experiments and statistical tests.
brazilian conference on intelligent systems | 2013
Murilo Coelho Naldi; Ricardo J. G. B. Campello
Dealing with big amounts of data is one of the challenges for clustering, which causes the need for distribution of large data sets in separate repositories. However, most clustering techniques require the data to be centralized. One of them, the k-means, has been elected one of the most influential data mining algorithms. Although exact distributed versions of the k-means algorithm have been proposed, the algorithm is still sensitive to the selection of the initial cluster prototypes and requires that the number of clusters be specified in advance. Additionally, distributed versions of clustering algorithms usually requires multiple rounds of data transmission. This work tackles the problem of generating an approximated model for distributed clustering, based on k-means, for scenarios where the number of clusters of the distributed data is unknown and the data transmission rate is low or costly. A collection of algorithms is proposed to combine k-means clustering for each distributed subset of the data with a single round of communication. These algorithms are compared from two perspectives: the theoretical one, through asymptotic complexity analyses, and the experimental one, through a comparative evaluation of results obtained from experiments and statistical tests.