Pablo A. Jaskowiak | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Pablo A. Jaskowiak is active.

Explore More

Publication

Featured researches published by Pablo A. Jaskowiak.

statistical and scientific database management | 2013

On the combination of relative clustering validity criteria

Lucas Vendramin; Pablo A. Jaskowiak; Ricardo J. G. B. Campello

Many different relative clustering validity criteria exist that are very useful as quantitative measures for assessing the quality of data partitions. These criteria are endowed with particular features that may make each of them more suitable for specific classes of problems. Nevertheless, the performance of each criterion is usually unknown a priori by the user. Hence, choosing a specific criterion is not a trivial task. A possible approach to circumvent this drawback consists of combining different relative criteria in order to obtain more robust evaluations. However, this approach has so far been applied in an ad-hoc fashion only; its real potential is actually not well-understood. In this paper, we present an extensive study on the combination of relative criteria considering both synthetic and real datasets. The experiments involved 28 criteria and 4 different combination strategies applied to a varied collection of data partitions produced by 5 clustering algorithms. In total, 427,680 partitions of 972 synthetic datasets and 14,000 partitions of a collection of 400 image datasets were considered. Based on the results, we discuss the shortcomings and possible benefits of combining different relative criteria into a committee.

BMC Bioinformatics | 2014

On the selection of appropriate distances for gene expression data clustering

Pablo A. Jaskowiak; Ricardo J. G. B. Campello; Ivan G. Costa

BackgroundClustering is crucial for gene expression data analysis. As an unsupervised exploratory procedure its results can help researchers to gain insights and formulate new hypothesis about biological data from microarrays. Given different settings of microarray experiments, clustering proves itself as a versatile exploratory tool. It can help to unveil new cancer subtypes or to identify groups of genes that respond similarly to a specific experimental condition. In order to obtain useful clustering results, however, different parameters of the clustering procedure must be properly tuned. Besides the selection of the clustering method itself, determining which distance is going to be employed between data objects is probably one of the most difficult decisions.Results and conclusionsWe analyze how different distances and clustering methods interact regarding their ability to cluster gene expression, i.e., microarray data. We study 15 distances along with four common clustering methods from the literature on a total of 52 gene expression microarray datasets. Distances are evaluated on a number of different scenarios including clustering of cancer tissues and genes from short time-series expression data, the two main clustering applications in gene expression. Our results support that the selection of an appropriate distance depends on the scenario in hand. Moreover, in each scenario, given the very same clustering method, significant differences in quality may arise from the selection of distinct distance measures. In fact, the selection of an appropriate distance measure can make the difference between meaningful and poor clustering outcomes, even for a suitable clustering method.

IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2013

Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis

Pablo A. Jaskowiak; Ricardo J. G. B. Campello; Ivan G. Costa

Cluster analysis is usually the first step adopted to unveil information from gene expression microarray data. Besides selecting a clustering algorithm, choosing an appropriate proximity measure (similarity or distance) is of great importance to achieve satisfactory clustering results. Nevertheless, up to date, there are no comprehensive guidelines concerning how to choose proximity measures for clustering microarray data. Pearson is the most used proximity measure, whereas characteristics of other ones remain unexplored. In this paper, we investigate the choice of proximity measures for the clustering of microarray data by evaluating the performance of 16 proximity measures in 52 data sets from time course and cancer experiments. Our results support that measures rarely employed in the gene expression literature can provide better results than commonly employed ones, such as Pearson, Spearman, and euclidean distance. Given that different measures stood out for time course and cancer data evaluations, their choice should be specific to each scenario. To evaluate measures on time-course data, we preprocessed and compiled 17 data sets from the microarray literature in a benchmark along with a new methodology, called Intrinsic Biological Separation Ability (IBSA). Both can be employed in future research to assess the effectiveness of new measures for gene time-course data.

siam international conference on data mining | 2014

Density-based clustering validation

Davoud Moulavi; Pablo A. Jaskowiak; Ricardo J. G. B. Campello; Arthur Zimek; Jörg Sander

One of the most challenging aspects of clustering is validation, which is the objective and quantitative assessment of clustering results. A number of different relative validity criteria have been proposed for the validation of globular, clusters. Not all data, however, are composed of globular clusters. Density-based clustering algorithms seek partitions with high density areas of points (clusters, not necessarily globular) separated by low density areas, possibly containing noise objects. In these cases relative validity indices proposed for globular cluster validation may fail. In this paper we propose a relative validation index for density-based, arbitrarily shaped clusters. The index assesses clustering quality based on the relative density connection between pairs of objects. Our index is formulated on the basis of a new kernel density function, which is used to compute the density of objects and to evaluate the within- and between-cluster density connectedness of clustering results. Experiments on synthetic and real world data show the effectiveness of our approach for the evaluation and selection of clustering algo rithms and their respective appropriate parameters.

intelligent systems design and applications | 2011

A bottom-up oblique decision tree induction algorithm

Rodrigo C. Barros; Ricardo Cerri; Pablo A. Jaskowiak; André Carlos Ponce Leon Ferreira de Carvalho

Decision tree induction algorithms are widely used in knowledge discovery and data mining, specially in scenarios where model comprehensibility is desired. A variation of the traditional univariate approach is the so-called oblique decision tree, which allows multivariate tests in its non-terminal nodes. Oblique decision trees can model decision boundaries that are oblique to the attribute axes, whereas univariate trees can only perform axis-parallel splits. The majority of the oblique and univariate decision tree induction algorithms perform a top-down strategy for growing the tree, relying on an impurity-based measure for splitting nodes. In this paper, we propose a novel bottom-up algorithm for inducing oblique trees named BUTIA. It does not require an impurity-measure for dividing nodes, since we know a priori the data resulting from each split. For generating the splitting hyperplanes, our algorithm implements a support vector machine solution, and a clustering algorithm is used for generating the initial leaves. We compare BUTIA to traditional univariate and oblique decision tree algorithms, C4.5, CART, OC1 and FT, as well as to a standard SVM implementation, using real gene expression benchmark data. Experimental results show the effectiveness of the proposed approach in several cases.

BMC Bioinformatics | 2015

Impact of missing data imputation methods on gene expression clustering and classification

Marcílio Carlos Pereira de Souto; Pablo A. Jaskowiak; Ivan G. Costa

BackgroundSeveral missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes.Results and conclusionsWe performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/.

brazilian symposium on bioinformatics | 2012

Evaluating Correlation Coefficients for Clustering Gene Expression Profiles of Cancer

Pablo A. Jaskowiak; Ricardo J. G. B. Campello; Ivan G. Costa

Cluster analysis is usually the first step adopted to unveil information from gene expression data. One of its common applications is the clustering of cancer samples, associated with the detection of previously unknown cancer subtypes. Although guidelines have been established concerning the choice of appropriate clustering algorithms, little attention has been given to the subject of proximity measures. Whereas the Pearson correlation coefficient appears as the de facto proximity measure in this scenario, no comprehensive study analyzing other correlation coefficients as alternatives to it has been conducted. Considering such facts, we evaluated five correlation coefficients (along with Euclidean distance) regarding the clustering of cancer samples. Our evaluation was conducted on 35 publicly available datasets covering both (i) intrinsic separation ability and (ii) clustering predictive ability of the correlation coefficients. Our results support that correlation coefficients rarely considered in the gene expression literature may provide competitive results to more generally employed ones. Finally, we show that a recently introduced measure arises as a promising alternative to the commonly employed Pearson, providing competitive and even superior results to it.

brazilian symposium on neural networks | 2010

A Comparative Study on the Use of Correlation Coefficients for Redundant Feature Elimination

Pablo A. Jaskowiak; Ricardo J. G. B. Campello; Thiago F. Covoes; Eduardo R. Hruschka

Simplified Silhouette Filter (SSF) is a recently introduced feature selection method that automatically estimates the number of features to be selected. To do so, a sampling strategy is combined with a clustering algorithm that seeks clusters of correlated (potentially redundant) features. It is well known that the choice of a similarity measure may have great impact in clustering results. As a consequence, in this application scenario, this choice may have great impact in the feature subset to be selected. In this paper we study six correlation coefficients as similarity measures in the clustering stage of SSF, thus giving rise to several variants of the original method. The obtained results show that, in particular scenarios, some correlation measures select fewer features than others, while providing accurate classifiers.

brazilian conference on intelligent systems | 2015

A Cluster Based Hybrid Feature Selection Approach

Pablo A. Jaskowiak; Ricardo J. G. B. Campello

Data collection and storage capacities have increased significantly in the past decades. In order to cope with the increasingly complexity of data, feature selection methods have become an omnipresent preprocessing step in data analysis. In this paper we present a hybrid (filter - wrapper) feature selection method tailored for data classification problems. Our hybrid approach is composed of two stages. In the first stage, a filter clusters features to identify and remove redundancy. In the second stage, a wrapper evaluates different feature subsets produced by the filter, determining the one that produces the best classification performance in terms of accuracy. The effectiveness of our method is demonstrated through an empirical evaluation performed on real-world datasets coming from various sources.

Methods | 2018

Clustering of RNA-Seq samples: comparison study on cancer data

Pablo A. Jaskowiak; Ivan G. Costa; Ricardo J. G. B. Campello

RNA-Seq is becoming the standard technology for large-scale gene expression level measurements, as it offers a number of advantages over microarrays. Standards for RNA-Seq data analysis are, however, in its infancy when compared to those of microarrays. Clustering, which is essential for understanding gene expression data, has been widely investigated w.r.t. microarrays. In what concerns the clustering of RNA-Seq data, however, a number of questions remain open, resulting in a lack of guidelines to practitioners. Here we evaluate computational steps relevant for clustering cancer samples via an empirical analysis of 15mRNA-seq datasets. Our evaluation considers strategies regarding expression estimates, number of genes after non-specific filtering and data transformations. We evaluate the performance of four clustering algorithms and twelve distance measures, which are commonly used for gene expression analysis. Results support that clustering cancer samples based on a gene quantification should be preferred. The use of non-specific filtering leading to a small number of features (1,000) presents, in general, superior results. Data should be log-transformed previously to cluster analysis. Regarding the choice of clustering algorithms, Average-Linkage and k-medoids provide, in general, superior recoveries. Although specific cases can benefit from a careful selection of a distance measure, Symmetric Rank-Magnitude correlation provides consistent and sound results in different scenarios.

Explore More