Alejandro Murua
Université de Montréal
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Alejandro Murua.
knowledge discovery and data mining | 2002
Jeremy Tantrum; Alejandro Murua; Werner Stuetzle
The goal of clustering is to identify distinct groups in a dataset. Compared to non-parametric clustering methods like complete linkage, hierarchical model-based clustering has the advantage of offering a way to estimate the number of groups present in the data. However, its computational cost is quadratic in the number of items to be clustered, and it is therefore not applicable to large problems. We review an idea called Fractionation, originally conceived by Cutting, Karger, Pedersen and Tukey for non-parametric hierarchical clustering of large datasets, and describe an adaptation of Fractionation to model-based clustering. A further extension, called Refractionation, leads to a procedure that can be successful even in the difficult situation where there are large numbers of small groups.
Human Brain Mapping | 2008
Larissa Stanberry; Alejandro Murua; Dietmar Cordes
An unsupervised stochastic clustering method based on the ferromagnetic Potts spin model is introduced as a powerful tool to determine functionally connected regions. The method provides an intuitively simple approach to clustering and makes no assumptions of the number of clusters in the data or their underlying distribution. The performance of the method and its dependence on the intrinsic parameters (size of the neighborhood, form of the interaction term, etc.) is investigated on the simulated data and real fMRI data acquired during a conventional periodic finger tapping task. The merits of incorporating Euclidean information into the connectivity analysis are discussed. The ability of the Potts model clustering to uncover the hidden structure in the complex data is demonstrated through its application to the resting‐state data to determine functional connectivity networks of the anterior and posterior cingulate cortices for the group of nine healthy male subjects. Hum Brain Mapp 2008.
Journal of Computational and Graphical Statistics | 2008
Alejandro Murua; Larissa Stanberry; Werner Stuetzle
Many clustering methods, such as K -means, kernel K -means, and MNcut clustering, follow the same recipe: (i) choose a measure of similarity between observations; (ii) define a figure of merit assigning a large value to partitions of the data that put similar observations in the same cluster; and (iii) optimize this figure of merit over partitions. Potts model clustering represents an interesting variation on this recipe. Blatt, Wiseman, and Domany defined a new figure of merit for partitions that is formally similar to the Hamiltonian of the Potts model for ferromagnetism, extensively studied in statistical physics. For each temperature T, the Hamiltonian defines a distribution assigning a probability to each possible configuration of the physical system or, in the language of clustering, to each partition. Instead of searching for a single partition optimizing the Hamiltonian, they sampled a large number of partitions from this distribution for a range of temperatures. They proposed a heuristic for choosing an appropriate temperature and from the sample of partitions associated with this chosen temperature, they then derived what we call a consensus clustering: two observations are put in the same consensus cluster if they belong to the same cluster in the majority of the random partitions. In a sense, the consensus clustering is an “average” of plausible configurations, and we would expect it to be more stable (over different samples)than the configuration optimizing the Hamiltonian. The goal of this article is to contribute to the understanding of Potts model clustering and to propose extensions and improvements: (1) We show that the Hamiltonian used in Potts model clustering is closely related to the kernel K -means and MNCutcriteria. (2) We propose a modification of the Hamiltonian penalizing unequal clustersizes and show that it can be interpreted as a weighted version of the kernel K -meanscriterion. (3) We introduce a new version of the Wolff algorithm to simulate configurations from the distribution defined by the penalized Hamiltonian, leading to penalized Potts model clustering. (4) We note a link between kernel based clustering methods and nonparametric density estimation and exploit it to automatically determine locally adaptive kernel bandwidths. (5) We propose a new simple rule for selecting a good temperature T. As an illustration we apply Potts model clustering to gene expression data and compare our results to those obtained by model based clustering and a nonparametric dendrogram sharpening method.
Journal of Applied Statistics | 2015
Thierry Chekouo; Alejandro Murua
Biclustering is the simultaneous clustering of two related dimensions, for example, of individuals and features, or genes and experimental conditions. Very few statistical models for biclustering have been proposed in the literature. Instead, most of the research has focused on algorithms to find biclusters. The models underlying them have not received much attention. Hence, very little is known about the adequacy and limitations of the models and the efficiency of the algorithms. In this work, we shed light on associated statistical models behind the algorithms. This allows us to generalize most of the known popular biclustering techniques, and to justify, and many times improve on, the algorithms used to find the biclusters. It turns out that most of the known techniques have a hidden Bayesian flavor. Therefore, we adopt a Bayesian framework to model biclustering. We propose a measure of biclustering complexity (number of biclusters and overlapping) through a penalized plaid model, and present a suitable version of the deviance information criterion to choose the number of biclusters, a problem that has not been adequately addressed yet. Our ideas are motivated by the analysis of gene expression data.
Journal of Computational and Graphical Statistics | 2014
Alejandro Murua; Nicolas Wicker
This article presents a Bayesian kernel-based clustering method. The associated model arises as an embedding of the Potts density for class membership probabilities into an extended Bayesian model for joint data and class membership probabilities. The method may be seen as a principled extension of the super-paramagnetic clustering. The model depends on two parameters: the temperature and the kernel bandwidth. The clustering is obtained from the posterior marginal adjacency membership probabilities and does not depend on any particular value of the parameters. We elicit an informative prior based on random graph theory and kernel density estimation. A stochastic population Monte Carlo algorithm, based on parallel runs of the Wang–Landau algorithm, is developed to estimate the posterior adjacency membership probabilities and the parameter posterior. The convergence of the algorithm is also established. The method is applied to the whole human proteome to uncover human genes that share common evolutionary history. Our experiments and application show that good clustering results are obtained at many different values of the temperature and bandwidth parameters. Hence, instead of focusing on finding adequate values of the parameters, we advocate making clustering inference based on the study of the distribution of the posterior adjacency membership probabilities. This article has online supplementary material.
Journal of Public Health Policy | 2008
Sue Thomas Hegyvary; Devon M. Berry; Alejandro Murua
Clustering countries based on health outcomes is a useful technique for assessing global health disparities. However, data on country-specific indicators of health outcomes are inconsistent across databases from different sources, such as World Bank, WHO, and UNICEF. The new database on under-five child mortality from the Institute for Health Metrics and Evaluation advances information about child mortality by showing both country-level estimates and confidence intervals. We used the new database for child mortality and WHO data for HALE from 160 countries to identify country clusters through model-based clustering techniques. The four clusters in 2000 and six in 2003, within levels of uncertainty, showed nonlinear distributions of health outcomes globally, indicating that no single trajectory for progression is evident. We propose the use of country clusters in further study of societal conditions that contribute to health outcomes and changes over time.
international conference on acoustics speech and signal processing | 1999
Jiayu Li; Alejandro Murua
A two-dimensional extension of hidden Markov models (HMM) is introduced, aiming at improving the modeling of speech signals. The extended model (a) focuses on the conditional joint distribution of state durations given the length of utterances, rather than on state transition probabilities; (b) extends the dependency of observation densities to current, as well as neighboring states; and (c) introduces a local averaging procedure to smooth the outcome associated to transitions from successive states. A set of efficient iterative algorithms, based on segmental K-means and iterative conditional modes, for the implementation of the extended model, is also presented. In applications to the recognition of segmented digits spoken over the telephone, the extended model achieved about 23% reduction in the recognition error rate, when compared to the performance of HMMs.
The Annals of Applied Statistics | 2015
Thierry Chekouo; Alejandro Murua; Wolfgang Raffelsberger
We propose and develop a Bayesian plaid model for biclustering that accounts for the prior dependency between genes (and/or conditions) through a stochastic relational graph. This work is motivated by the need for improved understanding of the molecular mechanisms of human diseases for which effective drugs are lacking, and based on the extensive raw data available through gene expression profiling. We model the prior dependency information from biological knowledge gathered from gene ontologies. Our model, the Gibbs-plaid model, assumes that the relational graph is governed by a Gibbs random field. To estimate the posterior distribution of the bicluster membership labels, we develop a stochastic algorithm that is partly based on the Wang-Landau flat-histogram algorithm. We apply our method to a gene expression database created from the study of retinal detachment, with the aim of confirming known or finding novel subnetworks of proteins associated with this disorder.
Journal of Computational and Graphical Statistics | 2017
Alejandro Murua; Fernando A. Quintana
ABSTRACT We consider Bayesian nonparametric regression through random partition models. Our approach involves the construction of a covariate-dependent prior distribution on partitions of individuals. Our goal is to use covariate information to improve predictive inference. To do so, we propose a prior on partitions based on the Potts clustering model associated with the observed covariates. This drives by covariate proximity both the formation of clusters, and the prior predictive distribution. The resulting prior model is flexible enough to support many different types of likelihood models. We focus the discussion on nonparametric regression. Implementation details are discussed for the specific case of multivariate multiple linear regression. The proposed model performs well in terms of model fitting and prediction when compared to other alternative nonparametric regression approaches. We illustrate the methodology with an application to the health status of nations at the turn of the 21st century. Supplementary materials are available online.
Stochastic Processes and their Applications | 1999
Alejandro Murua
This paper deals with the study of the relationship between the complete linear regularity of continuous-time weakly stationary processes and the smoothness of their spectral densities. It is shown that when the coefficient of complete linear regularity behaves like O([tau]-(r+[mu])) as [tau] --> +[infinity], for some , [mu] [set membership, variant] (0,1], then the spectral density has at least r uniformly continuous, bounded, and integrable derivatives, with the rth derivative satisfying a Lipschitz continuity condition of order [mu]. Conversely, under certain smoothness assumptions on the spectral density, upper bounds on the rate of decay of the coefficient of complete linear regularity are obtained.