Pedro Contreras
Royal Holloway, University of London
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Pedro Contreras.
Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery | 2012
Fionn Murtagh; Pedro Contreras
We survey agglomerative hierarchical clustering algorithms and discuss efficient implementations that are available in R and other software environments. We look at hierarchical self‐organizing maps, and mixture models. We review grid‐based clustering, focusing on hierarchical density‐based approaches. Finally, we describe a recently developed very efficient (linear time) hierarchical clustering algorithm, which can also be viewed as a hierarchical grid‐based algorithm.
SIAM Journal on Scientific Computing | 2008
Fionn Murtagh; Geoff Downs; Pedro Contreras
Coding of data, usually upstream of data analysis, has crucial implications for the data analysis results. By modifying the data coding—through use of less than full precision in data values—we can aid appreciably the effectiveness and efficiency of the hierarchical clustering. In our first application, this is used to lessen the quantity of data to be hierarchically clustered. The approach is a hybrid one, based on hashing and on the Ward minimum variance agglomerative criterion. In our second application, we derive a hierarchical clustering from relationships between sets of observations, rather than the traditional use of relationships between the observations themselves. This second application uses embedding in a Baire space, or longest common prefix ultrametric space. We compare this second approach, which is of
Journal of Classification | 2012
Pedro Contreras; Fionn Murtagh
O(n \log n)
Artificial Intelligence Review | 2003
Fionn Murtagh; Tugba Taskaya; Pedro Contreras; Josiane Mothe; Kurt Englmeier
complexity, to k-means.
P-adic Numbers, Ultrametric Analysis, and Applications | 2012
Fionn Murtagh; Pedro Contreras
The Baire metric induces an ultrametric on a dataset and is of linear computational complexity, contrasted with the standard quadratic time agglomerative hierarchical clustering algorithm. In this work we evaluate empirically this new approach to hierarchical clustering. We compare hierarchical clustering based on the Baire metric with (i) agglomerative hierarchical clustering, in terms of algorithm properties; (ii) generalized ultrametrics, in terms of definition; and (iii) fast clustering through k-means partitioning, in terms of quality of results. For the latter, we carry out an in depth astronomical study. We apply the Baire distance to spectrometric and photometric redshifts from the Sloan Digital Sky Survey using, in this work, about half a million astronomical objects. We want to know how well the (more costly to determine) spectrometric redshifts can predict the (more easily obtained) photometric redshifts, i.e. we seek to regress the spectrometric on the photometric redshifts, and we use clusterwise regression for this.
International Symposium on Statistical Learning and Data Sciences | 2015
Fionn Murtagh; Pedro Contreras
Following a short survey of input data types onwhich to construct interactive visual userinterfaces, we report on a new and recentimplementation taking concept hierarchies asinput data. The visual user interfacesexpress domain ontologies which are based onthese concept hierarchies. We detail aweb-based implementation, and show examples ofusage. An appendix surveys related systems,many of them commercial.
Archive | 2010
Pedro Contreras; Fionn Murtagh
We describe many vantage points on the Baire metric and its use in clustering data, or its use in preprocessing and structuring data in order to support search and retrieval operations. In some cases, we proceed directly to clusters and do not directly determine the distances. We show how a hierarchical clustering can be read directly from one pass through the data. We offer insights also on practical implications of precision of datameasurement. As a mechanism for treating multidimensional data, including very high dimensional data, we use random projections.
arXiv: Machine Learning | 2012
Fionn Murtagh; Pedro Contreras
For high dimensional clustering and proximity finding, also referred to as high dimension and low sample size data, we use random projection with the following principle. With the greater probability of close-to-orthogonal projections, compared to orthogonal projections, we can use rank order sensitivity of projected values. Our Baire metric, divisive hierarchical clustering, is of linear computation time.
Archives of Data Science Series A (Online First) | 2017
Fionn Murtagh; Pedro Contreras
The Baire or longest common prefix ultrametric allows a hierarchy, a multiway tree, or ultrametric topology embedding, to be constructed very efficiently. The Baire distance is a 1-bounded ultrametric. For high dimensional data, one approach for the use of the Baire distance is to base the hierarchy construction on random projections. In this paper we use the Baire distance on the Sloan Digital Sky Survey (SDSS, http://www.sdss.org) archive. We are addressing the regression of (high quality, more costly to collect) spectroscopic and (lower quality, more readily available) photometric redshifts. Nonlinear regression is used for mapping photometric and astrometric redshifts.
Entropy | 2009
Fionn Murtagh; Pedro Contreras; Jean-Luc Starck
Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. “Structure” can be understood as symmetry and a range of symmetries are expressed by hierarchy. Such symmetries directly point to invariants, that pinpoint intrinsic properties of the data and of the background empirical domain of interest. We review many aspects of hierarchy here, including ultrametric topology, generalized ultrametric, linkages with lattices and other discrete algebraic structures and with p-adic number representations. By focusing on symmetries in data we have a powerful means of structuring and analyzing massive, high dimensional data stores. We illustrate the powerfulness of hierarchical clustering in case studies in chemistry and finance, and we provide pointers to other published case studies.