Juan José Egozcue
Polytechnic University of Catalonia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Juan José Egozcue.
Mathematical Geosciences | 2003
Juan José Egozcue; Vera Pawlowsky-Glahn; G. Mateu-Figueras; C. Barceló-Vidal
Geometry in the simplex has been developed in the last 15 years mainly based on the contributions due to J. Aitchison. The main goal was to develop analytical tools for the statistical analysis of compositional data. Our present aim is to get a further insight into some aspects of this geometry in order to clarify the way for more complex statistical approaches. This is done by way of orthonormal bases, which allow for a straightforward handling of geometric elements in the simplex. The transformation into real coordinates preserves all metric properties and is thus called isometric logratio transformation (ilr). An important result is the decomposition of the simplex, as a vector space, into orthogonal subspaces associated with nonoverlapping subcompositions. This gives the key to join compositions with different parts into a single composition by using a balancing element. The relationship between ilr transformations and the centered-logratio (clr) and additive-logratio (alr) transformations is also studied. Exponential growth or decay of mass is used to illustrate compositional linear processes, parallelism and orthogonality in the simplex.
Geological Society, London, Special Publications | 2006
Vera Pawlowsky-Glahn; Juan José Egozcue
Abstract Compositional data are those which contain only relative information. They are parts of some whole. In most cases they are recorded as closed data, i.e. data summing to a constant, such as 100% — whole-rock geochemical data being classic examples. Compositional data have important and particular properties that preclude the application of standard statistical techniques on such data in raw form. Standard techniques are designed to be used with data that are free to range from − ∞ to + ∞. Compositional data are always positive and range only from 0 to 100, or any other constant, when given in closed form. If one component increases, others must, perforce, decrease, whether or not there is a genetic link between these components. This means that the results of standard statistical analysis of the relationships between raw components or parts in a compositional dataset are clouded by spurious effects. Although such analyses may give apparently interpretable results, they are, at best, approximations and need to be treated with considerable circumspection. The methods outlined in this volume are based on the premise that it is the relative variation of components which is of interest, rather than absolute variation. Log-ratios of components provide the natural means of studying compositional data. In this contribution the basic terms and operations are introduced using simple numerical examples to illustrate their computation and to familiarize the reader with their use.
Geological Society, London, Special Publications | 2006
Juan José Egozcue; Vera Pawlowsky-Glahn
Abstract The main features of the Aitchison geometry of the simplex of D parts are reviewed. Compositions are positive vectors in which the relevant information is contained in the ratios between their components or parts. They can be represented in the simplex of D parts by closing them to a constant sum, e.g. percentages, or parts per million. Perturbation and powering in the simplex of D parts are respectively an internal operation, playing the role of a sum, and of an external product by real numbers or scalars. These operations impose the structure of (D − 1)-dimensional vector space to the simplex of D parts. An inner product, norm and distance, compatible with perturbation and powering, complete the structure of the simplex, a structure known in mathematical terms as a Euclidean space. This general structure allows the representation of compositions by coordinates with respect to a basis of the space, particularly, an orthonormal basis. The interpretation of the so-called balances, coordinates with respect to orthonormal bases associated with groups of parts, is stressed. Subcompositions and balances are interpreted as orthogonal projections. Finally, log-ratio transformations (alr, clr and ilr) are considered in this geometric context.
Mathematical Geosciences | 2002
Vera Pawlowsky-Glahn; Juan José Egozcue
One of the principal objections to the logratio approach for the statistical analysis of compositional data has been the absence of unbiasedness and minimum variance properties of some estimators: they seem not to be BLU estimator. Using a geometric approach, we introduce the concept of metric variance and of a compositional unbiased estimator, and we show that the closed geometric mean is a c-BLU estimator (compositional best linear unbiased estimator with respect to the geometry of the simplex) of the center of the distribution of a random composition. Thus, it satisfies analogous properties to the arithmetic mean as a BLU estimator of the expected value in real space. The geometric approach used gives real meaning to the concepts of measure of central tendency and measure of dispersion and opens up a new way of understanding the statistical analysis of compositional data.
PLOS Computational Biology | 2015
David Lovell; Vera Pawlowsky-Glahn; Juan José Egozcue; Samuel Marguerat; Jürg Bähler
In the life sciences, many measurement methods yield only the relative abundances of different components in a sample. With such relative—or compositional—data, differential expression needs careful interpretation, and correlation—a statistical workhorse for analyzing pairwise relationships—is an inappropriate measure of association. Using yeast gene expression data we show how correlation can be misleading and present proportionality as a valid alternative for relative data. We show how the strength of proportionality between two variables can be meaningfully and interpretably described by a new statistic ϕ which can be used instead of correlation as the basis of familiar analyses and visualisation methods, including co-expression networks and clustered heatmaps. While the main aim of this study is to present proportionality as a means to analyse relative data, it also raises intriguing questions about the molecular mechanisms underlying the proportional regulation of a range of yeast genes.
Mathematical Geosciences | 2002
Hilmar von Eynatten; Vera Pawlowsky-Glahn; Juan José Egozcue
Perturbation is an operation defined on the simplex and can be used for centering compositional data in a ternary diagram, applying objective criteria. Because a straight line in the original diagram is still astraight line in the perturbed diagram, gridlines or compositional fields defined by straight lines can easily be included in the operation. Simultaneous perturbation of data, gridlines, and/or compositional fields is shown to improve both visualization and graphical interpretation of compositions in ternary diagrams. This is illustrated by some examples using simulated as well as real data.
Frontiers in Plant Science | 2013
Serge-Étienne Parent; Léon-Étienne Parent; Juan José Egozcue; Danilo-Eduardo Rozane; Amanda Hernandes; Line Lapointe; Valérie Hébert-Gentile; Kristine Naess; Sébastien Marchand; Jean Lafond; Dirceu Mattos; Philip Barlow; William Natale
Tissue analysis is commonly used in ecology and agronomy to portray plant nutrient signatures. Nutrient concentration data, or ionomes, belong to the compositional data class, i.e., multivariate data that are proportions of some whole, hence carrying important numerical properties. Statistics computed across raw or ordinary log-transformed nutrient data are intrinsically biased, hence possibly leading to wrong inferences. Our objective was to present a sound and robust approach based on a novel nutrient balance concept to classify plant ionomes. We analyzed leaf N, P, K, Ca, and Mg of two wild and six domesticated fruit species from Canada, Brazil, and New Zealand sampled during reproductive stages. Nutrient concentrations were (1) analyzed without transformation, (2) ordinary log-transformed as commonly but incorrectly applied in practice, (3) additive log-ratio (alr) transformed as surrogate to stoichiometric rules, and (4) converted to isometric log-ratios (ilr) arranged as sound nutrient balance variables. Raw concentration and ordinary log transformation both led to biased multivariate analysis due to redundancy between interacting nutrients. The alr- and ilr-transformed data provided unbiased discriminant analyses of plant ionomes, where wild and domesticated species formed distinct groups and the ionomes of species and cultivars were differentiated without numerical bias. The ilr nutrient balance concept is preferable to alr, because the ilr technique projects the most important interactions between nutrients into a convenient Euclidean space. This novel numerical approach allows rectifying historical biases and supervising phenotypic plasticity in plant nutrition studies.
Annals of Epidemiology | 2016
Gregory B. Gloor; Jia Rong Wu; Vera Pawlowsky-Glahn; Juan José Egozcue
PURPOSE The ability to properly analyze and interpret large microbiome data sets has lagged behind our ability to acquire such data sets from environmental or clinical samples. Sequencing instruments impose a structure on these data: the natural sample space of a 16S rRNA gene sequencing data set is a simplex, which is a part of real space that is restricted to nonnegative values with a constant sum. Such data are compositional and should be analyzed using compositionally appropriate tools and approaches. However, most of the tools for 16S rRNA gene sequencing analysis assume these data are unrestricted. METHODS We show that existing tools for compositional data (CoDa) analysis can be readily adapted to analyze high-throughput sequencing data sets. RESULTS The Human Microbiome Project tongue versus buccal mucosa data set shows how the CoDa approach can address the major elements of microbiome analysis. Reanalysis of a publicly available autism microbiome data set shows that the CoDa approach in concert with multiple hypothesis test corrections prevent false positive identifications. CONCLUSIONS The CoDa approach is readily scalable to microbiome-sized analyses. We provide example code and make recommendations to improve the analysis and reporting of microbiome data sets.
Frontiers in Microbiology | 2017
Gregory B. Gloor; Jean M. Macklaim; Vera Pawlowsky-Glahn; Juan José Egozcue
Datasets collected by high-throughput sequencing (HTS) of 16S rRNA gene amplimers, metagenomes or metatranscriptomes are commonplace and being used to study human disease states, ecological differences between sites, and the built environment. There is increasing awareness that microbiome datasets generated by HTS are compositional because they have an arbitrary total imposed by the instrument. However, many investigators are either unaware of this or assume specific properties of the compositional data. The purpose of this review is to alert investigators to the dangers inherent in ignoring the compositional nature of the data, and point out that HTS datasets derived from microbiome studies can and should be treated as compositions at all stages of analysis. We briefly introduce compositional data, illustrate the pathologies that occur when compositional data are analyzed inappropriately, and finally give guidance and point to resources and examples for the analysis of microbiome datasets using compositional data analysis.
Journal of Hydraulic Research | 2008
Agustín Sánchez-Arcilla; Jesus Gomez Aguar; Juan José Egozcue; M. I. Ortego; Panagiota Galiatsatou; Panagiotis Prinos
This paper deals with the analysis of extreme wave heights and their uncertainties. The main purpose is to assess confidence intervals using a conventional extreme value, and a Bayesian approach. It is shown how the introduction of an a priori information helps to bound the upper confidence limit. The analysis is performed with wave-height data recorded off the Spanish Catalan coast (NW Mediterranean) and wave-height data from the Dutch coast (North Sea). An analysis with natural-scale and log-transformed wave-height time series has been performed. This scale selection is proven to be advantageous for naturally bounded variables and also better captures some distribution features. The paper ends with a discussion on how the different techniques can be used to select a statistically robust threshold for an extreme event definition. This affects the evaluation of risk in low-lying coastal areas, associated to variables controlling flooding and erosion risks.