intRinsic: an R package for model-based estimation of the intrinsic dimension of a dataset
iintRinsic : an R package for model-basedestimation of the intrinsic dimension of a dataset Francesco Denti [email protected]
University of California, IrvineFebruary 24, 2021
The estimation of the intrinsic dimension of a dataset is a fundamental step in mostdimensionality reduction techniques. This article illustrates intRinsic , an R package thatimplements novel state-of-the-art likelihood-based estimators of the intrinsic dimensionof a dataset. In detail, the methods included in this package are the TWO-NN , Gride ,and
Hidalgo models. To allow these novel estimators to be easily accessible, the packagecontains a few high-level, intuitive functions that rely on a broader set of efficient, low-levelroutines. intRinsic encompasses models that fall into two categories: homogeneous andheterogeneous intrinsic dimension estimators. The first category contains the
TWO-NN and
Gride models. The functions dedicated to these two methods carry out inference underboth the frequentist and Bayesian frameworks. In the second category we find
Hidalgo ,a Bayesian mixture model, for which an efficient Gibbs sampler is implemented. Afterdiscussing the theoretical background, we demonstrate the performance of the models onsimulated datasets. This way, we can assess the results by comparing them with theground truth. Then, we employ the package to study the intrinsic dimension of the
Alon dataset, obtained from a famous microarray experiment. We show how the estimation ofhomogeneous and heterogeneous intrinsic dimensions allows us to gain valuable insightsabout the topological structure of a dataset.
Statisticians and data scientists are often called to manipulate, analyze, and summarizedatasets that present high-dimensional and elaborate dependency structures. In multiplecases, these large datasets are constituted by variables that contain a considerable amountof redundant information. One can exploit these redundancies to represent a large dataseton a much lower-dimensional scale. This summarization procedure, called dimensionalityreduction , is a fundamental step in many statistical analyses. For example, dimensionality1 a r X i v : . [ s t a t . C O ] F e b eduction techniques make otherwise prohibitive manipulation and visualization of largedatasets feasible by reducing computational time and memory requirements.More formally, dimensionality reduction is possible whenever the data lie on one man-ifold (or more) characterized by a lower dimension than what was observed. We call thedimension of this latent, potentially non-linear manifold the intrinsic dimension ( id ). Sev-eral other definitions of id exist in the literature. For example, we can see the id as theminimal number of parameters needed to represent all the information contained in thedata without significant information loss [Ansuini et al., 2019, Rozza et al., 2011, Bennett,1969]. Intuitively, the id is an indicator of the complexity of the features of a dataset. It isa necessary piece of information to have before attempting to perform any dimensionalityreduction, manifold learning, or visualization tasks. Indeed, most dimensionality reductionmethods would be worthless without a reliable estimate of the true id they need to target:a too-small id value can cause needless information loss, while the reverse can lead to anunnecessary waste of time and computational resources [Hino et al., 2017].Over the past few decades, a vast number of methods for id estimation and dimension-ality reduction have been developed. The algorithms can be broadly classified into twomain categories: projection methods and geometric methods. The former maps the origi-nal data to a lower-dimensional space. The projection function can be linear, as in the caseof Principal Component Analysis (PCA) [Hotelling, 1933] or nonlinear, as in the case ofLocally Linear Embedding [Roweis and Lawrence, 2000], Isomap [Tenenbaum et al., 2000],and the tSNE [Laurens and Geoffrey, 2009]. For more examples, see Jollife and Cadima[2016] and the references therein. In consequence, there is a plethora of R packages thatimplement these types of algorithms: for example RDRToolbox [Bartenhagen, 2020], lle [Kayo, 2006],
Rtsne [Krijthe, 2015], and the classic princomp function from the defaultpackage stats .Geometric methods rely instead on the topology of a dataset, exploiting the propertiesof the distances between data points. Within this family, we can find fractal methods[Falconer, 2003], graphical methods [Costa and Hero, 2004], methods based on likelihoodapproaches [Levina and Bickel, 2005], and methods based on nearest neighbors distances[Pettis et al., 1979]. Numerous packages are available: for example, for fractal methodsalone there are fractaldim [Sevcikova et al., 2014], nonlinearTseries [Garcia, 2020], and tseriesChaos [Di Narzo, 2019], among others. For a recent review of the methods usedfor id estimation we refer to Campadelli et al. [2015].Given the abundance of methods in this area, several R developers have also attemptedto provide unifying collections of dimensionality reduction and id estimation methods. Forexample, valuable ensemble of methodologies are implemented in the packages ider [Hino,2017], dimred and coRanking [Kraemer et al., 2018], dyndimred [Cannoodt and Saelens,2020], IDmining [Golay and Kanevski, 2017], and intrinsicDimension [Johnsson andLund University, 2019]. Among these proposals, the package
Rdimtools [You, 2020b]stands out, implementing 150 different algorithms, 17 of which are exclusively dedicatedto id estimation [You, 2020a].In this paper, we introduce and discuss the R package intRinsic , which is openlyavailable a the Github repository at https://github.com/Fradenti/intRinsic . The2ackage implements the TWO-NN , Gride , and
Hidalgo models, three state-of-the-art id estimators recently introduced in Facco et al. [2017], Denti [2020] and Allegra et al. [2020],respectively. All are likelihood-based estimators that rely on the theoretical propertiesof the distances among the nearest neighbors. The first two models yield an estimate ofa global, unique id for a dataset and are implemented under both the frequentist andBayesian paradigms. Hidalgo , on the other hand, is a Bayesian mixture model that allowsfor the estimation of clusters of points characterized by heterogeneous id s.The package contains two sets of functions, organized into high-level and low-level rou-tines. The former set contains R functions intended to be user-friendly and straightforwardin their usage. Our goal is to make the package as accessible and intuitive as possible byautomating most of the tasks. The low-level set represents the core of the package, andit is not exported. The most computationally-intensive low-level functions are written in C++ , exploiting the interface with R provided by the packages Rcpp and
RcppArmadillo [Eddelbuettel and Fran¸cois, 2011, Eddelbuettel and Sanderson, 2014]. The
C++ implemen-tation considerably speeds up time-consuming tasks, like running the Gibbs sampler for theBayesian mixture model
Hidalgo . Moreover, intRinsic is well integrated with external R packages. For example, all the graphical outputs are produced using ggplot2 [Wickham,2016] or pheatmap [Kolde, 2019] and, where appropriate, simulated Monte Carlo MarkovChains (MCMC) are exported as coda [Plummer et al., 2006] objects. Lastly, we extendedthe generic method autoplot from the ggplot2 package to create a set of ad-hoc plottingfunctions that interact with the S3 classes defined in package intRinsic .The article is structured as follows. Section 2 introduces and describes the model-ing background, linking each method to its corresponding high-level function. Section 3presents the functions contained in the package with applications to simulated and realmicroarray datasets. We show how to obtain, manipulate, and interpret the different out-puts. Additionally, we assess the robustness of the methods by monitoring how the resultsvary when the input parameters change. Finally, Section 4 discusses future extensions tothe package. Let us consider X , a collection of n data points measured over D variables. We denoteeach observation as x i ∈ R D . Despite being observed over a D -dimensional space, wesuppose that the points lie entirely on a latent manifold M with intrinsic dimension ( id ) d ≤ D . Generally, we expect that d << D : a low-dimensional data-generating mechanismcan accurately describe the dataset.Now, we focus on the single data point x i . Starting from this point, one can order theremaining n − x i . In this way, we obtaina list of nearest neighbors (NNs) of increasing order. Formally, let ∆ : R D × R D → R + be a generic distance function between data points. We denote with x ( l ) i the l -th NN of x i and with r i,l = ∆ (cid:16) x i , x ( l ) i (cid:17) their distance, for l = 1 , . . . , n −
1. Given the sequence of NNsfor each data point, we can define the volume of the hyper-spherical shell enclosed between R of the first two NNs of point x i and the corresponding dis-tances. In two dimensions, the volume of the hyper-spherical shells ν i, and ν i, is equal tothe surfaces of the first circle and the outer ring, respectively. two successive neighbors of x i as ν i,l = ω d (cid:16) r di,l − r di,l − (cid:17) , for l = 1 , . . . , n − , and i = 1 , . . . , n, (1)where d is the dimensionality of the latent manifold in which the points are embedded (the id ) and ω d is the volume of the d -dimensional hyper-sphere with unitary radius. For thisformula to hold, we need to set x i, ≡ x i and r i, = 0. Considering the two-dimensionalcase for simplicity, we provide a visual representation of the quantities involved in Figure1 for l = 1 , X is a realization of a Poisson pointprocess characterized by density function ρ ( x ). Facco et al. [2017] showed that the hyper-spherical shells defined in Equation (1) are the multivariate extension of the well-known inter-arrival times [Kingman, 1992]. Therefore, they proved that under the assumption ofhomogeneity of the Poisson point process, i.e. ρ ( x ) = ρ ∀ x , all the ν i,l ’s are independentlydrawn from an Exponential distribution with rate equal to the density ρ : ν i,l ∼ Exp ( ρ ),for l = 1 , . . . , n − , and i = 1 , . . . , n . This fundamental result motivates the derivation ofthe estimators we will introduce in the following sections.4 .1 The TWO-NN family of estimators Building on the distribution of the hyper-spherical shells, Facco et al. [2017] noticed that,if the intensity of the Poisson point process is assumed to be constant on the scale of thesecond NN, the following distributional result holds: µ i = r i, r i, ∼ P areto (1 , d ) , µ i ∈ (1 , + ∞ ) i = 1 , . . . , n. (2)In other words, if the intensity of the Poisson point process that generates the data canbe regarded as locally constant (on the scale of the second NN), the ratio of the first twodistances from the closest two NNs is Pareto distributed. Recall that the Pareto randomvariable is characterized by a scale parameter a , shape parameter b , and density function f X ( x ) = ab a x − a − defined over x ∈ [ a, + ∞ ]. Remarkably, Equation (2) states that theratio r i, /r i, follows a Pareto distribution with scale a = 1 and shape b = d , i.e. the shapeparameter can be interpreted as the id of the data. We regard the id estimation methodsderived from this result as members of the TWO-NN family of estimators. To obtain thevector µ = ( µ i ) ni =1 , one can use the intRinsic function generate mus . Once the vector µ is collected, we can employ different estimators for the id . Linear Estimator.
Facco et al. [2017] proposed to estimate the id via the linearizationof the Pareto c.d.f. F ( µ i ) = (1 − µ − di ). The estimate ˆ d OLS is obtained as the solution of − log(1 − ˆ F ( µ ( i ) )) = d log( µ ( i ) ) , (3)where ˆ F ( · ) denotes the empirical c.d.f. of the sample and the µ ( i ) ’s are the ratios defined inEquation (2) sorted by increasing order. To obtain a more robust estimation, the authorssuggested trimming from µ a percentage α T R of the most extreme values. This choice isjustified because the extreme ratios often correspond to observations that do not complywith the local homogeneity assumption. This estimator is implemented within the function linfit TWONN . Maximum Likelihood Estimator.
In a similar spirit, Denti [2020] took advantage ofthe distributional results in Equation (2) to derive a simple Maximum Likelihood Estimator(MLE) and corresponding Confidence Interval (CI). Trivially, the (unbiased) MLE for theshape parameter of a Pareto distribution is given by:ˆ d = n − P ni log( µ i ) , (4)while the corresponding CI of level (1- α ) is defined as CI ( d, − α ) = ˆ dq − α/ IG n, ( n − , ˆ dq α/ IG n, ( n − , (5)where q α/ IG a,b denotes the quantile of order α/ a and scale b . The MLE estimate and CI can be obtained with the function mle twonn .5 ayesian Estimator. It is also straightforward to derive an estimator according to aBayesian perspective. Indeed, one can specify a prior distribution on the shape parameter d . The most natural prior to choose is a conjugate d ∼ Gamma ( a, b ). It is immediate toderive the posterior distribution for the shape parameter: d | µ ∼ Gamma a + n, b + n X i =1 log( µ i ) ! . (6)With the function bayesfit twonn , one can obtain the principal quantiles of the poste-rior distribution, collecting point estimates and uncertainty quantification with a credibleinterval (CrI) of level α . Denti [2020] also extended the previous theoretical framework by deriving the distributionof the ratio of two generic distances from a point x i and two of its NNs of generic order. Letus consider two integers n < n and assume that the density ρ is approximately locallyconstant on the scale defined by the distance of the n -th NN, denoted as r i,n . Moreover,we define ˙ µ i = µ i,n ,n = r i,n r i,n . (7)It is possible to derive a closed-form density for the random variable ˙ µ : f µ i,n ,n ( ˙ µ ) = ( n − n ) n − n − ! d ( ˙ µ d − n − n − ˙ µ ( n − d +1 , ˙ µ ∈ (1 , + ∞ ) . (8)Starting from this distribution, they are able to propose the generalized ratio intrinsicdimension estimator ( Gride ) family. Unfortunately, closed-form representation for theMLE and CI are not available. We need to resort to numerical optimization and para-metric bootstrapping within the frequentist framework and MCMC simulation within theBayesian setting. To do so, we can use the functions mle gride , bootstrap gride , and bayesfit gride . All the estimators we described in the previous sections implicitly assume that the id of adataset is unique. However, considering a single value for the id can often be limiting, es-pecially when the data present complex dependence structures among variables. To extendthe previous modeling framework, one can imagine that the data points are divided intoclusters, each of them lying on a latent manifold with its specific id . Allegra et al. (2020)employed this heterogeneous id estimation approach in their model: the heterogeneous id algorithm ( Hidalgo ). The authors assumed that the density function of the generatingpoint process is in fact a mixture of K distributions defined on K different latent mani-folds, expressed as ρ ( x ) = P Kk =1 π k ρ k ( x ), where π = ( π , . . . , π K ) is the vector of mixture6igure 2: Density functions of P areto (1 , d ) distribution for different values of the shape parameter d .weights. This assumption induces a mixture of Pareto distributions as the distribution ofthe ratios µ i ’s: µ i | d , π = K X k =1 π k d k µ − ( d k +1) i , (9)where d = ( d , . . . , d K ) is the vector of different id s and π is the vector containing the K mixture weights. Allegra et al. [2020] adopted a Bayesian perspective, specifying indepen-dent Gamma priors for each element of d : d k ∼ Gamma ( a d , b d ), and a Dirichlet prior forthe mixture weights π ∼ Dirichlet ( α , . . . , α K ).Unfortunately, a model-based clustering approach such as the one presented in Equa-tion (9) is ineffective at modeling the data. The problem lies in the fact that the differentPareto kernel densities constituting the mixture components are extremely similar andoverlapping. Therefore, Pareto densities with varying shape parameters can fit the samedata points equally well, compromising the clustering. Even when considering very di-verse shape parameters, the right tails of the different Pareto distributions overlap to agreat extent. This issue is evident in Figure 2, where we illustrate different examples of P areto (1 , d ) densities. Thus, the distributional similarity jeopardizes the cluster assign-ments of the data points and the consequent id s estimation.To address this problem, Allegra et al. [2020] introduced a local homogeneity assump-tion, which postulates that points close to each other are more likely to be part of the7ame latent manifold. To incorporate this, the authors added an extra penalizing term inthe likelihood. We now summarize their approach.First, they introduced the latent membership labels z = ( z , . . . , z n ) to assign each ob-servation to a cluster, where z i = k means that the i -th observation was assigned to the k -th mixture component. Then, they defined the binary adjacency matrix N ( q ) , whoseentries are N ( q ) ij = 1 if the point x j is among the q NNs of x i and 0 otherwise. Finally,they assumed the following probabilities: P h N ( q ) ij = 1 | z i = z j i = ζ , with ζ > . P h N ( q ) ij = 1 | z i = z j i = ζ , with ζ < .
5. These probabilities are incorporated in the fol-lowing distributional constraints for the data point x i : π ( N ( q ) i | z ) = Q nj =1 ζ zi = zj ζ zi = zj / Z i ,where Z i is the normalizing constant and A is the indicator function, equal to 1 whenthe event A is true, 0 otherwise. A more technical discussion of this model extension andthe validity of the underlying hypothesis can be found in the Supplementary Material ofAllegra et al. [2020]. For simplicity, we assume ζ = ζ and ζ = 1 − ζ . The model becomes L (cid:16) µ i , N ( q ) | d , z , ζ (cid:17) = d z i µ − ( d zi +1) i × n Y i =1 ζ zi = zj (1 − ζ ) zi = zj Z i , z i | π ∼ Cat K ( π ) , (10)where Cat K denotes a Categorical distribution over { , . . . , K } . A closed-form for theposterior distribution is not available, so we rely on MCMC techniques to simulate aposterior sample. The function Hidalgo implements the Bayesian mixture model under theconjugate prior and two other alternative distributions. When the nominal dimension D islow, the unbounded support of a Gamma prior may provide unrealistic results, where theposterior distribution assigns positive density to the interval ( D, + ∞ ). Santos-Fernandezet al. [2020] proposed to employ a more informative prior for d : π ( d k ) = ˆ ρ · d a − k exp − bd k (0 ,D ) C a,b,D + (1 − ˆ ρ ) · δ D ( d k ) ∀ k, (11)where they denoted the normalizing constant of a Gamma ( a, b ) truncated over (0 , D ] with C a,b,D . That is, the prior distribution for d k is a mixture between a truncated Gammadistribution over (0 , D ] and a point mass located at D . The parameter ρ denotes themixing proportion. When ρ = 1, the distribution in (11) reduces to a simple truncatedGamma. Both approaches are implemented in intRinsic .The steps of the Gibbs sampler are the following:1. Sample the mixture weights according to π | · · · ∼ Dirichlet α + n X i =1 z i =1 , . . . , α K + n X i =1 z i = K !
2. Let z − i denote the vector z without its i -th element. Sample the cluster indicators z i according to: P ( z i = k | z − i , · · · ) ∝ π z i f (cid:16) µ i , N ( q ) i | z , . . . , z i − , k, z i +1 , . . . , z n , d (cid:17)
8e emphasize that, given the new likelihood we are considering, the cluster labels areno longer independent given all the other parameters. Let us define z ki = ( z , . . . , z i − , k, z i − , . . . , z n ) . Then, let N z i ( z − i ) be the number of elements in the ( n − z − i that are assigned to the same manifold (mixture component) as z i . Moreover, let m ini = P l N ( q ) li z l = z i be the number of points sampled from the same manifold of the i -th observation that have x i as neighbor, and let n ini ( z ) = P l N ( q ) il z l = z i ≤ q be thenumber of neighbors of x i sampled from the same manifold. Then, we can simplifythe previous formula, obtaining the following full conditional: P ( z i = k | z − i , · · · ) ∝ π k d k µ − ( d k +1) i Z ( ζ, N z i = k ( z − i ) + 1) × (cid:18) ζ − ζ (cid:19) n ini ( z ki )+ m ini ( z ki ) × (cid:18) Z ( ζ, N z i = k ( z − i )) Z ( ζ, N z i = k ( z − i ) + 1) (cid:19) N zi = k ( z − i ) . (12)See Facco and Laio [2017] for a detailed derivation of this result.3. The posterior distribution for d depends on the prior specification we adopt:a) If we assume a conjugate Gamma prior, we obtain d k | · · · ∼ Gamma a + n k , b + X i : z i = k log µ i , where n k = P ni =1 z i = k is the number of observations assigned to the k -th group;b) If G is assumed to be a truncated Gamma distribution on (0 , D ), then d k | · · · ∼ Gamma a + n k , b + X i : z i = k log µ i ( · ) (0 ,D ) ;c) Finally, let us define a ∗ = a + n k and b ∗ = b + P i : z i = k log µ i , if G is assumedto be a truncated Gamma with point mass at D we obtain d k | · · · ∼ ˆ ρ ∗ ˆ ρ ∗ + ˆ ρ ∗ Gamma ( a ∗ , b ∗ ) ( · ) (0 ,D ) + ˆ ρ ∗ ˆ ρ ∗ + ˆ ρ ∗ δ D ( d k ) , where ˆ ρ ∗ = ˆ ρ C a ∗ ,b ∗ ,D C a,b,D and ˆ ρ ∗ = (1 − ˆ ρ ) D n k exp − D P i : zi = k log µ i .The algorithm for the Gibbs sampler concludes our brief introduction of the novel id estimators implemented in intRinsic . Table 1 summarizes the main characteristics of thethree approaches we discussed. 9odel Frequentist Bayesian NN Ratio Scale idTWO-NN (cid:51) (cid:51) Second to First Homogeneous
Gride (cid:51) (cid:51)
General Homogeneous
Hidalgo (cid:55) (cid:51)
Second to First HeterogeneousTable 1: Summary of the estimators included in the intRinsic package.
This section illustrates and discusses the main routines of the intRinsic package. Weapply our id estimation techniques to four datasets. The first three datasets are simulated.That way, we can compare the results of the different experiments with the ground truth.Then, our methods are employed to estimate the id s present in the Alon microarray table,contained in the R package HiDimDa [Silva, 2015]. To start with the illustration, we firstload our package by running:
R> library(intRinsic)
In the following, we will indicate the number of observations and the observed nomi-nal dimension with n and D , respectively. The id value is denoted with d . Let us alsorepresent with N k ( m, Σ) a multivariate Normal distribution of dimension k , mean m , andcovariance matrix Σ. Moreover, let T g,k represent a multivariate Student’s t-distributionwith g degrees of freedom in k dimensions. Finally, we denote I k as an identity matrix ofdimension k and A as the indicator function equal to 1 when event A is true. For the first part of the illustration, we consider three different simulated datasets: D , D , and D . The first one is obtained via the classical Swissroll transformation S : R → R defined as S ( x, y ) = ( x cos( x ) , y, x sin( x )). Each pair of points ( x, y ) mapped with theSwissroll transformation is sampled from two independent Uniform distributions on (0 , Swissroll maker specified with the numberof observations N as the input parameter. D contains a cloud of points sampled from T , embedded in an eight-dimensional space R . Lastly, the dataset D is the collection of datapoints generated from three different random variables, X , X , and X , embedded in afive-dimensional Euclidean space. In particular, the three random variables are distributedas X = 3 X , X ∼ N (0 , X ∼ N (0 , I ); X ∼ N (5 , I ) . One can generate the exact replica of the three simulated datasets used in this paper byrunning the following code:
R> set.seed(123456)R> Swissroll <- as_tibble(Swissroll_maker(N = 1000)) n D d D Swissroll D Student T
500 8 5 D GaussMix
R> Student_T <- replicate(5, rt(500, 8))R> Student_T <- as_tibble(cbind(Student_T, 0, 0, 0))R> v1 <- rnorm(500, -5, sd = 2)R> v1 <- cbind(x = v1, y = 3 * v1, 0, 0, 0)R> v2 <- cbind(replicate(3, rnorm(500)), 0, 0)R> v3 <- replicate(5, rnorm(500, 5))R> GaussMix <- as_tibble(rbind(v1, v2, v3))R> GaussMix <- GaussMix %>%+ mutate(Mix_comp = rep(c("A", "B", "C"), rep(500, 3)))
Table 2 summarizes the sample sizes along with the nominal and intrinsic dimensionsthat characterize the three datasets. To visualize the data, we can use the
PairPlot function from the
WVPlots package.
R> WVPlots::PairPlot(d = Swissroll, meas_vars = c("x", "y", "z"),+ title = "Swiss Roll dataset",+ point_color = "royalblue", alpha = .5)
These scatterplots can be useful as an exploratory tool to spot any clear dependenceacross the columns of a dataset. The plot provides us with a qualitative idea about thevalue of the id that should be expected. For example, applying the function PairPlot tothe
Swissroll dataset we obtain Figure 3. From the different panels, it is evident thattwo of the three coordinates are free, and we can recover the last coordinate as a functionof the others. Therefore, the id of D is equal to 2. From the description of the datasimulation, it is also evident that the id =5 for the second dataset D . However, it is notas simple to figure out the value of the ground truth id for D , given the heterogeneity ofits data generating mechanism. The ratios between distances of NNs constitute the fundamental quantities on which thetheoretical development presented in Section 2 is based. We can compute the ratios µ i defined in Equation (2) with the function generate mus . The function takes the followingarguments:• X : a dataset of dimension n × D of which we want to compute the ratios µ i ’s;11 y z xyz −10 −5 0 5 0.0 2.5 5.0 7.5 10.0 −5 0 5−10−5050.02.55.07.510.0−505 Figure 3: Scatterplot of the three variables in the
Swissroll dataset obtained with the function
PairPlot . The dependence among the coordinates x and z is evident.• dupl remove : logical, if TRUE the function automatically detects and removes dupli-cates in the sample;• DM : a n × n user-provided distance matrix. If NULL , the function automatically detectsthe NNs in terms of Euclidean distance.The output of the function is the vector of ratios µ = { µ i } ni =1 . For example, we canwrite: R> mus_Swiss <- generate_mus(X = Swissroll)R> mus_Students <- generate_mus(X = Student_T)R> mus_GaussMix <- generate_mus(X = GaussMix %>% select(-Mix_comp))
The histograms of the three datasets’ estimated ratios are reported in Figure 4. Allof the histograms present the right-skewed shape typical of Pareto distribution. If thedistance matrix DM is not specified, generate mus relies on the function get.knn from thepackage FNN [Beygelzimer et al., 2019], which implements fast NN-search algorithms.Recall that the model is based on the assumption that a Poisson point process is thegenerating mechanism of the dataset. Ergo, the model cannot handle situations where tiesamong data points are present. From a more practical point of view, if ∃ i = j such that12igure 4: Histograms of the ratios µ = { µ i } ni =1 for datasets D , D , and D . The shape of thehistograms suggests that a Pareto distribution could be a good fit. x i = x j , the computation of µ i would be infeasible, since r i, = 0. We devised the function generate mus to automatically detect and remove duplicates in a dataset if the argument dupl remove is set to TRUE . In case dupl remove is set to
FALSE , the function stops andlocates the problematic observations. Let us showcase this behavior with a simple example:
R> DummyData <- rbind(c(1, 2, 3), c(1, 4, 3), c(1, 2, 3),+ c(1, 4, 3), c(1, 4, 3), c(1, 4, 3), c(1, 1, 1))R> generate_mus(X = DummyData, dupl_remove = F, DM = NULL)Duplicates are present in the dataset.I cannot proceed!The problematic indexes are:[1] 3 4 5 6
This function is the core of the higher-level, intuitive routines we use to estimate the id .In the following subsections, we show how to use the TWO-NN and
Gride models to obtaina unique, homogeneous id estimate for each dataset. Let us start with the
TWO-NN model presented in Section 2.1. In alignment with recentdevelopments in the literature, we propose three methods to carry out inference on the id value: the linear estimator, the MLE, and Bayesian approach.13 wissroll Swissroll Student T Student T Not trimmed 1% trimmed Not trimmed 1% trimmed R object lin11 lin12 lin21 lin22 Lower Bound 1.9524 2.2333 4.7476 5.2052Estimate 1.9669 2.2456 4.7753 5.2263Upper Bound 1.9814 2.2580 4.8031 5.2474Table 3: Point estimate and relative confidence intervals for the id values retrieved with the linearestimator. The estimates change when removing the 1% of most extreme ratios. The function linfit twonn implements the regression stated in Equation (3) and needsthe following objects as input: the dataset X or the distance matrix DM (see the previousfunction’s arguments specification for more details), and• trimmed : logical, if TRUE a percentage of the most extreme observations is discardedto help improve the estimation;• alpha trimmed : the percentage of extreme observations to discard before the linearestimation is performed.The function returns a list of class lnf twonn as output. The first element is the vector
Estimates , containing the id point estimate and the relative 95% confidence interval.The second element, Lin mod is the output of the linear regression performed within thefunction linfit twonn .We apply this estimator to D and D . We compare the estimates with and withouttrimming, creating four R objects. Table 3 displays the estimates obtained by running thefollowing code: R> lin11 <- linfit_twonn(X = Swissroll, trimmed = F, alpha_trimmed = 0)R> lin12 <- linfit_twonn(X = Swissroll, trimmed = T, alpha_trimmed = 0.01)R> lin21 <- linfit_twonn(X = Student_T, trimmed = F, alpha_trimmed = 0)R> lin22 <- linfit_twonn(X = Student_T, trimmed = T, alpha_trimmed = 0.01)
We can also plot the data and the relative regression line using the generic function autoplot . Examples of plots obtained from this experiment are shown in Figure 5. Theslope of the regression lines corresponds to the linear fit id estimates. R> autoplot(lin12) + ggtitle("Swissroll - 0.1% trimmed")R> autoplot(lin22) + ggtitle("Student_T - 0.1% trimmed") − log(1 − ˆ F ( µ i )) = d log( µ i ) for the Swissroll dataset (left panel) and
Student T dataset (right panel).
A second way to obtain an id estimate, along with its confidence interval, is via MLE.The formulas implemented are presented in Equations (4) and (5). We can obtain the MLestimate with the mle twonn function, which takes six arguments. Four arguments areidentical to those which we have already discussed: X , trimmed , alpha trimmed , and DM .The last two arguments are:• conf lev : the confidence level specified to compute the confidence interval for theMLE. The default is 0.95;• unbiased : logical, if TRUE the point estimate according to the unbiased estimator(where the numerator is n −
1, as in Equation (4)) is computed.We present an example with two different distance matrices: one computed using the Eu-clidean distance, the other using the Manhattan distance. To obtain the distance matricesfor D and D , we write: R> dist_Eucl_Sw <- as.matrix(stats::dist(Swissroll))R> dist_Manh_Sw <- as.matrix(stats::dist(Swissroll, method = "manhattan"))R> dist_Eucl_St <- as.matrix(stats::dist(Student_T)) > dist_Manh_St <- as.matrix(stats::dist(Student_T, method = "manhattan")) We can now run the function mle twonn to obtain id estimates under different scenarios.Specifically, we first run mle twonn specifying only the dataset X ( ml1 , ml4 ). The functionwill automatically detect the NNs using Euclidean distances. We also supply the Euclideandistance matrix ( ml2 ), which is an equivalent result to ml1 and since we set DM=NULL .Lastly, we supply the Manhattan distance matrix ( ml3 , ml5 ). Other distance matrices canbe employed as well. We also show how the bounds of the CIs change by varying theconfidence levels. The results are summarized in Table 4. The type of distance can leadto differences in the results. However, in this case, the estimators agree with each other,obtaining values that are close to the ground truth. R> ml1_Sw <- mle_twonn(X = Swissroll)R> ml2_Sw <- mle_twonn(X = Swissroll, DM = dist_Eucl_Sw)R> ml3_Sw <- mle_twonn(X = Swissroll, DM = dist_Manh_Sw)R> ml4_Sw <- mle_twonn(X = Swissroll, conf_lev = .99)R> ml5_Sw <- mle_twonn(X = Swissroll, DM = dist_Manh_Sw, conf_lev = .90)R> ml1_St <- mle_twonn(X = Student_T)R> ml2_St <- mle_twonn(X = Student_T, DM = dist_Eucl_St)R> ml3_St <- mle_twonn(X = Student_T, DM = dist_Manh_St)R> ml4_St <- mle_twonn(X = Student_T, conf_lev = .99)R> ml5_St <- mle_twonn(X = Student_T, DM = dist_Manh_St, conf_lev = .90)
Dataset R object Distance α Lower Bound Estimate Upper Bound
Swissroll ml1 Sw Eucl ml2 Sw Eucl ml3 Sw Manh ml4 Sw Eucl ml5 Sw Manh
Student T ml1 St Eucl ml2 St Eucl ml3 St Manh ml4 St Eucl ml5 St Manh
TWO-NN model. Different distance functions and confidence levelspecifications are adopted.
The third option for id estimation is to adopt a Bayesian perspective and specify a priordistribution for the parameter d . To obtain the Bayesian estimates in this way, we can use16ague Vague Informative Informative R object bay1 bay2 bay3 bay4 CrI level α = 0 . α = 0 . α = 0 . α = 0 . α of the credible interval.the function bayesfit twonn . Along with X , trimmed , alpha trimmed , and DM , we alsoneed to specify:• a d and b d : shape and rate parameter for the Gamma prior distribution on d . Avague specification is adopted as a default with a d=0.001 and b d=0.001 ;• alpha : the probability contained in the credible interval computed on the posteriordistribution.In the following, four examples showcase the usage of this function on the Swissroll dataset. We compute the estimates under the default specification ( bay1 ) with differentcredible interval levels ( bay2 ) and prior specifications ( bay3 ), as well as with a combinationof the previous settings ( bay4 ). The estimates are reported in Table 5.
R> bay1 <- bayesfit_twonn(X = Swissroll)R> bay2 <- bayesfit_twonn(X = Swissroll, alpha = 0.99)R> bay3 <- bayesfit_twonn(X = Swissroll, a_d = 1, b_d = 1)R> bay4 <- bayesfit_twonn(X = Swissroll, a_d = 1, b_d = 1, alpha = .99)
We can plot the posterior distribution, as reported in Figure 6, using autoplot . Whenplotting an object returned from bayesfit twonn , we can also specify the following pa-rameters:• plot low and plot upp : lower and upper extremes of the support on which theposterior is evaluated.• by : increment of the sequence going from plot low to plot upp that defines thesupport.We compare the default vague prior specification with a strongly informative one by writ-ing: R> b1 <- bayesfit_twonn(X = Swissroll)R> b2 <- bayesfit_twonn(X = Swissroll, a_d = 200, b_d = 100)R> autoplot(b1, plot_low = 1.5, plot_upp = 2.5, by = .01) +
Swissroll dataset. Graphical representation of the posterior distribution (black line), priordistribution (blue line), and main quantiles and average (vertical dotted red lines) undervague (left panel) and informative (right panel) prior specifications. + ggtitle("Vague Prior")R> autoplot(b2, plot_low = 1.5, plot_upp = 2.5, by = .01) ++ ggtitle("Strongly Informative Prior")
The posterior distribution is depicted in black, the prior in blue, and the dashed verticalred lines represent the estimates.
Another estimator for the id is Gride , the
TWO-NN generalization proposed in Denti [2020].This estimator is based on the distributional properties of the generic ratios between dis-tances of NNs. Paralleling the structure of core and high-level functions for the
TWO-NN model, we start by presenting generate mudots , which computes the ratios as specified inEquation (7). The function generate mudots takes as inputs the dataset X , dupl remove , DM , and:• n1 : integer, the order of the closest NN;• n2 : integer, the order of the furthest NN.It is important to specify the orders of the two NNs n1 and n2 where n1 < n2 . We canuse this function in the following way: 18igure 7: Swissroll dataset. Histograms of the generalized ratios ˙ µ = { µ i,n ,n } ni =1 for differentvalues of the NN orders n1 and n2 . R> mud1 <- generate_mudots(X = Swissroll, n1 = 1, n2 = 2)R> mud2 <- generate_mudots(X = Swissroll, n1 = 2, n2 = 4)R> mud3 <- generate_mudots(X = Swissroll, n1 = 5, n2 = 8)R> mud4 <- generate_mudots(X = Swissroll, n1 = 5, n2 = 15)
From this chunk of code we obtain as output the vectors ˙ µ = { µ i,n ,n } ni =1 . We plotthe histograms of the computed vectors in Figure 7. The shape of the distributions variesaccording to the different ratios of NN order specifications.As discussed in Section 2.2, the MLE for d has no closed-form expression given thedistribution of ˙ µ in Equation (8). Therefore, we rely on numerical methods to carry outthe estimations. The package intRinsic implements both a frequentist MLE estimatorand a Bayesian estimator. In the former framework, we provide uncertainty quantificationvia bootstrap sample. For the latter, a Gamma prior for d is adopted. We simulatethe posterior distribution via the Metropolis-Hastings algorithm, with N (0 , σ ) as theproposal distribution over log( d ). 19 .4.1 Maximum Likelihood Estimation The MLE can be computed with the mle gride function, which has the same input argu-ments as generate mudots . Here, we report some examples specifying different NN ordersand confidence levels. The code to obtain the results for the
Swissroll dataset is:
R> sw11 <- mle_gride(Swissroll, n1 = 1, n2 = 2)R> sw21 <- mle_gride(Swissroll, n1 = 2, n2 = 4)R> sw31 <- mle_gride(Swissroll, n1 = 5, n2 = 8)R> sw41 <- mle_gride(Swissroll, n1 = 5, n2 = 15)R> sw12 <- mle_gride(Swissroll, n1 = 1, n2 = 2, conf_lev = .99)R> sw22 <- mle_gride(Swissroll, n1 = 2, n2 = 4, conf_lev = .99)R> sw32 <- mle_gride(Swissroll, n1 = 5, n2 = 8, conf_lev = .99)R> sw42 <- mle_gride(Swissroll, n1 = 5, n2 = 15, conf_lev = .99)
Similarly, for the
Student T dataset we have:
R> st11 <- mle_gride(Student_T, n1 = 1, n2 = 2)R> st21 <- mle_gride(Student_T, n1 = 2, n2 = 4)R> st31 <- mle_gride(Student_T, n1 = 5, n2 = 8)R> st41 <- mle_gride(Student_T, n1 = 5, n2 = 15)R> st12 <- mle_gride(Student_T, n1 = 1, n2 = 2, conf_lev = .99)R> st22 <- mle_gride(Student_T, n1 = 2, n2 = 4, conf_lev = .99)R> st32 <- mle_gride(Student_T, n1 = 5, n2 = 8, conf_lev = .99)R> st42 <- mle_gride(Student_T, n1 = 5, n2 = 15, conf_lev = .99)
The results are reported in Table 6. The main function mle gride calls the function bootstrap gride to perform parametric bootstrap to provide an approximate confidenceinterval for the parameter d . The function bootstrap gride takes as arguments the objects X , n1 , n2 , all previously discussed, and nsim , the number of bootstrap samples that wewant to generate. The default is set to . For example, we can collect a bootstrapsample with: R> bootsam <- bootstrap_gride(X = Swissroll, n1 = 2, n2 = 4, nsim = 2000)
The left panel of Figure 8 shows the density plot of the bootstrap sample generated withthe previous line of code.
Under the Bayesian paradigm, we can sample from the posterior distribution π ( d | ˙ µ ) withthe function posterior sampler gride . This function is the core of the higher-level rou-tine bayesfit gride . Both take as input X , DM , a d , b d , alpha , n1 , and n2 . Moreover,the following inputs can be specified:• sigma : the standard deviation of the proposal for the Metropolis-Hastings step. Thedefault value is set to 0.5; 20ataset R object n1 n2 conf lev Lower Bound Estimate Upper Bound
Swissroll sw11 sw21 sw31 sw41 sw12 sw22 sw32 sw42
Student T st11 st21 st31 st41 st12 st22 st32 st42 id obtained with mle gride . Different NN orders andconfidence level are used.• nsim : the number of posterior samples to collect;• burn in : number of discarded initial iterations to allow the chain to reach conver-gence;• start d : the user-specified initial value. If NULL , the MLE is adopted instead.The first function returns the posterior MCMC sample. We now show the results accordingto two different specifications of the standard deviation σ of the proposal distribution forthe Metropolis-Hastings algorithm. R> sam1 <- posterior_sampler_gride(X = Swissroll, n1 = 2, n2 = 4)R> sam2 <- posterior_sampler_gride(X = Swissroll, n1 = 2, n2 = 4, sigma = .1)
Figure 9 shows the trace plots of the two chains, obtained by applying the autoplot method to the objects returned by posterior sampler gride . By tuning the parameter sigma , different mixing of the chains can be achieved.Alternatively, we can use the high-level function bayesfit gride to collect a morecomplete and interpretable output in terms of inference.
R> baydot <- bayesfit_gride(X = Swissroll, nsim = 10000, burn_in = 50000,+ n1 = 2, n2 = 4, sigma = .1)
Swissroll dataset. The left panel shows the bootstrap density simulated with bootstrap gride . The right panel shows the posterior density simulated with bayesfit gride .The function returns a list. The principal distributional values that summarize the pos-terior can be extracted with baydot$Estimates . We show the resulting posterior densityin the right panel of Figure 8, produced with autoplot . R> autoplot(baydot)R> baydot$EstimatesLower Bound Mean Median Mode Upper Bound1.819256 1.904314 1.903480 1.905358 1.991911
The methods presented so far allow us to determine the id of a dataset accurately andefficiently. Given the data generating process of the simulated data, it is easy to comparethe obtained estimates with the ground truth for D and D . However, the same task isnot immediate when dealing with the GaussMix table. For D , it is unclear what id valueshould represent the ground truth. The issue arises because the GaussMix data pointsare generated from Gaussian distributions lying on different manifolds of heterogeneousdimensions.To introduce the problem, we start by considering the estimates obtained for
GaussMix with the MLE and the linear fit
TWO-NN estimators.
R> mle_twonn(GaussMix %>% select(-Mix_comp))Lower Bound Estimate Upper Bound[1,] 1.782876 1.875404 1.972805R> linfit_twonn(GaussMix%>% select(-Mix_comp)) id parameter obtained with the function posterior sampler gride . The top and bottom panel differ for their values of thestandard deviation of the proposal distribution. $EstimatesLower Bound Estimate Upper Bound1.433901 1.454059 1.474218 Here, we see that the estimates obtained with the different methods do not agree. Figure10 raises concerns about the appropriateness of the model. The data points are coloredaccording to their generating mixture component. The log-ratios present a non-linearpattern, where the slope values vary among the different mixture components A , B , and C .We claim that the methods presented so far are powerful tools for estimating an overall id of a dataset, but they all rely on the assumption that the id of the data is unique. How-ever, the last example suggests that this hypothesis is limiting and easily violated whendealing with complex data. To address this issue, in Section 2.3 we introduced Hildalgo ,a Bayesian finite mixture model for heterogeneous id estimation. Hildalgo addresses thepresence of multiple manifolds in the same dataset, yielding a vector of different estimated id values d . As already discussed, the estimation of a mixture model with Pareto compo-nents is challenging. A naive model-based estimation can lead to inaccurate results sincethere is no clear separation between the kernel densities. Therefore, we modify the classicalBayesian mixture using the likelihood stated in Equation (10). By introducing the extraterm Q ni =1 π ( N ( q ) i | z ) into the likelihood, we can induce local homogeneity, which helps23 D: 1.454 R : log ( m ) − l og ( − ( i / N )) MixtureComponent
ABC
GaussMix dataset
Figure 10: Linear estimator applied to the
GaussMix dataset. The points are colored according tothe mixture component from which they originate. The gray lines represent the linearestimators applied to the subsets of points relative to each mixture component.identify the model parameters.Using intRinsic , we can compute the ratios and adjacency matrix needed to evaluatethe likelihood from the data points with
Hidalgo data preprocessing . This functiontakes as arguments the dataset X , the distance matrix DM , and• q : integer, the number of NNs to be considered in the construction of the matrix N ( q ) . Default is 3.The function returns a list that contains, along with the value of q provided as input, thefollowing two objects:• mu : the vector of ratios of distances - easily obtained via generate mus ;• Nq : the binary adjacency matrix.To provide an idea of the structure of the adjacency matrix N ( q ) , we report three exam-ples obtained from a random sub-sample of the GaussMix dataset for increasing values of q . R> set.seed(12345)R> ind <- sort(sample(1:1500, 100, F))R> DPreP1 <- Hidalgo_data_preprocessing(X = GaussMix[ind, -6], q = 3)R> DPreP2 <- Hidalgo_data_preprocessing(X = GaussMix[ind, -6], q = 10)R> DPreP3 <- Hidalgo_data_preprocessing(X = GaussMix[ind, -6], q = 20)
We display the heatmaps of the resulting matrices in Figure 11. As q increases, thebinary matrix becomes more populated, uncovering the neighboring structure of the datapoints. Allegra et al. [2020] investigated how the performance of the model changes as q varies. They suggest fixing q = 3, a value that provides a good trade-off between theflexibility of the mixture allocations and local homogeneity.24igure 11: Heatmaps of the adjacency matrices N ( q ) computed on a subset of observations of the GaussMix dataset. Different values of q are assumed.The core, high-level function to fit the Bayesian mixture is Hidalgo . This routineimplements the Gibbs sampler described in Section 2.3. The function has the followingarguments: X , DM , q and• K : integer, number of mixture components;• nsim , burn in , and thinning : number of MCMC iterations to collect, initial itera-tions to discard, and thinning interval, respectively;• verbose : logical, should the progress of the sampler be printed;• csi : real between 0 and 1, local homogeneity parameter. Default is 0.75;• alpha Dirichlet : parameter of the Dirichlet prior on the mixture weights;• a0 d and b0 d : shape and rate parameter of the Gamma prior on d ;• prior type : character, type of Gamma prior on d which can be – Conjugate : a Gamma prior is adopted; – Truncated : a truncated Gamma prior on the interval (0 , D ). This is useful forsmall datasets to avoid the estimated id exceeding the nominal dimension D ; – Truncated PointMass : same as
Truncated , but a point mass is placed on D .That is, the estimated id is allowed to be exactly equal to the nominal dimension D ;• D : integer, the nominal dimension of the dataset;• pi mass : probability placed a priori on D when a Truncated PointMass prior speci-fication is chosen; 25 recap : logical, if
TRUE the function includes a list of the chosen parameters in theoutput;• if coda : logical, if
TRUE the function returns the simulated chains as a coda object.We apply the
Hidalgo model on the
GaussMix dataset with two different prior configura-tions: conjugate and truncated with point mass at D = 5. The code is the following: R> set.seed(1234)R> hid_fit <- Hidalgo(X = as.matrix(GaussMix[,-6]), K = 10,+ alpha_Dirichlet = 0.05, verbose = F+ nsim = 2000, burn_in = 2000)R> set.seed(1234)R> hid_fit_TR <- Hidalgo(X = as.matrix(GaussMix[,-6]), K = 10,+ prior_type = "Truncated_PointMass",+ D = 5, alpha_Dirichlet = 0.05, verbose = F,+ nsim = 2000, burn_in = 2000)
Notice that, by using alpha Dirichlet = 0.05 , we have adopted a sparse mixturemodeling approach in the spirit of Malsiner-Walli et al. [2016, 2017].The output object hid fit is a list of class
Hidalgo , containing three elements:• membership labels : a matrix of dimension nsim × n . Each column contains theMCMC sample of the membership labels for every observation;• cluster prob : a matrix of dimension nsim × L . Each column contains the MCMCsample of the mixing weight for each mixture component;• intrinsic dimension : a matrix of dimension nsim × L . Each column contains theMCMC sample for the id estimated in each cluster.Selecting recap = TRUE adds a list with details of the run to the output. This allows usto collect a summary of all the input values previously specified. For a simple visualizationof raw MCMC samples, we can use the function autoplot . R> r1 <- autoplot(output = hid_fit) + ggtitle("Conjugate Prior")R> r2 <- autoplot(output = hid_fit_TR) + ggtitle("Truncated_PointMass Prior")
Plotting the trace plots of the elements in d allows us to assess the convergence ofthe algorithm. We need to be aware that the chains may suffer from label-switchingissues, which prevent us from drawing inference from the MCMC output directly. Dueto the label-switching, mixture components can be discarded and emptied or repopulatedacross iterations. Figure 12 shows the MCMC trace plots of the two models, with theergodic means for each mixture component superimposed. Additionally, we can see thatif no constraint is imposed on the support of the prior distribution for d (left panel), theposterior estimates can exceed the nominal dimension D of the dataset. This problemdisappears when imposing a truncation on the prior support (right panel).To address the label-switching issue and perform meaningful inference, we need to post-process the results. A simple solution is implemented in the function Hidalgo postpr chains ,which takes as input an object of class
Hidalgo , along with:26 .02.55.07.5 0 500 1000 1500 2000
MCMC Iteration R a w M C M C − I n t r i n s i c D i m en s i on Conjugate Prior
MCMC Iteration R a w M C M C − I n t r i n s i c D i m en s i on Truncated_PointMass Prior
Figure 12: MCMC trace plots and superimposed ergodic means of components of the vector d . Leftpanel: conjugate prior specification. Right panel: truncated with point mass prior specifi-cation.• all chains : logical, if TRUE all the chains, after the disentanglement for label switch-ing, are reported. If
FALSE , a dataset with summary statistics is returned instead.The post-processing procedure works as follows. Let us consider a MCMC sample of length T , and denote a particular MCMC iteration with t , t = 1 , . . . , T . Let z i ( t ) indicate thecluster membership of observation i at the t -th iteration. Similarly, d k ( t ) represents thevalue of the estimated id in the k -th mixture component at the t -th iteration. The function Hidalgo postpr chains maps the K chains for the parameters in d to each data pointvia the values of z . That is, it constructs n chains, one for each observation, by computing d ∗ k ( t ) = d z i ( t ) ( t ). We then obtain a collection of chains that link every observation to its id estimate. When the chains have been post-processed, the local observation-specific id can be estimated by the ergodic mean or median. R> post1 <- Hidalgo_postpr_chains(output = hid_fit, all_chains = F)R> post1_TR <- Hidalgo_postpr_chains(output = hid_fit_TR, all_chains = F) If all chains = TRUE , the function returns a tibble of class hid all mcmc , containingthe entire post-processed chains. Otherwise, it returns a tibble of class hid sum mcmc that encompasses summary statistics (mean and principal quantiles) for each observation.For both classes, a suitable autoplot method is defined. In this case, we have R> autoplot(post1) + ggtitle("Conjugate Prior") id represented with black crosses and red triangles,respectively. The gray bars represent a 90% credible interval. The two panels correspondto the two different prior specifications. R> autoplot(post1_TR) + ggtitle("Truncated_PointMass Prior")R> kable(head(post1$ID_summary))| Q.05| Q.25| MEAN| MEDIAN| Q.75| Q.95| OBS||------:|------:|------:|------:|------:|------:|---:|| 1.2083| 1.8481| 2.5386| 2.2175| 3.1230| 5.0666| 1|| 0.6989| 0.7792| 0.9477| 0.8518| 0.9562| 1.4386| 2|| 0.7652| 1.0224| 2.0319| 1.8650| 2.9425| 3.9050| 3|| 0.6901| 0.7650| 1.0146| 0.8439| 0.9798| 2.2052| 4|| 0.6832| 0.7589| 0.9847| 0.8265| 0.9231| 1.9468| 5|| 0.7404| 0.8465| 1.4106| 1.1572| 1.7989| 2.9472| 6|
The created plots are shown in Figure 13. They display the mean (cross) and median(triangle) id estimates for each data point. The separation of the data into the threegenerating manifolds is evident. Also, we notice that some of the estimates in the conjugatecase are incorrectly above the nominal value D=5 , justifying the need for a truncated prior.Once the observation-specific id chains are computed, we can investigate the presenceof potential patterns between the id s and external variables, if available. To explore thesepossible relations, we can use the function Hidalgo ID class . Along with an object ofclass
Hidalgo , we need to specify:• class : factor, the variable used to stratify the id posterior estimates;28igure 14: Hidalgo ID class produces four types of plots. The id s estimates of the GaussMix datasetare stratified according the generating distribution, as specified in the class argument.• avg : logical, if
TRUE the average is used, if
FALSE , the median is adopted;To visualize the results, we can pass the output of this function to autoplot . This time,we also need to specify the variable class and• type : the type of plot we require. The default is histogram . Other possibilities are density , violin , and boxplots .As an example, Figure 14 shows the four possible plot types, with the id estimates ofthe GaussMix dataset stratified by the generating manifold of the observations (passed as class ). R> mix_comp <- pull(GaussMix[, 6])R> d1 <- Hidalgo_ID_class( output = hid_fit, class = mix_comp, avg = TRUE )R> autoplot(d1, class = mix_comp, type = "density)R> autoplot(d2, class = mix_comp, type = "histogram")R> autoplot(d3, class = mix_comp, type = "violin")R> autoplot(d4, class = mix_comp, type = "boxplot")
In Bayesian mixture models, the posterior co-clustering matrix (PCM) represents an im-portant source of information. The entries of this matrix are computed as the proportion29f times in which two observations have been clustered together across the MCMC itera-tions. The PCM provides a description of the underlying clustering structure of the datadetected by
Hidalgo . We can compute the PCM with
Hidalgo coclustering matrix .Given the PCM, we can compute a variety of widely used loss-functions on the spaceof the partitions. Examples are the Binder loss [Lau and Green, 2007] or Variation ofInformation [Wade and Ghahramani, 2015]. By minimizing the loss functions, we canretrieve the optimal partition of the dataset into clusters. The function relies on thefunctions minVI from the R packages mcclust.ext . Thus, Hidalgo coclustering matrix takes as inputs an object of class
Hidalgo and• estimate clustering : logical, if
TRUE the best partition of the data obtained byminimizing the Variation of Information is estimated;• greed : logical, if
TRUE the best partition is recovered adopting a greedy algorithm.In our case, for example, we can write:
R> cl <- factor(pull(GaussMix[,6]))R> coc <- Hidalgo_coclustering_matrix(output = hid_fit_TR,+ class = cl,+ VI = F, greed = F)
We can plot the result using autoplot , specifying as inputs the object returned by
Hidalgo coclustering matrix , and possibly class (to stratify the rows according to anexternal variable) and• id names : a vector of label that uniquely identifies the observations. Default is
NULL . R> autoplot(object = coc,class = mix_comp)
As we can see from Figure 15, the model correctly detects the three clusters in the data.
We now apply the functions discussed so far to investigate the id of the Alon dataset.A copy of this famous microarray measurements table can be accessed via the R package HiDimDA . The dataset, first presented in Alon et al. [1999], contains microarray data for2000 genes measured on 62 patients. Among them, 40 were diagnosed with colon cancerand 22 are healthy subjects. A factor variable named status , describes the patient healthstatus. We store the gene measurements in the object
Xalon , a matrix of nominal dimension
D=2000 with n=62 observations. To load the the dataset, write:
R> library(HiDimDA)R> data("AlonDS")R> status <- factor(AlonDS$grouping, labels = c("Cancer", "Healthy"))R> Xalon <- as.matrix(AlonDS[, -1])
Hidalgo model. The rowsare arranged according to the factor variable class , while the columns are sorted accordingto the best partition found with minVI from the package mcclust.ext .We obtain a visual summary of the dataset by plotting the heatmap of the log-datavalues annotated by status . The result is shown in Figure 16.
R> Ann <- data.frame("Health Status"=class)R> rownames(Xalon) <- rownames(Ann) <- paste("Patient",1:62)R> ann_colors <- list( Health.Status = c(Healthy="blue",Cancer= "lightblue"))R> pheatmap::pheatmap(log(Xalon), show_colnames = F,+ cluster_cols = T, cluster_rows = T,+ annotation_row = Ann, angle_col = "0",+ annotation_colors = ann_colors )
Homogeneous id . Let us start with describing the overall complexity of the datasetby computing the estimates assuming a homogeneous id . Using the TWO-NN model, we cancompute: 31igure 16: Heatmap of the log values of the
Alon microarray dataset. The patients on the rows arestratified according to their health status (
Cancer vs Healthy ). R> r0 <- mle_twonn(Xalon)R> r0Lower Bound Estimate Upper Bound7.508059 9.634819 12.37633R> r1 <- linfit_twonn(Xalon)R> r1$EstimatesLower Bound Estimate Upper Bound9.560248 9.875964 10.191680R> r2 <- bayesfit_twonn(Xalon)R> r2$EstimatesLower Bound Mean Median Mode Upper Bound7.507011 9.791379 9.738788 9.633456 12.374552
Alon dataset. The left panel shows the result of the linear estimator, while the right paneldepicts the posterior distribution obtained via the Bayesian approach.The different estimates based on the
TWO-NN model agree. The results are also illustratedin Figure 17, which shows the linear fit (left panel) and posterior distribution (right panel).According to these results, we conclude that the information contained in the
D=2000 genescan be summarized with approximately ten variables.
Heterogeneous id . Alternatively, we can investigate the presence of heterogeneouslatent manifolds in the Alon dataset by employing
Hidalgo . Since D is large, we do notneed to truncate the prior on d . Moreover, given the small number of data points, we optfor an informative prior Gamma (1 , id parameter vector d . R> set.seed(12345)R> hid_fit <- Hidalgo(X = Xalon, L = 30, a0_d = 1, b0_d = 1,+ alpha_Dirichlet = .05, verbose = F+ nsim = 10000, burn_in = 100000, thin=5)R> autoplot(hid_fit)
We see from the ergodic means reported in Figure 18 that we may have incurred thelabel switching problem: the fluctuations of the highlighted lines suggest that componentsget selected and discarded during the run. Therefore, we need to post-process our resultsbefore drawing any inference.
R> post1 <- Hidalgo_postpr_chains(hid_fit, all_chains = F) MCMC Iteration R a w M C M C − I n t r i n s i c D i m en s i on Figure 18: Raw MCMC output of the model
Hidalgo applied to the
Alon dataset. The ergodicmeans of the
L=30 mixture components are superimposed. The label switching phenomenonaffecting the ergodic means is evident.As a first step, one should check the PCM. We opt to display the results stratified by status and by the optimal estimated partition given the MCMC chains of the clustermembership labels hid fit$membership labels . R> gr <- Hidalgo_coclustering_matrix(hid_fit, estimate_clustering = T)R> autoplot(gr, class = status)
From the heatmap reported in Figure 19 and the optimal partition found, we concludethat the data can be grouped in two main groups. The next natural step is to investigatehow strongly the estimated partition, and in general, the estimated id s, are associatedwith health status . As a starting point, we compute the confusion matrix between theoptimal partition and status : R> table(gr$optimalCL, status)statusCancer Healthy1 28 5
Alon dataset. Rows and columns are aggregatedaccording to health status , while the columns present the optimal partition.
The estimated partition suggest that there might be an association between health status and different values of the id . To explore this relation further, we can employthe function Hidalgo ID class . R> Hidalgo_ID_class(hid_fit, class = status, avg = T)R> autoplot(d1, type = "boxplot", class = status)
We report the boxplots of the posterior id means for each subject stratified by status in Figure 20. The two groups are characterized by different values of the id . This resultis remarkable, since it is based only on the topological information of the genes that isenclosed in the distances between NNs.The estimated individual id values are useful to potentially classify the health status ofnew patients according to their genomic profiles. Alternatively, they can be used to trainprediction models. As a simple example, we now perform a classification analysis using two random forest models, predicting the target variable Y = status . To train the models,we use two different sets of covariates:
X O , the original dataset composed of 2000 genes,and
X I , the observation-specific id summary returned by Hidalgo postpr ID chains . R> X_O <- data.frame(Y = status, X = Xalon)R> X_I <- data.frame(Y = status, X = post1$ID_summary[,1:6] )R> set.seed(12346)R> rfm1 <- randomForest::randomForest(Y˜.,data=X_O, type="classification") id values stratified by health status. The healthy groupis characterized by values of id that are smaller overall. R> set.seed(12346)R> rfm2 <- randomForest::randomForest(Y˜.,data=X_I, type="classification")
We now inspect the results:
R> print( rfm1 )Call:randomForest(formula = Y ˜ ., data = X_O, type = "classification")Type of random forest: classificationNumber of trees: 500No. of variables tried at each split: 44OOB estimate of error rate: 19.35%Confusion matrix:Cancer Healthy class.errorCancer 35 5 0.1250000Healthy 7 15 0.3181818R> print( rfm2 )Call:randomForest(formula = Y ˜ ., data = X_I, type = "classification")Type of random forest: classificationNumber of trees: 500No. of variables tried at each split: 2OOB estimate of error rate: 12.9%Confusion matrix: ancer Healthy class.errorCancer 36 4 0.1000000Healthy 4 18 0.1818182 Remarkably, a simple dataset with six variables summarizing the main distributionaltraits of the observation-specific posterior id s obtains better performance in predictingthe health status than the original dataset. Therefore, we conclude that, in this case, thetopological properties of the dataset are associated with the outcome of interest.The heterogeneous id can not only provide a reliable index of complexity for involveddata structures, but can also help unveil relationships among data points hidden at thetopological level. The application to the Alon dataset showcases how reliable id estimatesrepresent a fundamental perspective that help us discover hidden data patterns. Further-more, the extracted information can be subsequentially exploited in many downstreamanalyses such as patient segmentation or predictive analyses. In this paper we introduced and discussed intRinsic , an R package that implements novelroutines for the id estimation according to the models recently developed in Facco et al.[2017], Allegra et al. [2020], Denti [2020], and Santos-Fernandez et al. [2020]. intRinsic consists of a collection of high-level, user-friendly functions that in turn rely on efficient,low-level routines implemented in R and C++ . We also remark that intRinsic integratesfunctionalities from external packages. For example, all the graphical outputs returned bythe functions are built using the well-known package ggplot2 . Therefore, they are easilycustomizable using the grammar of graphics [Wilkinson, 2005]. Moreover, the MCMCoutput of the functions implementing the Bayesian models can be exported as coda objects.To summarize, the package includes both frequentist and Bayesian model specificationsfor the global id estimators TWO-NN and
Gride . Moreover, it contains a Gibbs samplerfor the posterior simulation of the
Hidalgo model, which can capture the presence ofheterogeneous manifolds within a single dataset. We showed how multiple latent manifoldscan help unveil topological traits of a dataset by associating the id estimates with externalvariables. As a general analysis pipeline for practitioners, we suggest starting with theefficient TWO-NN functions to understand how appropriate the hypothesis of homogeneityis for the data at hand. If departures from the assumptions are visible, one should relyon
Gride and
Hidalgo . The latter also allows exploring the relationship between local id estimates and external variables.The most promising future research directions stem from Hidalgo . First, we plan todevelop more reliable methods to obtain an optimal partition of the data based on the id estimates, since the one proposed heavily relies on a mixture model of overlappingdistribution. Moreover, another research avenue worth exploring is a version of Hidalgo that relies on the
Gride distribution, exploiting the information coming from multipledistance ratios that could enhance the id estimates. We plan to keep working on thispackage and to continuously update it in the long run. The novel id estimators we discussed37ave started a lively research branch, and we intend to include all the future advancementsin intRinsic . The author is extremely thankful to Michelle N. Ngo , Derenik Haghverdian , WendyRummerfield , Andrea Cappozzo , and Riccardo Corradin for their helpful commentsand precious insights. References
Michele Allegra, Elena Facco, Francesco Denti, Alessandro Laio, and Antonietta Mira.Data segmentation based on the local intrinsic dimension.
Scientific Reports , 10(1):1–27, 2020. ISSN 20452322. doi: 10.1038/s41598-020-72222-0. URL http://arxiv.org/abs/1902.10459 .U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine.Broad patterns of gene expression revealed by clustering analysis of tumor and normalcolon tissues probed by oligonucleotide arrays.
Proceedings of the National Academy ofSciences , 96(12):6745–6750, 1999. ISSN 0027-8424. doi: 10.1073/pnas.96.12.6745.Alessio Ansuini, Alessandro Laio, Jakob H. Macke, and Davide Zoccolan. Intrinsic dimen-sion of data representations in deep neural networks.
Advances in Neural InformationProcessing Systems , 32, 2019. ISSN 10495258. URL http://arxiv.org/abs/1905.12784 .Christoph Bartenhagen.
RDRToolbox: A package for nonlinear dimension reduction withIsomap and LLE. , 2020. R package version 1.38.0.Robert S. Bennett. The Intrinsic Dimensionality of Signal Collections.
IEEE Transactionson Information Theory , 15(5):517–525, 1969. ISSN 15579654. doi: 10.1109/TIT.1969.1054365.Alina Beygelzimer, Sham Kakadet, John Langford, Sunil Arya, David Mount, andShengqiao Li.
FNN: Fast Nearest Neighbor Search Algorithms and Applications , 2019.URL https://CRAN.R-project.org/package=FNN . R package version 1.1.3.P. Campadelli, E. Casiraghi, C. Ceruti, and A. Rozza. Intrinsic Dimension Estimation:Relevant Techniques and a Benchmark Framework.
Mathematical Problems in Engi-neering , 2015, 2015. ISSN 15635147. doi: 10.1155/2015/759567.Robrecht Cannoodt and Wouter Saelens. dyndimred: Dimensionality Reduction Methodsin a Common Format , 2020. URL https://CRAN.R-project.org/package=dyndimred .R package version 1.0.3. University of California, Irvine University of Milan, Bicocca
IEEE Transactions on Signal Processing , 52(8):2210–2221, 2004. ISSN 1053587X. doi: 10.1109/TSP.2004.831130.Francesco Denti.
Bayesian Mixtures for Large Scale Inference . PhD the-sis, 2020. URL https://boa.unimib.it/retrieve/handle/10281/262923/382631/phd{_}unimib{_}746944.pdf .Antonio Fabio Di Narzo. tseriesChaos: Analysis of Nonlinear Time Series , 2019. URL https://CRAN.R-project.org/package=tseriesChaos . R package version 0.1-13.1.Dirk Eddelbuettel and Romain Fran¸cois. Rcpp: Seamless R and C++ integration.
Journalof Statistical Software , 40(8):1–18, 2011. ISSN 15487660. doi: 10.18637/jss.v040.i08.Dirk Eddelbuettel and Conrad Sanderson. Rcpparmadillo: Accelerating r with high-performance c++ linear algebra.
Computational Statistics and Data Analysis , 71:1054–1063, March 2014. URL http://dx.doi.org/10.1016/j.csda.2013.02.005 .Elena Facco and Alessandro Laio.
The intrinsic dimension of biological data landscapes .PhD thesis, 2017. URL https://core.ac.uk/download/pdf/144263715.pdf .Elena Facco, Maria D’Errico, Alex Rodriguez, and Alessandro Laio. Estimating the intrin-sic dimension of datasets by a minimal neighborhood information.
Scientific Reports , 7(1):1–8, 2017. ISSN 20452322. doi: 10.1038/s41598-017-11873-y.K. Falconer.
Fractal Geometry—Mathematical Foundations and Applications . John Wiley& Sons, 2nd edition, 2003.Constantino A. Garcia. nonlinearTseries: Nonlinear Time Series Analysis , 2020. URL https://CRAN.R-project.org/package=nonlinearTseries . R package version 0.2.10.Jean Golay and Mikhail Kanevski. Unsupervised feature selection based on the Morisitaestimator of intrinsic dimension.
Knowledge-Based Systems , 135:125–134, 2017. ISSN09507051. doi: 10.1016/j.knosys.2017.08.009.Hideitsu Hino. ider: Intrinsic dimension estimation with R.
R Journal , 9(2):329–341, 2017.ISSN 20734859. doi: 10.32614/rj-2017-054.Hideitsu Hino, Jun Fujiki, Shotaro Akaho, and Noboru Murata. Local intrinsic dimensionestimation by generalized linear Modeling.
Neural Computation , 29(7):1838–1878, 2017.ISSN 1530888X. doi: 10.1162/NECO a 00969.H. Hotelling. Analysis of a complex of statistical variables into principal components.
Journal of Educational Psychology , 24(7):498–520, 1933. ISSN 00220663. doi: 10.1037/h0070888.Kerstin Johnsson and Lund University. intrinsicDimension: Intrinsic Dimension Esti-mation , 2019. URL https://CRAN.R-project.org/package=intrinsicDimension . Rpackage version 1.2.0. 39an T. Jollife and Jorge Cadima. Principal component analysis: A review and recentdevelopments.
Philosophical Transactions of the Royal Society A: Mathematical, Physicaland Engineering Sciences , 374(2065), 2016. ISSN 1364503X. doi: 10.1098/rsta.2015.0202.Olga Kayo. Locally Linear Embedding Algorithm Extensions and Applications. 2006. URL internal-pdf://isbn9514280415-4255035408/isbn9514280415.pdf .J. F. C. Kingman.
Poisson Processes. , volume 3. 1992. ISBN 0191591246.Raivo Kolde. pheatmap: Pretty Heatmaps , 2019. URL https://CRAN.R-project.org/package=pheatmap . R package version 1.0.12.Guido Kraemer, Markus Reichstein, and Miguel D. Mahecha. dimRed and coRanking—unifying dimensionality reduction in r.
The R Journal , 10(1):342–358, 2018. URL https://journal.r-project.org/archive/2018/RJ-2018-039/index.html . coRanking ver-sion 0.2.3.Jesse H. Krijthe.
Rtsne: T-Distributed Stochastic Neighbor Embedding using Barnes-HutImplementation , 2015. URL https://github.com/jkrijthe/Rtsne . R package version0.15.John W. Lau and Peter J. Green. Bayesian model-based clustering procedures.
Journalof Computational and Graphical Statistics , 16(3):526–558, 2007. ISSN 10618600. doi:10.1198/106186007X238855.van der Maaten Laurens and Hinton Geoffrey. Visualizing Data using t-SNE.
Journal ofMachine Learning Research , 9:2579–2605, 2009. ISSN 15729338.Elizaveta Levina and Peter J Bickel. Maximum Likelihood Estimation of Intrinsic Dimen-sion. In L K Saul, Y Weiss, and L Bottou, editors,
Advances in Neural InformationProcessing Systems 17 , pages 777–784. MIT Press, 2005. URL http://papers.nips.cc/paper/2577-maximum-likelihood-estimation-of-intrinsic-dimension.pdf .Gertraud Malsiner-Walli, Sylvia Fr¨uhwirth-Schnatter, and Bettina Gr¨un. Model-basedclustering based on sparse finite Gaussian mixtures.
Statistics and Computing , 26(1-2):303–324, 2016. ISSN 15731375. doi: 10.1007/s11222-014-9500-2.Gertraud Malsiner-Walli, Sylvia Fr¨uhwirth-Schnatter, and Bettina Gr¨un. Identifying Mix-tures of Mixtures Using Bayesian Estimation.
Journal of Computational and GraphicalStatistics , 26(2):285–295, 2017. ISSN 15372715. doi: 10.1080/10618600.2016.1200472.Karl W. Pettis, Thomas A. Bailey, Anil K. Jain, and Richard C. Dubes. An IntrinsicDimensionality Estimator from Near-Neighbor Information.
IEEE Transactions on Pat-tern Analysis and Machine Intelligence , PAMI-1(1):25–37, 1979. ISSN 01628828. doi:10.1109/TPAMI.1979.4766873. 40artyn Plummer, Nicky Best, Kate Cowles, and Karen Vines. Coda: Convergencediagnosis and output analysis for mcmc.
R News , 6(1):7–11, 2006. URL https://journal.r-project.org/archive/ .T. S. Roweis and K. Saul Lawrence. Nonlinear Dimensionality Reduction by Locally LinearEmbedding.
Science , 290:2323–2326, 2000.Alessandro Rozza, Gabriele Lombardi, Marco Rosa, Elena Casiraghi, and Paola Cam-padelli. IDEA: Intrinsic dimension estimation algorithm.
Lecture Notes in Com-puter Science (including subseries Lecture Notes in Artificial Intelligence and LectureNotes in Bioinformatics) , 6978 LNCS(PART 1):433–442, 2011. ISSN 03029743. doi:10.1007/978-3-642-24085-0 45.Edgar Santos-Fernandez, Francesco Denti, Kerrie Mengersen, and Antonietta Mira. Therole of intrinsic dimension in high-resolution player tracking data - Insights in Basketball. arXiv , 2020.Hana Sevcikova, Don Percival, and Tilmann Gneiting. fractaldim: Estimation of frac-tal dimensions , 2014. URL https://CRAN.R-project.org/package=fractaldim . Rpackage version 0.8-4.Antonio Pedro Duarte Silva.
HiDimDA: High Dimensional Discriminant Analysis , 2015.URL https://CRAN.R-project.org/package=HiDimDA . R package version 0.2-4.J. B. Tenenbaum, V. De Silva, and J. C. Langford. A global geometric framework fornonlinear dimensionality reduction.
Science , 290(5500):2319–2323, 2000. ISSN 00368075.doi: 10.1126/science.290.5500.2319.Sara Wade and Zoubin Ghahramani. Bayesian cluster analysis: Point estimation and cred-ible balls. arXiv:stat.MEsta , 13(2):1–33, 2015. ISSN 19316690. doi: 10.1214/17-BA1073.URL http://arxiv.org/abs/1505.03339 .Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis . Springer-Verlag New York,2016. ISBN 978-3-319-24277-4. URL https://ggplot2.tidyverse.org .Leland Wilkinson.
The Grammar of Graphics , volume 95. Springer-Verlag, 2005.Kisung You. Rdimtools: An R package for Dimension Reduction and Intrinsic DimensionEstimation. arXiv , 2020a.Kisung You.
Rdimtools: Dimension Reduction and Estimation Methods , 2020b. URL https://CRAN.R-project.org/package=Rdimtoolshttps://CRAN.R-project.org/package=Rdimtools