Mode Hunting Using Pettiest Components Analysis
Tianhao Liu, Daniel Andrés Díaz-Pachón, J. Sunil Rao, Jean-Eudes Dazard
MMode Hunting using Pettiest Components Analysis
Tianhao Liu a , Daniel Andr´es D´ıaz–Pach´on a, ∗ , J. Sunil Rao a , Jean-Eudes Dazard b a Division of Biostatistics - University of Miami, Don Soffer Clinical Research Center, 1120 NW 14thSt, Miami FL, 33136 b Center for Proteomics and Bioinformatics, Case Western Reserve University, 10900 Euclid Av.Cleveland, OH 44106
Abstract
Principal component analysis has been used to reduce dimensionality of datasets for along time. In this paper, we will demonstrate that in mode detection the components ofsmallest variance, the pettiest components, are more important. We prove that when thedata follows a multivariate normal distribution, by implementing “pettiest componentanalysis” when the data is normally distributed, we obtain boxes of optimal size inthe sense that their size is minimal over all possible boxes with the same number ofdimensions and given probability. We illustrate our result with a simulation revealingthat pettiest component analysis works better than its competitors.
Keywords : Bump hunting, Dimension reduction, Mode detection, Patient rule induc-tion method, Principal components analysis, Space rotation.
1. Introduction
Principal components analysis is a widely used method in data analysis which reducesthe dimension by projecting on the orthogonal rotation that maximizes the variance(Mardia et al., 1979, Ch. 8). Due to the simple mechanism and good compatibility withdifferent regression methods, principal components analysis is used in many scenarios,especially dealing in p (cid:29) n datasets. Even in regression and learning settings, most prac-titioners prefer to discard the components of the input with the smallest variance, usingjust the first few principal components. This is called principal components regression.But we should be careful about some potential pitfalls when applying principal compo-nents analysis on regression models. As Jolliffe demonstrated in a classic paper, we caneasily find some ordinary examples in which small-variance components become equallyimportant, if not more, than large-variance ones (Joliffe, 1982). Hadi and Ling havealso presented three cautionary notes on using principal components regression, mainlydue to issues arising from multicollinearity (Hadi and Ling, 1998). In the end, maybethe strongest reason to question principal components regression is that the dependentvariables are never used in principal components analysis. As Cox said, it is hard to seeany reason “why the dependent variable should not be closely tied to the least impor-tant principal components” (Cox, 1968, p. 272). Here we will call those least important ∗ Corresponding author a r X i v : . [ s t a t . M E ] J a n rincipal components pettiest components . Nevertheless, using principal components re-gression is not totally invalid. Actually, discarding the pettiest components does givesatisfying results in many situations. In fact, Artemiou and Li gave conditions underwhich the first few leading principal components have a higher probability of correlatingwith the dependent variable than the small-variance ones (Artemiou and Li, 2009).All these arguments are informative but none, to the best of our knowledge, havetried to make a systematic use of the pettiest components in their own right. That is,most criticisms have focused on the negative aspect of finding isolated counterexamplesshowing that principal components regression is not desirable, based on the underlyingassumption that the leading principal components are better, but none has focused onthe positive aspect of systematically selecting the pettiest components. In this article wetake advantage from these pettiest components. In fact, a modification of a mode huntingalgorithm will show that the smallest regions of the same probability are achieved usingthe pettiest components. Intuitively, the result presented here follows this very simpleidea: In a space S ⊂ R p , define a β -mode of a continuous distribution as the region ofthe space with the smallest volume constrained to have a probability β . Assume theunderlying distribution of a p -dimensional vector x is a multivariate normal, and thatwe project it to a p (cid:48) -dimensional space, with p (cid:48) < p . Then the box that minimizes thesize, among all possible boxes of probability β , is the one projected over the subspace ofthe orthogonal variables with the smallest variance.This work can be seen as a continuation of D´ıaz-Pach´on et al. (2017), where the opti-mality, in terms of smallest volume, of the p -dimensional box centered around the meanin the direction of the eigenvectors was proved for a multivariate normal distribution.The next obvious step, the reduction of dimensionality to p (cid:48) dimensions, is accomplishedhere; but contrary to common practice, we show that the optimal p (cid:48) -box is obtainedwhen the p -dimensional box in the previous step is projected to the subspace of the p (cid:48) pettiest components.Several questions and comments arise from this situation. First, finding a generalsolution for the set of minimum volume in the class C of all the Borel sets with probability β in R d seems in general intractable; therefore we consider the next big thing: the moremanageable class C (cid:48) ⊂ C of all hyper-rectangles I × · · · I p , where I i are intervals in R . Second, we define a β -mode and not the β -mode, because there can be more thanone Borel set with identical hyper-volume and probability β in the space; in fact, inmany situations not only global but local β -modes are of interest. Third, we talk about β -modes in general, instead of simply modes, because we are interested on regions ofpositive probability. And fourth, out of the previous considerations, there are no spaceswithout β -modes; i.e., even if S is of finite volume and the underlying distribution isuniform we have uncountable β -modes, though these will not be very informative; on theother hand, if S has infinite Lebesgue measure, a continuous distribution in S will haveat least a β -mode.Why the interest in the β -regions with the smallest volume? An answer amongmany approaches lies in the fact that such regions might contain a large amount ofrelative Shannon information. To see this, imagine S = [0 ,
1] and suppose a non-uniformdistribution φ in this space whose β -mode I is unique in the sense that if there is anotherset I (cid:48) with probability β , then the volume of the symmetric difference I (cid:52) I (cid:48) is zero. Fordefiniteness, let I = [ a, b ]. Then, calling U the uniform distribution over S , we have that I + ( I ), defined as log( φ ( I ) / U ( I )) = log( β/ ( b − a )), is maximal over all the other intervals2n S . That is, for any other interval I ∗ with φ ( I ∗ ) = β , then I + ( I ) − I + ( I ∗ ) > I + is called active information; it is the information added by φ with respect to a baselinedistribution, in this case U (D´ıaz-Pach´on et al., 2020). Active information was introducedin the context of search problems (Dembski and Marks II, 2009a,b, D´ıaz-Pach´on andMarks II, 2020), and has also been used to detect modes in multivariate analyses (D´ıaz-Pach´on et al., 2019).Besides the theoretical curiosity of finding such a region, there is a vast numberof applications. Since there are uncountably infinite regions of probability β in a p -dimensional support of a continuous distribution, even when p = 1, the smallest regionof probability β will tell us that the data is highly concentrated inside such region. This isof course of interest in many fields, for instance in tumor detection by image recognitionin medicine, where a high concentration of points around a small region can help on earlydetection; or in more accurate predictions based on closeness to a given target, as in theNetflix recommendations algorithm. Many more come easily to mind.Yet, mode detection continues being a difficult problem, specially in multivariatesettings. One of the most famous ways to approach it, specially in computer vision,is via the mean-shift algorithm by kernel density estimation (Fukunaga and Hostetler,1975). It works well when p = 2 having amenable asymptotic properties (Parzen, 1962,de Valpine, 2004). However, it becomes very slow when p >
2, though when the shape ofthe density is known, the speed improves (Cule et al., 2010). More recently, Ruzankin andLogashov introduced a fast mode estimator with time complexity O ( pn ), whereas otherestimators time complexity is O ( pn ), where n is the number of observations (Ruzankinand Logashov, 2020).
2. Basic notions
Suppose there is a p -dimensional dataset of n observations { x , . . . , x p } n with covari-ance matrix Σ. The problem is to solve the eigenvalue question Σ a = λa . From thesolution, we get the eigenvectors a , . . . , a p , and respective eigenvalues λ , . . . , λ k . Theoriginal p -dimensional input space will be X ( p ), and the space after the rotation in thedirection of its p eigenvectors, X (cid:48) ( p ).We will call these eigenvectors components , and their corresponding eigenvalue willbe their variance. Then, sorting the components in order of increasing variance, the k principal components will be the k components with the largest variances; and the k pettiest components will be the k components with the smallest variance. The patient rule induction method (PRIM) is a greedy algorithm designed for bumphunting. A bump can be roughly defined as a region on the response variable with highermean, compared to other places. Here we outline briefly the algorithm. More details canbe found in Friedman and Fisher (1999).Assume we have a dataset { y, x } n where x = { x , x , . . . , x p } is a continuous randomvector. f ( x ) is the target function. The input domain X ( p ) is here called S . Defining3 f B and ¯ f as ¯ f B = (cid:82) B f ( x ) p ( x ) dx (cid:82) B p ( x ) dx , ¯ f = (cid:90) S f ( x ) p ( x ) dx, where p ( x ) is the density function of x , the method aims to find a box B ⊂ S in which¯ f B / ¯ f is maximized under the constraint that B is not too small. In fact, the probabilityof B is defined at the outset by a tuning meta-parameter β for the smallest permittedbox size. In practice, ¯ f B / ¯ f will be replaced by its corresponding estimator.The algorithm is divided in three stages. First there is a peeling : Beginning with thewhole input space S , a whole class of eligible boxes for removal is set up. This class ismade of sets of the form b j − = { x | x j < x j ( α ) } ,b j + = { x | x j > x j (1 − α ) } , where x j ( α ) is an α -quantile. If a box b ∗ is in the eligible set, and without it the remaining S \ b ∗ will give the maximum average value of response variable, b ∗ is the specific boxselected to remove in this loop. Formally we write b ∗ = arg max b ave[ y i | x i ∈ B − b ] , where ave is the average. Update the new box as B = S \ b ∗ and run the same procedureon B . The process is iterated until the final box B has probability β . The peelingprocedure is shown in Fig. 1.After peeling, a second step called pasting is performed in order to correct for thegreediness of peeling. We will not consider pasting in this paper. Finally, the third stepis called covering , which means that after the peeling stage, the final box B is deletedfrom S , and then the whole procedure of peeling is repeated in S \ B . Figure 1: The peeling procedure in PRIM. .3. Fast patient rule induction method PRIM has at least three shortcomings. First, it is computationally expensive; second,it does not behave well in the presence of collinearity; and third, even in a small number ofdimensions, it cannot detect multiple bumps (Polonik and Wang, 2010). For this reason,Dazard and Rao developed a modified algorithm called local sparse bump hunting inwhich the patient rule induction method is subsumed (Dazard and Rao, 2010). The localsparse bump hunting algorithm uses a recursive partition algorithm, say classificationand regression trees, in order to divide the space X ( p ) in several regions S , . . . , S R ; thenapplies sparse principal components analysis inside each particular region S i in orderto reduce the dimension of each region in the partition; and finally applies the patientrule induction method to each rotated and projected subspace induced by S i . The localsparse bump hunting algorithm has been successfully applied to detect heterogeneityof colon tumors, dividing colon cancer patients into two subpopulations with differentgenetic/molecular profiles in all stages of cancer (Dazard et al., 2012).Elaborating on local sparse bump hunting, a new modified algorithm called fastPRIM was developed in order to find the minimal volume of boxes with probability β when thedistribution is a multivariate normal (D´ıaz-Pach´on et al., 2017, Algorithm 4). Moreaccurately, the modes can be defined as β -modes; i.e., contiguous regions, not necessarilyunique, with the smallest volume and probability β . Since the mode and the mean ofthe normal distribution coincide, mode hunting in this case is equivalent to finding abox of probability β centered around the mean. In fastPRIM the space is first rotatedin the direction of its eigenvectors, and then the patient rule induction is applied inthe direction of the rotation. It turns out that in this setting the whole peeling andpasting iteration is reduced to one single step. That is, fastPRIM chooses the centralizedbox such that each side of the box is parallel to an axis of the input space and whosevertices are located at the quantiles 2 − (1 ± β /pT ) of the corresponding variable, where β T = (cid:80) tk =1 β (1 − β ) k − = 1 − (1 − β ) t is the probability measure after t steps of covering.As a result, we obtain a rectangular Lebesgue set, or say a square set in probability,centralized at the zero point.Therefore, from a sampling viewpoint fastPRIM is consistent: calculating the averageof all the points inside the final centered box and taking this average as the center of ourbox, we know by the law of large numbers that the final box will be centered around theorigin. Since the mean and the mode coincide in the normal distribution, as n → ∞ ,the procedure is approaching a box whose center is the true mode. This is so with orwithout dimension reduction.
3. Pettiest components analysis with fastPRIM
The fastPRIM algorithm satisfies the following optimality property: when the datais multivariate normal, say N ( , Σ), the box with the smallest volume subject to haveprobability β is found in the direction of the rotation of the p principal components; thatis, the box centered around the mean of probability β in X (cid:48) ( p ) (D´ıaz-Pach´on et al., 2017,Proposition 1). However, no result in D´ıaz-Pach´on et al. (2017) dealt with dimensionreduction. This article goes one step further: it shows that if we are going to considerreduction of dimensionality, say to p (cid:48) < p , we should consider pettiest components insteadof principal components. 5tarting thus with the space X ( p ), let B i be the final box obtained by fastPRIM onan input space X (cid:48) i ( p (cid:48) ), where i indicates a specific way of choosing p (cid:48) variables from p variables. The collection of all such boxes will be B . The Lebesgue measure of the box B ∈ B with probability measure β is Vol( B | β ). In the input space X (cid:48) i ( p (cid:48) ) spanned bythe pettiest components, we write that specific box as B . Theorem 1.
Let X be a p -random vector such that its components X , . . . , X p are inde-pendent of each other and have normal distribution N (0 , σ ) , . . . , N (0 , σ p ) , respectively.Apply fastPRIM on each possible projection space X (cid:48) i ( p (cid:48) ) with same probability β . Then, arg min B ∈B Vol( B | β ) = B . Proof.
There are C pp (cid:48) ways to choose p (cid:48) dimensions from p . For X , . . . , X p followingindependent normal distributions, we write X j ∼ N j (0 , σ j ), ( j = 1 , . . . , p ). From thefact that the final box B in fastPRIM is a square in probability, the marginal probabilitymeasure of every B ∈ B is β /p (cid:48) . Because all boxes B ∈ B are centralized at zero, theLebesgue measure of X j , subject to a probability β /p (cid:48) , can be calculated by the equation P {− kσ j < X j < kσ j } = β /p (cid:48) . Here k only depends on β /p (cid:48) and it has nothing to dowith j . Therefore, the edge length of B i in the X j direction is 2 kσ j . Then,Vol( B i | β ) = p (cid:48) (cid:89) j ( i ) =1 kσ j ( i ) = (2 k ) p (cid:48) p (cid:48) (cid:89) j ( i ) =1 σ j ( i ) , where j ( i ) denotes the p (cid:48) variables in the choice of B i . Since a fix β implies a fix k , theminimum of Vol( B i | β ) can thus be obtained by minimizing (cid:81) p (cid:48) j ( i ) =1 σ j ( i ) . But this onlyrequires the σ ( i ) , . . . , σ p (cid:48) ( i ) to have the smallest values, which is achieved by choosing the X ( i ) , . . . , X p (cid:48) ( i ) with the smallest variances. That is, the p (cid:48) pettiest components. Thecorresponding box found by fastPRIM with pettiest components is B .Using Theorem 1, we can easily show that the same conclusion is true for any multi-variate normal distribution input under fastPRIM. Theorem 2.
Let X be a p -random vector distributed as N p (0 , Σ) . Apply fastPRIM oneach possible projection space X (cid:48) i ( p (cid:48) ) with same probability β . Then, arg min B ∈B Vol( B | β ) = B Proof.
For a multivariate normal distribution input, to rotate the space in the directionof its p components is equivalent to solve the eigenvalue equation of covariance matrix Σ.The rotation will produce a diagonal matrix D with eigenvalues as its diagonal elements.This new matrix D is the covariance matrix of the components, which implies that theyare all independent of each other and follow a normal distribution. So the problem isreduced to an instantiation of Theorem 1. Thus the result holds as before.
4. Simulation
We generate a dataset with 300 observations and 100 dimensions, which follow themultivariate normal distribution N p (0 , Σ). The covariance matrix Σ is assumed to have6ts first two diagonal elements as 1, and cov( x , x )= 0 .
7; the last two diagonal elementsare 12 with cov( x , x )= 8; the remaining diagonal elements are 6 and all other placesare 0. A pure PRIM, no rotation nor reduction, with ten iterations of covering, is thenapplied for contrast purposes. The result is shown in Fig. 2 for the first two dimension.We now standardize the dataset to use the correlation matrix and apply the rotation inthe direction of the eigenvectors. Once the rotation is performed, the two new variableswith larger variances are the principal components, and the two variables with smallervariances are the pettiest components. The next step is to run both PRIM and fastPRIMon the two principal components and the two pettiest components, both with β = 0 .
5. Summary
Even though it is well known that pettiest components can sometimes explain better aresponse than principal components, they have been treated more like counter-examples,anomalies to be avoided, than as a tool themselves. However, in this article we showed7 igure 2: Region obtained by the PRIM.(a) Principal components with PRIM (b) Pettiest components with PRIM(c) Principal components with fastPRIM (d) Pettiest components with fastPRIMFigure 3: Simulation results by method. that pettiest components can be systematically used in order to find the best β -modesin multivariate datasets, provided that the data is distributed normal. In this sense,the fact that it had not been noted before is surprising. This finding goes against thegeneral notion that principal components are more informative, since, as shown in the8 able 1: Density of boxes per volume by different method PRIM 1.07e-73 1.57e-73 1.76e-73 2.14e-73 2.59e-73 3.04e-73 3.45e-73 3.87e-73 4.26e-73 4.65e-73PRIM-Principal 15.7 16.7 14.2 14.7 14.3 14.2 14.0 13.9 13.8 13.0fastPRIM-Principal 16.3 15.6 16.2 16.3 14.5 13.2 13.1 13.1 12.7 11.8PRIM-Pettiest 182 168 187 152 155 142 130 132 132 127fastPRIM-Pettiest 215 240 237 218 207 186 189 177 169 161
Table 2: Empirical variance and bias by method
PRIM PRIM-Principal PRIM-Pettiest fastPRIM-Principal fastPRIM-Pettiest
Variance
104 2.28 0.204 2.23 0.149
Bias
Introduction, the box with the smallest size also maximizes the active information relativeto the uniform distribution.Given the mean’s lack of robustness even in low dimensions, and the prevalence of bigdata and large-dimensions analyses, the importance of mode-based statistics and learningis becoming more relevant Chac´on (2020). To mention just one case, a recent analysis inaffecting computing used for autism therapy is performed around modes, not the meanRudovic et al. (2019). Consequently, new theory and methods are required in order tobetter detect modes, but this task is difficult and elusive. The importance of fastPRIMin mode hunting is dictated by the central limit theorem. The setback of this approach isthe curse of dimensionality, as Vapnik explains in the Introduction of his book (Vapnik,1998, pp. 4–6). Therefore, even though there is some optimality that is reached just byrotating the information in the direction of the eigenvalues, if fastPRIM is going to beuseful it will need to reduce dimensionality.
ReferencesReferences
A. Artemiou and B. Li. On principal components and regression: A statistical explanation of a naturalphenomenon.
Stat. Sinica , 19:1557–1565, 2009.J. E. Chac´on. The Modal Age of Statistics.
International Statistical Review , 88:122–141, 2020. doi:10.1111/insr.12340.D. Cox. Notes on some aspects of regression analysis.
J. R. Stat. Soc. Ser. A. , 131:265/279, 1968.M. L. Cule, R. J. Samworth, and M. I. Stewart. Maximum likelihood estimation of a multi-dimensionallog-concave density.
J. Roy. Statist. Soc. B , 72:545–600, 2010. doi: 10.1111/j.1467-9868.2010.00753.x.J.-E. Dazard and J. S. Rao. Local Sparse Bump Hunting.
J. Comput. Graph. Stat. , 19(4):900–929, 2010.doi: 10.1198/jcgs.2010.09029.J.-E. Dazard, J. S. Rao, and S. Markowitz. Local Sparse Bump Hunting reveals molecular heterogeneityof colon tumors.
Stat. Med. , 31(11-12):1203–1220, 2012. doi: 10.1002/sim.4389.P. de Valpine. Monte Carlo State-Space Likelihoods by Weighted Posterior Kernel Density Estimation.
J Amer Stat Assoc , 99(466):523–536, 2004. doi: 10.1198/016214504000000476.W. A. Dembski and R. J. Marks II. Bernoulli’s Principle of Insufficient Reason and Conservation ofInformation in Computer Search. In
Proc. of the 2009 IEEE International Conference on Systems,Man, and Cybernetics. San Antonio, TX , pages 2647–2652, October 2009a. doi: 10.1109/ICSMC.2009.5346119.W. A. Dembski and R. J. Marks II. Conservation of Information in Search: Measuring the Cost ofSuccess.
IEEE Transactions on Systems, Man and Cybernetics A, Systems & Humans , 5(5):1051–1061, September 2009b. doi: 10.1109/TSMCA.2009.2025027. . A. D´ıaz-Pach´on and R. J. Marks II. Generalized active information: Extension to unbounded domains. BIO-Complexity , 2020(3):1–6, 2020. doi: 10.5048/BIO-C.2020.3. URL https://bio-complexity.org/ojs/index.php/main/article/view/BIO-C.2020.3 .D. A. D´ıaz-Pach´on, J.-E. Dazard, and J. S. Rao. Unsupervised Bump Hunting Using Principal Com-ponents. In S. E. Ahmed, editor,
Big and Complex Data Analysis: Methodologies and Applica-tions , pages 325–345. Springer International Publishing, 2017. URL https://doi.org/10.1007/978-3-319-41573-4_16 .D. A. D´ıaz-Pach´on, J. P. S´aenz, J. S. Rao, and J.-E. Dazard. Mode hunting through active infor-mation.
Applied Stochastic Models in Business and Industry , 35(2):376–393, 2019. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/asmb.2430 .D. A. D´ıaz-Pach´on, J. P. S´aenz, and J. S. Rao. Hypothesis testing with active information.
Statistics& Probability Letters , 161:108742, 2020. ISSN 0167-7152. doi: https://doi.org/10.1016/j.spl.2020.108742. URL .J. Friedman and N. Fisher. Bump hunting in high-dimensional data.
Stat. Comput. , 9:123–143, 1999.K. Fukunaga and L. D. Hostetler. The Estimation of the Gradient of a Density Function, with Ap-plications in Pattern Recognition.
IEEE Transactions on Information Theory , 21:32–40, 1975. doi:10.1109/TIT.1975.1055330.S. Hadi and R. Ling. Some Cautionary Notes on the Use of Principal Components Regression.
Am.Stat. , 52(1):15–19, 1998. doi: 10.1080/00031305.1998.10480530.I. Joliffe. A Note on the Use of Principal Components in Regression.
J. R. Stat. Soc. Ser. C. Appl.Stat. , 31(3):300–303, 1982. doi: 10.2307/2348005.K. V. Mardia, J. T. Kent, and J. M. Bibby.
Multivariate Analysis . Academic Press, 1979.E. Parzen. On Estimation of a Probability Density Function and Mode.
Ann Math Stat , 33(3):1065–1076,1962. doi: 10.1214/aoms/1177704472.W. Polonik and Z. Wang. PRIM analysis.
J. Multivar. Anal. , 101(3):525–540, 2010. doi: 10.1016/j.jmva.2009.08.010.O. Rudovic, M. Zhang, B. Schuller, and R. Picard. Multi-modal Active Learning From HumanData: A Deep Reinforcement Learning Approach. In , ICMI ’19, pages 6–15, New York, NY, USA, 2019. ACM. doi:10.1145/3340555.3353742. URL https://doi.org/10.1145/3340555.3353742 .P. S. Ruzankin and A. V. Logashov. A fast mode estimator in multidimensional space.
Stat ProbabLetters , 158:108670, 2020. doi: 10.1016/j.spl.2019.108670.V. Vapnik.
Statistical Learning Theory . Wiley, 1998.. Wiley, 1998.