[PDF] Peter Hall's work on high-dimensional data and classification

Abstract

In this article, I summarise Peter Hall's contributions to high-dimensional data, including their geometric representations and variable selection methods based on ranking. I also discuss his work on classification problems, concluding with some personal reflections on my own interactions with him.

Full PDF

SSubmitted to the Annals of Statistics

PETER HALL’S WORK ON HIGH-DIMENSIONAL DATAAND CLASSIFICATION

By Richard J. Samworth ∗ , † University of Cambridge † In this article, I summarise Peter Hall’s contributions to high-dimensional data, including their geometric representations and vari-able selection methods based on ranking. I also discuss his work onclassiﬁcation problems, concluding with some personal reﬂections onmy own interactions with him.

1. High-dimensional data.

Peter Hall wrote many inﬂuential workson high-dimensional data, though notably he largely eschewed the notionsof sparsity and penalised likelihood that have become so popular in recentyears. Nevertheless, he was interested in variable selection, and wrote severalpapers that involved ranking variables in some way. Perhaps his most well-known papers in this area, though, concern geometrical representations ofhigh-dimensional data.1.1.

Geometric representations of high-dimensional data.

Hall and Li(1993) was one of the pioneering works in the early days of high-dimensionaldata analysis that tried to understand the properties of low-dimensionalprojections of a high-dimensional isotropic random vector X in R p . As mo-tivation, let γ ∈ R p have (cid:107) γ (cid:107) = 1 and suppose that(1) ∀ b ∈ R p , ∃ α b , β b ∈ R , E ( b T X | γ T X = t ) = α b t + β b . This condition says that the regression function of b T X on γ T X is linear.Then, using the isotropy of X ,0 = E ( b T X ) = E { E ( b T X | γ T X ) } = E ( α b γ T X + β b ) = β b . Moreover, b T γ = Cov( b T X, γ T X ) = E { E ( b T XX T γ | γ T X ) } = α b γ T E ( XX T ) γ = α b , and we conclude that E ( X | γ T X = t ) = tγ , or equivalently,(2) (cid:107) E ( X | γ T X = t ) (cid:107) − t = 0 . ∗ The research of Richard J. Samworth was supported an EPSRC Early Career Fellow-ship and a Philip Leverhulme prize. imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016 a r X i v : . [ s t a t . O T ] J un R. J. SAMWORTH

The left-hand side of (2) is always non-negative, so can be used as a measureof the extent to which the condition (1) holds. Remarkably, under very mildconditions on the distribution of X , Hall and Li (1993) proved that if γ isdrawn from the uniform distribution on the unit Euclidean sphere in R p ,then (cid:107) E ( X | γ, γ T X = t ) (cid:107) − t p → p → ∞ . This is equivalent to the statementsup b ∈ R p : (cid:107) b (cid:107) =1 ,b T γ =0 (cid:12)(cid:12) E ( b T X | γ, γ T X = t ) (cid:12)(cid:12) p → p → ∞ . See also Diaconis and Freedman (1984), who showed that undermild conditions, most low-dimensional projections of high-dimensional dataare nearly normal. Of course, when X has a spherically symmetric distribu-tion, (1) holds for every γ ∈ R p with (cid:107) γ (cid:107) = 1. But the result of this papershows that even without spherical symmetry, there is a good chance (in thesense of random draws of γ as described above) that (1) holds, at least ap-proximately, when p is large. An important statistical consequence of this isthat even if the relationship between a response Y and a high-dimensionalpredictor is non-linear, say Y = g ( γ T X, (cid:15) ) for some unknown link function g and error (cid:15) , standard linear regression procedures can often be expectedto yield an approximately correct estimate of γ up to a constant of pro-portionality. The generalisation of this result that replaces γ T X with Γ T X ,where Γ is a random p × k matrix with orthonormal columns, also plays animportant role in justifying the use of sliced inverse regression for dimensionreduction (Li, 1991).Another seminal paper that articulated many of the key geometrical prop-erties of high-dimensional data is Hall, Marron and Neeman (2005). This pa-per begins with the simple, yet remarkable, observation that if Z ∼ N p (0 , I ),then (cid:107) Z (cid:107) = p / + O p (1) as p → ∞ . Thus, data drawn from this distribu-tion tend to lie near the boundary of a large ball. Similarly, the pairwisedistances between points are almost a deterministic distance apart, and theobservations tend to be almost orthogonal. In fact, the authors go on toexplain that, under much weaker assumptions than Gaussianity, the data lieapproximately on the vertices of a regular simplex, and that the stochastic-ity in the data essentially appears as a random rotation of this simplex. Aswell as clarifying the relationship between Support Vector Machines (e.g.Christianini and Shawe-Taylor, 2000) and Distance Weighted Discrimina-tion classiﬁers (Marron, Todd and Ahn, 2007) in high dimensions, the paperforced researchers to rewire their intuition about high-dimensional data, andprecipitated a ﬂood of subsequent papers on high-dimensional asymptotics. imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016 IGH-DIMENSIONAL DATA AND CLASSIFICATION Variable selection and ranking.

The last 15 years or so have seenvariable selection emerge as one of the most prominently-studied topics inStatistics. Although Peter’s instinct was to think nonparametrically, he re-alised that he could contribute to a prominent line of research in the variableselection literature, namely marginal screening (e.g. Fan and Lv, 2008; Fan,Samworth and Wu, 2009; Li, Zhong and Zhu, 2012), via the deep under-standing he developed for rankings. Hall and Miller (2009a) deﬁned variablerankings through their generalised correlation with a response, while Delaigleand Hall (2012) studied variable transformations prior to ranking based oncorrelation as a method for dealing with heavy-tailed data. For classiﬁca-tion, Hall, Titterington and Xue (2009a) proposed a cross-validation basedcriterion for assessing variable importance, while in the unsupervised set-ting, Chan and Hall (2010) suggested ranking the importance of variablesfor clustering based on nonparametric tests of modality.These works above were underpinned by Peter’s realisation that he couldexplain how perhaps his favourite tool of all, namely the bootstrap, couldbe used to quantify the authority of a ranking (Hall and Miller, 2009b). Infact, there are some subtle issues here, particularly surrounding the issue ofties. Peter developed an ingenious method for proving that even though thestandard n -out-of- n bootstrap does not handle this issue well, the m -out-of- n bootstrap overcomes it in an elegant way.

2. Classiﬁcation problems.

I believe that Peter may have become in-terested in classiﬁcation problems in the early 2000s at least partly throughideas of bootstrap aggregating, or bagging (Breiman, 1996). Indeed, in Fried-man and Hall (2007), a preprint of which was already available in early 2000,Peter had attempted to understand the eﬀect of bagging in M -estimationproblems. This is a typical example of Peter’s extraordinary ability to ex-plain empirically observed eﬀects through asymptotic expansions. One of theother interesting contributions of this work is that subsampling (i.e. sam-pling without replacement) half of the observations closely mimics ordinary n -out-of- n bootstrap sampling, a very useful fact that has been observed andexploited in several other contexts, including stability selection for choosingvariables in high-dimensional inference (Meinshausen and B¨uhlmann, 2010;Shah and Samworth, 2013) and stochastic search methods for semiparamet-ric regression (D¨umbgen, Samworth and Schuhmacher, 2013).Classiﬁcation problems are ideally suited to bagging, because the discretenature of the response variable means that small changes to the trainingdata can often yield diﬀerent outputs from a classiﬁer; in the terminologyof Breiman (1996), many classiﬁers are ‘unstable’. Suppose we are given imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016 R. J. SAMWORTH training data X := { ( X , Y ) , . . . , ( X n , Y n ) } , where each X i is a covariatetaking values in a general normed space B , and Y i is a response takingvalues in {− , } . Assume further that we have access to a classiﬁer ˆ C n ( · ) =ˆ C n ( · ; X ) constructed from the training data, so that x ∈ B is assigned toclass ˆ C n ( x ; X ). To form the bagged version ˆ C ∗ n of the classiﬁer, we draw B bootstrap resamples {X ∗ b : b = 1 , . . . , B } from X , and setˆ C ∗ n ( x ) := sgn (cid:18) B B (cid:88) b =1 ˆ C n ( x ; X ∗ b ) (cid:19) . Peter got me interested in bagging nearest neighbour classiﬁers. Ironically,the nearest neighbour classiﬁer had been described by Breiman as stable,since the nearest neighbour appears in more than half — in fact, around1 − (1 − /n ) n ≈ − e − — of the bootstrap resamples; thus the baggednearest neighbour classiﬁer is typically identical to the unbagged version. InHall and Samworth (2005), however, we studied the eﬀect of drawing resam-ples (either with or without replacement) of smaller size m . Naturally, thisreduces the probability of including the nearest neighbour in the resample,and the bagged classiﬁer is now well approximated by a weighted nearestneighbour classiﬁer with geometrically decaying weights; see also Biau andDevroye (2010). In order for bagging to yield any asymptotic improvementover the basic nearest neighbour classiﬁer, we require m/n < / m/n < log 2 (when sampling with re-placement); in order to converge to the theoretically-optimal Bayes classiﬁer,we require m = m n → ∞ but m/n → B = R d . A particularly curious discoveryhe made there is that even in the simplest case where d = 1 and where theclass conditional densities f and g cross only at the single point x , the rateof convergence and order of the asymptotically optimal bandwidth dependson the sign of f (cid:48)(cid:48) ( x ) g (cid:48)(cid:48) ( x ). In Hall, Park and Samworth (2008), we con-sidered similar problems in the context of k -nearest neighbour classiﬁcation,obtaining an asymptotic expansion for the regret (i.e. the diﬀerence betweenthe risk of the k -nearest neighbour classiﬁer and that of the Bayes classiﬁer)which implied that the usual nonparametric error rate of order n − / ( d +4) wasattainable with k chosen to be of order n / ( d +4) . The form of the expansionmade me realise that the limiting ratio of the regrets of the bagged nearest imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016 IGH-DIMENSIONAL DATA AND CLASSIFICATION neighbour classiﬁer and the k -nearest neighbour classiﬁer (with both the re-sample size m and the number of neighbours k chosen optimally) dependedonly on d , and not on the underlying distributions. To my great surprise,this limiting ratio was greater than 1 when d = 1, equal to 1 when d = 2and less than 1 for d ≥ d ). It took mesome years to explain this phenomenon in terms of the optimal weightingscheme (Samworth, 2012).In more recent years, Peter turned his attention to a wealth of other im-portant, though perhaps less well studied, issues in classiﬁcation. Some ofthese were motivated by what he saw as drawbacks of existing classiﬁers.For instance, in Hall, Titterington and Xue (2009b), he developed classiﬁersbased on componentwise medians, to alleviate the diﬃculties of both com-puting and interpreting multivariate medians; such methods can be highlyeﬀective for high-dimensional data that may have heavy tails. In Chan andHall (2009a), he studied robust versions of nearest neighbour classiﬁers forhigh-dimensional data that try to perform an initial variable selection step toreduce variability. Chan and Hall (2009b) presented simple scale adjustmentsto make distance-based classiﬁers (primarily designed to detect location dif-ferences) less sensitive to scale variation between populations; see also Halland Pham (2010). Hall and Xue (2010) and Hall, Xia and Xue (2013) con-cerned settings where one might want to incorporate the prior probabilitiesinto a classiﬁer, and where these prior probabilities may be signiﬁcantlydiﬀerent from 1 /

2, respectively. Finally, Ghosh and Hall (2008) discoveredthe phenomenon that estimating the risk of a classiﬁer, and estimating thetuning parameters to minimise that risk, are two rather diﬀerent problems,requiring the use of diﬀerent methodologies.

3. Some personal reﬂections.

I ﬁrst met Peter as a PhD studentwhen he visited Cambridge in 2002. I spent an hour or so discussing a prob-lem I was working on that involved using ideas of James–Stein estimationto ﬁnd small conﬁdence sets for the location parameter of a sphericallysymmetric distribution (Samworth, 2005). I was blown away at the speedwith which he was able to understand where my diﬃculties lay, and makehelpful suggestions. Shortly afterwards, he invited me to spend six weeksat the Australian National University in Canberra in July–August 2003. Iarrived utterly exhausted after nearly 24 hours in the air, but Peter was fullof energy when he kindly picked me up from the bus station. Almost theﬁrst thing he said to me was: ‘I’ve got a problem I thought we could thinkabout...’, and he proceeded to take out a pen and pad of paper; one couldn’thelp but be drawn along by his enthusiasm for research. imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016

R. J. SAMWORTH

Everything with Peter happened at breakneck speed, whether it was dash-ing around the supermarket, a driving tour through the rural AustralianCapital Territory or, of course, writing papers. Many of his collaboratorswill have experienced discussing a problem with Peter one evening and re-turning to the oﬃce the following morning to ﬁnd that he had typed upa draft manuscript that would form the basis of a joint paper. His prosewas always elegant, and he had a wonderful ability to see his way throughtechnical asymptotic arguments, aided by almost physicist-like intuition forwhat ought to be true.

Fig 1 . Peter with Juhyun Park (Lancaster University), the author and Nick Bingham(Imperial College London) on a blustery day in rural Australian Capital Territory in 2003.

One of my favourite Peter stories, which I initially heard second-hand butwhich he later conﬁrmed was true, concerned a time when he’d been askedto teach an elementary Statistics course to students with really very littlequantitative background. Realising that he’d lost some of the students alongthe way, and in order not to ruin their grades, Peter had a cunning idea andspent the last class before the ﬁnal going through the problems that he’dset on the exam. To his horror, however, the students still ﬂunked the exam.When Peter bumped into one of the students and asked in bemusement‘What happened? I went through the questions in the last class’, the student imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016

IGH-DIMENSIONAL DATA AND CLASSIFICATION replied ‘Yes, but you did them in a diﬀerent order’ !Peter had seemingly boundless energy and capacity to work, but he wasalso a very gentle individual in many ways. He was extraordinarily generousto others, and particularly junior researchers for whom he did so much. Hewas a remarkable person and I miss him very deeply. References.

Biau, G. and Devroye, L. (2010) On the layered nearest neighbour estimate, the baggednearest neighbour estimate and the random forest method in regression and classiﬁca-tion

J. Mult. Anal. , , 2499–2518.Breiman, L. (1996) Bagging predictors. Mach. Learn. , , 123–140.Chan, Y.-b. and Hall, P. (2009a) Robust nearest-neighbor methods for classifying high-dimensional data. Ann. Statist. , , 3186–3203.Chan, Y.-B. and Hall, P. (2009b) Scale adjustments for classiﬁers in high-dimensional, lowsample size settings. Biometrika , , 469–478.Chan, Y.-b. and Hall, P. (2010) Using evidence of mixed populations to select variablesfor clustering very high dimensional data. J. Amer. Statist. Assoc., , 798–809.Christianini, N. and Shawe-Taylor, J. (2000) An Introduction to Support Vector Machines .Cambridge University Press, Cambridge.Delaigle, A. and Hall, P. (2012) Eﬀect of heavy-tails on ultra high dimensional variableranking methods.

Statistica Sinica , , 909–932.Diaconis, P. and Freedman, D. (1984) Asymptotics of graphical projection pursuit. Ann.Statist. , , 793–815.D¨umbgen, L., Samworth, R. J. and Schuhmacher, D. (2013) Stochastic search for semi-parametric linear regression models. In From Probability to Statistics and Back: High-Dimensional Models and Processes – A Festschrift in Honor of Jon A. Wellner . EdsM. Banerjee, F. Bunea, J. Huang, V. Koltchinskii, M. H. Maathuis, pp. 78–90.Ghosh, A. K. and Hall, P. (2008) On error-rate estimation in nonparametric classiﬁcation.Statistica Sinica, , 1081–1100.Fan, J. and Lv, J. (2008) Sure independence screening for ultrahigh dimensional featurespace (with discussion). J. Roy. Statist. Soc. Ser. B , , 849–911.Fan, J., Samworth, R. and Wu, Y. (2009) Ultrahigh dimensional feature selection: beyondthe linear model. J. Mach. Learn. Res. , , 2013–2038.Friedman, J. H. and Hall, P. (2007) On bagging and nonlinear estimation. J. Statist.Plann. Inf., , 669–683.Hall, P. and Kang, K.-H. (2005) Bandwidth choice for nonparametric classiﬁcation. Ann.Statist. , , 284–306.Hall, P. and Li, K.-C. (1993) On almost linearity of low dimensional projections from highdimensional data. Ann. Statist. , , 867–889.Hall, P., Marron, J. S. and Neeman, A. (2005) Geometric representation of high dimension,low sample size data. J. Roy. Statist. Soc. Ser. B , , 427–444.Hall, P. and Miller, H. (2009a) Using generalized correlation to eﬀect variable selection invery high dimensional problems. J. Comput. Graph. Statist. , , 533–550.Hall, P. and Miller, H. (2009b) Using the bootstrap to quantify the authority of an em-pirical ranking. Ann. Statist. , , 3929–3959.Hall, P., Park, B. U. and Samworth, R. J. (2008) Choice of neighbor order in nearest-neighbor classiﬁcation. Ann. Statist. , , 2135–2152.Hall, P. and Pham, T. (2010). Optimal properties of centroid-based classiﬁers for veryhigh-dimensional data. Ann. Statist. , , 1071–1093. imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016 R. J. SAMWORTHHall, P. and Samworth, R. J. (2005) Properties of bagged nearest neighbour classiﬁers.

J.Roy. Statist. Soc. Ser. B , , 363–379.Hall, P., Titterington, D. M. and Xue, J.-H. (2009a). Tilting methods for assessing theinﬂuence of components in a classiﬁer. J. Roy. Statist. Soc. Ser. B , , 783–803.Hall, P., Titterington, D. M. and Xue, J.-H. (2009b) Median-based classiﬁers for high-dimensional data. J. Amer. Statist. Assoc. , , 1597–1608.Hall, P., Xia, Y. and Xue, J.-H. (2013) Simple tiered classiﬁers. Biometrika , , 431–445.Hall, P. and Xue, J.-H. (2010) Incorporating prior probabilities into high-dimensionalclassiﬁers. Biometrika , , 31–48.Li, K. C. (1991) Sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc. , , 316–327.Li, R., Zhong, W. and Zhu, L. (2012) Feature screening via distance correlation learning. J. Amer. Statist. Assoc. , , 1129–1139.Marron, J. S., Todd, M. J. and Ahn, J. (2007) Distance-weighted discrimination. J. Amer.Statist. Assoc. , , 1267–1271.Meinshausen, N. and B¨uhlmann, P. (2010) Stability selection. J. Roy. Statist. Soc., Ser.B (with discussion) , , 417–473.Samworth, R. (2005) Small conﬁdence sets for the mean of a spherically symmetric dis-tribution. J. Roy. Statist. Soc., Ser. B , , 343–361.Samworth, R. J. (2012) Optimal weighted nearest neighbour classiﬁers. Ann. Statist. , ,2733–2763.Shah, R. D. and Samworth, R. J. (2013) Variable selection with error control: Anotherlook at Stability Selection. J. Roy. Statist. Soc., Ser. B , , 55–80. Statistical LaboratoryWilberforce RoadCambridgeCB3 0WBUnited KingdomE-mail: [email protected]

URL: