Peter Hall's work on high-dimensional data and classification
SSubmitted to the Annals of Statistics
PETER HALL’S WORK ON HIGH-DIMENSIONAL DATAAND CLASSIFICATION
By Richard J. Samworth ∗ , † University of Cambridge † In this article, I summarise Peter Hall’s contributions to high-dimensional data, including their geometric representations and vari-able selection methods based on ranking. I also discuss his work onclassification problems, concluding with some personal reflections onmy own interactions with him.
1. High-dimensional data.
Peter Hall wrote many influential workson high-dimensional data, though notably he largely eschewed the notionsof sparsity and penalised likelihood that have become so popular in recentyears. Nevertheless, he was interested in variable selection, and wrote severalpapers that involved ranking variables in some way. Perhaps his most well-known papers in this area, though, concern geometrical representations ofhigh-dimensional data.1.1.
Geometric representations of high-dimensional data.
Hall and Li(1993) was one of the pioneering works in the early days of high-dimensionaldata analysis that tried to understand the properties of low-dimensionalprojections of a high-dimensional isotropic random vector X in R p . As mo-tivation, let γ ∈ R p have (cid:107) γ (cid:107) = 1 and suppose that(1) ∀ b ∈ R p , ∃ α b , β b ∈ R , E ( b T X | γ T X = t ) = α b t + β b . This condition says that the regression function of b T X on γ T X is linear.Then, using the isotropy of X ,0 = E ( b T X ) = E { E ( b T X | γ T X ) } = E ( α b γ T X + β b ) = β b . Moreover, b T γ = Cov( b T X, γ T X ) = E { E ( b T XX T γ | γ T X ) } = α b γ T E ( XX T ) γ = α b , and we conclude that E ( X | γ T X = t ) = tγ , or equivalently,(2) (cid:107) E ( X | γ T X = t ) (cid:107) − t = 0 . ∗ The research of Richard J. Samworth was supported an EPSRC Early Career Fellow-ship and a Philip Leverhulme prize. imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016 a r X i v : . [ s t a t . O T ] J un R. J. SAMWORTH
The left-hand side of (2) is always non-negative, so can be used as a measureof the extent to which the condition (1) holds. Remarkably, under very mildconditions on the distribution of X , Hall and Li (1993) proved that if γ isdrawn from the uniform distribution on the unit Euclidean sphere in R p ,then (cid:107) E ( X | γ, γ T X = t ) (cid:107) − t p → p → ∞ . This is equivalent to the statementsup b ∈ R p : (cid:107) b (cid:107) =1 ,b T γ =0 (cid:12)(cid:12) E ( b T X | γ, γ T X = t ) (cid:12)(cid:12) p → p → ∞ . See also Diaconis and Freedman (1984), who showed that undermild conditions, most low-dimensional projections of high-dimensional dataare nearly normal. Of course, when X has a spherically symmetric distribu-tion, (1) holds for every γ ∈ R p with (cid:107) γ (cid:107) = 1. But the result of this papershows that even without spherical symmetry, there is a good chance (in thesense of random draws of γ as described above) that (1) holds, at least ap-proximately, when p is large. An important statistical consequence of this isthat even if the relationship between a response Y and a high-dimensionalpredictor is non-linear, say Y = g ( γ T X, (cid:15) ) for some unknown link function g and error (cid:15) , standard linear regression procedures can often be expectedto yield an approximately correct estimate of γ up to a constant of pro-portionality. The generalisation of this result that replaces γ T X with Γ T X ,where Γ is a random p × k matrix with orthonormal columns, also plays animportant role in justifying the use of sliced inverse regression for dimensionreduction (Li, 1991).Another seminal paper that articulated many of the key geometrical prop-erties of high-dimensional data is Hall, Marron and Neeman (2005). This pa-per begins with the simple, yet remarkable, observation that if Z ∼ N p (0 , I ),then (cid:107) Z (cid:107) = p / + O p (1) as p → ∞ . Thus, data drawn from this distribu-tion tend to lie near the boundary of a large ball. Similarly, the pairwisedistances between points are almost a deterministic distance apart, and theobservations tend to be almost orthogonal. In fact, the authors go on toexplain that, under much weaker assumptions than Gaussianity, the data lieapproximately on the vertices of a regular simplex, and that the stochastic-ity in the data essentially appears as a random rotation of this simplex. Aswell as clarifying the relationship between Support Vector Machines (e.g.Christianini and Shawe-Taylor, 2000) and Distance Weighted Discrimina-tion classifiers (Marron, Todd and Ahn, 2007) in high dimensions, the paperforced researchers to rewire their intuition about high-dimensional data, andprecipitated a flood of subsequent papers on high-dimensional asymptotics. imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016 IGH-DIMENSIONAL DATA AND CLASSIFICATION Variable selection and ranking.
The last 15 years or so have seenvariable selection emerge as one of the most prominently-studied topics inStatistics. Although Peter’s instinct was to think nonparametrically, he re-alised that he could contribute to a prominent line of research in the variableselection literature, namely marginal screening (e.g. Fan and Lv, 2008; Fan,Samworth and Wu, 2009; Li, Zhong and Zhu, 2012), via the deep under-standing he developed for rankings. Hall and Miller (2009a) defined variablerankings through their generalised correlation with a response, while Delaigleand Hall (2012) studied variable transformations prior to ranking based oncorrelation as a method for dealing with heavy-tailed data. For classifica-tion, Hall, Titterington and Xue (2009a) proposed a cross-validation basedcriterion for assessing variable importance, while in the unsupervised set-ting, Chan and Hall (2010) suggested ranking the importance of variablesfor clustering based on nonparametric tests of modality.These works above were underpinned by Peter’s realisation that he couldexplain how perhaps his favourite tool of all, namely the bootstrap, couldbe used to quantify the authority of a ranking (Hall and Miller, 2009b). Infact, there are some subtle issues here, particularly surrounding the issue ofties. Peter developed an ingenious method for proving that even though thestandard n -out-of- n bootstrap does not handle this issue well, the m -out-of- n bootstrap overcomes it in an elegant way.
2. Classification problems.
I believe that Peter may have become in-terested in classification problems in the early 2000s at least partly throughideas of bootstrap aggregating, or bagging (Breiman, 1996). Indeed, in Fried-man and Hall (2007), a preprint of which was already available in early 2000,Peter had attempted to understand the effect of bagging in M -estimationproblems. This is a typical example of Peter’s extraordinary ability to ex-plain empirically observed effects through asymptotic expansions. One of theother interesting contributions of this work is that subsampling (i.e. sam-pling without replacement) half of the observations closely mimics ordinary n -out-of- n bootstrap sampling, a very useful fact that has been observed andexploited in several other contexts, including stability selection for choosingvariables in high-dimensional inference (Meinshausen and B¨uhlmann, 2010;Shah and Samworth, 2013) and stochastic search methods for semiparamet-ric regression (D¨umbgen, Samworth and Schuhmacher, 2013).Classification problems are ideally suited to bagging, because the discretenature of the response variable means that small changes to the trainingdata can often yield different outputs from a classifier; in the terminologyof Breiman (1996), many classifiers are ‘unstable’. Suppose we are given imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016 R. J. SAMWORTH training data X := { ( X , Y ) , . . . , ( X n , Y n ) } , where each X i is a covariatetaking values in a general normed space B , and Y i is a response takingvalues in {− , } . Assume further that we have access to a classifier ˆ C n ( · ) =ˆ C n ( · ; X ) constructed from the training data, so that x ∈ B is assigned toclass ˆ C n ( x ; X ). To form the bagged version ˆ C ∗ n of the classifier, we draw B bootstrap resamples {X ∗ b : b = 1 , . . . , B } from X , and setˆ C ∗ n ( x ) := sgn (cid:18) B B (cid:88) b =1 ˆ C n ( x ; X ∗ b ) (cid:19) . Peter got me interested in bagging nearest neighbour classifiers. Ironically,the nearest neighbour classifier had been described by Breiman as stable,since the nearest neighbour appears in more than half — in fact, around1 − (1 − /n ) n ≈ − e − — of the bootstrap resamples; thus the baggednearest neighbour classifier is typically identical to the unbagged version. InHall and Samworth (2005), however, we studied the effect of drawing resam-ples (either with or without replacement) of smaller size m . Naturally, thisreduces the probability of including the nearest neighbour in the resample,and the bagged classifier is now well approximated by a weighted nearestneighbour classifier with geometrically decaying weights; see also Biau andDevroye (2010). In order for bagging to yield any asymptotic improvementover the basic nearest neighbour classifier, we require m/n < / m/n < log 2 (when sampling with re-placement); in order to converge to the theoretically-optimal Bayes classifier,we require m = m n → ∞ but m/n → B = R d . A particularly curious discoveryhe made there is that even in the simplest case where d = 1 and where theclass conditional densities f and g cross only at the single point x , the rateof convergence and order of the asymptotically optimal bandwidth dependson the sign of f (cid:48)(cid:48) ( x ) g (cid:48)(cid:48) ( x ). In Hall, Park and Samworth (2008), we con-sidered similar problems in the context of k -nearest neighbour classification,obtaining an asymptotic expansion for the regret (i.e. the difference betweenthe risk of the k -nearest neighbour classifier and that of the Bayes classifier)which implied that the usual nonparametric error rate of order n − / ( d +4) wasattainable with k chosen to be of order n / ( d +4) . The form of the expansionmade me realise that the limiting ratio of the regrets of the bagged nearest imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016 IGH-DIMENSIONAL DATA AND CLASSIFICATION neighbour classifier and the k -nearest neighbour classifier (with both the re-sample size m and the number of neighbours k chosen optimally) dependedonly on d , and not on the underlying distributions. To my great surprise,this limiting ratio was greater than 1 when d = 1, equal to 1 when d = 2and less than 1 for d ≥ d ). It took mesome years to explain this phenomenon in terms of the optimal weightingscheme (Samworth, 2012).In more recent years, Peter turned his attention to a wealth of other im-portant, though perhaps less well studied, issues in classification. Some ofthese were motivated by what he saw as drawbacks of existing classifiers.For instance, in Hall, Titterington and Xue (2009b), he developed classifiersbased on componentwise medians, to alleviate the difficulties of both com-puting and interpreting multivariate medians; such methods can be highlyeffective for high-dimensional data that may have heavy tails. In Chan andHall (2009a), he studied robust versions of nearest neighbour classifiers forhigh-dimensional data that try to perform an initial variable selection step toreduce variability. Chan and Hall (2009b) presented simple scale adjustmentsto make distance-based classifiers (primarily designed to detect location dif-ferences) less sensitive to scale variation between populations; see also Halland Pham (2010). Hall and Xue (2010) and Hall, Xia and Xue (2013) con-cerned settings where one might want to incorporate the prior probabilitiesinto a classifier, and where these prior probabilities may be significantlydifferent from 1 /
2, respectively. Finally, Ghosh and Hall (2008) discoveredthe phenomenon that estimating the risk of a classifier, and estimating thetuning parameters to minimise that risk, are two rather different problems,requiring the use of different methodologies.
3. Some personal reflections.
I first met Peter as a PhD studentwhen he visited Cambridge in 2002. I spent an hour or so discussing a prob-lem I was working on that involved using ideas of James–Stein estimationto find small confidence sets for the location parameter of a sphericallysymmetric distribution (Samworth, 2005). I was blown away at the speedwith which he was able to understand where my difficulties lay, and makehelpful suggestions. Shortly afterwards, he invited me to spend six weeksat the Australian National University in Canberra in July–August 2003. Iarrived utterly exhausted after nearly 24 hours in the air, but Peter was fullof energy when he kindly picked me up from the bus station. Almost thefirst thing he said to me was: ‘I’ve got a problem I thought we could thinkabout...’, and he proceeded to take out a pen and pad of paper; one couldn’thelp but be drawn along by his enthusiasm for research. imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016
R. J. SAMWORTH
Everything with Peter happened at breakneck speed, whether it was dash-ing around the supermarket, a driving tour through the rural AustralianCapital Territory or, of course, writing papers. Many of his collaboratorswill have experienced discussing a problem with Peter one evening and re-turning to the office the following morning to find that he had typed upa draft manuscript that would form the basis of a joint paper. His prosewas always elegant, and he had a wonderful ability to see his way throughtechnical asymptotic arguments, aided by almost physicist-like intuition forwhat ought to be true.
Fig 1 . Peter with Juhyun Park (Lancaster University), the author and Nick Bingham(Imperial College London) on a blustery day in rural Australian Capital Territory in 2003.
One of my favourite Peter stories, which I initially heard second-hand butwhich he later confirmed was true, concerned a time when he’d been askedto teach an elementary Statistics course to students with really very littlequantitative background. Realising that he’d lost some of the students alongthe way, and in order not to ruin their grades, Peter had a cunning idea andspent the last class before the final going through the problems that he’dset on the exam. To his horror, however, the students still flunked the exam.When Peter bumped into one of the students and asked in bemusement‘What happened? I went through the questions in the last class’, the student imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016
IGH-DIMENSIONAL DATA AND CLASSIFICATION replied ‘Yes, but you did them in a different order’ !Peter had seemingly boundless energy and capacity to work, but he wasalso a very gentle individual in many ways. He was extraordinarily generousto others, and particularly junior researchers for whom he did so much. Hewas a remarkable person and I miss him very deeply. References.
Biau, G. and Devroye, L. (2010) On the layered nearest neighbour estimate, the baggednearest neighbour estimate and the random forest method in regression and classifica-tion
J. Mult. Anal. , , 2499–2518.Breiman, L. (1996) Bagging predictors. Mach. Learn. , , 123–140.Chan, Y.-b. and Hall, P. (2009a) Robust nearest-neighbor methods for classifying high-dimensional data. Ann. Statist. , , 3186–3203.Chan, Y.-B. and Hall, P. (2009b) Scale adjustments for classifiers in high-dimensional, lowsample size settings. Biometrika , , 469–478.Chan, Y.-b. and Hall, P. (2010) Using evidence of mixed populations to select variablesfor clustering very high dimensional data. J. Amer. Statist. Assoc., , 798–809.Christianini, N. and Shawe-Taylor, J. (2000) An Introduction to Support Vector Machines .Cambridge University Press, Cambridge.Delaigle, A. and Hall, P. (2012) Effect of heavy-tails on ultra high dimensional variableranking methods.
Statistica Sinica , , 909–932.Diaconis, P. and Freedman, D. (1984) Asymptotics of graphical projection pursuit. Ann.Statist. , , 793–815.D¨umbgen, L., Samworth, R. J. and Schuhmacher, D. (2013) Stochastic search for semi-parametric linear regression models. In From Probability to Statistics and Back: High-Dimensional Models and Processes – A Festschrift in Honor of Jon A. Wellner . EdsM. Banerjee, F. Bunea, J. Huang, V. Koltchinskii, M. H. Maathuis, pp. 78–90.Ghosh, A. K. and Hall, P. (2008) On error-rate estimation in nonparametric classification.Statistica Sinica, , 1081–1100.Fan, J. and Lv, J. (2008) Sure independence screening for ultrahigh dimensional featurespace (with discussion). J. Roy. Statist. Soc. Ser. B , , 849–911.Fan, J., Samworth, R. and Wu, Y. (2009) Ultrahigh dimensional feature selection: beyondthe linear model. J. Mach. Learn. Res. , , 2013–2038.Friedman, J. H. and Hall, P. (2007) On bagging and nonlinear estimation. J. Statist.Plann. Inf., , 669–683.Hall, P. and Kang, K.-H. (2005) Bandwidth choice for nonparametric classification. Ann.Statist. , , 284–306.Hall, P. and Li, K.-C. (1993) On almost linearity of low dimensional projections from highdimensional data. Ann. Statist. , , 867–889.Hall, P., Marron, J. S. and Neeman, A. (2005) Geometric representation of high dimension,low sample size data. J. Roy. Statist. Soc. Ser. B , , 427–444.Hall, P. and Miller, H. (2009a) Using generalized correlation to effect variable selection invery high dimensional problems. J. Comput. Graph. Statist. , , 533–550.Hall, P. and Miller, H. (2009b) Using the bootstrap to quantify the authority of an em-pirical ranking. Ann. Statist. , , 3929–3959.Hall, P., Park, B. U. and Samworth, R. J. (2008) Choice of neighbor order in nearest-neighbor classification. Ann. Statist. , , 2135–2152.Hall, P. and Pham, T. (2010). Optimal properties of centroid-based classifiers for veryhigh-dimensional data. Ann. Statist. , , 1071–1093. imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016 R. J. SAMWORTHHall, P. and Samworth, R. J. (2005) Properties of bagged nearest neighbour classifiers.
J.Roy. Statist. Soc. Ser. B , , 363–379.Hall, P., Titterington, D. M. and Xue, J.-H. (2009a). Tilting methods for assessing theinfluence of components in a classifier. J. Roy. Statist. Soc. Ser. B , , 783–803.Hall, P., Titterington, D. M. and Xue, J.-H. (2009b) Median-based classifiers for high-dimensional data. J. Amer. Statist. Assoc. , , 1597–1608.Hall, P., Xia, Y. and Xue, J.-H. (2013) Simple tiered classifiers. Biometrika , , 431–445.Hall, P. and Xue, J.-H. (2010) Incorporating prior probabilities into high-dimensionalclassifiers. Biometrika , , 31–48.Li, K. C. (1991) Sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc. , , 316–327.Li, R., Zhong, W. and Zhu, L. (2012) Feature screening via distance correlation learning. J. Amer. Statist. Assoc. , , 1129–1139.Marron, J. S., Todd, M. J. and Ahn, J. (2007) Distance-weighted discrimination. J. Amer.Statist. Assoc. , , 1267–1271.Meinshausen, N. and B¨uhlmann, P. (2010) Stability selection. J. Roy. Statist. Soc., Ser.B (with discussion) , , 417–473.Samworth, R. (2005) Small confidence sets for the mean of a spherically symmetric dis-tribution. J. Roy. Statist. Soc., Ser. B , , 343–361.Samworth, R. J. (2012) Optimal weighted nearest neighbour classifiers. Ann. Statist. , ,2733–2763.Shah, R. D. and Samworth, R. J. (2013) Variable selection with error control: Anotherlook at Stability Selection. J. Roy. Statist. Soc., Ser. B , , 55–80. Statistical LaboratoryWilberforce RoadCambridgeCB3 0WBUnited KingdomE-mail: [email protected]
URL: