[PDF] Generalized Forward Sufficient Dimension Reduction for Categorical and Ordinal Responses

Abstract

We present a forward sufficient dimension reduction method for categorical or ordinal responses by extending the outer product of gradients and minimum average variance estimator to multinomial generalized linear model. Previous work in this direction extend forward regression to binary responses, and are applied in a pairwise manner to multinomial data, which is less efficient than our approach. Like other forward regression-based sufficient dimension reduction methods, our approach avoids the relatively stringent distributional requirements necessary for inverse regression alternatives. We show consistency of our proposed estimator and derive its convergence rate. We develop an algorithm for our methods based on repeated applications of available algorithms for forward regression. We also propose a clustering-based tuning procedure to estimate the tuning parameters. The effectiveness of our estimator and related algorithms is demonstrated via simulations and applications.

Full PDF

GGeneralized Forward Sufﬁcient Dimension Reduction forCategorical and Ordinal Responses

Harris Quach † The Pennsylvania State University, University Park, USA.

Bing Li

The Pennsylvania State University, University Park, USA.

Summary . We present a forward sufﬁcient dimension reduction method for categoricalor ordinal responses by extending the outer product of gradients and minimum averagevariance estimator to the multinomial generalized linear model. Previous work in thisdirection extend forward regression to binary responses, and are applied in a pairwisemanner to multinomial data, which is less efﬁcient than our approach. Like other forwardregression-based sufﬁcient dimension reduction methods, our approach avoids therelatively stringent distributional requirements necessary for inverse regression alternatives.We show consistency of our proposed estimator and derive its convergence rate. Wedevelop an algorithm for our methods based on repeated applications of availablealgorithms for forward regression. We also propose a clustering-based tuning procedure toestimate the tuning parameters. The effectiveness of our estimator and related algorithmsis demonstrated via simulations and applications.

Keywords : Classiﬁcation, Bandwidth Selection

1. Introduction

The Outer Product of Gradients (OPG) and the Minimum Average Variance Estimator(MAVE) (Xia et al., 2002) are two popular methods for suﬃcient dimension reduction(SDR) (Li, 1991; Cook and Weisberg, 1991; Li, 2018) because of the weak distributionalassumptions they require, as well as their eﬃcient and stable performance. However,both OPG and MAVE are based on the derivative of the response variable, and thereforeare inappropriate for categorical responses, which limits their application. The purposeof this paper is to extend these methods tot he categorical and ordinal categoricalresponses by imposing a multivariate link function on the conditional mean of theresponse, in a generalized linear model.Let Y denote the response variable and X a p-dimensional predictor. Suﬃcientdimension reduction (SDR) estimates a lower dimensional function of X that retains allrelevant information about Y that is available in X . that is, we seek to estimate a linearfunction of X , represented by β (cid:62) X , where β ∈ R p × d with d < p , such that Y ⊥⊥ X | β (cid:62) X. (1) † Address for correspondence: [email protected] a r X i v : . [ s t a t . M E ] F e b Bing Li

Since relation (1) is identiﬁable only up to the column space of β , the parameterof interest is the subspace span( β ) = S ( β ). For any β such that (1) holds, thecorresponding S ( β ) is referred to as a SDR subspace, and the columns of β are referredto as SDR directions. While existence of some SDR subspace follows from taking β asthe identity, existence of a minimal SDR subspace requires some mild conditions that aregiven in generality by Yin et al. (2008). We will assume existence of the minimal SDRsubspace, referred to as the Central Subspace (CS) and denoted by S Y | X . The subspace S Y | X is minimal in the sense that for any β such that (1) holds, S Y | X ≤ S ( β ), wherethe inequality denotes subspace. The goal of SDR, in the context of (1), is to ﬁnd β such that span( β ) = S Y | X .However, in many statistical problems, the regression function between Y and X , E ( Y | X ), is the relationship of interest. To preserve the regression function through alow dimension linear transformation of the predictor, SDR ﬁnds a matrix γ such that Y ⊥⊥ E ( Y | X ) | γ (cid:62) X. (2)This was studied by Cook et al. (2002), where they showed (2) is equivalent to E ( Y | X ) = E ( Y | γ (cid:62) X ). The minimal SDR subspace such that (2) holds, referred to as the CentralMean Subspace (CMS) , was introduced by Cook et al. (2002), and is denoted by S E ( Y | X ) .The CMS for E ( Y | X ) is analogous to the CS for Y | X , and so S E ( Y | X ) exists under mildconditions as well. We will assume that S E ( Y | X ) exists, and the aim of SDR for E ( Y | X )is to ﬁnd γ such that span( γ ) = S E ( Y | X ) .Among the SDR methods for estimating S E ( Y | X ) , the Outer Product of Gradients(OPG) and the Minimum Average Variance Estimator (MAVE), both proposed by Xiaet al. (2002), have maintained wide popularity. But OPG and MAVE are only applicablefor continuous scalar responses, and subsequent extensions of MAVE have largely focusedon the continuous scalar Y setting (Xia, 2007; Wang and Xia, 2008). These methodsare referred to as forward regression approaches for SDR since they rely on regressing Y onto X . This is in contrast to inverse regression methods that rely on regressing X onto Y . The quintessential inverse method is the Sliced Inverse Regression (SIR) originallyintroduced by Li (1991).Our main contribution extends OPG and MAVE to multinomial responses thatinclude categorical and ordinal variables. We ﬁt a local linear deviance criteria,as in Adragni (2018), but we employ a multivariate link that enables multi-labelclassiﬁcation. In binary classiﬁcation, our multivariate extension of MAVE coincideswith Adragni’s MADE. We also introduce a clustering-oriented tuning procedure forSDR in classiﬁcation problems, this avoids the need to select a prediction method forconventional cross-validation methods.The rest of the paper is organized as follows: section 2 provides the population-leveltheoretical motivation for our method. In section 3, we introduce the Outer Productof Canonical Gradients (OPCG) and provide its consistency and convergence rate. Wealso give our formulation of the

Minimum Average Deviance Estimator (MADE) , discussreﬁnements to OPCG and MADE, and give methods for estimating the dimension d . Insection 4, we provide the multivariate multinomial and ordinal link functions necessaryfor our estimator. In addition, we introduce our clustering procedure for tuning SDRmethods in classiﬁcation problems. In section 5, we provide a simulation and some eneralized Forward Sufﬁcient Dimension Reduction applications to real data sets. We conclude in section 6 with a discussion. All proofscan be found in the online supplementary materials.

2. Population-Level Foundations

Suppose X ∈ Ω X ⊂ R p is a continuous random vector and let ˜ Y ∈ { , ..., m } be aclass label, where m is the number of classes being considered. Then we can associate˜ Y with a discrete vector Y = ( Y , ..., Y m ) ∈ Ω Y ⊂ { , } m . The structure of thiscorrespondence depends on whether we view the label as categorical or ordinal. Butwe can regard both scenarios as special cases of Y ∼ multinomial ( k, p ), where k isthe number of trials and p ∈ [0 , m is a vector of probabilities such that (cid:80) mj =1 p j = 1.For our theoretical developments, the speciﬁcation of categorical or ordinal Y is notimportant. Let Y ∈ Ω Y be any discrete random vector for which we may specify a linearexponential family. Suppose X ∈ Ω X and Y ∈ Ω Y such that relation (1) holds. Since Y is a discrete random vector, its conditional expectation given X = x ∈ Ω X is E ( Y | X = x ) = p ( x ) ∈ [0 , m , where p ( x ) is a vector of conditional probabilities in bijection with the conditionalprobability mass function of Y | X = x . Since the conditional p.m.f. uniquely determinesthe distribution of Y | X , this implies that Y depends on X solely through its conditionalmean. As a result, S E ( Y | X ) coincides with S Y | X , as observed in Cook et al. (2002). Proposition 2.1.

If a random element Y depends on random element X onlythrough the conditional mean E ( Y | X ) , i.e. Y ⊥⊥ X | E ( Y | X ) , then the Central MeanSubspace coincides with the Central Subspace, i.e. S E ( Y | X ) = S Y | X . Therefore, we can appeal to methods for estimating S E ( Y | X ) in order to estimate S Y | X .The advantage of estimators for S E ( Y | X ) is that they avoid the Linear ConditionalMean assumption required for many inverse regression estimators of S Y | X (Li, 1991;Cook and Weisberg, 1991; Li and Wang, 2007). Existing methods that avoid thislinearity assumption for estimating S Y | X are extensions of methods for S E ( Y | X ) withcontinuous scalar Y (Xia, 2007; Wang and Xia, 2008). Our contribution addresses thisneed for estimating S E ( Y | X ) with discrete Y , which coincides with estimating S Y | X , byProposition 2.1.For a continuous scalar response, Y , the OPG and MAVE estimators introducedby Xia et al. (2002) are consistent for S E ( Y | X ) . In the SDR literature, an estimator is unbiased for S E ( Y | X ) if it is a member of S E ( Y | X ) , and is exhaustive if its span generates S E ( Y | X ) . In OPG, a local linear regression model is used to compute the gradient ofthe conditional mean, ∂E ( Y | X = x ) /∂x . Li (2018) has shown that the gradient ofthe conditional mean is unbiased for S E ( Y | X ) and, under some conditions, its expectedouter product is exhaustive . On the other hand, MAVE re-parametrises the local linearregression model in order to estimate β directly. The explicit estimation of β contributesto the improved eﬃciency of MAVE over OPG (Xia, 2006, 2007).Given the eﬀectiveness of OPG and MAVE, we aim to extend them multivariate,discrete Y by ﬁtting a local linear generalized linear model. To motivate this approach, Bing Li let X = x ∈ Ω X , Y = y ∈ Ω Y , and specify a linear exponential family for ( x, y ) withnegative log-likelihood (cid:96) { θ ( x ); y } = − θ ( x ) (cid:62) y + b { θ ( x ) } , (3)where θ ( x ) ∈ R m is the canonical parameter and b ( · ) the associated cumulant generatingfunction. Then X relates to Y only through the canonical parameter θ ( x ). Underidentiﬁability of θ ( x ), the canonical parameter is a function of the conditional meanvia the inverse canonical link, θ ( x ) = ˙ b − { µ ( x ) } , where µ ( x ) = E ( Y | X = x ).Since (2) holds, we have E ( Y | x ) = E ( Y | β (cid:62) x ), which implies θ ( x ) = ˜ θ ( β (cid:62) x ) and ∂θ ( x ) (cid:62) /∂x = β { ∂ ˜ θ ( β (cid:62) x ) (cid:62) /∂u } , where u = β (cid:62) x . Then the gradient of the canonicalparameter is unbiased for S E ( Y | X ) and its expected outer product is exhaustive undersome conditions. Proposition 2.2.

Suppose X ∈ Ω X and Y ∈ Ω Y such that (2) holds and specify thenegative log-likelihood as in (3) .(a) Suppose the canonical parameter θ ( · ) : Ω X → R m is identiﬁable and continuouslydiﬀerentiable. Then the gradient of the canonical parameter, ∂θ ( x ) (cid:62) /∂x ∈ R p × m , isunbiased for S E ( Y | X ) .(b) In addition, if u = β (cid:62) x is convexly supported, then the expected outer product of thecanonical gradient is exhaustive for S E ( Y | X ) , i.e. span (cid:20) E (cid:26) ∂θ ( X ) (cid:62) ∂x ∂θ ( X ) ∂x (cid:62) (cid:27)(cid:21) = S E ( Y | X ) . Proposition 2.2 implies that the d largest eigenvectors ofΛ = E [ { ∂θ ( X ) (cid:62) /∂x }{ ∂θ ( X ) /∂x (cid:62) } ] , denoted by η ∈ R p × d , span S E ( Y | X ) ; i.e. span( η ) = S E ( Y | X ) = span( β ). This result isanalogous to OPG and motivates our approach for estimating S E ( Y | X ) . Proposition 2.2diﬀerentiates our method from GSIM (Lambert-Lacroix and Peyre, 2006), which alsoaims to generalize OPG for classiﬁcation. GSIM targets the gradient of the conditionalmean in order to estimate β directly, and requires specifying the link function for theGLM. According to Proposition 2.2, the conditional mean is not necessary for SDR, andwe can focus on the canonical parameter instead. By forgoing the conditional mean, ourestimation procedure in section 3 is simpler relative to GSIM.At the population level, we can estimate the canonical gradient at x ∈ Ω X byminimizing the following expected risk function: R n ( a, B ) = E ( K { ( X − x ) /h n } [ −{ a + B (cid:62) ( X − x ) } (cid:62) Y + b { a + B (cid:62) ( X − x ) } ]) , (4)where K { ( X − x ) /h n } is a kernel with bandwidth h n that depends on n . We make thefollowing assumptions for the minima of (4) to be ﬁsher (i.e. population-level) consistent: Assumption 1.

The joint density f ( x, y ) of ( X, Y ) is twice continuously diﬀerentiablewith respect to x . The sample space Ω X ⊂ R p is compact. The random vector Y hasﬁnite third moments. eneralized Forward Sufﬁcient Dimension Reduction Assumption 2.

The canonical parameter θ is identiﬁable and the negative log-likelihood (cid:96) ( θ ; y ) = − θ (cid:62) y + b ( θ ) is strictly convex in θ , twice continuously diﬀerentiable in θ and has a unique minimum. Furthermore, for each x ∈ Ω X , the conditional risk R ( θ, x ) = E {− θ (cid:62) Y + b ( θ ) | X = x } is twice continuously diﬀerentiable in θ with uniqueminimizer θ ( x ) . Assumption 3.

The canonical parameter θ ( · ) : Ω X → Θ ⊂ R m is a continuouslydiﬀerentiable function of x . The parameter space Θ ⊂ R m is convex. The parameters a ∈ R m and B ∈ R p × m are identiﬁable and ( a, B ) lie in a compact and convex parameterspace. Assumption 4.

Derivatives and integrals in ∂∂θ (cid:90) (cid:90) (cid:96) ( θ ; y ) f ( x, y ) dydx and ∂∂x (cid:90) ∂(cid:96) ( θ ; y ) ∂θ f ( y | x ) dy are interchangeable. Assumption 5.

The kernel K is symmetric with ﬁnite moments. Furthermore, (cid:82) K ( u ) du = 1 , (cid:82) K ( u ) uu (cid:62) du = I p and K ( u ) is twice continuously diﬀerentiable. Under these assumptions, the minimum of (4), denoted by a ( h, x ) and B ( h, x ),respectively, are uniformly ﬁsher consistent for the canonical parameter and its gradientat x ∈ Ω X . Theorem 2.1.

Under Assumptions 1-5, the minima a ( x ) and B ( x ) of (4) are ﬁsherconsistent as h ↓ , with convergence rate sup x ∈ Ω X (cid:107) a ( h, x ) − θ ( x ) (cid:107) = O ( h ) , sup x ∈ Ω X (cid:13)(cid:13)(cid:13)(cid:13) B ( h, x ) − ∂θ ( x ) (cid:62) ∂x (cid:13)(cid:13)(cid:13)(cid:13) F = O ( h ) , where (cid:107) · (cid:107) is euclidean norm and (cid:107) · (cid:107) F is Frobenius norm. Furthermore, by Theorem 2.1 and Proposition 2.2, the d largest eigenvectors of theexpected outer product of Λ( h ) = E { B ( h, X ) B ( h, X ) (cid:62) } , denoted by η ( h ) ∈ R p × d , areconsistent for the eigenvectors, η , of Λ, which span S E ( Y | X ) . Theorem 2.2.

Under Assumptions 1-5, as h ↓ , η ( h ) is ﬁsher consistent for S E ( Y | X ) with convergence rate (cid:107) η ( h ) η ( h ) (cid:62) − ηη (cid:62) (cid:107) F = O ( h ) . We conclude this section by noting that, while we do not make use of a ( h, x ) in thispaper, these estimates of θ ( x ) can be used for prediction, such as in Adragni (2018) andLambert-Lacroix and Peyre (2006), where the latter also uses a ( h, x ) to estimate β . Andso their ﬁsher consistency is desirable in these respects.

3. Empirical Estimation of the Central Mean Subspace, S E ( Y | X ) Suppose X n and Y n denote a random sample from X and Y respectively, where X, Y satisfy (2) for some β ∈ R p × d . Motivated by our population-level developments in section Bing Li

2, we ﬁt an empirical version of (4) to the sample. This corresponds to ﬁtting a locallinear GLM with full negative log-likelihood (cid:96) ( a , ..., a n , B , ..., B n ; Y n , X n ) = n (cid:88) j =1 (cid:96) j ( a j , B j ; Y n , X n ) , (5)where the negative local log-likelihood (cid:96) j ( a j , B j ; Y n , X n ) is (cid:96) j ( a j , B j ; Y n , X n ) = n (cid:88) i =1 w ij [ −{ a j + B (cid:62) j ( X i − X j ) } (cid:62) Y i + b ( a j + B (cid:62) j ( X i − X j ))] , (6)with w ij = K { ( X i − X j ) /h } / (cid:80) ni =1 K { ( X i − X j ) /h } being a normalized kernel weight.We will suppress the dependence of likelihoods on X n and Y n , when convenient forbrevity, in the remainder of this paper. Minimizing (5), we obtain estimates ˆ B j , for j = 1 , ..., n . To show consistency of the estimation, we make an additional assumptionon the bandwidth, h . Assumption 6.

The bandwidth satisﬁes h ∝ n − α for < α ≤ p , where p > ( p +4) ss − for s > . With this assumption, the estimates ˆ B , ..., ˆ B n are uniformly consistent for the canonicalgradient. Theorem 3.1.

Under Assumptions 1-6, for ﬁxed X j ∈ Ω X , the estimates ˆ a j and ˆ B j are uniformly consistent for the canonical parameter and its gradients at X j , respectively,with convergence rates sup X j ∈ Ω X (cid:107) ˆ a j − θ ( X j ) (cid:107) = O a.s ( h + δ ph ) , sup X j ∈ Ω X (cid:13)(cid:13)(cid:13)(cid:13) ˆ B j − ∂θ ( X j ) (cid:62) ∂x (cid:13)(cid:13)(cid:13)(cid:13) F = O a.s ( h + h − δ ph ) , where δ ph = (cid:112) log n/ ( nh p ) → , h ↓ , and h − δ ph → as n → ∞ . Using these consistent estimates of the canonical gradients, we construct the averageouter product of ˆ B j to obtain the candidate matrix for S E ( Y | X ) , ˆΛ n = n − (cid:80) nj =1 ˆ B j ˆ B (cid:62) j .The matrix ˆΛ n consistently estimates Λ, and the d eigenvectors corresponding to the d largest eigenvalues of ˆΛ n , denoted by ˆ η ∈ R p × d , are consistent estimates for a basisthat spans S E ( Y | X ) . We denote ˆ β opcg = ˆ η as the Outer Product of Canonical Gradients(OPCG) estimator.

Theorem 3.2.

Under Assumptions 1-6, the Outer Product of Canonical GradientsEstimator is consistent for a basis of the Central Mean Space, S E ( Y | X ) = span( β ) , andhas convergence rate (cid:107) ˆ β opcg ˆ β (cid:62) opcg − ββ (cid:62) (cid:107) F = O a.s ( h + h − δ ph + δ n ) , where δ ph = (cid:112) log n/ ( nh p ) → , δ n = (cid:112) log n/n → , h ↓ , and h − δ ph → as n → ∞ . Similar to the population-level scenario, we do not make use of the consistent estimatesˆ a , ... ˆ a n , but in they can be used for prediction and estimation, as in Adragni (2018);Lambert-Lacroix and Peyre (2006). eneralized Forward Sufﬁcient Dimension Reduction Since the canonical gradient satisﬁes the relation ∂θ ( x ) (cid:62) /∂x = β { ∂ ˜ θ ( β (cid:62) x ) (cid:62) /∂u } , wecan re-parametrise the local linear GLM in (5) via B j (cid:55)→ β ˜ B j , where ˜ B j ∈ R d × m , for j = 1 , ..., n . For convenience, we overload the notation B j to mean B j ∈ R p × m in OPCGand B j = ˜ B j ∈ R p × d in MADE. The re-parametrised full negative log-likelihood is now (cid:96) ( a , ..., a n , B , ..., B n , β ; Y n , X n ) = n (cid:88) j =1 (cid:96) j ( a j , B j , β ; Y n , X n ) , (7)with the local negative log-likelihood (cid:96) j ( a j , B j , β ; Y n , X n ) as (cid:96) j ( a j , B j , β ) = n (cid:88) i =1 w ij [ { a j + B (cid:62) j β (cid:62) ( X i − X j ) } (cid:62) Y i − b ( a j + B (cid:62) j β (cid:62) ( X i − X j ))] , (8)where w ij = K { ( X i − X j ) /h } / (cid:80) ni =1 K { ( X i − X j ) /h } being a normalized kernel weight.Minimizing (7) iteratively between a , ..., a n , B , ..., B n and β until convergence, weobtain the estimate of β directly, which we denote by ˆ β made and refer to as the MinimumAverage Deviance Estimator (MADE) for S E ( Y | X ) . Since β in (8) is identiﬁable up toorthogonal transformations, the β solutions to the minimization of (7) are not unique. When estimating OPCG and MADE, we use a standard Gaussian kernel for thenormalized weights w ij = K { ( X i − X j ) /h } / (cid:80) ni =1 K { ( X i − X j ) /h } , with a bandwidth h . For the Gaussian kernel to be more appropriate, we standardize the predictors sothat under spherical kernels, all directions in the predictor space are treated similarly.The choice of bandwidth, h , will be discussed in section 4.We can improve OPCG and MADE by reﬁning the kernel weights in (6) and (8),similar to the reﬁnement schemes of Xia et al. (2002); Xia (2007). By replacing w ij with˜ w ij = K { ˆ β (cid:62) opcg ( X i − X j ) /h } in (5) for OPCG, or with ˜ w ij = K { ˆ β (cid:62) made ( X i − X j ) /h } in (7)for MADE, we can more accurately estimate S E ( Y | X ) . To obtain the reﬁned estimators,we use the reﬁned likelihoods to obtain new estimates of β . Then we calculate the newreﬁned weights and substitute them back into the reﬁned likelihood. We repeat thisreﬁnement process until some convergence criteria for the estimates of β is achieved.The resulting estimators are referred to as the reﬁned OPCG estimator, ˆ β ropcg , and thereﬁned MADE estimator, ˆ β rmade , respectively. The reﬁned estimators are expected tobe more accurate than their unreﬁned counterparts. d We have developed our method under the assumption that d , the dimension of S E ( Y | X ) ,is known. In practice, we need to estimate d in order to construct a basis for S E ( Y | X ) .An advantage of OPCG is that we can appeal to contemporary methods based on thevariation of eigenvalues and eigenvectors. In particular, we estimate d for OPCG usingthe Ladle and Predictor Augmentation methods for order determination, introduced byLuo and Li (2016, 2020). The Ladle estimator combines information extracted from the Bing Li bootstrap variation in the eigenvectors and eigenvalues of ˆΛ n , in order to determine thedimension of S E ( Y | X ) . The Predictor Augmentation estimator combines informationextracted from the variation in the eigenvectors and eigenvalues of ˆΛ n , that is a resultof augmenting the predictors with irrelevant noisy components.Since MADE estimates β directly without the use of eigenvalues and eigenvectors,the Ladle and Predictor Augmentation methods are not applicable. To estimate thedimension d in the context of MADE, we can use a cross-validation approach similar toXia et al. (2002), a k-fold version of which is outlined in Adragni (2018).

4. Classiﬁcation and Tuning

To construct the OPCG and MADE estimates, we need to specify the cumulantgenerating function, b ( · ), (6) and (8) for the local linear GLMs. The ﬁrst and secondderivatives of b ( · ) can be used to construct the score vector and information matrixneeded for estimating OPCG and MADE via Fisher-Scoring (McCullagh and Nelder,1989). In order to implement optimization algorithms for OPCG and MADE, weneed to construct good initial values, which can be done using the inverse canonicallink function, like in conventional GLM estimation. For categorical classiﬁcation, weemploy a multivariate logistic link function, while for ordinal classiﬁcation, we employa multivariate canonical link function corresponding to the Adjacent-Categories LogitModel (Agresti, 2010, 2013). For a given number of categories m , a categorical response ˜ Y = l ∈ { , ..., m } can berepresented by Y = ( Y , ..., Y m ) ∈ { , } m , where each component represents a category.The entries of Y are Y j = 0 for j (cid:54) = l and Y l = 1. Since categorical responses are aspecial case of Y ∼ multinomial ( k, p ) with k = 1, we present the multivariate logisticlink for general k . Since the response and probabilities are constrained by (cid:80) mj =1 Y j = k and (cid:80) mj =1 p j = 1, respectively, without loss of generality, set Y m = k − (cid:80) m − j =1 Y j and p m = 1 − (cid:80) m − j =1 p j . For convenience of notation, let Y = ( Y , ...., Y m − ) and p = ( p , ...., p m − ) be the unconstrained response and probabilities, respectively, forthe remainder of this section. As an example, when we have m = 3 categories, theobservation ˜ Y = 2 is equivalent to Y = (0 , Y = 3 is equivalent to Y = (0 , X and Y satisfy (2), so that X relates to Y only through the conditionalmean, which corresponds to the conditional probabilities p = p ( X ) ∈ [0 , m − . Let be a vector of ones with length determined by the context. Then, by our multinomialspeciﬁcation, the canonical parameter is the multivariate logistic transformation of theconditional mean θ ( X ) = log (cid:26) p ( X )1 − (cid:62) p ( X ) (cid:27) . With some algebra and the Sherman-Morrison-Woodbury formula, the inverse logistic eneralized Forward Sufﬁcient Dimension Reduction transform is p ( X ) = e θ ( X ) (cid:62) e θ ( X ) . This implies that the cumulant generating function is b ( θ ( X )) = k log { (cid:62) e θ ( X ) } ,which satisﬁes the following relations: ∂b ( θ ( X )) ∂θ = kp ( X ) , ∂ b ( θ ( X )) ∂θ∂θ (cid:62) = k (cid:20) Diag (cid:18) e θ ( X ) (cid:62) e θ ( X ) (cid:19) − e θ ( X ) ( e θ ( X ) ) (cid:62) [1 + (cid:62) e θ ( X ) ] (cid:21) , where Diag ( · ) denotes a diagonal matrix with the argument specifying the diagonalentries. Substituting the cumulant generating function for categorical Y into (6), thenegative local log-likelihood about X j , for OPCG, becomes (cid:96) j ( a j , B j ) = n (cid:88) i w ij [ −{ a j + B (cid:62) j ( X i − X j ) } (cid:62) Y i + log[1 + (cid:62) exp { a j + B (cid:62) j ( X i − X j ) } ]] , (9) where k = 1 for categorical Y i , for all i = 1 , ..., n . The substitution into (8) produces asimilar likelihood for MADE. Suppose ˜ Y = k ∈ { , ..., m } is an ordinal response for m given ordered categories. Then˜ Y can be represented as a vector Y = ( Y , ...., Y m ) ∈ { , } m , with entries Y j = 1 for j ≤ k , and Y j = 0 for all j < k ≤ m . For example, with m = 3 categories, the ordinalresponse ˜ Y = 2 is equivalent to Y = (1 , , Y to a cumulative response, as in McCullagh (1980). This willprovide us with the canonical linear exponential family form needed for OPCG andMADE. Let R s = (cid:80) sl =1 Y l be the cumulative response up to category s , and γ s = (cid:80) sl =1 p l be the cumulative probability up to category s , for s = 1 , ..., m . Let R = ( R , ..., R m ) =( Y , Y + Y , ..., k ) and γ = ( γ , ..., γ m ) = ( p , p + p , ..., Y s = R s − R s − ,with R = Y , R m = k , and p s = γ s − γ s − , with γ = p and γ m = 1, where we set R = 0 and γ = 0. Then, we can associate every ordinal Y with R , and re-write themultinomial log-likelihood for ordinal Y as (cid:96) ( γ ; Y n ) = m − (cid:88) s =1 (cid:20) R s log (cid:18) γ s γ s +1 (cid:19) + ( R s +1 − R s ) log (cid:18) γ s +1 − γ s γ s +1 (cid:19)(cid:21) . This formulation of the ordinal response model is canonical and referred to as theAdjacent-Categories Logit Model for cumulative probabilities (Agresti, 2010, 2013).Note that the conditional mean of R is γ ( X ) = E ( R | X ). This implies the canonicalparameter, as a function of the predictor X , is θ ( X ) = ( θ ( X ) , ..., θ ( X ) m − ), with entries θ s ( X ) = k log (cid:18) γ ( X ) s − γ ( X ) s − γ ( X ) s +1 − γ ( X ) s (cid:19) . Bing Li

We refer to the transformation from γ ( X ) to θ ( X ) as the Adjacent-Categories (Ad-Cat)transformation. The inverse Ad-Cat transformation is γ ( X ) = (cid:34) I m − − τ { θ ( X ) } e (cid:62) m − e (cid:62) m − τ { θ ( X ) } (cid:35) τ { θ ( X ) } , where τ ( X ) ∈ R m − with entries τ s = (cid:80) sr =1 exp (cid:110)(cid:80) m − t = r ( θ ( X ) t /k ) (cid:111) , and e m − is avector of 0’s with a 1 in the m − b ( · ) is b ( θ ( X )) = − k log[1 − e (cid:62) m − γ { θ ( X ) } ], and it satisﬁesthe following relations, as in McCullagh (1980), ∂b { θ ( X ) } ∂θ = γ ( X ) , ∂ b { θ ( X ) } ∂θ∂θ (cid:62) = 1 k [Γ( X ) − γ ( X ) γ ( X ) (cid:62) ] , where, Γ( X ) =  γ ( X ) γ ( X ) · · · γ ( X ) γ ( X ) γ ( X ) · · · γ ( X )... ... · · · ... γ ( X ) γ ( X ) · · · γ m − ( X )  . Since R is constrained by R m = k , we consider the unconstrained vector ( R , ..., R m − )divided by k . Let Z = ( Z , ..., Z m − ), with entries Z s = R s /k for s = 1 , ..., m − X j , for OPCG,as (cid:96) j ( a j , B j ) = n (cid:88) i w ij [ −{ a j + B (cid:62) j ( X i − X j ) } (cid:62) Z i − k log(1 − e (cid:62) m − γ [ { a j + B (cid:62) j ( X i − X j ) } ])] , (10) where k is the observed ordinal category of ˜ Y i , for all i = 1 , ..., n . The substitution into(8) produces a similar likelihood for MADE. Conventional means of selecting the optimal bandwidth parameter h typically involveminimizing out of sample prediction error, or choosing an optimal h for minimizing aMean Integrated Squared Error criterion; see (Wand and Jones, 1994; Silverman, 2018;Fan et al., 1995). Cross validation depends on the prediction method and the results canvary between methods. Optimal bandwidths are the prevailing choice for OPG, MAVE,and their extensions (Xia et al., 2002; Xia, 2006, 2008; Wang and Xia, 2008). Someexamples include h = c n − / ( p +6) , where p = max( p,

3) (Xia, 2006) and h = n − / ( p +4) (Wang and Xia, 2008). The suggested constant in the former is c = 2 .

34 when K ( · ) isthe Epanechnikov kernel (Wand and Jones, 1994; Xia, 2007), while the constant in thelatter is set to 1. This implies that the constant requires tuning when the recommendedvalues perform poorly. eneralized Forward Sufﬁcient Dimension Reduction To determine the tuning parameter for the weights, w ij , in (6) for OPCG and in(8) for MADE, we propose a k-means clustering procedure. The motivating principlebehind this data driven bandwidth selection procedure is that the tuning parameter, h ,should be selected, so as to make prediction “easier”. Denoting ˆ β ( h ) as the estimatorof β corresponding bandwidth h , the aim is to choose h so that the suﬃcient predictorsconstructed on some tuning set, { X tunei : i = 1 , ...n } , provides the smallest predictionerror. In the context of classiﬁcation, we expect this prediction error to be minimizedwhen the suﬃcient predictors { ˆ β ( h ) (cid:62) X tunei : i = 1 , ...n } cluster into distinct groupscorresponding to their class labels.Suppose we have m classes and want to select h such that the most distinct m clusters are formed. To achieve this, for each h , we implement a k-means clusteringalgorithm on the suﬃcient predictors { ˆ β ( h ) (cid:62) X tunei : i = 1 , ...n } , in order to estimate m clusters, and compute the corresponding ratio between the within cluster sum of squares, W SS km ( h ), to the between cluster sum of squares, BSS km ( h ). The bandwidth h km corresponding to smallest ratio value is chosen as our bandwidth estimate, i.e. h km =min h> { W SS km ( h ) /BSS km ( h ) } . Some drawbacks of this approach include assumingthere is only one cluster per class label and not using the label information from thetuning set. A quick remedy for the former issue is to estimate a pre-speciﬁed total numberof clusters instead of just m clusters. We refer to this adjustment as our unsupervisedk-means tuning procedure.To incorporate class label information available in the tuning set, we modify thek-means tuning approach. Fix h > l ∈ { , ..., m } , let { X tunei : i = 1 , ...n l } denote the sample of predictors in the tuning set that belong to class l . Weestimate c l clusters for { ˆ β ( h ) (cid:62) X tunei : i = 1 , ...n l } using k-means. For the estimated c l clusters, we compute the W SS l ( h ) and BSS l ( h ). Then, for each h and pre-speciﬁednumber of clusters per class, c , ..., c m , we have W SS ( h ) , W SS ( h ) , ..., W SS m ( h ) and BSS ( h ) , BSS ( h ) , ..., BSS m ( h ). We deﬁne the supervised between and within sum ofsquares as SBSS ( h ) = (cid:80) ml =1 BSS l ( h ) and SW SS ( h ) = (cid:80) ml =1 W SS l ( h ), respectively,and choose the bandwidth that minimizes the ratio SW SS ( h ) /SBSS ( h ), i.e. h skm =min h> { SW SS ( h ) /SBSS ( h ) } . We refer to h skm as the supervised k-means bandwidth.While this approach uses the label information in the tuning set, we need to pre-specifythe number of clusters per class, c l , before hand. In our applications, this tuningprocedure seems robust to the choice of c l , which we set to a constant c l = 1, 2 or3 across all l . If we specify c l ≡

1, then

W SS l ( h ) is equivalent to sum of squares fora single cluster and BSS l ( h ) ≡

0. In this case, we consider the estimated center foreach label l , denote C l , and let SBSS ( h ) be the sum of squares of the estimated centers C , ..., C l .To combine the unsupervised and supervised k-means tuning methods, we choose h that minimizes the average of the two standardized ratios above, i.e. h wkm =min h> { W SS ( h ) /BSS ( h )+ SW SS ( h ) /SBSS ( h ) } . We refer to h wkm as the weightedk-means bandwidth h wkm . A further improvement to the tuning algorithm is to performthe procedure k-fold on the training data. This will avoid the need for an external tuningset. We refer to the k-fold, weighted k-means as simply our k-fold k-means tuning.For the intuition behind this tuning procedure, consider the unsupervised k-meansscenario, where there are m classes/clusters to estimate, and let K ( · ) be the uniform Bing Li kernel, i.e balls of radius h . When h is small, the local linear approximation tothe log-likelihood is rough and the gradient estimates will be noisy due to a lack ofobservations in the local neighbourhood. If h small enough that only one other pointis in the neighbourhood, then the canonical gradient estimate is eﬀectively random,and so is the corresponding estimated SDR directions. Then the constructed suﬃcientpredictors, { ˆ β ( h ) (cid:62) X tunei : i = 1 , ...n } , is essentially random noise with large W SS ( h )and small BSS ( h ), making the ratio large.When h is large, the local linear approximation to the log-likelihood is over smoothed,and important features may be lost. As a result, the dimension reduction may notcapture all the relevant directions, meaning only some classes may be separated. If h large enough that the entire sample is in every neighbourhood, then the canonicalgradient is just the slop coeﬃcient in a conventional GLM, and OPCG reduces to ﬁttingthe same GLM re-centered at the various observed points X j . Since re-centering does notaﬀect the slope estimate, this provides at most one gradient. Then OPCG will be ableto estimate at most one direction in S E ( Y | X ) , which is not exhaustive if d >

1. We endup estimating a proper subspace of the central mean space when h grows too large, andthe suﬃcient predictors will only separate some classes while confounding others. Theideal choice of h would be large enough to eﬀectively estimate all the SDR directions, sothat all the classes are clearly separated from each other. And this is achieved throughclustering algorithms, such as k-means.

5. Simulations and Data Analysis

We mainly study the performance of OPCG, but consider MADE in our simulations.This is due to the non-strict convexity of MADE and its computational cost for justmoderate-sized p . OPCG has unique gradient estimates (under identiﬁability) and canbe computed in parallel to mitigate the computational cost issue. Adragni (2018) usesGrassmanian optimization methods to estimate MADE for scalar Y , which may be fasterfor moderate-sized p . From our simulations, we will see that the MADE estimates arecomparable to OPCG, similar to observations made by Xia (2006) and Wang and Xia(2008). Therefore, our focus on OPCG will also provide insight for the performance ofMADE in applications.For our simulations and examples, we compare OPCG to the classical inverseregression method, Sliced Inverse Regression (SIR) (Li, 1991), and to the more recentDirectional Regression (DR) (Li and Wang, 2007). In the simulations, we include thepairwise suggestion by Adragni (2018) for dealing with multi-label classiﬁcation. Forapplications, we analyze the USPS handwritten digits dataset (Hull, 1994), as well astwo data sets from the UCI database: wine quality (Cortez et al., 2009) and ISOLET(Fanty and Cole, 1990). The wine quality data is considered as an ordinal classiﬁcationproblem while the simulations and remaining applications are categorical classiﬁcationproblems. For the USPS handwritten digits and ISOLET data, we include some resultsverbatim from Fukumizu and Leng (2014) for comparison. When the sample covariancematrix is ill-behaved, we use a Tikhonov regularized covariance for standardizing SIRand DR, similar to Zhong et al. (2005). For the order determination of OPCG, weuse the Ladle and Predictor Augmentation estimation in our simulations and all our eneralized Forward Sufﬁcient Dimension Reduction applications except ISOLET, due to computational time.Given the cumulant generating function and its derivatives, we can estimate OPCGand MADE via traditional Fisher-Scoring (McCullagh and Nelder, 1989). However, toimprove the computational speed of our methods, we use the hybrid conjugate gradient(hCG) algorithm of Dai and Yuan (2001) to minimize the full negative log-likelihoods,(5) and (7), with the appropriate canonical links (9) or (10). Our implementation ofthe hybrid conjugate gradient performs well without needing to enforce the weak WolfeConditions outlined in Dai and Yuan (2001), and this improves the computational speedof the minimization. A backtracking rule is used to determine the step size for thehCG algorithm, as in Bertsekas (2015, page 69). Without enforcing the weak Wolfeconditions, global convergence of the hCG algorithm is not guaranteed, but we suspectthe initial values available from the inverse canonical link functions suﬃces in startingthe estimation close to the minima. For our simulation study, we consider the following ﬁve bi-variate normal distributions:( X , X ) ∼ N ( µ c , . × I ) , µ c ∈ { (0 , , (3 , , ( − , − , ( − , , (2 , − } , . For each µ c , we sample 50 observations for our training set, 30 for tuning and 30 fortesting, giving a total of 250, 150 and 150 observations for training, tuning and testing,respectively. The 5 clusters in each set are then labeled into 3 classes, with one classbeing one cluster, and the other two classes being comprise of two clusters each; seeFigure 1. Fig. 1.

The true directions and class labels for training, tuning and testing, from left to right,respectively.The red (cid:3) ’s is class ; the blue + ’s is 2; the green (cid:78) ’s is 3. We generate 8 standard normal variables as noise, so that p = 10 and shuﬄe the 10predictors so that a basis for S E ( Y | X ) is β = ((0 , , , , , , , , , (cid:62) , (0 , , , , , , , , , (cid:62) ) (cid:62) ∈ R × , meaning S E ( Y | X ) has dimension d = 2. The k-means tuning procedures outlined insection 4 are illustrated in Figure 2; in particular, we demonstrate the performance of Bing Li our unsupervised k-means, supervised k-means, weighted k-means and k-fold k-means,with the pre-speciﬁed number of clusters per class set to c l = 2 and the dimension of theOPCG estimates set to d = 2. The suﬃcient predictors from the non k-fold methodsare constructed with the tuning set and the behaviour of these suﬃcient predictorsfor various h is presented in Figure 2a. The plots of the within sums of squares tobetween sums of squares ratio are given in Figure 2b. When h is small, the suﬃcientpredictors is eﬀectively random noise and h is large, we only recover one SDR direction,which separates class 2 from class 1 and 3; classes 1 and 3 are confounded. Theestimated bandwidth from the tuning set is h ≈ .

24. The plot for 3-fold k-meansis also given in Figure 2b, with the estimated bandwidth being h ≈ .

96. We use the3-fold k-means estimate for the bandwidth since the k-fold estimate is what we use inour data applications. Note that the suggested constant for the optimal bandwidth gives h opt = 2 . n − / ( p +6) = 1 .

65, which is larger than all our estimates. Figure 2a suggeststhat this choice for optimal bandwidth would poor at separating classes 1 and 3 fromclass 2. (a) Suﬃcient Predictors for diﬀerent bandwidth, h (b) K-means Tuning Procedures for h Fig. 2. (a) From Left to Right: The sufﬁcient predictors constructed from the tuning set, for h = 0 . , . , . , . The red (cid:3) ’s is class ; the blue + ’s is 2; the green (cid:78) ’s is 3. (b) Theratios of within over between sums of squares for k-means, supervised k-means, weightedsemi-supervised k-means and k-fold supervised k-means, from left to right, respectively. Thefour red (cid:5) points correspond to the instances of h = 0 . , . , . , plotted in ﬁgure (a). The plotfor 3-fold k-means does not have the points because it estimates are constructed from foldingthe training set, not the tuning set. eneralized Forward Sufﬁcient Dimension Reduction To estimate the dimension of S E ( Y | X ) , we use the Ladle and Predictor Augmentationplots for OPCG with bandwidth h = 1. The Ladle and Predictor Augmentation plotsare given in Figure 3. The Ladle Estimator estimates the dimension as 3, though theladle values for dimension 2 and 3 are very close. The Predictor Augmentation methodcorrectly detects the dimension as 2. (a) Ladle Estimator (b) Predictor Augmentation Fig. 3. (a) The Ladle plot estimates ˆ d = 3 ; (b) the Predictor Augmentation plot estimates ˆ d = 2 . For the initial values of β in MADE, we compare using ( e , e ), where e i ∈ R p has 1 inthe i th position and 0 elsewhere, and OPCG, which we refer to as OPCG-MADE. For thepairwise binary classiﬁcation suggested by Adragni (2018), we estimate 2 SDR directionsfor each pair of classes, (1,2), (1,3) and (2,3), giving us 6 directions in total, and selectthe two directions which best separates the three classes. This corresponds to takingthe leading eigenvector for the pairwise classiﬁcation of (1,2) and (1,3); we refer to thisestimate as OPCG-PW. Treating the dimension d as known, and using bandwidth h = 1,we estimate the SDR directions using OPCG, MADE, OPCG-MADE, OPCG-PW, DRand SIR. We construct the suﬃcient predictors on the testing set and present the plotsin Figure 4. The OPCG, MADE, OPCG-MADE and DR estimates work well, while thepairwise estimator appears reasonable, and SIR fails. The failure of SIR due to symmetryis well known (Cook and Weisberg, 1991), while subpar performance of the pairwiseestimate demonstrates the beneﬁt of using the multivariate link approach. Furthermore,for problems with many class labels, the pairwise search for the SDR directions would becomputationally costly. For MADE, the non-uniqueness of β means MADE iterationsare slow to converge. Neither MADE nor OPCG-MADE achieve convergence in oursimulations, but both still yield good estimates. Furthermore, with OPCG as the initialestimate, the MADE estimator does not diﬀer much from the initial value.We also compute the distances of the estimated SDR directions, ˆ β , to the true β underthe Frobenius Norm in Table 1. We see that DR is closest to the true subspace, followedby OPCG-MADE, MADE and OPCG. MADE and OPCG-MADE are expected to bemore eﬃcient than OPCG, while OPCG-PW performs the worst among the methodsthat are expected to work and SIR is the absolute worst due to SIR’s inability to copewith symmetry. Bing Li (a) Test Data

Fig. 4.

The sufﬁcient predictors constructed on the testing set. From Left to Right and Top toBottom: OPCG, MADE, MADE with OPCG initial value, pairwise OPCG, DR, SIR. The red (cid:3) ’sis class ; the blue + ’s is 2; the green (cid:78) ’s is 3. Table 1.

Distances || ˆ β − β || F under Frobenius Norm OPCG MADE OPCG+MADE PW-OPCG DR SIR0.2437652 0.2261340 0.2261263 0.4485811 0.1972445 1.9939585

The red wine quality data set consists of a sensory-based, integer score between 0 and 10as the response, along with 11 continuous predictors. The “quality” variable is consideredas the ordinal response. We use inverse Ad-Cat transform to construct initial values, andsubstitute the ordinal cumulant generating function in the likelihood. We consider onlythe mediocre red wines with scores 5,6,7. These classes had the most observations, whilethe poorer and higher quality wines had much fewer observations. An alternative is togroup all scores below 5 as one group, and all scores above 7 as one group. This leadsto similar, but noisier, results since the more extreme scores are akin to outliers. Werandomly split the sample into 1000 training and 513 testing observations. For assessingthe importance of specifying the ordinal link, we compare the ordinal OPCG estimatesagainst categorical OPCG estimates, which use the multinomial link; we refer to thelatter as OPCG-MN.We use a 3-fold k-means tuning procedure, with d set to 2, and pre-specify the eneralized Forward Sufﬁcient Dimension Reduction number of cluster per class to be 3. The procedure suggests h ≈ .

5; see Figure 5. Usinga bandwidth of h = 3, the estimated dimension from both the Ladle and PA plot isˆ d = 2; see Figure 5. Fig. 5.

From left to right: the Ladle plot estimates the dimension to be ˆ d = 2 ; the PredictorAugmentation plot agrees with the Ladle plot; 3-fold k-means tuning suggests h = 3 . We lookat the second dip in the tuning plot because for small values of h , the sufﬁcient predictors areill-behaved and clump tightly together. Using the estimated ˆ d = 2 and h ≈ .

5, we construct the suﬃcient predictors onthe testing set for OPCG, OPCG-MN, DR and SIR; see Figure 6. The OPCG andOPCG-MN suﬃcient predictors capture the ordinal nature of the quality scores well,while SIR performs to a lesser extent, and DR appears unable to clearly discern thecategories. The suﬃcient predictors from OPCG appears to have lower variance thanthose from OPCG-MN, and this suggests that a misspeciﬁed link function negativelyaﬀects the eﬃciency of the estimates.To conduct a simple ordinal classiﬁcation, we consider the 2 binary classiﬁcationproblems for the events > >

6, as in Frank and Hall (2001). We train a SupportVector Machine with defaults, from the R library ‘e1072’, on the suﬃcient predictorsconstructed from the training set, and predict the class label of the suﬃcient predictorsconstructed from the testing set; see Table 2. While all methods perform poorly sinceordinal classiﬁcation is diﬃcult, the OPCG suﬃcient predictors predict best.

Table 2.

Classiﬁcation % Error forWine Quality Scores 5,6,7.

OPCG OPCG-MN DR SIR36.68 37.26 48.26 40.73

With the handwritten digits, we have 2007 observations consisting of vectorized, 16-by-16pixel images of handwritten digits from 0 to 9. The predictors are vectors of length p = 256; since they represent a vectorized matrix, we do not standardize the predictorsfor OPCG. The 2007 observations are split into 1000 training and 1007 testing. Bing Li

Fig. 6.

The sufﬁcient predictors constructed on the testing set. Top Left: OPCG with ordinallink. Top Right: OPCG with categorical link. Bottom Left: DR. Bottom Right: SIR. The red (cid:3) ’s isquality ; the blue + ’s is 6; the green (cid:78) ’s is 7. We use a 2-fold k-means tuning with one cluster per class and d set to 3 to determine h . The resulting tuning plot in Figure 7 suggests the bandwidth be set to h ≈ . d = 9 while the Ladleplot is inconclusive; see Figure 7. Following Fukumizu and Leng (2014), we illustratethe eﬀectiveness of OPCG by randomly sampling 500 observations of 1 , , , , , , , , , , , , eneralized Forward Sufﬁcient Dimension Reduction Fig. 7.

From left to right: the Ladle plot is unable to meaningfully estimate the dimension; thePredictor Augmentation plot estimates the dimension to be ˆ d = 9 ; the 2-fold k-means tuningalgorithm suggests a bandwidth of around h = 4 . . (a) (b) Fig. 8. (a) A plot of the OPCG sufﬁcient predictors for 500 observations from ‘1’ (red (cid:3) ), ‘2’(blue + ), ‘6’(green (cid:78) ),‘7’(purple ◦ ),‘9’(orange ∗ ). The OPCG estimate is for all digits 0-9, and wesee that the estimated linear space separates the categories well. (b) The 9 estimated SDRdirections, considered as estimated features, applied to 16 random observations from the testingset. We see that the directions can clearly recover the images in the testing set. This is true forDR and SIR as well. and DR using a Tikhonov-regularized covariance matrix, with a regularization penaltyof 3. For various dimensions d , we construct the suﬃcient predictors and calculate theirclassiﬁcation error rates, which is reported in Table 3. OPCG outperforms SIR andDR in that we achieve a lower classiﬁcation error with fewer necessary dimensions. Inthis sense, OPCG also outperforms the partitioned version of the gradient-based KernelDimension Reduction, denoted by g-KDR-v, in Fukumizu and Leng (2014), though wedo not replicate their analysis, but compare verbatim. Bing Li

Table 3.

Classiﬁcation % Error for USPS Handwritten Digits. The‘-’ is because SIR can only estimate m − SDR directions. *Theerror rates for g-KDR-v are taken from Fukumizu and Leng (2014). dim 3 5 7 9 13 20 25SIR 41.51 21.95 15.39 15.09 - - -DR 40.22 22.24 15.29 13.51 10.43 8.54 8.74OPCG 33.76 19.17 10.43 7.94 7.85 7.94 7.85g-KDR-v* 36.94 17.58 12.41 12.21 8.54 7.94 7.75

For the ISOLET data from UCI, we have 150 individuals’ verbal recordings of lettersfrom the alphabet. We have p = 617 continuous predictors scaled between − . . h , due to the high computational cost, andso assess the performance of OPCG purely in terms of classiﬁcation error using SVMin Table 4. We estimate SIR and DR using a Tikhonov-regularized covariance matrix,with a regularization penalty of 2. The OPCG suﬃcient predictors performs worse thanDR, but eventually outperforms SIR for large enough d . All the methods appear to beoutperformed by the g-KDR-v method of Fukumizu and Leng (2014). Table 4.

Classiﬁcation % Error for ISOLET. The ‘-’ isbecause SIR can only estimate m − SDR directions. **Theerror rates for g-KDR-v are taken from Fukumizu and Leng(2014). dim 5 10 15 20 25 30SIR 25.85 10.84 8.66 8.47 8.40 -DR 26.49 12.12 8.72 7.25 6.48 5.45OPCG 41.37 20.65 12.64 8.21 6.48 6.54g-KDR-v** 29.44 13.15 8.28 4.55 3.91 4.81

6. Conclusion

We introduce a generalized forward regression approach for suﬃcient dimension reductionthat is applicable to categorical and ordinal classiﬁcation problems. The Outer Productof Canonical Gradients (OPCG) and multivariate formulation of the Minimum AverageDeviance Estimator (MADE) provide estimates of the Central Mean Subspace, whichcoincides with the Central Subspace, for the categorical and ordinal classiﬁcationresponses. We also introduce a tuning procedure for suﬃcient dimension reductionthat does not require pre-specifying a prediction method, andour k-fold k-means tuningmethod performs reasonably well in our real data applications. Furthermore, theOPCG method is amenable to the Ladle and Predictor Augmentation estimates fororder determination, which are faster and more accessible than conventional methodsfor MADE. In our applications, we demonstrated that OPCG is comparable and can eneralized Forward Sufﬁcient Dimension Reduction outperform conventional inverse regression methods. We also demonstrate the eﬀect oneﬃciency from mis-speciﬁcation of the response for OPCG. References

Adragni, K. P. (2018) Minimum average deviance estimation for suﬃcient dimensionreduction.

Journal of Statistical Computation and Simulation , , 411–431.Agresti, A. (2010) Analysis of ordinal categorical data , vol. 656. John Wiley & Sons.— (2013)

Categorical Data Analysis . Wiley, 3 edn.Bertsekas, D. P. (2015)

Convex optimization algorithms . Athena Scientiﬁc Belmont.Cook, R. D., Li, B. et al. (2002) Dimension reduction for conditional mean in regression.

The Annals of Statistics , , 455–474.Cook, R. D. and Weisberg, S. (1991) Comment. Journal of the American StatisticalAssociation , , 328–332.Cortez, P., Cerdeira, A., Almeida, F., Matos, T. and Reis, J. (2009) Modeling winepreferences by data mining from physicochemical properties. Decision support systems , , 547–553.Dai, Y.-h. and Yuan, Y. (2001) An eﬃcient hybrid conjugate gradient method forunconstrained optimization. Annals of Operations Research , , 33–47.Fan, J., Heckman, N. E. and Wand, M. P. (1995) Local polynomial kernel regressionfor generalized linear models and quasi-likelihood functions. Journal of the AmericanStatistical Association , , 141–150.Fanty, M. and Cole, R. (1990) Spoken letter recognition. Advances in neural informationprocessing systems , , 220–226.Frank, E. and Hall, M. (2001) A simple approach to ordinal classiﬁcation. In EuropeanConference on Machine Learning , 145–156. Springer.Fukumizu, K. and Leng, C. (2014) Gradient-based kernel dimension reduction forregression.

Journal of the American Statistical Association , , 359–370.Hull, J. J. (1994) A database for handwritten text recognition research. IEEETransactions on Pattern Analysis and Machine Intelligence , , 550–554.Lambert-Lacroix, S. and Peyre, J. (2006) Local likelihood regression in generalized linearsingle-index models with applications to microarray data. Computational statistics &data analysis , , 2091–2113.Li, B. (2018) Suﬃcient Dimension Reduction: Methods and Applications with R . CRCPress.Li, B. and Wang, S. (2007) On directional regression for dimension reduction.

Journalof the American Statistical Association , , 997–1008. Bing Li

Li, K.-C. (1991) Sliced inverse regression for dimension reduction.

Journal of theAmerican Statistical Association , , 316–327.Luo, W. and Li, B. (2016) Combining eigenvalues and variation of eigenvectors for orderdetermination. Biometrika , , 875–887.— (2020) On order determination by predictor augmentation. Biometrika .McCullagh, P. (1980) Regression models for ordinal data.

Journal of the royal statisticalsociety. Series B (Methodological) , 109–142.McCullagh, P. and Nelder, J. (1989)

Generalized Linear Models, Second Edition .Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor &Francis.Silverman, B. W. (2018)

Density estimation for statistics and data analysis . Routledge.Wand, M. P. and Jones, M. C. (1994)

Kernel smoothing . Chapman and Hall/CRC.Wang, H. and Xia, Y. (2008) Sliced regression for dimension reduction.

Journal of theAmerican Statistical Association , , 811–821.Xia, Y. (2006) Asymptotic distributions for two estimators of the single-index model. Econometric Theory , , 1112–1137.— (2007) A constructive approach to the estimation of dimension reduction directions. The Annals of Statistics , , 2654–2690.— (2008) A multiple-index model and dimension reduction. Journal of the AmericanStatistical Association , , 1631–1640.Xia, Y., Tong, H., Li, W. and Zhu, L.-X. (2002) An adaptive estimation of dimensionreduction space. Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , , 363–410.Yin, X., Li, B. and Cook, R. D. (2008) Successive direction extraction for estimating thecentral subspace in a multiple-index regression. Journal of Multivariate Analysis , ,1733–1757.Zhong, W., Zeng, P., Ma, P., Liu, J. S. and Zhu, Y. (2005) Rsir: regularized slicedinverse regression for motif discovery. Bioinformatics ,21