Generalized Forward Sufficient Dimension Reduction for Categorical and Ordinal Responses
GGeneralized Forward Sufficient Dimension Reduction forCategorical and Ordinal Responses
Harris Quach † The Pennsylvania State University, University Park, USA.
Bing Li
The Pennsylvania State University, University Park, USA.
Summary . We present a forward sufficient dimension reduction method for categoricalor ordinal responses by extending the outer product of gradients and minimum averagevariance estimator to the multinomial generalized linear model. Previous work in thisdirection extend forward regression to binary responses, and are applied in a pairwisemanner to multinomial data, which is less efficient than our approach. Like other forwardregression-based sufficient dimension reduction methods, our approach avoids therelatively stringent distributional requirements necessary for inverse regression alternatives.We show consistency of our proposed estimator and derive its convergence rate. Wedevelop an algorithm for our methods based on repeated applications of availablealgorithms for forward regression. We also propose a clustering-based tuning procedure toestimate the tuning parameters. The effectiveness of our estimator and related algorithmsis demonstrated via simulations and applications.
Keywords : Classification, Bandwidth Selection
1. Introduction
The Outer Product of Gradients (OPG) and the Minimum Average Variance Estimator(MAVE) (Xia et al., 2002) are two popular methods for sufficient dimension reduction(SDR) (Li, 1991; Cook and Weisberg, 1991; Li, 2018) because of the weak distributionalassumptions they require, as well as their efficient and stable performance. However,both OPG and MAVE are based on the derivative of the response variable, and thereforeare inappropriate for categorical responses, which limits their application. The purposeof this paper is to extend these methods tot he categorical and ordinal categoricalresponses by imposing a multivariate link function on the conditional mean of theresponse, in a generalized linear model.Let Y denote the response variable and X a p-dimensional predictor. Sufficientdimension reduction (SDR) estimates a lower dimensional function of X that retains allrelevant information about Y that is available in X . that is, we seek to estimate a linearfunction of X , represented by β (cid:62) X , where β ∈ R p × d with d < p , such that Y ⊥⊥ X | β (cid:62) X. (1) † Address for correspondence: [email protected] a r X i v : . [ s t a t . M E ] F e b Bing Li
Since relation (1) is identifiable only up to the column space of β , the parameterof interest is the subspace span( β ) = S ( β ). For any β such that (1) holds, thecorresponding S ( β ) is referred to as a SDR subspace, and the columns of β are referredto as SDR directions. While existence of some SDR subspace follows from taking β asthe identity, existence of a minimal SDR subspace requires some mild conditions that aregiven in generality by Yin et al. (2008). We will assume existence of the minimal SDRsubspace, referred to as the Central Subspace (CS) and denoted by S Y | X . The subspace S Y | X is minimal in the sense that for any β such that (1) holds, S Y | X ≤ S ( β ), wherethe inequality denotes subspace. The goal of SDR, in the context of (1), is to find β such that span( β ) = S Y | X .However, in many statistical problems, the regression function between Y and X , E ( Y | X ), is the relationship of interest. To preserve the regression function through alow dimension linear transformation of the predictor, SDR finds a matrix γ such that Y ⊥⊥ E ( Y | X ) | γ (cid:62) X. (2)This was studied by Cook et al. (2002), where they showed (2) is equivalent to E ( Y | X ) = E ( Y | γ (cid:62) X ). The minimal SDR subspace such that (2) holds, referred to as the CentralMean Subspace (CMS) , was introduced by Cook et al. (2002), and is denoted by S E ( Y | X ) .The CMS for E ( Y | X ) is analogous to the CS for Y | X , and so S E ( Y | X ) exists under mildconditions as well. We will assume that S E ( Y | X ) exists, and the aim of SDR for E ( Y | X )is to find γ such that span( γ ) = S E ( Y | X ) .Among the SDR methods for estimating S E ( Y | X ) , the Outer Product of Gradients(OPG) and the Minimum Average Variance Estimator (MAVE), both proposed by Xiaet al. (2002), have maintained wide popularity. But OPG and MAVE are only applicablefor continuous scalar responses, and subsequent extensions of MAVE have largely focusedon the continuous scalar Y setting (Xia, 2007; Wang and Xia, 2008). These methodsare referred to as forward regression approaches for SDR since they rely on regressing Y onto X . This is in contrast to inverse regression methods that rely on regressing X onto Y . The quintessential inverse method is the Sliced Inverse Regression (SIR) originallyintroduced by Li (1991).Our main contribution extends OPG and MAVE to multinomial responses thatinclude categorical and ordinal variables. We fit a local linear deviance criteria,as in Adragni (2018), but we employ a multivariate link that enables multi-labelclassification. In binary classification, our multivariate extension of MAVE coincideswith Adragni’s MADE. We also introduce a clustering-oriented tuning procedure forSDR in classification problems, this avoids the need to select a prediction method forconventional cross-validation methods.The rest of the paper is organized as follows: section 2 provides the population-leveltheoretical motivation for our method. In section 3, we introduce the Outer Productof Canonical Gradients (OPCG) and provide its consistency and convergence rate. Wealso give our formulation of the
Minimum Average Deviance Estimator (MADE) , discussrefinements to OPCG and MADE, and give methods for estimating the dimension d . Insection 4, we provide the multivariate multinomial and ordinal link functions necessaryfor our estimator. In addition, we introduce our clustering procedure for tuning SDRmethods in classification problems. In section 5, we provide a simulation and some eneralized Forward Sufficient Dimension Reduction applications to real data sets. We conclude in section 6 with a discussion. All proofscan be found in the online supplementary materials.
2. Population-Level Foundations
Suppose X ∈ Ω X ⊂ R p is a continuous random vector and let ˜ Y ∈ { , ..., m } be aclass label, where m is the number of classes being considered. Then we can associate˜ Y with a discrete vector Y = ( Y , ..., Y m ) ∈ Ω Y ⊂ { , } m . The structure of thiscorrespondence depends on whether we view the label as categorical or ordinal. Butwe can regard both scenarios as special cases of Y ∼ multinomial ( k, p ), where k isthe number of trials and p ∈ [0 , m is a vector of probabilities such that (cid:80) mj =1 p j = 1.For our theoretical developments, the specification of categorical or ordinal Y is notimportant. Let Y ∈ Ω Y be any discrete random vector for which we may specify a linearexponential family. Suppose X ∈ Ω X and Y ∈ Ω Y such that relation (1) holds. Since Y is a discrete random vector, its conditional expectation given X = x ∈ Ω X is E ( Y | X = x ) = p ( x ) ∈ [0 , m , where p ( x ) is a vector of conditional probabilities in bijection with the conditionalprobability mass function of Y | X = x . Since the conditional p.m.f. uniquely determinesthe distribution of Y | X , this implies that Y depends on X solely through its conditionalmean. As a result, S E ( Y | X ) coincides with S Y | X , as observed in Cook et al. (2002). Proposition 2.1.
If a random element Y depends on random element X onlythrough the conditional mean E ( Y | X ) , i.e. Y ⊥⊥ X | E ( Y | X ) , then the Central MeanSubspace coincides with the Central Subspace, i.e. S E ( Y | X ) = S Y | X . Therefore, we can appeal to methods for estimating S E ( Y | X ) in order to estimate S Y | X .The advantage of estimators for S E ( Y | X ) is that they avoid the Linear ConditionalMean assumption required for many inverse regression estimators of S Y | X (Li, 1991;Cook and Weisberg, 1991; Li and Wang, 2007). Existing methods that avoid thislinearity assumption for estimating S Y | X are extensions of methods for S E ( Y | X ) withcontinuous scalar Y (Xia, 2007; Wang and Xia, 2008). Our contribution addresses thisneed for estimating S E ( Y | X ) with discrete Y , which coincides with estimating S Y | X , byProposition 2.1.For a continuous scalar response, Y , the OPG and MAVE estimators introducedby Xia et al. (2002) are consistent for S E ( Y | X ) . In the SDR literature, an estimator is unbiased for S E ( Y | X ) if it is a member of S E ( Y | X ) , and is exhaustive if its span generates S E ( Y | X ) . In OPG, a local linear regression model is used to compute the gradient ofthe conditional mean, ∂E ( Y | X = x ) /∂x . Li (2018) has shown that the gradient ofthe conditional mean is unbiased for S E ( Y | X ) and, under some conditions, its expectedouter product is exhaustive . On the other hand, MAVE re-parametrises the local linearregression model in order to estimate β directly. The explicit estimation of β contributesto the improved efficiency of MAVE over OPG (Xia, 2006, 2007).Given the effectiveness of OPG and MAVE, we aim to extend them multivariate,discrete Y by fitting a local linear generalized linear model. To motivate this approach, Bing Li let X = x ∈ Ω X , Y = y ∈ Ω Y , and specify a linear exponential family for ( x, y ) withnegative log-likelihood (cid:96) { θ ( x ); y } = − θ ( x ) (cid:62) y + b { θ ( x ) } , (3)where θ ( x ) ∈ R m is the canonical parameter and b ( · ) the associated cumulant generatingfunction. Then X relates to Y only through the canonical parameter θ ( x ). Underidentifiability of θ ( x ), the canonical parameter is a function of the conditional meanvia the inverse canonical link, θ ( x ) = ˙ b − { µ ( x ) } , where µ ( x ) = E ( Y | X = x ).Since (2) holds, we have E ( Y | x ) = E ( Y | β (cid:62) x ), which implies θ ( x ) = ˜ θ ( β (cid:62) x ) and ∂θ ( x ) (cid:62) /∂x = β { ∂ ˜ θ ( β (cid:62) x ) (cid:62) /∂u } , where u = β (cid:62) x . Then the gradient of the canonicalparameter is unbiased for S E ( Y | X ) and its expected outer product is exhaustive undersome conditions. Proposition 2.2.
Suppose X ∈ Ω X and Y ∈ Ω Y such that (2) holds and specify thenegative log-likelihood as in (3) .(a) Suppose the canonical parameter θ ( · ) : Ω X → R m is identifiable and continuouslydifferentiable. Then the gradient of the canonical parameter, ∂θ ( x ) (cid:62) /∂x ∈ R p × m , isunbiased for S E ( Y | X ) .(b) In addition, if u = β (cid:62) x is convexly supported, then the expected outer product of thecanonical gradient is exhaustive for S E ( Y | X ) , i.e. span (cid:20) E (cid:26) ∂θ ( X ) (cid:62) ∂x ∂θ ( X ) ∂x (cid:62) (cid:27)(cid:21) = S E ( Y | X ) . Proposition 2.2 implies that the d largest eigenvectors ofΛ = E [ { ∂θ ( X ) (cid:62) /∂x }{ ∂θ ( X ) /∂x (cid:62) } ] , denoted by η ∈ R p × d , span S E ( Y | X ) ; i.e. span( η ) = S E ( Y | X ) = span( β ). This result isanalogous to OPG and motivates our approach for estimating S E ( Y | X ) . Proposition 2.2differentiates our method from GSIM (Lambert-Lacroix and Peyre, 2006), which alsoaims to generalize OPG for classification. GSIM targets the gradient of the conditionalmean in order to estimate β directly, and requires specifying the link function for theGLM. According to Proposition 2.2, the conditional mean is not necessary for SDR, andwe can focus on the canonical parameter instead. By forgoing the conditional mean, ourestimation procedure in section 3 is simpler relative to GSIM.At the population level, we can estimate the canonical gradient at x ∈ Ω X byminimizing the following expected risk function: R n ( a, B ) = E ( K { ( X − x ) /h n } [ −{ a + B (cid:62) ( X − x ) } (cid:62) Y + b { a + B (cid:62) ( X − x ) } ]) , (4)where K { ( X − x ) /h n } is a kernel with bandwidth h n that depends on n . We make thefollowing assumptions for the minima of (4) to be fisher (i.e. population-level) consistent: Assumption 1.
The joint density f ( x, y ) of ( X, Y ) is twice continuously differentiablewith respect to x . The sample space Ω X ⊂ R p is compact. The random vector Y hasfinite third moments. eneralized Forward Sufficient Dimension Reduction Assumption 2.
The canonical parameter θ is identifiable and the negative log-likelihood (cid:96) ( θ ; y ) = − θ (cid:62) y + b ( θ ) is strictly convex in θ , twice continuously differentiable in θ and has a unique minimum. Furthermore, for each x ∈ Ω X , the conditional risk R ( θ, x ) = E {− θ (cid:62) Y + b ( θ ) | X = x } is twice continuously differentiable in θ with uniqueminimizer θ ( x ) . Assumption 3.
The canonical parameter θ ( · ) : Ω X → Θ ⊂ R m is a continuouslydifferentiable function of x . The parameter space Θ ⊂ R m is convex. The parameters a ∈ R m and B ∈ R p × m are identifiable and ( a, B ) lie in a compact and convex parameterspace. Assumption 4.
Derivatives and integrals in ∂∂θ (cid:90) (cid:90) (cid:96) ( θ ; y ) f ( x, y ) dydx and ∂∂x (cid:90) ∂(cid:96) ( θ ; y ) ∂θ f ( y | x ) dy are interchangeable. Assumption 5.
The kernel K is symmetric with finite moments. Furthermore, (cid:82) K ( u ) du = 1 , (cid:82) K ( u ) uu (cid:62) du = I p and K ( u ) is twice continuously differentiable. Under these assumptions, the minimum of (4), denoted by a ( h, x ) and B ( h, x ),respectively, are uniformly fisher consistent for the canonical parameter and its gradientat x ∈ Ω X . Theorem 2.1.
Under Assumptions 1-5, the minima a ( x ) and B ( x ) of (4) are fisherconsistent as h ↓ , with convergence rate sup x ∈ Ω X (cid:107) a ( h, x ) − θ ( x ) (cid:107) = O ( h ) , sup x ∈ Ω X (cid:13)(cid:13)(cid:13)(cid:13) B ( h, x ) − ∂θ ( x ) (cid:62) ∂x (cid:13)(cid:13)(cid:13)(cid:13) F = O ( h ) , where (cid:107) · (cid:107) is euclidean norm and (cid:107) · (cid:107) F is Frobenius norm. Furthermore, by Theorem 2.1 and Proposition 2.2, the d largest eigenvectors of theexpected outer product of Λ( h ) = E { B ( h, X ) B ( h, X ) (cid:62) } , denoted by η ( h ) ∈ R p × d , areconsistent for the eigenvectors, η , of Λ, which span S E ( Y | X ) . Theorem 2.2.
Under Assumptions 1-5, as h ↓ , η ( h ) is fisher consistent for S E ( Y | X ) with convergence rate (cid:107) η ( h ) η ( h ) (cid:62) − ηη (cid:62) (cid:107) F = O ( h ) . We conclude this section by noting that, while we do not make use of a ( h, x ) in thispaper, these estimates of θ ( x ) can be used for prediction, such as in Adragni (2018) andLambert-Lacroix and Peyre (2006), where the latter also uses a ( h, x ) to estimate β . Andso their fisher consistency is desirable in these respects.
3. Empirical Estimation of the Central Mean Subspace, S E ( Y | X ) Suppose X n and Y n denote a random sample from X and Y respectively, where X, Y satisfy (2) for some β ∈ R p × d . Motivated by our population-level developments in section Bing Li
2, we fit an empirical version of (4) to the sample. This corresponds to fitting a locallinear GLM with full negative log-likelihood (cid:96) ( a , ..., a n , B , ..., B n ; Y n , X n ) = n (cid:88) j =1 (cid:96) j ( a j , B j ; Y n , X n ) , (5)where the negative local log-likelihood (cid:96) j ( a j , B j ; Y n , X n ) is (cid:96) j ( a j , B j ; Y n , X n ) = n (cid:88) i =1 w ij [ −{ a j + B (cid:62) j ( X i − X j ) } (cid:62) Y i + b ( a j + B (cid:62) j ( X i − X j ))] , (6)with w ij = K { ( X i − X j ) /h } / (cid:80) ni =1 K { ( X i − X j ) /h } being a normalized kernel weight.We will suppress the dependence of likelihoods on X n and Y n , when convenient forbrevity, in the remainder of this paper. Minimizing (5), we obtain estimates ˆ B j , for j = 1 , ..., n . To show consistency of the estimation, we make an additional assumptionon the bandwidth, h . Assumption 6.
The bandwidth satisfies h ∝ n − α for < α ≤ p , where p > ( p +4) ss − for s > . With this assumption, the estimates ˆ B , ..., ˆ B n are uniformly consistent for the canonicalgradient. Theorem 3.1.
Under Assumptions 1-6, for fixed X j ∈ Ω X , the estimates ˆ a j and ˆ B j are uniformly consistent for the canonical parameter and its gradients at X j , respectively,with convergence rates sup X j ∈ Ω X (cid:107) ˆ a j − θ ( X j ) (cid:107) = O a.s ( h + δ ph ) , sup X j ∈ Ω X (cid:13)(cid:13)(cid:13)(cid:13) ˆ B j − ∂θ ( X j ) (cid:62) ∂x (cid:13)(cid:13)(cid:13)(cid:13) F = O a.s ( h + h − δ ph ) , where δ ph = (cid:112) log n/ ( nh p ) → , h ↓ , and h − δ ph → as n → ∞ . Using these consistent estimates of the canonical gradients, we construct the averageouter product of ˆ B j to obtain the candidate matrix for S E ( Y | X ) , ˆΛ n = n − (cid:80) nj =1 ˆ B j ˆ B (cid:62) j .The matrix ˆΛ n consistently estimates Λ, and the d eigenvectors corresponding to the d largest eigenvalues of ˆΛ n , denoted by ˆ η ∈ R p × d , are consistent estimates for a basisthat spans S E ( Y | X ) . We denote ˆ β opcg = ˆ η as the Outer Product of Canonical Gradients(OPCG) estimator.
Theorem 3.2.
Under Assumptions 1-6, the Outer Product of Canonical GradientsEstimator is consistent for a basis of the Central Mean Space, S E ( Y | X ) = span( β ) , andhas convergence rate (cid:107) ˆ β opcg ˆ β (cid:62) opcg − ββ (cid:62) (cid:107) F = O a.s ( h + h − δ ph + δ n ) , where δ ph = (cid:112) log n/ ( nh p ) → , δ n = (cid:112) log n/n → , h ↓ , and h − δ ph → as n → ∞ . Similar to the population-level scenario, we do not make use of the consistent estimatesˆ a , ... ˆ a n , but in they can be used for prediction and estimation, as in Adragni (2018);Lambert-Lacroix and Peyre (2006). eneralized Forward Sufficient Dimension Reduction Since the canonical gradient satisfies the relation ∂θ ( x ) (cid:62) /∂x = β { ∂ ˜ θ ( β (cid:62) x ) (cid:62) /∂u } , wecan re-parametrise the local linear GLM in (5) via B j (cid:55)→ β ˜ B j , where ˜ B j ∈ R d × m , for j = 1 , ..., n . For convenience, we overload the notation B j to mean B j ∈ R p × m in OPCGand B j = ˜ B j ∈ R p × d in MADE. The re-parametrised full negative log-likelihood is now (cid:96) ( a , ..., a n , B , ..., B n , β ; Y n , X n ) = n (cid:88) j =1 (cid:96) j ( a j , B j , β ; Y n , X n ) , (7)with the local negative log-likelihood (cid:96) j ( a j , B j , β ; Y n , X n ) as (cid:96) j ( a j , B j , β ) = n (cid:88) i =1 w ij [ { a j + B (cid:62) j β (cid:62) ( X i − X j ) } (cid:62) Y i − b ( a j + B (cid:62) j β (cid:62) ( X i − X j ))] , (8)where w ij = K { ( X i − X j ) /h } / (cid:80) ni =1 K { ( X i − X j ) /h } being a normalized kernel weight.Minimizing (7) iteratively between a , ..., a n , B , ..., B n and β until convergence, weobtain the estimate of β directly, which we denote by ˆ β made and refer to as the MinimumAverage Deviance Estimator (MADE) for S E ( Y | X ) . Since β in (8) is identifiable up toorthogonal transformations, the β solutions to the minimization of (7) are not unique. When estimating OPCG and MADE, we use a standard Gaussian kernel for thenormalized weights w ij = K { ( X i − X j ) /h } / (cid:80) ni =1 K { ( X i − X j ) /h } , with a bandwidth h . For the Gaussian kernel to be more appropriate, we standardize the predictors sothat under spherical kernels, all directions in the predictor space are treated similarly.The choice of bandwidth, h , will be discussed in section 4.We can improve OPCG and MADE by refining the kernel weights in (6) and (8),similar to the refinement schemes of Xia et al. (2002); Xia (2007). By replacing w ij with˜ w ij = K { ˆ β (cid:62) opcg ( X i − X j ) /h } in (5) for OPCG, or with ˜ w ij = K { ˆ β (cid:62) made ( X i − X j ) /h } in (7)for MADE, we can more accurately estimate S E ( Y | X ) . To obtain the refined estimators,we use the refined likelihoods to obtain new estimates of β . Then we calculate the newrefined weights and substitute them back into the refined likelihood. We repeat thisrefinement process until some convergence criteria for the estimates of β is achieved.The resulting estimators are referred to as the refined OPCG estimator, ˆ β ropcg , and therefined MADE estimator, ˆ β rmade , respectively. The refined estimators are expected tobe more accurate than their unrefined counterparts. d We have developed our method under the assumption that d , the dimension of S E ( Y | X ) ,is known. In practice, we need to estimate d in order to construct a basis for S E ( Y | X ) .An advantage of OPCG is that we can appeal to contemporary methods based on thevariation of eigenvalues and eigenvectors. In particular, we estimate d for OPCG usingthe Ladle and Predictor Augmentation methods for order determination, introduced byLuo and Li (2016, 2020). The Ladle estimator combines information extracted from the Bing Li bootstrap variation in the eigenvectors and eigenvalues of ˆΛ n , in order to determine thedimension of S E ( Y | X ) . The Predictor Augmentation estimator combines informationextracted from the variation in the eigenvectors and eigenvalues of ˆΛ n , that is a resultof augmenting the predictors with irrelevant noisy components.Since MADE estimates β directly without the use of eigenvalues and eigenvectors,the Ladle and Predictor Augmentation methods are not applicable. To estimate thedimension d in the context of MADE, we can use a cross-validation approach similar toXia et al. (2002), a k-fold version of which is outlined in Adragni (2018).
4. Classification and Tuning
To construct the OPCG and MADE estimates, we need to specify the cumulantgenerating function, b ( · ), (6) and (8) for the local linear GLMs. The first and secondderivatives of b ( · ) can be used to construct the score vector and information matrixneeded for estimating OPCG and MADE via Fisher-Scoring (McCullagh and Nelder,1989). In order to implement optimization algorithms for OPCG and MADE, weneed to construct good initial values, which can be done using the inverse canonicallink function, like in conventional GLM estimation. For categorical classification, weemploy a multivariate logistic link function, while for ordinal classification, we employa multivariate canonical link function corresponding to the Adjacent-Categories LogitModel (Agresti, 2010, 2013). For a given number of categories m , a categorical response ˜ Y = l ∈ { , ..., m } can berepresented by Y = ( Y , ..., Y m ) ∈ { , } m , where each component represents a category.The entries of Y are Y j = 0 for j (cid:54) = l and Y l = 1. Since categorical responses are aspecial case of Y ∼ multinomial ( k, p ) with k = 1, we present the multivariate logisticlink for general k . Since the response and probabilities are constrained by (cid:80) mj =1 Y j = k and (cid:80) mj =1 p j = 1, respectively, without loss of generality, set Y m = k − (cid:80) m − j =1 Y j and p m = 1 − (cid:80) m − j =1 p j . For convenience of notation, let Y = ( Y , ...., Y m − ) and p = ( p , ...., p m − ) be the unconstrained response and probabilities, respectively, forthe remainder of this section. As an example, when we have m = 3 categories, theobservation ˜ Y = 2 is equivalent to Y = (0 , Y = 3 is equivalent to Y = (0 , X and Y satisfy (2), so that X relates to Y only through the conditionalmean, which corresponds to the conditional probabilities p = p ( X ) ∈ [0 , m − . Let be a vector of ones with length determined by the context. Then, by our multinomialspecification, the canonical parameter is the multivariate logistic transformation of theconditional mean θ ( X ) = log (cid:26) p ( X )1 − (cid:62) p ( X ) (cid:27) . With some algebra and the Sherman-Morrison-Woodbury formula, the inverse logistic eneralized Forward Sufficient Dimension Reduction transform is p ( X ) = e θ ( X ) (cid:62) e θ ( X ) . This implies that the cumulant generating function is b ( θ ( X )) = k log { (cid:62) e θ ( X ) } ,which satisfies the following relations: ∂b ( θ ( X )) ∂θ = kp ( X ) , ∂ b ( θ ( X )) ∂θ∂θ (cid:62) = k (cid:20) Diag (cid:18) e θ ( X ) (cid:62) e θ ( X ) (cid:19) − e θ ( X ) ( e θ ( X ) ) (cid:62) [1 + (cid:62) e θ ( X ) ] (cid:21) , where Diag ( · ) denotes a diagonal matrix with the argument specifying the diagonalentries. Substituting the cumulant generating function for categorical Y into (6), thenegative local log-likelihood about X j , for OPCG, becomes (cid:96) j ( a j , B j ) = n (cid:88) i w ij [ −{ a j + B (cid:62) j ( X i − X j ) } (cid:62) Y i + log[1 + (cid:62) exp { a j + B (cid:62) j ( X i − X j ) } ]] , (9) where k = 1 for categorical Y i , for all i = 1 , ..., n . The substitution into (8) produces asimilar likelihood for MADE. Suppose ˜ Y = k ∈ { , ..., m } is an ordinal response for m given ordered categories. Then˜ Y can be represented as a vector Y = ( Y , ...., Y m ) ∈ { , } m , with entries Y j = 1 for j ≤ k , and Y j = 0 for all j < k ≤ m . For example, with m = 3 categories, the ordinalresponse ˜ Y = 2 is equivalent to Y = (1 , , Y to a cumulative response, as in McCullagh (1980). This willprovide us with the canonical linear exponential family form needed for OPCG andMADE. Let R s = (cid:80) sl =1 Y l be the cumulative response up to category s , and γ s = (cid:80) sl =1 p l be the cumulative probability up to category s , for s = 1 , ..., m . Let R = ( R , ..., R m ) =( Y , Y + Y , ..., k ) and γ = ( γ , ..., γ m ) = ( p , p + p , ..., Y s = R s − R s − ,with R = Y , R m = k , and p s = γ s − γ s − , with γ = p and γ m = 1, where we set R = 0 and γ = 0. Then, we can associate every ordinal Y with R , and re-write themultinomial log-likelihood for ordinal Y as (cid:96) ( γ ; Y n ) = m − (cid:88) s =1 (cid:20) R s log (cid:18) γ s γ s +1 (cid:19) + ( R s +1 − R s ) log (cid:18) γ s +1 − γ s γ s +1 (cid:19)(cid:21) . This formulation of the ordinal response model is canonical and referred to as theAdjacent-Categories Logit Model for cumulative probabilities (Agresti, 2010, 2013).Note that the conditional mean of R is γ ( X ) = E ( R | X ). This implies the canonicalparameter, as a function of the predictor X , is θ ( X ) = ( θ ( X ) , ..., θ ( X ) m − ), with entries θ s ( X ) = k log (cid:18) γ ( X ) s − γ ( X ) s − γ ( X ) s +1 − γ ( X ) s (cid:19) . Bing Li
We refer to the transformation from γ ( X ) to θ ( X ) as the Adjacent-Categories (Ad-Cat)transformation. The inverse Ad-Cat transformation is γ ( X ) = (cid:34) I m − − τ { θ ( X ) } e (cid:62) m − e (cid:62) m − τ { θ ( X ) } (cid:35) τ { θ ( X ) } , where τ ( X ) ∈ R m − with entries τ s = (cid:80) sr =1 exp (cid:110)(cid:80) m − t = r ( θ ( X ) t /k ) (cid:111) , and e m − is avector of 0’s with a 1 in the m − b ( · ) is b ( θ ( X )) = − k log[1 − e (cid:62) m − γ { θ ( X ) } ], and it satisfiesthe following relations, as in McCullagh (1980), ∂b { θ ( X ) } ∂θ = γ ( X ) , ∂ b { θ ( X ) } ∂θ∂θ (cid:62) = 1 k [Γ( X ) − γ ( X ) γ ( X ) (cid:62) ] , where, Γ( X ) = γ ( X ) γ ( X ) · · · γ ( X ) γ ( X ) γ ( X ) · · · γ ( X )... ... · · · ... γ ( X ) γ ( X ) · · · γ m − ( X ) . Since R is constrained by R m = k , we consider the unconstrained vector ( R , ..., R m − )divided by k . Let Z = ( Z , ..., Z m − ), with entries Z s = R s /k for s = 1 , ..., m − X j , for OPCG,as (cid:96) j ( a j , B j ) = n (cid:88) i w ij [ −{ a j + B (cid:62) j ( X i − X j ) } (cid:62) Z i − k log(1 − e (cid:62) m − γ [ { a j + B (cid:62) j ( X i − X j ) } ])] , (10) where k is the observed ordinal category of ˜ Y i , for all i = 1 , ..., n . The substitution into(8) produces a similar likelihood for MADE. Conventional means of selecting the optimal bandwidth parameter h typically involveminimizing out of sample prediction error, or choosing an optimal h for minimizing aMean Integrated Squared Error criterion; see (Wand and Jones, 1994; Silverman, 2018;Fan et al., 1995). Cross validation depends on the prediction method and the results canvary between methods. Optimal bandwidths are the prevailing choice for OPG, MAVE,and their extensions (Xia et al., 2002; Xia, 2006, 2008; Wang and Xia, 2008). Someexamples include h = c n − / ( p +6) , where p = max( p,
3) (Xia, 2006) and h = n − / ( p +4) (Wang and Xia, 2008). The suggested constant in the former is c = 2 .
34 when K ( · ) isthe Epanechnikov kernel (Wand and Jones, 1994; Xia, 2007), while the constant in thelatter is set to 1. This implies that the constant requires tuning when the recommendedvalues perform poorly. eneralized Forward Sufficient Dimension Reduction To determine the tuning parameter for the weights, w ij , in (6) for OPCG and in(8) for MADE, we propose a k-means clustering procedure. The motivating principlebehind this data driven bandwidth selection procedure is that the tuning parameter, h ,should be selected, so as to make prediction “easier”. Denoting ˆ β ( h ) as the estimatorof β corresponding bandwidth h , the aim is to choose h so that the sufficient predictorsconstructed on some tuning set, { X tunei : i = 1 , ...n } , provides the smallest predictionerror. In the context of classification, we expect this prediction error to be minimizedwhen the sufficient predictors { ˆ β ( h ) (cid:62) X tunei : i = 1 , ...n } cluster into distinct groupscorresponding to their class labels.Suppose we have m classes and want to select h such that the most distinct m clusters are formed. To achieve this, for each h , we implement a k-means clusteringalgorithm on the sufficient predictors { ˆ β ( h ) (cid:62) X tunei : i = 1 , ...n } , in order to estimate m clusters, and compute the corresponding ratio between the within cluster sum of squares, W SS km ( h ), to the between cluster sum of squares, BSS km ( h ). The bandwidth h km corresponding to smallest ratio value is chosen as our bandwidth estimate, i.e. h km =min h> { W SS km ( h ) /BSS km ( h ) } . Some drawbacks of this approach include assumingthere is only one cluster per class label and not using the label information from thetuning set. A quick remedy for the former issue is to estimate a pre-specified total numberof clusters instead of just m clusters. We refer to this adjustment as our unsupervisedk-means tuning procedure.To incorporate class label information available in the tuning set, we modify thek-means tuning approach. Fix h > l ∈ { , ..., m } , let { X tunei : i = 1 , ...n l } denote the sample of predictors in the tuning set that belong to class l . Weestimate c l clusters for { ˆ β ( h ) (cid:62) X tunei : i = 1 , ...n l } using k-means. For the estimated c l clusters, we compute the W SS l ( h ) and BSS l ( h ). Then, for each h and pre-specifiednumber of clusters per class, c , ..., c m , we have W SS ( h ) , W SS ( h ) , ..., W SS m ( h ) and BSS ( h ) , BSS ( h ) , ..., BSS m ( h ). We define the supervised between and within sum ofsquares as SBSS ( h ) = (cid:80) ml =1 BSS l ( h ) and SW SS ( h ) = (cid:80) ml =1 W SS l ( h ), respectively,and choose the bandwidth that minimizes the ratio SW SS ( h ) /SBSS ( h ), i.e. h skm =min h> { SW SS ( h ) /SBSS ( h ) } . We refer to h skm as the supervised k-means bandwidth.While this approach uses the label information in the tuning set, we need to pre-specifythe number of clusters per class, c l , before hand. In our applications, this tuningprocedure seems robust to the choice of c l , which we set to a constant c l = 1, 2 or3 across all l . If we specify c l ≡
1, then
W SS l ( h ) is equivalent to sum of squares fora single cluster and BSS l ( h ) ≡
0. In this case, we consider the estimated center foreach label l , denote C l , and let SBSS ( h ) be the sum of squares of the estimated centers C , ..., C l .To combine the unsupervised and supervised k-means tuning methods, we choose h that minimizes the average of the two standardized ratios above, i.e. h wkm =min h> { W SS ( h ) /BSS ( h )+ SW SS ( h ) /SBSS ( h ) } . We refer to h wkm as the weightedk-means bandwidth h wkm . A further improvement to the tuning algorithm is to performthe procedure k-fold on the training data. This will avoid the need for an external tuningset. We refer to the k-fold, weighted k-means as simply our k-fold k-means tuning.For the intuition behind this tuning procedure, consider the unsupervised k-meansscenario, where there are m classes/clusters to estimate, and let K ( · ) be the uniform Bing Li kernel, i.e balls of radius h . When h is small, the local linear approximation tothe log-likelihood is rough and the gradient estimates will be noisy due to a lack ofobservations in the local neighbourhood. If h small enough that only one other pointis in the neighbourhood, then the canonical gradient estimate is effectively random,and so is the corresponding estimated SDR directions. Then the constructed sufficientpredictors, { ˆ β ( h ) (cid:62) X tunei : i = 1 , ...n } , is essentially random noise with large W SS ( h )and small BSS ( h ), making the ratio large.When h is large, the local linear approximation to the log-likelihood is over smoothed,and important features may be lost. As a result, the dimension reduction may notcapture all the relevant directions, meaning only some classes may be separated. If h large enough that the entire sample is in every neighbourhood, then the canonicalgradient is just the slop coefficient in a conventional GLM, and OPCG reduces to fittingthe same GLM re-centered at the various observed points X j . Since re-centering does notaffect the slope estimate, this provides at most one gradient. Then OPCG will be ableto estimate at most one direction in S E ( Y | X ) , which is not exhaustive if d >
1. We endup estimating a proper subspace of the central mean space when h grows too large, andthe sufficient predictors will only separate some classes while confounding others. Theideal choice of h would be large enough to effectively estimate all the SDR directions, sothat all the classes are clearly separated from each other. And this is achieved throughclustering algorithms, such as k-means.
5. Simulations and Data Analysis
We mainly study the performance of OPCG, but consider MADE in our simulations.This is due to the non-strict convexity of MADE and its computational cost for justmoderate-sized p . OPCG has unique gradient estimates (under identifiability) and canbe computed in parallel to mitigate the computational cost issue. Adragni (2018) usesGrassmanian optimization methods to estimate MADE for scalar Y , which may be fasterfor moderate-sized p . From our simulations, we will see that the MADE estimates arecomparable to OPCG, similar to observations made by Xia (2006) and Wang and Xia(2008). Therefore, our focus on OPCG will also provide insight for the performance ofMADE in applications.For our simulations and examples, we compare OPCG to the classical inverseregression method, Sliced Inverse Regression (SIR) (Li, 1991), and to the more recentDirectional Regression (DR) (Li and Wang, 2007). In the simulations, we include thepairwise suggestion by Adragni (2018) for dealing with multi-label classification. Forapplications, we analyze the USPS handwritten digits dataset (Hull, 1994), as well astwo data sets from the UCI database: wine quality (Cortez et al., 2009) and ISOLET(Fanty and Cole, 1990). The wine quality data is considered as an ordinal classificationproblem while the simulations and remaining applications are categorical classificationproblems. For the USPS handwritten digits and ISOLET data, we include some resultsverbatim from Fukumizu and Leng (2014) for comparison. When the sample covariancematrix is ill-behaved, we use a Tikhonov regularized covariance for standardizing SIRand DR, similar to Zhong et al. (2005). For the order determination of OPCG, weuse the Ladle and Predictor Augmentation estimation in our simulations and all our eneralized Forward Sufficient Dimension Reduction applications except ISOLET, due to computational time.Given the cumulant generating function and its derivatives, we can estimate OPCGand MADE via traditional Fisher-Scoring (McCullagh and Nelder, 1989). However, toimprove the computational speed of our methods, we use the hybrid conjugate gradient(hCG) algorithm of Dai and Yuan (2001) to minimize the full negative log-likelihoods,(5) and (7), with the appropriate canonical links (9) or (10). Our implementation ofthe hybrid conjugate gradient performs well without needing to enforce the weak WolfeConditions outlined in Dai and Yuan (2001), and this improves the computational speedof the minimization. A backtracking rule is used to determine the step size for thehCG algorithm, as in Bertsekas (2015, page 69). Without enforcing the weak Wolfeconditions, global convergence of the hCG algorithm is not guaranteed, but we suspectthe initial values available from the inverse canonical link functions suffices in startingthe estimation close to the minima. For our simulation study, we consider the following five bi-variate normal distributions:( X , X ) ∼ N ( µ c , . × I ) , µ c ∈ { (0 , , (3 , , ( − , − , ( − , , (2 , − } , . For each µ c , we sample 50 observations for our training set, 30 for tuning and 30 fortesting, giving a total of 250, 150 and 150 observations for training, tuning and testing,respectively. The 5 clusters in each set are then labeled into 3 classes, with one classbeing one cluster, and the other two classes being comprise of two clusters each; seeFigure 1. Fig. 1.
The true directions and class labels for training, tuning and testing, from left to right,respectively.The red (cid:3) ’s is class ; the blue + ’s is 2; the green (cid:78) ’s is 3. We generate 8 standard normal variables as noise, so that p = 10 and shuffle the 10predictors so that a basis for S E ( Y | X ) is β = ((0 , , , , , , , , , (cid:62) , (0 , , , , , , , , , (cid:62) ) (cid:62) ∈ R × , meaning S E ( Y | X ) has dimension d = 2. The k-means tuning procedures outlined insection 4 are illustrated in Figure 2; in particular, we demonstrate the performance of Bing Li our unsupervised k-means, supervised k-means, weighted k-means and k-fold k-means,with the pre-specified number of clusters per class set to c l = 2 and the dimension of theOPCG estimates set to d = 2. The sufficient predictors from the non k-fold methodsare constructed with the tuning set and the behaviour of these sufficient predictorsfor various h is presented in Figure 2a. The plots of the within sums of squares tobetween sums of squares ratio are given in Figure 2b. When h is small, the sufficientpredictors is effectively random noise and h is large, we only recover one SDR direction,which separates class 2 from class 1 and 3; classes 1 and 3 are confounded. Theestimated bandwidth from the tuning set is h ≈ .
24. The plot for 3-fold k-meansis also given in Figure 2b, with the estimated bandwidth being h ≈ .
96. We use the3-fold k-means estimate for the bandwidth since the k-fold estimate is what we use inour data applications. Note that the suggested constant for the optimal bandwidth gives h opt = 2 . n − / ( p +6) = 1 .
65, which is larger than all our estimates. Figure 2a suggeststhat this choice for optimal bandwidth would poor at separating classes 1 and 3 fromclass 2. (a) Sufficient Predictors for different bandwidth, h (b) K-means Tuning Procedures for h Fig. 2. (a) From Left to Right: The sufficient predictors constructed from the tuning set, for h = 0 . , . , . , . The red (cid:3) ’s is class ; the blue + ’s is 2; the green (cid:78) ’s is 3. (b) Theratios of within over between sums of squares for k-means, supervised k-means, weightedsemi-supervised k-means and k-fold supervised k-means, from left to right, respectively. Thefour red (cid:5) points correspond to the instances of h = 0 . , . , . , plotted in figure (a). The plotfor 3-fold k-means does not have the points because it estimates are constructed from foldingthe training set, not the tuning set. eneralized Forward Sufficient Dimension Reduction To estimate the dimension of S E ( Y | X ) , we use the Ladle and Predictor Augmentationplots for OPCG with bandwidth h = 1. The Ladle and Predictor Augmentation plotsare given in Figure 3. The Ladle Estimator estimates the dimension as 3, though theladle values for dimension 2 and 3 are very close. The Predictor Augmentation methodcorrectly detects the dimension as 2. (a) Ladle Estimator (b) Predictor Augmentation Fig. 3. (a) The Ladle plot estimates ˆ d = 3 ; (b) the Predictor Augmentation plot estimates ˆ d = 2 . For the initial values of β in MADE, we compare using ( e , e ), where e i ∈ R p has 1 inthe i th position and 0 elsewhere, and OPCG, which we refer to as OPCG-MADE. For thepairwise binary classification suggested by Adragni (2018), we estimate 2 SDR directionsfor each pair of classes, (1,2), (1,3) and (2,3), giving us 6 directions in total, and selectthe two directions which best separates the three classes. This corresponds to takingthe leading eigenvector for the pairwise classification of (1,2) and (1,3); we refer to thisestimate as OPCG-PW. Treating the dimension d as known, and using bandwidth h = 1,we estimate the SDR directions using OPCG, MADE, OPCG-MADE, OPCG-PW, DRand SIR. We construct the sufficient predictors on the testing set and present the plotsin Figure 4. The OPCG, MADE, OPCG-MADE and DR estimates work well, while thepairwise estimator appears reasonable, and SIR fails. The failure of SIR due to symmetryis well known (Cook and Weisberg, 1991), while subpar performance of the pairwiseestimate demonstrates the benefit of using the multivariate link approach. Furthermore,for problems with many class labels, the pairwise search for the SDR directions would becomputationally costly. For MADE, the non-uniqueness of β means MADE iterationsare slow to converge. Neither MADE nor OPCG-MADE achieve convergence in oursimulations, but both still yield good estimates. Furthermore, with OPCG as the initialestimate, the MADE estimator does not differ much from the initial value.We also compute the distances of the estimated SDR directions, ˆ β , to the true β underthe Frobenius Norm in Table 1. We see that DR is closest to the true subspace, followedby OPCG-MADE, MADE and OPCG. MADE and OPCG-MADE are expected to bemore efficient than OPCG, while OPCG-PW performs the worst among the methodsthat are expected to work and SIR is the absolute worst due to SIR’s inability to copewith symmetry. Bing Li (a) Test Data
Fig. 4.
The sufficient predictors constructed on the testing set. From Left to Right and Top toBottom: OPCG, MADE, MADE with OPCG initial value, pairwise OPCG, DR, SIR. The red (cid:3) ’sis class ; the blue + ’s is 2; the green (cid:78) ’s is 3. Table 1.
Distances || ˆ β − β || F under Frobenius Norm OPCG MADE OPCG+MADE PW-OPCG DR SIR0.2437652 0.2261340 0.2261263 0.4485811 0.1972445 1.9939585
The red wine quality data set consists of a sensory-based, integer score between 0 and 10as the response, along with 11 continuous predictors. The “quality” variable is consideredas the ordinal response. We use inverse Ad-Cat transform to construct initial values, andsubstitute the ordinal cumulant generating function in the likelihood. We consider onlythe mediocre red wines with scores 5,6,7. These classes had the most observations, whilethe poorer and higher quality wines had much fewer observations. An alternative is togroup all scores below 5 as one group, and all scores above 7 as one group. This leadsto similar, but noisier, results since the more extreme scores are akin to outliers. Werandomly split the sample into 1000 training and 513 testing observations. For assessingthe importance of specifying the ordinal link, we compare the ordinal OPCG estimatesagainst categorical OPCG estimates, which use the multinomial link; we refer to thelatter as OPCG-MN.We use a 3-fold k-means tuning procedure, with d set to 2, and pre-specify the eneralized Forward Sufficient Dimension Reduction number of cluster per class to be 3. The procedure suggests h ≈ .
5; see Figure 5. Usinga bandwidth of h = 3, the estimated dimension from both the Ladle and PA plot isˆ d = 2; see Figure 5. Fig. 5.
From left to right: the Ladle plot estimates the dimension to be ˆ d = 2 ; the PredictorAugmentation plot agrees with the Ladle plot; 3-fold k-means tuning suggests h = 3 . We lookat the second dip in the tuning plot because for small values of h , the sufficient predictors areill-behaved and clump tightly together. Using the estimated ˆ d = 2 and h ≈ .
5, we construct the sufficient predictors onthe testing set for OPCG, OPCG-MN, DR and SIR; see Figure 6. The OPCG andOPCG-MN sufficient predictors capture the ordinal nature of the quality scores well,while SIR performs to a lesser extent, and DR appears unable to clearly discern thecategories. The sufficient predictors from OPCG appears to have lower variance thanthose from OPCG-MN, and this suggests that a misspecified link function negativelyaffects the efficiency of the estimates.To conduct a simple ordinal classification, we consider the 2 binary classificationproblems for the events > >
6, as in Frank and Hall (2001). We train a SupportVector Machine with defaults, from the R library ‘e1072’, on the sufficient predictorsconstructed from the training set, and predict the class label of the sufficient predictorsconstructed from the testing set; see Table 2. While all methods perform poorly sinceordinal classification is difficult, the OPCG sufficient predictors predict best.
Table 2.
Classification % Error forWine Quality Scores 5,6,7.
OPCG OPCG-MN DR SIR36.68 37.26 48.26 40.73
With the handwritten digits, we have 2007 observations consisting of vectorized, 16-by-16pixel images of handwritten digits from 0 to 9. The predictors are vectors of length p = 256; since they represent a vectorized matrix, we do not standardize the predictorsfor OPCG. The 2007 observations are split into 1000 training and 1007 testing. Bing Li
Fig. 6.
The sufficient predictors constructed on the testing set. Top Left: OPCG with ordinallink. Top Right: OPCG with categorical link. Bottom Left: DR. Bottom Right: SIR. The red (cid:3) ’s isquality ; the blue + ’s is 6; the green (cid:78) ’s is 7. We use a 2-fold k-means tuning with one cluster per class and d set to 3 to determine h . The resulting tuning plot in Figure 7 suggests the bandwidth be set to h ≈ . d = 9 while the Ladleplot is inconclusive; see Figure 7. Following Fukumizu and Leng (2014), we illustratethe effectiveness of OPCG by randomly sampling 500 observations of 1 , , , , , , , , , , , , eneralized Forward Sufficient Dimension Reduction Fig. 7.
From left to right: the Ladle plot is unable to meaningfully estimate the dimension; thePredictor Augmentation plot estimates the dimension to be ˆ d = 9 ; the 2-fold k-means tuningalgorithm suggests a bandwidth of around h = 4 . . (a) (b) Fig. 8. (a) A plot of the OPCG sufficient predictors for 500 observations from ‘1’ (red (cid:3) ), ‘2’(blue + ), ‘6’(green (cid:78) ),‘7’(purple ◦ ),‘9’(orange ∗ ). The OPCG estimate is for all digits 0-9, and wesee that the estimated linear space separates the categories well. (b) The 9 estimated SDRdirections, considered as estimated features, applied to 16 random observations from the testingset. We see that the directions can clearly recover the images in the testing set. This is true forDR and SIR as well. and DR using a Tikhonov-regularized covariance matrix, with a regularization penaltyof 3. For various dimensions d , we construct the sufficient predictors and calculate theirclassification error rates, which is reported in Table 3. OPCG outperforms SIR andDR in that we achieve a lower classification error with fewer necessary dimensions. Inthis sense, OPCG also outperforms the partitioned version of the gradient-based KernelDimension Reduction, denoted by g-KDR-v, in Fukumizu and Leng (2014), though wedo not replicate their analysis, but compare verbatim. Bing Li
Table 3.
Classification % Error for USPS Handwritten Digits. The‘-’ is because SIR can only estimate m − SDR directions. *Theerror rates for g-KDR-v are taken from Fukumizu and Leng (2014). dim 3 5 7 9 13 20 25SIR 41.51 21.95 15.39 15.09 - - -DR 40.22 22.24 15.29 13.51 10.43 8.54 8.74OPCG 33.76 19.17 10.43 7.94 7.85 7.94 7.85g-KDR-v* 36.94 17.58 12.41 12.21 8.54 7.94 7.75
For the ISOLET data from UCI, we have 150 individuals’ verbal recordings of lettersfrom the alphabet. We have p = 617 continuous predictors scaled between − . . h , due to the high computational cost, andso assess the performance of OPCG purely in terms of classification error using SVMin Table 4. We estimate SIR and DR using a Tikhonov-regularized covariance matrix,with a regularization penalty of 2. The OPCG sufficient predictors performs worse thanDR, but eventually outperforms SIR for large enough d . All the methods appear to beoutperformed by the g-KDR-v method of Fukumizu and Leng (2014). Table 4.
Classification % Error for ISOLET. The ‘-’ isbecause SIR can only estimate m − SDR directions. **Theerror rates for g-KDR-v are taken from Fukumizu and Leng(2014). dim 5 10 15 20 25 30SIR 25.85 10.84 8.66 8.47 8.40 -DR 26.49 12.12 8.72 7.25 6.48 5.45OPCG 41.37 20.65 12.64 8.21 6.48 6.54g-KDR-v** 29.44 13.15 8.28 4.55 3.91 4.81
6. Conclusion
We introduce a generalized forward regression approach for sufficient dimension reductionthat is applicable to categorical and ordinal classification problems. The Outer Productof Canonical Gradients (OPCG) and multivariate formulation of the Minimum AverageDeviance Estimator (MADE) provide estimates of the Central Mean Subspace, whichcoincides with the Central Subspace, for the categorical and ordinal classificationresponses. We also introduce a tuning procedure for sufficient dimension reductionthat does not require pre-specifying a prediction method, andour k-fold k-means tuningmethod performs reasonably well in our real data applications. Furthermore, theOPCG method is amenable to the Ladle and Predictor Augmentation estimates fororder determination, which are faster and more accessible than conventional methodsfor MADE. In our applications, we demonstrated that OPCG is comparable and can eneralized Forward Sufficient Dimension Reduction outperform conventional inverse regression methods. We also demonstrate the effect onefficiency from mis-specification of the response for OPCG. References
Adragni, K. P. (2018) Minimum average deviance estimation for sufficient dimensionreduction.
Journal of Statistical Computation and Simulation , , 411–431.Agresti, A. (2010) Analysis of ordinal categorical data , vol. 656. John Wiley & Sons.— (2013)
Categorical Data Analysis . Wiley, 3 edn.Bertsekas, D. P. (2015)
Convex optimization algorithms . Athena Scientific Belmont.Cook, R. D., Li, B. et al. (2002) Dimension reduction for conditional mean in regression.
The Annals of Statistics , , 455–474.Cook, R. D. and Weisberg, S. (1991) Comment. Journal of the American StatisticalAssociation , , 328–332.Cortez, P., Cerdeira, A., Almeida, F., Matos, T. and Reis, J. (2009) Modeling winepreferences by data mining from physicochemical properties. Decision support systems , , 547–553.Dai, Y.-h. and Yuan, Y. (2001) An efficient hybrid conjugate gradient method forunconstrained optimization. Annals of Operations Research , , 33–47.Fan, J., Heckman, N. E. and Wand, M. P. (1995) Local polynomial kernel regressionfor generalized linear models and quasi-likelihood functions. Journal of the AmericanStatistical Association , , 141–150.Fanty, M. and Cole, R. (1990) Spoken letter recognition. Advances in neural informationprocessing systems , , 220–226.Frank, E. and Hall, M. (2001) A simple approach to ordinal classification. In EuropeanConference on Machine Learning , 145–156. Springer.Fukumizu, K. and Leng, C. (2014) Gradient-based kernel dimension reduction forregression.
Journal of the American Statistical Association , , 359–370.Hull, J. J. (1994) A database for handwritten text recognition research. IEEETransactions on Pattern Analysis and Machine Intelligence , , 550–554.Lambert-Lacroix, S. and Peyre, J. (2006) Local likelihood regression in generalized linearsingle-index models with applications to microarray data. Computational statistics &data analysis , , 2091–2113.Li, B. (2018) Sufficient Dimension Reduction: Methods and Applications with R . CRCPress.Li, B. and Wang, S. (2007) On directional regression for dimension reduction.
Journalof the American Statistical Association , , 997–1008. Bing Li
Li, K.-C. (1991) Sliced inverse regression for dimension reduction.
Journal of theAmerican Statistical Association , , 316–327.Luo, W. and Li, B. (2016) Combining eigenvalues and variation of eigenvectors for orderdetermination. Biometrika , , 875–887.— (2020) On order determination by predictor augmentation. Biometrika .McCullagh, P. (1980) Regression models for ordinal data.
Journal of the royal statisticalsociety. Series B (Methodological) , 109–142.McCullagh, P. and Nelder, J. (1989)
Generalized Linear Models, Second Edition .Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor &Francis.Silverman, B. W. (2018)
Density estimation for statistics and data analysis . Routledge.Wand, M. P. and Jones, M. C. (1994)
Kernel smoothing . Chapman and Hall/CRC.Wang, H. and Xia, Y. (2008) Sliced regression for dimension reduction.
Journal of theAmerican Statistical Association , , 811–821.Xia, Y. (2006) Asymptotic distributions for two estimators of the single-index model. Econometric Theory , , 1112–1137.— (2007) A constructive approach to the estimation of dimension reduction directions. The Annals of Statistics , , 2654–2690.— (2008) A multiple-index model and dimension reduction. Journal of the AmericanStatistical Association , , 1631–1640.Xia, Y., Tong, H., Li, W. and Zhu, L.-X. (2002) An adaptive estimation of dimensionreduction space. Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , , 363–410.Yin, X., Li, B. and Cook, R. D. (2008) Successive direction extraction for estimating thecentral subspace in a multiple-index regression. Journal of Multivariate Analysis , ,1733–1757.Zhong, W., Zeng, P., Ma, P., Liu, J. S. and Zhu, Y. (2005) Rsir: regularized slicedinverse regression for motif discovery. Bioinformatics ,21