A Simple Approach to Online Sparse Sliced Inverse Regression
aa r X i v : . [ s t a t . C O ] S e p A Simple Approach to Online Sparse Sliced Inverse Regression
Haoyang Cheng and Wenquan CuiDepartment of Statistics and FinanceThe School of ManagementUniversity of Science and Technology of China
Abstract
Sliced inverse regression is an e ffi cient approach to estimate the central subspace for su ffi cient dimension reduction.Due to the demand for tackling the problem of sparse high dimensional data, several methods of online su ffi cientdimension reduction has been proposed. However, as far as we know, all of these methods are not well suitable forhigh dimensional and sparse data. Hence, the purpose of this paper is to propose a simple and e ffi cient approach toonline sparse sliced inverse regression (OSSIR). Motivated by Lasso-SIR and online SIR, we implement the Lasso-SIR in an online fashion. There are two important steps in our method, one is to iteratively obtain the eigenvaluesand eigenvectors of matrix cov( E ( x | Y )), the other is the online L regularization. For the former problem, we expandthe online principal component analysis and summarize four di ff erent ways. While in the online fashion, truncatedgradient has been shown to be an online counterpart of L regularization in the batch setting, so we apply the trun-cated gradient in the online sliced inverse regression for the latter problem. The theoretical properties of this onlinelearner are established. By comparing with several existing methods in the simulations and real data applications, wedemonstrate the e ff ectiveness and e ffi ciency of our algorithm. Keywords:
Sliced inverse regression, online learning, online PCA, sparsity, truncated gradient.
1. Introduction
With the arrival of big data era, mining e ff ective information from various kinds of data is attracting more and moreinterest in various fields such as social network, multimedia, stock market, etc. However, these data from practicalscenarios are usually di ffi cult to deal with due to their high dimensionality. This is well known as the “curse ofdimensionality”. To tackle this problem, dimension reduction approach is widely adopted, through which the originaldata are compressed into a much lower dimensional representation space. Meanwhile, the compressed data can stillpreserve su ffi cient information to satisfy the need for statistical analysis and inference. To be concrete, suppose theresponse variable Y ∈ R and the covariates x = ( x , . . . , x p ) T ∈ R p . Assuming that only a few features or a few linearcombinations of features x contain the information that we are interested in, we can consider the following model: Y = f ( β T x , . . . , β Td x ) + ǫ (1.1)where d is usually far less than p . Su ffi cient dimension reduction[15] aims to find the linear combinations in theconcept of statistical su ffi ciency, i.e F ( Y | x ) = F ( Y | B T x ) (1.2)where F ( Y |· ) is the conditional distribution function of Y , B = ( β , . . . , β d ). The dimension reduction space spannedby B is not unique. So we usually consider the intersect of all dimension reduction space, which is denoted as S Y | x .Many approaches has been proposed to estimate the su ffi cient dimension reduction space S Y | x , see sliced inverse re-gression (SIR)[15], sliced average variance estimation (SAVE)[7], kernel inverse regression[33], direction regression(DR)[14], principal Hessian directions (PHD)[16], minimum average variance estimation(MAVE)[27], density basedMAVE(dMave)[26], semi-parametric approach[18], etc. Preprint submitted to Elsevier October 1, 2020 nline learning has been attract a lot of attentions due to the fast development and great demands of big data pro-cessing. When the high dimensional data arrivals sequentially in a data stream, new approaches to online dimensionreduction are necessary to handle this situation. In the batch learning, principal component analysis(PCA) and lineardiscriminant analysis(LDA) are the most wildly used dimension reduction methods. There are some PCA-based andLDA-based online dimension reduction algorithms, such as Incremental PCA, see [8, 9, 22, 25, 31] and IncrementalLDA, see [6, 10, 21, 30]. The perturbation method, gradient descent optimization, and the randomized algorithmsare widely used in the online principal component analysis, see [2, 24]. Besides, there are also several researches inonline su ffi cient dimension reduction. Chavent[5] o ff ers a method to estimate the central dimension reduction sub-space block-by-block. While considering the situation where observations arrive one-by-one, Cai[3] uses perturbationmethod and gradient descent optimization to propose two online su ffi cient dimension reduction algorithms. The for-mer algorithm usually costs more computational time. Zhang[28] extends the incremental PCA to incremental slicedinverse regression. However, both of they need to update the inverse of covariance matrix in every step. Obviously,such operations cost more computation time. Moreover, as far as we know, their methods are not well suitable forsparse high dimensional data. Hence, we refer to the sparse sliced inverse regression(Lasso-SIR) [17] and propose anonline sparse sliced inverse regression in this paper.Lin [17] firstly constructed an artificial response variable ˜ Y which is made up from top-d eigenvalues Λ and theircorresponding eigenvectors η of the estimated conditional covariance matrix cov( E ( x | Y )), then applied Lasso to obtainan estimation of the su ffi cient dimension reduction direction β . The problem can be formulated asmin β n k ˜ Y − x T β k + µ k β k (1.3)Our method is the online fashion of (1.3) and one of the advantages of our method is the avoidance of the update forthe inverse of matrix covariance in every step, which make our method is more computation e ffi cient. To implementthis Lasso-SIR method in an online fashion, there are still two problems needed to be tackle, one is to iterativelyobtain the eigenvalues and eigenvectors of matrix cov( E ( x | Y )), the other is the online L regularization. Because Λ and η is the top-d eigenvalues and their corresponding eigenvectors of covariance matrix cov( E ( x | Y )), it is natural toconsider the online PCA. Cardot[4] has presented the main approaches to online PCA, such as perturbation methods,incremental methods, and stochastic optimization. In our method, observations in the t-th step are also used in the(t + ff erentmethods to tackle this problem. Further, we derive the theoretical convergence of the extended Candid covariance-freeincremental PCA(CCIPCA). As Cai[3] and Zhang[28] did not consider the sparsity of data and any regularization, wetry to add L regularization in our method. While in the online learning, truncated gradient[13] has been shown tobe an online counterpart of L regularization in the batch setting, we apply the truncated gradient in the online sparsesliced inverse regression. With the above modifications, we finally obtain the online fashion of (1.3). Combining thediscussion of truncated gradient and the Lasso-SIR, we also analysis the convergence property of our estimator.The outline of this article is the following. In section 2, we first give a brief introduction to the Lasso-SIR, onlinePCA and the truncated gradient. Then we present the derivation of online sparse sliced inverse regression and itsdetail algorithm. In section 3, we demonstrate the convergence property of the estimator. Numerical simulations andreal data applications are shown in Section 4 and 5. The article is finished with a brief conclusion in Section 6.
2. Sparse Online Sliced Inverse Regression
We first give a brief introduction to the Lasso-SIR[17]. Given the samples { ( y i , x i ) } ni = , Lasso-SIR first arranges the { ( y i , x i ) } ni = by y ≤ y ≤ . . . ≤ y n and divides the data into H equal-sized slices I , . . . , I H . For simplified, they assumethat n = cH and E ( x ) =
0. Then the estimation of Γ , cov( E ( x | y )) can be formulated as b Γ H = (1 / H ) X H X TH , where X H = ( ¯ x , · , . . . , ¯ x H , · ), ¯ x h , · is the sample mean of the h-th slice. Li[15] has shown that if x ’s distribution is ellipticallysymmetric, then Σ col( B ) = col( Γ ) (2.1)where col( B ) and col( Γ ) is the space spanned by the columns of B and Γ respectively. Approximating Σ and col( Γ ) by n XX T and ˆ η respectively, equation (2.1) can be converted to the following equation1 n XX T col( ˆ B ) = b η (2.2)2ere we denote Λ d = diag ( ˆ λ , . . . , ˆ λ d ) as the d-top eigenvalues of b Γ H and ˆ η = (cid:0) ˆ η , . . . , ˆ η d (cid:1) as the correspondingeigenvectors. Rewriting X H = X M / c , the equation b Γ H ˆ η = ˆ η Λ d is equivalent to1 nc X M M T X T ˆ η = H X H X TH ˆ η = ˆ η Λ d where M = I H ⊗ c , c is the c × Y = (1 / c ) M M T X T ˆ η Λ − d , we haveˆ η = n X ˜ Y . Then with the equation (2.2), we can obtain that (1 / n ) X ˜ Y ∝ (1 / n ) XX T β . To recover a sparse vector ˆ β ∝ β ,Lin [17] consider the following optimization problemˆ β i = arg min L β , i = arg min β n k ˜ Y ⋆, i − X T β k + µ i k β k , i = , . . . , d (2.3)where ˆ B = (cid:16) ˆ β , . . . , ˆ β d (cid:17) and µ i = C q log( p ) n ˆ λ i for su ffi ciently large constant C.Our idea in this paper is to turn Lasso-SIR into an online learning method. There are two problems that we shouldconsider, the first is the update for the artificial response variable ˜ Y and the second is the online least square regressionwith L regularization. Before constructing the artificial response variable ˜ Y , the update for the matrix ˆ Γ H is necessary.In the online learning, the observations arrive sequentially in the data stream, so we denote { ( x i , Y i ) , i = , . . . } asthe data stream. To conduct the online sliced inverse regression, we firstly pre-specify the cutting points −∞ = q < q < . . . < q H = ∞ by a small batch data and partition. If Y is continues, quantiles of Y is usually used as the cuttingpoints. With the first t observations { ( x i , Y i ) , i = , . . . , t } , we have b Γ H , t = H H X h = e X t M t M Tt e X Tt (2.4)where e X t = x (1: t ) − ¯ x t Tt ∈ R p × t , ¯ x t = t P ti = x i , M t , h = (cid:16) n h { y ∈ I h } , . . . , n h { y t ∈ I h } (cid:17) T ∈ R t × and M t = ( M t , , . . . , M t , H ) ∈ R t × H . As we pre-specify the cutting points, the sample size in each interval usually will be di ff erentand even extreme imbalanced with the observations arrive sequentially. To handle this situation, [3] recommendedthe cumulative slicing estimation[32], so it is also reasonable to replace b Γ H by the following matrix b D in our method.Defining d h , E { ( x − E x ) I { Y ∈ I h }} and D = H H X h = d h d Th , then we have Σ − D ⊂ S Y | x according to Theorem 1 in [32]. We setˆ d t , h = t t X i = ( x i − ¯ x t ) { y i ∈ I h } and b D t = H H X h = ˆ d t , h ˆ d Tt , h = Ht e X t e M t e M Tt e X Tt (2.5)Here e M t = ( e M t , , . . . , e M t , H ) and e M t , h = ( { y ∈ I h } , . . . , { y t ∈ I h } ) T . To update ( ˆ d t + , , . . . , ˆ d t + , H ) from ( ˆ d t + , , . . . , ˆ d t + , H )more e ffi ciently, we define ˜ e t = { { y t ∈ I } , . . . , { y t ∈ I H }} T ∈ R H × , then we haveˆ d t + = ( ˆ d t + , , . . . , ˆ d t + , H ) = t + t X i = ˜ x i ˜ e Ti + ˜ x t + ˜ e Tt + (2.6)Then we have the update formula for ˆ d t and b D t . With the discussion in [3, 32], we can also have that b D t = D + O p ( t − / ). ˜ Y To implement (2.3) in an online fashion, the first problem we ran into is the update for ˜ Y . In this section, we willshow how to update the ˜ Y e ffi ciently. By setting˜ Y t = Ht e M t e M Tt X Tt ˆ η t diag ( 1ˆ λ t , , . . . , λ t , d ) ,
3e find that estimating ˜ Y t at the t-th iteration means to seek for the d principal singular values and vectors of b D t . Toobtain the eigenvalues and eigenvectors of a covariance matrix, it is natural to consider the online PCA. There hasbeen many approaches to online PCA, such as Stochastic gradient algorithms for online PCA [11, 19, 20, 23], candidcovariance-free incremental PCA (CCIPCA) [25], incremental PCA[1]. The computational cost and memory usageof these online PCA method in per iteration can refer to Table 1 in [4]. In the t-step update of online PCA, only oneobservation x t has been used, and the x t is independent with x t − . While in our method, we need to replace x t by d t . Asthe d t is highly correlated with d t − , it is necessary to reformulate the iterative algorithm and analysis the convergenceproperty of the methods in [4], such as perturbation methods, incremental methods, and stochastic optimization. Next,we will summarize four di ff erent methods and analysis their convergence property and computation complexity.Firstly, the idea in [3] is to use the perturbation theorem to implement online singular value decomposition andconsider the following lemma, Lemma 1.
Let Q ∈ R p × p be a symmetric matrix and ( λ j , v j ) be the eigen-pairs of Q , j = , . . . , p. Assume | λ | > · · · > | λ d | > λ d + = · · · = λ p = . Let ǫ be a very small constant and G be a symmetric matrix. Denote the first orderperturbation Q ( ǫ ) = Q + ǫ G + O ( ǫ ) and the eigen-pairs of Q ( ǫ ) by n λ j ( ǫ ) , v j ( ǫ ) o . Then λ j ( ǫ ) = λ j + ǫ ( v Tj Gv j ) + O ( ǫ ) v j ( ǫ ) = v j + ǫ ( λ j I p × p − Q ) + Gv j + O ( ǫ ) , j = , . . . , dwhere ( λ j I p × p − Q ) + stands for the Moore-Penrose pseudo-inverse of ( λ j I p × p − Q ) and I p × p stands for the p × p identitymatrix. By setting D t = t P ti = b D t , we have D t + = D t − ( t + − (cid:16) D t − b D t + (cid:17) . Then define Q = D t , G = D − b D t + and ǫ = − ( t + − in Lemma1, the update algorithm of online singular value decomposition for b D t + is as followings:ˆ λ t + , j = ˆ λ t , j − ( t + − ˆ η Tt , j ( D t − b D t + ) ˆ η t , j (2.7)ˆ η t + , j = ˆ η t , j − ( t + − ( ˆ λ t , j I p × p − D t ) + ( D t − b D t + ) ˆ η t , j (2.8)As the computation complexity of the matrix ( ˆ λ t , j I p × p − D t ) + is O p ( p ) and the matrix multiply of ( D t − b D t + ) ˆ η t , j need to be executed d times, the computation complexity of (2.8) is O p ( p + p d ). Thus, when p is large, the compu-tation cost of this method is not easy to bear.Besides the perturbation method, a more famous method is stochastic gradient optimization. Defining φ t , j = d Tt + ˆ η t , j , the update algorithm of stochastic gradient optimization for online PCA of b D t is as followings:ˆ λ t + , j = ˆ λ t , j + γ n (cid:16) φ Tt , j φ t , j − ˆ λ t , j (cid:17) (2.9)ˆ η t + , j = ˆ η t , j + γ n d t + − ˆ η t , j φ Tt , j − j − X i = ˆ η t , i φ Tt , i φ t , j (2.10)The equation (2.10) is a first order approximation of the Gram-Schmidt orthonormalization of ˆ η , so we can alsouse Gram-Schmidt orthonormalization to replace (2.10). Because the computation complexity of Gram-Schmidt or-thonormalization and (2.10) is O p ( p d ) and O p ( pdH ) respectively, we recommend the equation (2.10) or performGram-Schmidt orthonormalization every L step in this paper. The perturbation method and stochastic gradient opti-mization has been discussed in Cai[3], we slightly modify their formula and obtain the computation complexity of(2.8) and (2.9) is O p ( p + p d ) and O p ( pdH ). While the computation complexity of the perturbation method andstochastic gradient optimization in Cai[3] is O p ( p d ) and O p ( p d ).Next, we present two di ff erent methods for online PCA of ˆ D t . Similar with stochastic gradient optimization, theupdate algorithm of Candid covariance-free incremental PCA for b D t + is as follows: v t + , j = tt + v t , j + t + d t + ( j ) d t + ( j ) T v t , j k v t , j k d t + ( j ) = d t + ( j − − v t + , j − k v t + , j − k v Tt + , j − k v t + , j − k d t + ( j −
1) (2.11)4here j = , . . . , d , d t + (1) = d t + , and the normalized eigenvector ˆ η t , j and eigenvalue λ t , j are estimated byˆ η t , j = v t , j / k v t , j k and λ t , j = k v t , j k If j = t , initialize the j th eigenvector as v t , j = d t ( j ). The computation complexity of (2.11) is O p ( pdH ).Zhang[28] applied reduced rank incremental PCA to the matrix Σ − / ΓΣ − / , here we use this method for thematrix ˆ D t . When we have a new observation ( x t + , y t + ), we first locate which slice it belongs to according to thedistances from y t + to sample slice mean values ¯ y h of the response variable. Let us suppose the distance from y t + to ¯ y k is the smallest. So we place the new observation into the slice k. We denote d t + , k as a new observation for E { ( x − E x ) I { Y ∈ I k }} , then we define a residual v t + = d t + , k − η t η Tt d t + , k Thus we have that the new η t + and Λ t + , d is the top-d eigenvectors and eigenvalues of " η t , v t + k v t + k T ˆ D t + " η t , v t + k v t + k = " η t , v t + k v t + k T ˆ d t + ˆ d Tt + " η t , v t + k v t + k (2.12)The computation complexity of (2.12) and its eigen-decomposition is O p ( pH ( d + + H ( d + ) and O p (( d + ),then the computation complexity for reduced rank incremental PCA of ˆ D t is O p (cid:16) pH ( d + + H ( d + + ( d + (cid:17) .The computation complexity of all methods is summarized in the following table. Table 1: Computation complexity of online PCA for ˆ D t per iteration Method Computation TimePerturbation O p ( p + p d )SGB O p ( pdH ) or O p ( p d + pdH )CCIPCA O p ( pdH )IPCA O p (cid:16) pH ( d + + H ( d + + ( d + (cid:17) With the above methods, we can iterative obtain the eigenvalues Λ t , d and eigenvectors ˆ η t of ˆ D t , then with a newobservation ( x t + , y t + ) arriving, we construct the new artificial response ˜ y t + by˜ y t + = t + H ˜ e t + ˆ d Tt + ˆ η t + Λ − t + , d (2.13) John[13] has proposed the truncated gradient method, which can be an online counterpart of L regularization inthe batch setting. Hence, we can turn the (1.2) to an online method by the truncated gradient. The truncated gradientcan be formulated as f ( w i ) = T ( w i − γ ∇ L ( w i , z i ) , λ g i , θ ) (2.14)where g i > T is defined by T ( v , α, θ ) = max(0 , v − α ) , if v ∈ [0 , θ ]min(0 , v + α ) , if v ∈ [ − θ, v , otherwise . In the update progress, the truncated gradient can be executed every L steps. If i / L is not an integer, we set g i =
0; if i / L is an integer, we set g i = Lg for a scalar g >
0. The larger the parameters g and θ are, the more sparsity is incurred.Next, with the truncated gradient for least squares in [13] and the update for the artificial response variable ˜ Y , wewill show the algorithm of online sparse sliced inverse regression with truncated gradient. The detail of our algorithmis as follows: 5 lgorithm 1 Online Sliced Inverse Regression With Truncated Gradient
Input: threshold θ ≥
0, gravity sequence g i ≥
0, learning rate γ ∈ (0 , x i , y i ) , i = , . . . Output: ˆ B = (cid:16) ˆ β , . . . , ˆ β q (cid:17) Initialize ˆ D to obtain the corresponding eigenvalues and eigenvectors ( ˆ λ , j , ˆ η , j ) , j = , . . . , d with a small batchsample { x i , y i } ti = . for i = t + + do The new unlabeled example is x t + = [ x , . . . , x p ]; Update ˆ d t + and ˆ D t + by (2.6); Update ( ˆ λ t + , j , ˆ η t + , j ) , j = , . . . , d by online PCA in section 2.1; Construct the new ˜ y t + by (2.13); for j =
1, . . . , d do for coe ffi cient β ℓ ( ℓ = , . . . , p ) do if β ℓ > β ℓ ≤ θ then β ℓ ← max { β ℓ − g i γ, } elseif β ℓ < β ℓ ≥ − θ then β ℓ ← min { β ℓ + g i γ, } end for Compute prediction ˆ y = P p ℓ = β ℓ x ℓ Update for all ℓ : β ℓ ← β ℓ + γ ( y − ˆ y ) x ℓ , ˆ β t + , j = ( β t + , j , . . . , β pt + , j ) end for end for Note that we do not need to calculate ˆ D t for all methods, only the perturbation method need the matrix ˆ D t , othermethod only need to calculate ˆ d t . Hence, if we choose the perturbation method, the computation complexity ofAlgorithm 1 is O p ( p + p d + p H + pdH + pd ), otherwise, the computation complexity of Algorithm 1 is O p ( pdH + pd + L ), where O p ( L ) is presented in Table 1. While Cai[3] need to update the b Σ − t and ( ˆ m t , , . . . , ˆ m t , H ) = b Σ − t ˆ d t everystep, the computational cost of [3] is of order O ( p d + p H + p ( p + H + p ) or O ( p d + p H + p ( p + H + p ).Then we can conclude that our method is more computational e ff ective than algorithm in [3].
3. Convergency Properties
In this section, we will discuss the some properties of our method. Firstly, we refer to two theorems about therelationship between L1 regular and truncated gradient, and the consistency property of Lasso-SIR. John[13] hasanalysed the relationship between L1 regular and truncated gradient. The detail is described in Theorem 2.
Theorem 2.
Consider sparse online update rule (2.14) with w = and η > . If L ( w , z ) is convex in w and thereexist non-negative constants A and B such that k∇ L ( w , z ) k ≤ AL ( w , z ) + B for all w ∈ R d and z ∈ R d + , then for all ¯ w ∈ R d we have − . A η T T X i = " L ( w i , z i ) + g i − . A η k w i + · I ( w i + ≤ θ ) k ≤ η B + k w k η T + T T X i = (cid:2) L ( ¯ w , z i ) + g i k ¯ w · I ( w i + ≤ θ ) k (cid:3) (3.1) where for vectors v = [ v , . . . , v d ] and v ′ = [ v ′ , . . . , v ′ d ] , we let (cid:13)(cid:13)(cid:13)(cid:13) v · I (cid:16)(cid:12)(cid:12)(cid:12) v ′ (cid:12)(cid:12)(cid:12) ≤ θ (cid:17)(cid:13)(cid:13)(cid:13)(cid:13) = d X j = (cid:12)(cid:12)(cid:12) v j (cid:12)(cid:12)(cid:12) I (cid:16)(cid:12)(cid:12)(cid:12) v ′ j (cid:12)(cid:12)(cid:12) ≤ θ (cid:17) where I ( · ) is the set indicator function. In our method, the loss function is square loss, then from Theorem 2, we have k w T − ¯ w k = O (1 / √ T ). Moreover,Theorem 3 gives us the consistency of estimator ˆ B in Lasso-SIR[17]. Firstly, we present some conditions:6C1) There exist constants C min and C max such that 0 < C min < λ min ( Σ ) ≤ λ max ( Σ ) < C max (C2) There exists a constant κ ≥
1, such that0 < λ = λ d (var( E [ x | y ]) ≤ . . . ≤ λ (var( E [ x | y ]) ≤ κλ ≤ λ max ( Σ );(C3) The central curve m ( y ) = E ( x | y ) satisfies the sliced stability condition;(C4) The observations ( x i , y i ) , i = , , . . . are independent and identically distributed;(C5) The nonzero eigenvalues of D are all distinct;(C6) The tuning parameter γ t in SGD satisfies γ t = Ct − for some constant C.Condition (C1)-(C3) is described and necessary in Lin[17], the others is presented in Cai[3]. Then the detail of theconsistency of estimator ˆ B in Lasso-SIR is as follows: Theorem 3.
Assume that n λ = p α for some α > / , where λ is the smallest nonzero eigenvalue of var ( E [ x | y ]) , andthat conditions (C1)-(C3) hold for the multiple index model (1.1). Assume further that the dimension d of the centralsubspace is known. Let ˆ B be the output of Lasso-SIR, then k P ˆ B − P B k F ≤ C r s log( p ) n λ holds with probability at least − C exp( − C log( p )) for some constants C and C . Then with Theorem 2 and 3, the key to derive the consistency of our method is to analysis the convergenceof online PCA of b D t . The convergence of online principal component analysis (PCA) has been analyzed in manyresearches, see [1, 20, 25] . However, the situation they consider is that the t-th data x t is independent with x t − andits contribution to online covariance matrix is additive. While it is not true for the ˆ d t in our method. This makestheoretical analysis of the online PCA of b D t is more complicated. Cai[3] has discussed the convergency propertiesof perturbation method and stochastic gradient optimization. While in our paper, we show the convergence of the ˆ η t obtained by CCIPCA in the next Theorem 4. Theorem 4.
Under Conditions (C1),(C4)-(C6), the column space of ˆ η t = ( ˆ η t , , . . . , ˆ η t , d ) obtained from (2.11) con-verges almost surely to the column space of Γ H , as t → ∞ . Then with Theorem 2-4, we can finally derive the consistency of our method. The proof of Theorem 4 and Theorem5 is presented in Appendix.
Theorem 5.
Under Conditions (C1)-(C6), Let ˆ B t be the output of Algorithm 1, the column space of ˆ B t = ( ˆ B t , , . . . , ˆ B t , d ) converges almost surely to the column space of B , as t → ∞ .
4. Simulation
In this section, we conduct several simulations to evaluate the performance of di ff erent methods. The data generateprogress is as follows. We consider three models,model 1: Y = ( β T x ) + ǫ model 2: Y = sin( β T x ) × exp( β T x ) + ǫ model 3: Y = sgn ( β T x ) × (cid:12)(cid:12)(cid:12) + ( β T x ) / (cid:12)(cid:12)(cid:12) + ǫ where x is generated from multivariate normal distribution with zero mean and covariance structure like Cov ( x i , x j ) = ρ | i − j | ρ = . β is a p-dimensional vector. We set β , j = j = , β , j = β , j = j = , , , ,
10 and β , j = β , j = j = , , , β , j = β , j = j = , , β , j = N =
100 times with samples size n = p = , , , ff erent methods, we refer to following distance. d ( β, ˆ β ) = − | det ( β T ˆ β ) | where det ( · ) stands for the determinant operator.The results are summarized in Table 1-2. Table 1 show the average distance between estimator ˆ β and true value β . To compare the computational e ffi ciency of these methods, we show the averages of the computing time in Table2. From the result, we can find that Perturbation method is not suitable for high-dimensional data due to the highcomputation cost. Compared with the methods in [3], we have found that our method not only cost less time, but alsohave a better estimation accuracy in the high dimensional data. Combining the accuracy and the computation time,we recommend the onlineLassoSIR with CCIPCA to tackle the problem of online sparse sliced inverse regression. Table 2: The averages of the distance d ( β, ˆ β ) based on 100 replications for Model 1-3 O-LassoSIR O-SIRp Perturbation GD CCIPCA IPCA Perturbation GD SIR LassoSIRmodel I 20 0.0066 0.0079 0.0066 0.0065 0.0194 0.0102 0.0065 0.0011100 0.0110 0.0096 0.0091 0.0088 0.2654 0.2160 0.0378 0.0015500 0.0433 0.0449 0.0431 0.0408 0.7941 0.7776 0.9710 0.00191000 0.1145 0.1124 0.1190 0.1105 0.9190 0.8904 0.9806 0.0019model II 20 0.0122 0.0121 0.0120 0.0121 0.0288 0.0146 0.0038 0.0023100 0.0178 0.0146 0.0148 0.0187 0.2171 0.0190 0.0223 0.0028500 0.0487 0.0357 0.0367 0.0413 0.8438 0.8239 0.9577 0.00271000 0.1220 0.1156 0.1134 0.1172 0.9384 0.9310 0.9773 0.0033model III 20 0.0591 0.0431 0.0565 0.0569 0.4775 0.0483 0.0202 0.0270100 0.0787 0.0713 0.0773 0.0787 0.3951 0.4061 0.0815 0.0258500 0.1679 0.1562 0.1363 0.1390 0.9747 0.9340 0.9821 0.02811000 0.3451 0.3100 0.2787 0.2993 0.9826 0.9820 0.9996 0.02618 able 3: The averages of the computation time (in seconds) based on 100 replications for Model 1-3
O-LassoSIR O-SIRp Perturbation GD CCIPCA IPCA Perturbation GDmodel I 20 0.8714 0.3317 0.3118 0.5366 0.4122 0.2749100 4.7487 0.5359 0.4708 0.7358 5.0475 1.062500 315.3 5.209 3.797 5.251 358.0 84.261000 3239.7 26.38 20.78 26.64 3275.5 817.0model II 20 0.5417 0.3139 0.3157 0.5511 0.4903 0.2563100 3.997 0.4800 0.4633 0.7272 4.544 1.0748500 313.8 4.723 4.156 5.545 366.3 82.171000 3209.6 20.72 18.87 24.78 3254.0 801.77model III 20 1.003 0.622 0.621 0.811 0.877 0.484100 7.928 0.961 0.934 1.119 8.249 1.344500 615.6 7.868 7.172 7.834 649.0 85.0001000 5033.6 36.278 33.058 35.030 5592.6 841.569
5. Real data Analysis
To further show the performance of the proposed method, we apply our method to two datasets, one is the Cpusmalldataset(http: // / ∼ delve / data / comp-activ / desc.html). This dataset contain n = p =
12 features from a computer systems activity measures. The response variable is portion of time that cpus runin user mode. We regard this dataset as a low-dimension case regression problem. We select 1000 observations as atraining set and the remaining as a test set. We choose the number of the dimension reduction directions d =
3. Afterapplying the dimension reduction methods to the dataset, we use SVM algorithm to construct the regression model.We use the relative prediction error to evaluate the prediction performance, i.e P i ∈ testset ( Y i − ˆ Y i ) . P i ∈ testset ( Y i − ¯ Y ) .The other dataset is the activity recognition based on wearable physiological measurements in this section. Thedataset can be obtained from the website http: // / / / / / s1. This dataset contains n = p =
533 features from Electrocardiogram (ECG), Thoracic Electrical Bioimpedance (TEB)and the Electrodermal Activity (EDA) for activity recognition. To be explicit, there are 174 attributes are statisticsextracted from the ECG signal, 151 attributes are features extracted from the TEB signal, 104 attributes come from theEDA measured in the arm, and 104 ones from the EDA in the hand. There are four types of the activities to be analyzed,including neutral, emotional, mental and physical. For this dataset, we still randomly select 1000 observations as atraining set and the remaining as a test set. Then this dataset can be regarded as a high-dimensional classification case.We choose the number of the dimension reduction directions d =
3. After applying the dimension reduction methodsto the dataset, we use SVM algorithm to construct the classifier. Predict accuracy P i ∈{ testset } I ( y i = ˆ y i ) / n test is usedas the evaluation standards. For both dataset, LassoSIR via batch learning is regarded as a benchmark. The result ispresented in the following table. Table 4: The predict accuracy in test set dataset O-LassoSIR O-SIRPerturbation GD CCIPCA IPCA O-SIR-P O-SIR-GD LassoSIRCpusmall 0.073 0.071 0.072 0.071 0.072 0.069 0.060Activity Recognize 0.611 0.617 0.640 0.656 0.402 0.461 0.689From the Table 4, we can find that both online sparse SIR and online SIR[3] have a similar prediction accuracy inthe Cpusmall dataset. While in the activity recognition dataset, online sparse SIR have a better predict accuracy thanonline SIR. Moreover, compared with the benchmark, our methods are also not much inferior. Hence, it is reasonable9o conclude that our method is as e ff ective as online SIR for the low dimensional data, and more e ff ective for the highdimensional data.
6. conclusion
By implement LassoSIR in an online fashion, we have proposed an approach to online sparse dimension reductionwith the truncated gradient, which more computational-e ffi cient and has a better performance than Cai[3] for the high-dimensional data. The online fashion of (1.3) consists two important steps, one is the online update for the eigenvaluesand eigenvectors of ˆ D t , the other is online L regularization. We slightly modify the online PCA to tackle the formerproblem and summarize four di ff erent algorithms to handle the online eigen decomposition of ˆ D t . We also give thetheoretical convergence of CCIPCA in our method. To tackle the sparsity problem, we use truncated gradient, whichhas been shown to be an online counterpart of L regularization in the batch setting. Moreover, we also show thetheoretical convergence properties of our estimators. From the analysis of computation complexity and the simulationstudies, the advantages of our method in high dimension data has been presented. However, the accuracy between ourmethod and the batch LassoSIR is not enough good, which needs more further researches.
7. ReferenceReferences [1] Raman Arora, Andrew Cotter, Karen Livescu, and Nathan Srebro. Stochastic optimization for pca and pls. In , pages 861–868. IEEE, 2012.[2] Christos Boutsidis, Dan Garber, Zohar Karnin, and Edo Liberty. Online principal components analysis. In
Proceedings of the twenty-sixthannual ACM-SIAM symposium on Discrete algorithms , pages 887–901. SIAM, 2014.[3] Zhanrui Cai, Runze Li, and Liping Zhu. Online su ffi cient dimension reduction through sliced inverse regression. Journal of Machine LearningResearch , 21(10):1–25, 2020.[4] Herv´e Cardot and David Degras. Online principal component analysis in high dimension: Which algorithm to choose?
InternationalStatistical Review , 86(1):29–50, 2018.[5] Kuentz-Simonet V Chavent M, Girard S. A sliced inverse regression approach for data stream.
Computational Statistics , 29(5):1129 – 1152,2014.[6] Delin Chu, Li-Zhi Liao, Michael Kwok-Po Ng, and Xiaoyan Wang. Incremental linear discriminant analysis: a fast algorithm and compar-isons.
IEEE transactions on neural networks and learning systems , 26(11):2716–2735, 2015.[7] R Dennis Cook and Sanford Weisberg. Discussion of “sliced inverse regression for dimension reduction”.
Journal of the American StatisticalAssociation , 86(414):335, 1991.[8] Peter Hall, David Marshall, and Ralph Martin. Merging and splitting eigenspace models.
IEEE Transactions on pattern analysis and machineintelligence , 22(9):1042–1049, 2000.[9] Peter M Hall, A David Marshall, and Ralph R Martin. Incremental eigenanalysis for classification. In
BMVC , volume 98, pages 286–295.Citeseer, 1998.[10] Tae-Kyun Kim, Shu-Fai Wong, Bjorn Stenger, Josef Kittler, and Roberto Cipolla. Incremental linear discriminant analysis using su ffi cientspanning set approximations. In , pages 1–8. IEEE, 2007.[11] T Krasulina. Method of stochastic approximation in the determination of the largest eigenvalue of the mathematical expectation of randommatrices. Automatation and remote control , 2:50–56, 1970.[12] Harold Joseph Kushner and Dean S Clark.
Stochastic approximation methods for constrained and unconstrained systems , volume 26. SpringerScience & Business Media, 2012.[13] John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient.
Journal of Machine Learning Research ,10(May):777–801, 2009.[14] Bing Li and Shaoli Wang. On directional regression for dimension reduction.
Journal of the American Statistical Association , 102(479):997–1008, 2007.[15] Ker-Chau Li. Sliced inverse regression for dimension reduction.
Journal of the American Statistical Association , 86(414):316–327, 1991.[16] Ker-Chau Li. On principal hessian directions for data visualization and dimension reduction: Another application of stein’s lemma.
Journalof the American Statistical Association , 87(420):1025–1039, 1992.[17] Qian Lin, Zhigen Zhao, and Jun S. Liu. Sparse sliced inverse regression via lasso.
Journal of the American Statistical Association , pages1–33, 2019.[18] Yanyuan Ma and Liping Zhu. A semiparametric approach to dimension reduction.
Journal of the American Statistical Association ,107(497):168–179, 2012.[19] Erkki Oja. Principal components, minor components, and linear neural networks.
Neural networks , 5(6):927–935, 1992.[20] Erkki Oja and Juha Karhunen. On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix.
Journal of mathematical analysis and applications , 106(1):69–84, 1985.[21] Shaoning Pang, Seiichi Ozawa, and Nikola Kasabov. Incremental linear discriminant analysis for classification of data streams.
IEEEtransactions on Systems, Man, and Cybernetics, part B (Cybernetics) , 35(5):905–914, 2005.
22] Paul Rodriguez and Brendt Wohlberg. A matlab implementation of a fast incremental principal component pursuit algorithm for videobackground modeling. In , pages 3414–3416. IEEE, 2014.[23] Terence D Sanger. Optimal unsupervised learning in a single-layer linear feedforward neural network.
Neural networks , 2(6):459–473, 1989.[24] Manfred K Warmuth and Dima Kuzmin. Randomized online pca algorithms with regret bounds that are logarithmic in the dimension.
Journalof Machine Learning Research , 9(Oct):2287–2320, 2008.[25] Juyang Weng, Yilu Zhang, and Wey-Shiuan Hwang. Candid covariance-free incremental principal component analysis.
IEEE Transactionson Pattern Analysis and Machine Intelligence , 25(8):1034–1040, 2003.[26] Yingcun Xia. A constructive approach to the estimation of dimension reduction directions.
The Annals of Statistics , 35(6):2654–2690, 2007.[27] Yingcun Xia, Howell Tong, Wai Keungxs Li, and Li-Xing Zhu. An adaptive estimation of dimension reduction space.
Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , 64(3):363–410, 2002.[28] Ning Zhang and Qiang Wu. Online learning for supervised dimension reduction. 2018.[29] Yilu Zhang and Juyang Weng. Convergence analysis of complementary candid incremental principal component analysis.
Michigan StateUniversity , 2001.[30] Haitao Zhao and Pong Chi Yuen. Incremental linear discriminant analysis for face recognition.
IEEE Transactions on Systems, Man, andCybernetics, Part B (Cybernetics) , 38(1):210–221, 2008.[31] Haitao Zhao, Pong Chi Yuen, and James T Kwok. A novel incremental principal component analysis and its application for face recognition.
IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) , 36(4):873–886, 2006.[32] Li Ping Zhu, Li Xing Zhu, and Zheng Hui Feng. Dimension reduction in regressions through cumulative slicing estimation.
Journal of theAmerican Statistical Association , 105(492):1455 – 1466, 2010.[33] Li-Xing Zhu, Kai-Tai Fang, et al. Asymptotics for kernel estimate of sliced inverse regression.
The Annals of Statistics , 24(3):1053–1068,1996.
Appendix A.1. Proof of Theorem 4
As the Perturbation Methods, SGD methods and Reduced rank incremental PCA has been discussed in [3] and [28].So we only discuss the convergence of the CCIPCA in our theorem.Firstly, we consider the situation when j =
1. Then we have v t + , = tt + v t , + t + d t + d Tt + v t , k v t , k (A.1)which is equivalent to v t + , = v t , + t + D t + k v t , k − I ! v t , (A.2) v t + , = v t , + t + D k v t , k − I ! v t , + t + D t + − D k v t , k v t , (A.3)Then we refer to the following lemmas, Lemma 6.
Assume that a sequence of non-negative random variables { r n } satisfy r n = O p ( n − / ) . Then ∀ ε > k →∞ Pr ∞ X n = k n − r n > ε = Lemma 7.
Let v be a locally asymptotically stable (in the sense of Liapunov) solution to ˙ v = D k v k − I ! v (A.5) with domain of attraction D ( v ) . If there is a compact set A ⊂ D ( v ) such that the solution v t , satisfies P { v ( n ) ∈ A} = , then v ( n ) tends to v almost surely. roof: To proof this lemma, we use the Theorem 2.3.1 in [12]. By (A.2), (A.5) and a t = t , The Assumption A2.2.1,A2.2.2 and A2.2.3 in [12] is easy to verify. Next we will show the boundedness of v ( t ).By the iteration equation (A.2), we have k v ( t ) k = k v ( t − k + t v T ( t − D ( t ) v ( t − k v ( t − k − t v T ( t − v ( t − + t v T ( t − D ( t ) v ( t − + t v T ( t − v ( t − − t v T ( t − D ( t ) v ( t − k v ( t − k (A.6)Next, we focus on each quantity in (A.6). If λ max ( D ( t )) ≤ k v ( t − k ,2 t v T ( t − D ( t ) v ( t − k v ( t − k < λ max ( D ( t )) t k v ( t − k < t k v ( t − k (A.7)Moreover, When t is large enough and satisfies t > max n , λ max ( D ( t )) o ,1 t v T ( t − D ( t ) v ( t − ≤ λ ( D ( t )) t k v ( t − k < t k v ( t − k (A.8)and 1 t v T ( t − v ( t − ≤ t k v ( t − k (A.9)With (A.6), (A.7), (A.8), (A.9), we have k v ( t ) k < k v ( t − k + t k v ( t − k − t k v ( t − k + t k v ( t − k + t k v ( t − k − t v T ( t − D ( t ) v ( t − k v ( t − k (A.10)Hence, when t > max n , λ max ( D ( t )) o , we have k v ( t ) k < k v ( t − k . As k D ( t ) − D k = O p ( t − / ) and the largesteigenvalue of D is bounded, v ( t ) is bounded while λ max ( D ( t )) ≤ k v ( t − k . From the two cases that k v ( t ) k < k v ( t − k or k v ( t − k < λ max ( D ( t )), we can conclude that v ( t ) is bounded with probability 1.Besides of the boundedness of v ( t ), we also verify the assumption A2.2.4 in Kushner and Clark(2012). Define r t = D t + − D k v t , k v t , , we have that k D t + − D k v t , k v t , k = k D t + − D k = O p ( t − / )Thus Pr sup m ≥ k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m X i = k i D t + − D k v t , k v t , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > ε ≤ Pr ∞ X n = k n r n > ε With the Lemma 6, the assumption A2.2.4 is satisfied. Then the Theorem 2.3.1 in [12] implies the results of Lemma7 here. (cid:3)
To complete the proof of Theorem 4, it is necessary to show that the locally asymptotically stable solution of(A.5) is λ η and (A.3) satisfies P { v ( n ) ∈ A} =
1. Firstly, we rewrite v ( t ) = P dj = α j ( t ) η j , where α ( t ) = v T ( t ) η j ,( λ j , η j ) , j = , . . . , d is the top-d eigenvalues and eigenvectors of D . Then (A.5) is equivalent to˙ α = Λ d qP dk = α k − I α (A.11)where α , = ( α , . . . , α ) and Λ d = diag ( λ , . . . , λ d ). Then refer to the derivation in [29], we have α → ± λ and α j → j > v T ( t ) λ η = d X k = α k η T η → ± λ Hence, v ( n ) enters D ( ± λ η ) with probability one. Finally, we apply Lemma 7 to obtain that v ( n ) → ± λ η ) withprobability 1 as n → ∞ . We omit the proof of the case that i > (cid:3) . Proof of Theorem 5 For simplify, we assume E ( x ) =
0. Denote ˆ β t is the output of Algorithm 2. Let ˜ β lasso is the solution ofmin β t t k ˜ Y ( t ) − X T β t k + µ k β t k where ˜ Y = Ht e M t e M Tt X Tt ˆ η t diag ( λ t , , . . . , λ t , d ). Then with Theorem 2, we have k ˆ β t − ˜ β lasso k = O p ( t − / ). Next, we setˆ β lasso is the solution of min β t k ˜ Y − X T β k + µ k β k where ˜ Y = c M M T X T ˆ η Λ − d . Further, with Theorem 4, we have that ˆ β lasso converges almost surely to ˜ β lasso . Thensimilar with the proof of Theorem 3 in [17], we set η = Σ β , ˜ η = P η ˆ η and ˜ β = Σ − ˜ η ∝ β , where β is the true valueof β . Then we have k P ˆ β t − P β k F = k P ˆ β t − P ˜ β k F ≤ k ˆ β t − ˜ β k k ˜ β k ≤ k ˆ β t − ˜ β lasso k + k ˜ β lasso − ˆ β lasso k + k ˆ β lasso − ˜ β k k ˜ β k = k ˆ β lasso − ˜ β k + O ( t − / ) k ˜ β k = O r s log( p ) t λ + t − / Hence, we have that the column space of ˆ B t = ( ˆ B t , , . . . , ˆ B t , d ) converges almost surely to the column space of B , as t → ∞ . (cid:3)(cid:3)