Bayesian inverse regression for dimension reduction with small datasets
aa r X i v : . [ s t a t . C O ] O c t Noname manuscript No. (will be inserted by the editor)
Bayesian inverse regression for dimension reductionwith small datasets
Xin Cai · Guang Lin · Jinglai Li
Received: date / Accepted: date
Abstract
We consider supervised dimension reduction problems, namely toidentify a low dimensional projection of the predictors x which can retain thestatistical relationship between x and the response variable y . We follow theidea of the sliced inverse regression (SIR) and the sliced average variance esti-mation (SAVE) type of methods, which is to use the statistical information ofthe conditional distribution π ( x | y ) to identify the dimension reduction (DR)space. In particular we focus on the task of computing this conditional distri-bution without slicing the data. We propose a Bayesian framework to computethe conditional distribution where the likelihood function is obtained using theGaussian process regression model. The conditional distribution π ( x | y ) canthen be computed directly via Monte Carlo sampling. We then can performDR by considering certain moment functions (e.g. the first or the second mo-ment) of the samples of the posterior distribution. With numerical examples,we demonstrate that the proposed method is especially effective for small dataproblems. Keywords
Bayesian inference · covariance operator · dimension reduction · Gaussian process · inverse regression. Mathematics Subject Classification (2000) · This work was supported by the NSFC under grant number 11301337. The computationalresources were provided by the Student Innovation Center as Shanghai Jiao Tong University.X. CaiSchool of Mathematical Sciences Shanghai Jiao Tong University, 800 Dongchuan Rd, Shang-hai 200240, China.G. LinDepartment of Mathematics, Purdue University, West Lafayette, IN 47906, USA.J. LiDepartment of Mathematical Sciences, University of Liverpool, Liverpool L69 7XL, UK.E-mail: [email protected] X. Cai, G. Lin and J. Li
In many statistical regression problems, one has to deal with problems wherethe available data are insufficient to provide a robust regression. If conductingregression directly in such problems, one often risks of overfitting or beingincorrectly regularized. In either case, the resulting regression model may loseits prediction accuracy. Extracting and selecting the important features oreliminating the redundant ones is a key step to avoid overfitting and improvethe robustness of the regression task [11]. The feature extraction and selectionthus constitutes of identifying a low dimensional subspace of the predictors x which retains the statistical relationship between x and the response y , i.e. asupervised dimension reduction problem. Mathematically such problems areoften posed as to estimate the central dimension reduction (DR) subspace [5].A very popular class of methods estimate this central subspace by consider-ing the statistics of the predictors x conditional on the response y , and suchmethods include the sliced inverse regression (SIR) proposed in the seminalwork [18], the sliced average variance estimation [6,8], and many of their vari-ants, e.g. [5,17,36,19,33,16,21,29,20]. Some of the extensions and variantshave been developed specifically for machine learning problems, e.g., [12,32,13]. The literature in this topic is vast and we refer to [22,15] for a morecomprehensive overview of the subject. It should be noted that most of theaforementioned methods adopt nonparametric formulation without assumingany specific relation between x and y . As will be shown in the examples, thenonparametric approaches may not work well for the problems with very smallnumber of data, which is considered in the present work. To this end, an al-ternative type of methods is to assume a parametric model of the likelihoodfunction p ( y | x ), and then compute the reduced dimensions with maximumlikelihood estimation [4,3], or in a Bayesian formulation [23,27]. A main dis-advantage of the parametric models is that they may be lack of the flexibilityto accurately characterize of the relation between the predictors x and theresponse y .In this work we present a method incorporating the SIR/SAVE type ofmethods with the model base ones, to make them more effective for smalldata problems. In particular we remain in the SIR/SAVE framework to iden-tify the DR space. As one can see, many works in this class focus on thequestion: what statistical information of the conditional distribution π ( x | y )should one use to obtain the DR subspace? For example, SIR makes use ofthe expectation of π ( x | y ) to identify the DR directions, SAVE utilizes thevariance of it, and the method in [34] is based on the third moments. In thiswork we consider a different aspect of the problem: how to obtain the condi-tional distribution π ( x | y ) when the data set is small? In SIR and SAVE, theconditional moments are approximately estimated by slicing the data [18]. Aswill be demonstrated with numerical examples, the slicing strategy does notperform well if we have a very small data set. The main purpose of the workis to address the problem of computing the conditional distribution π ( x | y ).In particular we present a Bayesian formulation which can provide not only ayesian inverse regression for dimension reduction 3 the first or second moments, but the full conditional distribution π ( x | y ), andonce the conditional distribution is available one can use any desired meth-ods to estimate the DR subspace based on the conditional distribution. Justlike [4,3], our method also involves constructing the likelihood function π ( y | x )from data, but a main difference here is that we characterize the likelihoodfunction with a nonparametric Gaussian Process (GP) model [31], which mayprovide more flexibility than a parametric model. Once the likelihood functionis available, we can compute the posterior distribution π ( x | y ) from the likeli-hood function and a desired distribution of x . In this work we choose to mainlyuse the first order moment of the conditional distribution (following SIR) todemonstrate the method, while noting that the method can be easily extendedto other conditional moments. It is important to note here that, while the con-ditional distribution π ( x | y ) is computed in a Bayesian fashion, the core of themethod (i.e. the estimation of the DR subspace) remains frequentist, and soit is fundamentally different from the methods [30,23,27] that do estimate theDR subspace with a Bayesian formulation (e.g., imposing a prior on the DRsubspace).To summarize, the main contribution of the work is to propose a GP basedBayesian formulation to compute the conditional distribution π ( x | y ) for anyvalue of y in the SIR/SAVE framework, and by doing so it avoids slicing thesamples, which makes it particularly effective for problems with very smallnumbers of data.The rest of this paper is organized as follows. In section 2 we set up a for-mulation of dimension reduction and go through the basic idea of the classicdimension reduction approaches SIR and SAVE. The Bayesian inverse regres-sion and the Bayesian average variance estimation are introduced explicitlyin Section 3, including a Bayesian formulation for computing π ( x | y ), the GPmodel used, and complete algorithms to draw samples from π ( x | y ). In section4 we provide several numerical examples. Section 5 offers concluding remarks. x be a p -dimensional random variable defined on R p following a distribution π ( x ),and suppose that we are interested in a scalar function of x , which ideally canbe written as, y = f ( b T x , b T x , ..., b TK x , ǫ ) , (2.1)where b k for k = 1 ...K are some p -dimensional vectors, and ǫ is small noiseindependent of x . It should be clear that, when this model holds, the projectionof the p -dimension variable x onto the k dimensional subspace of R p spannedby { b , ..., b K } , captures all the information of x with respect to y , and if K < p , we can achieve the goal of data reduction by estimating the coefficients { b k } Kk =1 . In practice, both the explicit expression of f and the coefficients X. Cai, G. Lin and J. Li { b k } Kk =1 are unknown, and instead we have a set of data pairs { ( x j , y j ) } nj =1 drawn from the joint distribution π ( x , y ) defined by π and Eq. (2.1). Findinga set of { b k } Kk =1 that satisfy the Eq. (2.1) from the given data set { ( x j , y j ) } nj =1 is the task of supervised dimension reduction. In what follows we shall referto the coefficients { b k } Kk =1 as dimension-reduction (DR) directions, and thelinear space B spanned by the { b k } Kk =1 as the DR subspace. For a more formaland generic description of the DR problem (in the Central DR Subspace andSufficient Dimension Reduction framework) we refer to [5].2.2 Sliced inverse regressionThe SIR approach [18] estimates the DR directions based on the idea of inverseregression (IR). In contrast to the forward regression E ( y | x ), IR regresseseach coordinate of x against y . Thus as y varies, E ( x | y ) draws a curve in R p along the y coordinate, whose center is located at E ( E ( x | y )) = E ( x ). Forsimplicity we shall assume that throughout this section x is a standardizedrandom variable: namely E ( x ) = 0 and Cov( x ) = I . Under the followingcondition the IR curve E ( x | y ) is contained in the DR subspace B [18]: Condition 1
For any β ∈ R p , the conditional expectation E ( β T x | b T x , ..., b TK x ) is linear in b T x , ..., b TK x . This condition is satisfied when the distribution of x is elliptically symmet-ric [18]. An important implication of this property is that the covariance matrixCov[ E ( x | y )] is degenerated in any direction orthogonal to the DR subspace B . We see, therefore, that the eigenvectors associated with the largest K eigen-values of Cov[ E ( x | y )] are the DR directions. So the key of estimating the DRdirection is to obtain the covariance of the conditional expectation of the data,Cov[ E ( x | y )].One of the most popular approaches to estimate the covariance Cov[ E ( x | y j )]is SIR. Simply put, SIR produces a crude estimate of E ( x | y ), by slicing thedata ( x , y ) , ..., ( x n , y n ) into H partitions according to the value of y j andthen estimating E ( x | y ∈ I h ) , h = 1 , ..., H using the data inside the interval I h for each h = 1 , ..., H . Finally one use the H samples to compute an estimateof the covariance matrix Cov[ E ( x | y )]. A complete SIR scheme is described asfollows:1. Divide range of y into H slices, I , ..., I H . Let the proportion of the y j thatfalls in slice I h be ˆ p h , i.e., ˆ p h = 1 n n X j =1 δ h ( y j ) , where δ h (y j ) takes the values 0 or 1 depending on whether y j falls into the h th slice I h or not. ayesian inverse regression for dimension reduction 5
2. Within each slice, compute the sample mean of the x j ’s, denoted by ˆ x h ( h =1 , ..., H ): ˆ x h = 1( n ˆ p h ) X y j ∈ I h x j .
3. Compute the weighted covariance matrixˆ C = H X h =1 ˆ p h ˆ x h ˆ x Th .
4. Perform eigenvalue decomposition of ˆ C , and return the eigenvectors as-sociated with the k largest eigenvectors as the estimated DR directionsˆ b , ..., ˆ b K .As is mentioned in Section 1, the slicing treatment is often not sufficientlyaccurate when the data set is small, and in what follows we shall provide analternative to compute the covariance matrix.2.3 Sliced average variance estimationThe SAVE method extract the DR directions from the variance of π ( x | y ),and by doing so it is able to recover the information that could be overlookedby SIR because of symmetries in the forward regression function [8]. Let thecolumns of b form a basis for the DR space. To use SAVE, we need to assumethe following two conditions [8]:1. E ( x | B T x )] is linear in B T x ,2. Var( x | B T x ) is a constant,where B is any basis matrix of R p . The conditions hold when x is normallydistributed although normality is not necessary. Under these two conditions,one can derive that span { I p − E (Cov[ x | y ]) } is a DR space [8], which is the basis for SAVE. A complete SAVE scheme isas follows:1. Divide range of y into H slices, I , ..., I H . Let the proportion of the y j thatfalls in slice I h be ˆ p h , i.e., ˆ p h = 1 n n X j =1 δ h ( y j ) , where δ h ( y j ) takes the values 0 or 1 depending on whether y j falls into the h th slice I h or not.2. Within each slice, compute the sample covariance matrix of the x j ’s, de-noted by ˆ M h ( h = 1 , ..., H ): ˆ M h = X y j ∈ I h x j x ′ j . (2.2) X. Cai, G. Lin and J. Li
3. The j -th sample SAVE DR direction can now be constructed by performeigenvalue decomposition on the following matrix , and return the eigen-vectiors associated with the k largest eigenvectors:ˆ C = H X h =1 ˆ p h ( I − ˆ M h ) (2.3) π ( x | y )Recall that in the SIR framework, a key step is to compute the covarianceCov[ E ( x | y )]. A natural choice to estimate the covariance Cov[ E ( x | y )] is touse the sample covariance of the data points,ˆ C = 1 n − n X j =1 (ˆ x j − ¯ x )(ˆ x j − ¯ x ) T , ¯ x = 1 n n X j =1 ˆ x j , (3.1)where ˆ x j is an estimate of E ( x | y j ) for all j = 1 ...n , and ( y , ..., y n ) are thedata points. Next we need to compute ˆ x j , the estimate of E ( x | y j ), and wepropose to do so in a Bayesian framework. Namely we formulate the problemas to compute the posterior distribution: π ( x | y ) ∝ π ( y | x ) π ( x ) , (3.2)where π ( y | x ) is the likelihood function and π ( x ) is the prior of x .We consider the prior distribution π ( x ) first. To start we note that thechoice of prior does not affect the DR subspace as this subspace structure liesin the function f ( x , ǫ ) in Eq. (2.1) rather than the distribution of x . As such,in principle one may use any prior distribution that satisfies the conditions re-quired by SIR/SAVE. However, if the chosen π ( x ) is too different from π , theGP model constructed from the data (following π ( x )) may not be accuratefor the samples drawn according to π ( x ), which in turn may hurt the accuracyof the posterior π ( x | y ). To this end, one should choose the prior to be π orclose to it. We consider the following three cases. First in certain problems,especially those where the data are generated from computer models, the dis-tribution π ( x ) may be known in advance. Secondly for most problems where π is not available in advance, a natural choice is to perform a crude densityestimation for the data { x j } nj =1 and use the estimated density as the prior.For example, one may use Gaussian mixtures [25] or a simple Gaussian toestimate the prior distribution from the data { x j } nj =1 . Finally, for problemswhere estimating the density of x are particularly challenging, we can just usethe original data points { x j } nj =1 , and in this case the prior is simply π . ayesian inverse regression for dimension reduction 7 π ( y | x ) from data, which,as mentioned earlier, is done by using the GP regression model.Simply speaking the GP regression performs a nonparametric regression ina Bayesian framework [31]. The main idea of the GP method is to assumes thatthe function of interest f ( x , ǫ ) is a realization from a Gaussian random field,whose mean is µ ( x ) and covariance is specified by a kernel function k ( x , x ′ ),namely, Cov[ f ( x ) , f ( x ′ )] = k ( x , x ′ ) . The kernel k ( x , x ′ ) is positive semidefinite and bounded.Now given the data points { ( x j , y j ) } nj =1 , we want to predict the value of y at a new point x . Now we let X := [ x , . . . , x n ], and Y = [ y , . . . , y n ]. Underthe GP assumption, it is easy to see that the joint distribution of ( Y , y ) isGaussian, (cid:20) Y y (cid:21) ∼ N (cid:18) µ ( X ) µ ( x ) , (cid:20) K ( X , X ) + σ n I K ( X , x ) K ( x , X ) K ( x , x ) (cid:21)(cid:19) , (3.3)where σ n is the variance of observation noise, I is an identity matrix, and thenotation K ( A , B ) denotes the matrix of the covariance evaluated at all pairsof points in set A and in set B using the kernel function k ( · , · ).It follows immediately from Eq. (3.3) that the conditional distribution π GP ( y | x , X , Y ) is also Gaussian: π GP ( y | x , X , Y ) = N ( µ pos , σ ) , (3.4a)where the posterior mean and variance are, µ pos ( x ) = µ ( x ) + k ( x , X )( k ( X , X ) + σ n I ) − ( Y − µ ( x )) ,σ = k ( x , x ) − k ( x , X )( k ( X , X ) + σ n I ) − k ( X , x ) . There are also a number of technical issues in the GP model, such as choos-ing the kernel function and determining the hyperparameters. For detaileddiscussion of these matters, we refer the readers to [31]. In what follows weshall use the GP posterior as the likelihood function, i.e., letting π ( y | x ) = π GP ( y | x , X , Y ).3.3 Computing the posterior meanOnce we obtain the likelihood function and the prior, a straightforward idea isto draw samples from the posterior distribution (3.2) with the Markov chainMonte Carlo (MCMC) simulation. An alternative strategy is to sample from π ( x ) in an importance sampling (IS) formulation. Namely suppose that wedraw a set of samples { x i } n MC i =1 from the prior distribution π ( x ), and for each x i we can compute the weight w i = π ( y | x i ) . X. Cai, G. Lin and J. Li
Algorithm 1
The Bayesian inverse regression algorithm with MCMC
Require: { ( x j , y j ) } nj =1 , n MC , π ( x ) Ensure:
The estimated DR directions: ˆ b , ..., ˆ b K
1: Construct the GP model from data { ( x j , y j ) } nj =1 : π GP ( y | x , X , Y );2: for j = 1 to n do
3: Draw n MC samples from π GP ( y j | x , X , Y ) π ( x ): { x i } n MC i =1 ;4: Compute ˆ x j = n MC P n MC i =1 x i ;5: end for
6: Compute ˆ C using Eq. (3.1) and { ˆ x j } nj =1 ;7: Perform eigenvalue decomposition of ˆ C ;8: Return the eigenvectors associated with the k largest eigenvalues as ˆ b , ..., ˆ b K . Algorithm 2
The Bayesian average variance estimation algorithm withMCMC
Require: { ( x j , y j ) } nj =1 , n MC , π ( x ) Ensure:
The estimated DR directions: ˆ b , ..., ˆ b K
1: Construct the GP model from data { ( x j , y j ) } nj =1 : π GP ( y | x , X , Y );2: for j = 1 to n do
3: Draw n MC samples from π GP ( y j | x , X , Y ) π ( x ): { x i } n MC i =1 ;4: Compute ˆ x = n MC P n MC i =1 x i ;5: Compute ˆ M j = n MC − P ni =1 ( x i − ˆ x )( x i − ˆ x ) T ;6: end for
7: Compute ˆ C = n P nj =1 ( I p − ˆ M j ) ;8: Perform eigenvalue decomposition of ˆ C ;9: Return the eigenvectors associated with the k largest eigenvalues as ˆ b , ..., ˆ b K . Finally the weights w , ..., w n MC are normalized so that P n MC i =1 w i = 1 (ifthese samples are drawn with MCMC, then w i = 1 /n MC for all i = 1 ...n MC ).We thus obtain obtain a set of weighted samples { ( x i , w i ) } n MC i =1 drawn fromthe posterior π ( x | y ). Now let { ( x i , w i ) } n MC i =1 be a set samples draw from theposterior, and we can estimate E ( x | y ) asˆ x = n MC X i =1 w i x i . (3.5)We repeat this procedure for each y j for j = 1 ...n , and then use Eq. (3.1) tocompute Cov[ E ( x | y )]. Since we use a Bayesian method to estimate E ( x | y ),we refer to proposed method as Bayesian inverse regression (BIR). Similarlythe samples can also be used to estimate the conditional covariance Cov[ x | y ]in SAVE, and the resulting method is termed as Bayesian average varianceestimation (BAVE). As is discussed earlier, the key of BIR/BAVE is essentiallyprovides a means to draw samples from the conditional distribution π ( x | y )without slicing the data, and its application is not limited to estimate E ( x | y )or Cov[ x | y ], and it is possible to make use of the conditional distribution in adifferent manner. Finally we present the BIR algorithm in Alg. 1 and BAVEin Alg. 2. ayesian inverse regression for dimension reduction 9 Remark 1
It is important to reinstate here that, the BIR/BAVE methods onlyuse the Bayes’ formula to compute the conditional distribution π ( x | y ), and theDR methods themselves are frequentist . Remark 2
A key step in the proposed method is to construct the likelihood π ( y | x ) with GP. It is well known that GP may not perform well as a regressionmodel for high dimensional problems. Nevertheless, as demonstrate by theexamples, while it is unable to provide accurate regression results, the resultingGP model are often adequate for the dimension reduction purposes. Moreover,as is stated earlier, in this work we focus on problems with modestly highdimensionality (less than 100) and a very limited number of data (hundredsor less). Remark 3
Another issue that should be mentioned here is how to select thenumber of the reduced dimensions; since BIR is also a method based on theeigenvalue decomposition of Cov[ E ( x | y )], the methods used in [18] and relatedworks, e.g., [10], can be used directly here. In this section we compare the performance of the proposed BIR/BAVE methodwith a number of common methods: SIR, SAVE, likelihood-based DR (LDR) [4],the Localized SIR (LSIR), in three mathematical and two real-data examples.The first example uses data simulated from a mathematical function, withwhich we want to exam the scalability of the methods with respect to the di-mensionality of the problem. The second one is also a mathematical example,and with this example we compare the performance of different methods af-fected by the non-ellipticity of the distribution of x . The third example is usedspecifically to compare the two second moment methods: SAVE and BAVE.Our last two examples are based on real data, in which we compare the perfor-mance of different methods in the small data situation. In the GP model usedin all the examples, we set the prior mean µ ( x ) = 0, and choose the AutomaticRelevance Determination (ARD) squared exponential kernel [31]: k ( x , x ′ ) = σ exp( − p X i =1 ( x i − x ′ i ) λ i ) , (4.1)where the hyperparameters σ , λ ..., λ d , and the σ n are determined by max-imum likelihood estimation [31]. In all the examples except the one in Sec-tion 4.2, the prior is obtained by fitting a Gaussian distribution to the data,while for the example in Section 4.2, we assume that the distribution π isknown, which is used as the prior. In addition, in all the examples, 10000MCMC samples are used to represent the conditional distribution π ( x | y ) inthe BIR and BAVE methods.
10 20 30 40 50 d R SIRLDRLSIRBIR
10 20 30 40 50 d R (b ) SIRLDRLSIRBIR
10 20 30 40 50 d R (b ) SIRLDRLSIRBIR (a)
10 20 30 40 50 d R SIRLDRLSIRBIR
10 20 30 40 50 d R (b ) SIRLSIRLDRBIR
10 20 30 40 50 d R (b ) SIRLSIRLDRBIR (b)
Fig. 1
The R accuracy of the DR subspace (left), the first DR direction b (center) andthe second DR direction b (right), all plotted against the dimensionality. (a) results forfunction (4.2a); (b) results for function (4.2b). d -dimensional problem where x follows a standard normaldistribution. The data are simulated from the following functions: f ( x , ǫ ) = x ( x + x ) + 0 . ǫ, (4.2a) f ( x , ǫ ) = x + x + x . x + x ) + 0 . ǫ, (4.2b)where ǫ ∼ N (0 , d = 10 , , , ,
50, where we set the number of data points tobe n = 5 d , i.e., growing linear with respect to dimensionality. To evaluate theperformance of the methods, we use the R metric of accuracy used in [18] tomeasure the accuracy of the DR subspace and the DR directions.We repeat all the tests for 100 times and report the average. Specifically,we show the R -accuracy of the DR subspace B and the two DR directions inFigs. 1. We can see that the BIR method has the best performance in all thetests in the two examples, except one situation: d = 10 for function (4.2b). The R accuracy for each DR direction provide more information on the results.Namely, for Function 4.2a, BIR performs better than all the other methods inboth of the directions. For function 4.2b, the accuracy of BIR is slightly lowerthan than SIR and LSIR for the first direction, but it achieves significantly ayesian inverse regression for dimension reduction 11 -2 0 2 x -150-100-500 x b=0 -2 0 2 x -150-100-500 x b=5 -2 0 2 x -150-100-500 x b=10 -2 0 2 x -150-100-500 x b=15 -2 0 2 x -150-100-500 x b=20 Fig. 2
The scatter plots of ( x , x ) for different values of b . higher accuracy on the second dimension than all the other ones. Finally wewant to note here that as the dimensionality increases, the performance of BIRdoes not decay evidently, suggesting that the method can handle rather highdimensional problems.4.2 Mathematical examples with non Gaussian distributionsIn our second example, we want to test the performance of the methods whenthe distribution of x is strongly non-Gaussian. We assume x is a 10-dimensionalvariable and the data are generated as follows. First let u = ( u , u ) follow atwo-dimensional standard normal distribution. We then perform the followingtransform: x = u , x = u − bu , (4.3)where b ≥
0. Here by varying parameter b one can control how different thedistribution of x is from Gaussian. Data of y are generated from u , and sothe transformation used to generating x does not affect the data of y . In thisexample we use the following two functions to generate y : y = u . u + 1 . + 0 . ǫ, (4.4a) y = sin(5 πu ) + u + 0 . ǫ, (4.4b)where ǫ ∼ N (0 , b : b =0 , , , ,
20 with sample size n = 100, and we show the scatter plots ofthe data points for all these cases in Fig. 2, where we can see that the re-sulting data points move apart from Gaussian as b increases. We plot the R accuracy against the value of b in Figs. 3 for both functions. From the figureswe can see that for function 4.4a, BIR clearly outperforms all the other meth-ods for all the values of b , and for function 4.4b, the BIR also has the bestperformance in all the cases, with LDR being about the same at b = 10 and20.4.3 Mathematical example for BAVEWe now consider a mathematical example which requires to consider the 2ndmoments. Let x be a 20 dimensional random variable following standard nor-
10 20 b R SIRLDRLSIRBIR
10 20 b R SIRLDRLSIRBIR
Fig. 3
The R -accuracy of the DR subspace plotted as a function of b , for function (4.4a)(left) and function (4.4b) (right) respectively.
30 40 50 60 70 80 90 100 110 120 sample size R SIRSAVELDRBAVE
Fig. 4
The R accuracy of the DR direction computed with different same sizes. mal distribution, and let y = x + 0 . ǫ, where noise ǫ ∼ N (0 , E ( x | y ) = 0, which implies thatthe first moment based approach, i.e., SIR, does not apply to this problem.We conduct numerical experiments with six different sample sizes: 30, 40,60, 80, 100 and 120, and for each sample size, we randomly generate 100sets of data. With each set of data, we estimate the DR direction with SIR,LDR, SAVE and BAVE. The R accuracy of the DR direction obtained byeach method, averaged over the 100 trials, is shown in Fig. 4. As expected,SIR fails completely for this example – its resulting R accuracy is near zero,regardless of the sample size. The results of LDR are better than SIR but theoverall accuracy remains quite low (less than 0 .
4) even when the sample sizereaches 120. On the other hand, the performance of SAVE and BAVE increasesnotably as the sample size increases, while for each sample size, the results ofBAVE are considerably better than those of SAVE, suggesting that BAVEperforms considerably better than SAVE for this small dataset problem. ayesian inverse regression for dimension reduction 13
Methods n = 15 n = 20 n = 25 n = 30 n = 35 n = 40w/o DR . . . . . . ( . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 1
The mean and the standard deviation (in parenthesis) of MRRE for Example 3.The best results are marked in bold.Methods min maxw/o DR . . . . . . . . . . The minimal and maximal relative regression error (RRE) in the 100 trials with20 data points for the death rate example. The best results are marked in bold.
The example considered in this section is to use pollution and relatedfactors to predict the death rate [24,1]. This is a regression problem with 15predictors and 60 data points and we choose this example to test how themethods perform with very small number of data. We first apply the DRmethods to select one feature (we have conducted tests with 2 and 3 featureswhich does not improve the regression accuracy, and so we omit those resultshere) and then construct a standard linear regression model of the data in thereduced dimension. As a comparison, we also perform the regression directlywithout DR. To test the methods with different numbers of data, we performthe experiments with 15, 20, 25, 30, 35, 40 data points randomly selected fromthe data set and another randomly selected 20 data points used as the testset. In each experiment we can compute the mean relative regression error(MRRE) using the data in the test set. Specifically, suppose { ( x i , y i ) } n t i =1 isthe training set and f r ( · ) is the regression model, the MRRE is computed as,MRRE = 1 n t n t X i =1 | y i − f r ( x i ) | y i . We repeat all the experiments 100 times, and compute the mean and thestandard deviation of the obtained MRRE, which is shown in Table 1. Firstwe observe that for n = 40 all the methods can achieve rather good accuracy;as n decrease, the results of all the other methods become evidently worse,while that of BIR remains quite stable, suggesting that the BIR is especiallyeffective in the small data case. It should be noted that for n = 15 LDRand LSIR fail to produce reasonable results due to numerical instability, andso we omit the results here. More importantly it can be seen from the tablethat starting from n = 30, the regression without DR actually has the bestperformance, suggesting that implementing DR is only necessary when thenumber of data points is below 30. In all the cases DR is genuinely needed,i.e., n <
25, the BIR method performs significantly better than all othermethods. To further analyze the performance, we also compute the minimaland the maximal relative regression errors (RRE) for the 20 data-point case,and present the results in Table 2. Once again, we can see that the BIR methodhas the best results in both the minimal and the maximal cases. sample size 20 30 40 50 60 70 80 90 100w/o DR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ( . . . . . . . . . Table 3
The mean and the standard deviation (in parenthesis) of MRRE for Example 4.The best results are marked in bold.Methods min maxw/o DR . . .
170 1 . .
169 1 . .
217 1 . .
113 0 . The minimal and maximal relative regression error (RRE) in the 100 trials with20 data points for the automobile price example. The best results are marked in bold.ayesian inverse regression for dimension reduction 15 n =10 , , ..., ,
100 randomly selected samples and another 50 random samplesused as the test set for all the cases. We repeat each experiment 100 times,and compute the MRRE each time. The mean and the standard deviation ofthe MRRE results are reported in Table 3. From the data given in Table 3,we obtain rather similar conclusions as those of Example 3. Namely, the BIRmethod has the best MRRE of all the four methods used. In Table 4, we showthe minimal and the maximal RRE for the 20 data-point case, and just likethe results in Example 3, we find that the BIR method has the smallest RREin both the minimal and the maximal cases.
We consider dimension reduction problems for regression and we propose aBayesian approach for computing the conditional distribution π ( x | y ) and per-form the dimension reduction. The method construct the likelihood functionfrom the data with a GP regression model and MCMC to generate samplesfrom the conditional distribution π ( x | y ). Numerical examples demonstratethat the proposed method is particularly effective for problems with very smalldata set. We reinstate here that, due to the use of GP model, BIR does notapply to problems with very high dimensions. Rather, we expect BIR canbe useful for problems with moderately high dimensions, and a very limitedamount of data.We believe the method can be useful in many real world applications. Forexample, in many high dimensional inverse problems and data assimilationproblems, one the data can only be informative on a small number of di-mensions [7,28,35]. A method that utilizes the DR methods to identify suchdata informed dimensions is currently under investigation. On the other hand,in certain problems gradient information is available, and DR methods whichtakes advantages of the gradient information have also been developed, e.g. [12,2,14]. In this case, we expect that the gradient information can also be used to enhance the performance of the BIR method, via, for example, Gradient-Enhanced Kriging [26], and we plan to investigate this problem in the future. References
1. Samprit Chatterjee and Ali S Hadi.
Regression analysis by example . John Wiley &Sons, 2015.2. Paul G Constantine.
Active subspaces: Emerging ideas for dimension reduction inparameter studies , volume 2. SIAM, 2015.3. R Dennis Cook and Liliana Forzani. Likelihood-based sufficient dimension reduction.
Journal of the American Statistical Association , 104(485):197–208, 2009.4. R Dennis Cook, Liliana M Forzani, and Diego R Tomassi. Ldr: A package for likelihood-based sufficient dimension reduction.
Journal of Statistical Software , 39(i03), 2011.5. R Dennis Cook and Liqiang Ni. Sufficient dimension reduction via inverse regression:A minimum discrepancy approach.
Journal of the American Statistical Association ,100(470):410–428, 2005.6. R Dennis Cook and Sanford Weisberg. Sliced inverse regression for dimension reduction:Comment.
Journal of the American Statistical Association , 86(414):328–332, 1991.7. Tiangang Cui, James Martin, Youssef M Marzouk, Antti Solonen, and Alessio Span-tini. Likelihood-informed dimension reduction for nonlinear inverse problems.
InverseProblems , 30(11):114015, 2014.8. R Dennis Cook. Save: a method for dimension reduction and graphics in regression.
Communications in statistics-Theory and methods , 29(9-10):2109–2121, 2000.9. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.10. Louis Ferr´e. Determining the dimension in sliced inverse regression and related methods.
Journal of the American Statistical Association , 93(441):132–140, 1998.11. Kenji Fukumizu, Francis R Bach, and Michael I Jordan. Dimensionality reductionfor supervised learning with reproducing kernel hilbert spaces.
Journal of MachineLearning Research , 5(Jan):73–99, 2004.12. Kenji Fukumizu and Chenlei Leng. Gradient-based kernel method for feature extractionand variable selection. In
Advances in Neural Information Processing Systems , pages2114–2122, 2012.13. Minyoung Kim and Vladimir Pavlovic. Dimensionality reduction using covariance op-erator inverse regression. In , pages 1–8. IEEE, 2008.14. R´emi Lam, Olivier Zahm, Youssef Marzouk, and Karen Willcox. Multifidelity dimensionreduction via active subspaces. arXiv preprint arXiv:1809.05567 , 2018.15. Bing Li.
Sufficient dimension reduction: Methods and applications with R . CRC Press,2018.16. Bing Li, Yuexiao Dong, et al. Dimension reduction for nonelliptically distributed pre-dictors.
The Annals of Statistics , 37(3):1272–1298, 2009.17. Bing Li and Shaoli Wang. On directional regression for dimension reduction.
Journalof the American Statistical Association , 102(479):997–1008, 2007.18. Ker-Chau Li. Sliced inverse regression for dimension reduction.
Journal of the AmericanStatistical Association , 86(414):316–327, 1991.19. Lexin Li and Xiangrong Yin. Sliced inverse regression with regularizations.
Biometrics ,64(1):124–131, 2008.20. Qian Lin, Zhigen Zhao, and Jun S Liu. Sparse sliced inverse regression via lasso.
Journalof the American Statistical Association , pages 1–33, 2019.21. Yanyuan Ma and Liping Zhu. A semiparametric approach to dimension reduction.
Journal of the American Statistical Association , 107(497):168–179, 2012.22. Yanyuan Ma and Liping Zhu. A review on dimension reduction.
International StatisticalReview , 81(1):134–150, 2013.23. Kai Mao, Feng Liang, and Sayan Mukherjee. Supervised dimension reduction usingbayesian mixture modeling. In
Proceedings of the Thirteenth International Conferenceon Artificial Intelligence and Statistics , pages 501–508, 2010.ayesian inverse regression for dimension reduction 1724. Gary C McDonald and Richard C Schwing. Instabilities of regression estimates relatingair pollution to mortality.
Technometrics , 15(3):463–481, 1973.25. Geoffrey McLachlan and David Peel.
Finite Mixture Models . John Wiley & Sons, 2004.26. Max D Morris, Toby J Mitchell, and Donald Ylvisaker. Bayesian design and analy-sis of computer experiments: use of derivatives in surface prediction.
Technometrics ,35(3):243–255, 1993.27. Brian J Reich, Howard D Bondell, and Lexin Li. Sufficient dimension reduction viabayesian mixture modeling.
Biometrics , 67(3):886–895, 2011.28. Antti Solonen, Tiangang Cui, Janne Hakkarainen, and Youssef Marzouk. On dimensionreduction in gaussian filters.
Inverse Problems , 32(4):045003, 2016.29. Kean Ming Tan, Zhaoran Wang, Tong Zhang, Han Liu, and R Dennis Cook. Aconvex formulation for high-dimensional sparse sliced inverse regression.
Biometrika ,105(4):769–782, 2018.30. Surya T Tokdar, Yu M Zhu, Jayanta K Ghosh, et al. Bayesian density regression withlogistic gaussian process and subspace projection.
Bayesian analysis , 5(2):319–344,2010.31. Christopher KI Williams and Carl Edward Rasmussen.
Gaussian processes for machinelearning . MIT Press, 2006.32. Qiang Wu, Sayan Mukherjee, and Feng Liang. Localized sliced inverse regression. In
Advances in neural information processing systems , pages 1785–1792, 2009.33. Yingcun Xia, Howell Tong, Wai Keungxs Li, and Li-Xing Zhu. An adaptive estimation ofdimension reduction space.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 64(3):363–410, 2002.34. Xiangrong Yin and R Dennis Cook. Estimating central subspaces via inverse thirdmoments.
Biometrika , 90(1):113–125, 2003.35. Olivier Zahm, Tiangang Cui, Kody Law, Alessio Spantini, and Youssef Marzouk. Cer-tified dimension reduction in nonlinear bayesian inverse problems. arXiv preprintarXiv:1807.03712 , 2018.36. Li-Ping Zhu and Li-Xing Zhu. On kernel method for sliced average variance estimation.