Efficient Bayesian reduced rank regression using Langevin Monte Carlo approach
aa r X i v : . [ s t a t . C O ] F e b Efficient Bayesian reduced rank regression using LangevinMonte Carlo approach
The Tien Mai (1) (1)
Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics,University of Oslo, Norway.Email: [email protected]
Abstract
The problem of Bayesian reduced rank regression is considered in this paper. Wepropose, for the first time, to use Langevin Monte Carlo method in this problem. Aspectral scaled Student prior distrbution is used to exploit the underlying low-rankstructure of the coefficient matrix. We show that our algorithms are significantlyfaster than the Gibbs sampler in high-dimensional setting. Simulation results showthat our proposed algorithms for Bayesian reduced rank regression are comparableto the state-of-the-art method where the rank is chosen by cross validation.
Reduced rank regression [Anderson et al., 1951, Izenman, 1975, Velu and Reinsel, 2013]is a widely used model in linear multivariate regression. In this model, the low-rankconstraint is imposed on the coefficient matrix to promote estimation and predictionaccuracy of multivariate regression. This low-rank structure is building upon the beliefthat the response variables are related to the predictors through only a few latentdirection. This assumption is also useful to extend the model to high dimension settings[Bunea et al., 2011].Variouse methods have been conducted for reduced rank regression based on Bayesianapproach. A first study can be found in [Geweke, 1996] in bayesian econometrics whichbased on a low-rank matrix factorization approach. Since then, various works have beenstudied, for example, [Kleibergen and Paap, 2002, Corander and Villani, 2004, Babacan et al., 2012,Schmidli, 2019, Alquier, 2013] and developed to low-rank matrix estimation and comple-tion [Lim and Teh, 2007, Salakhutdinov and Mnih, 2008, Mai and Alquier, 2015]. More-over, Bayesian reduced rank regression has been successfully applied in other fields suchas genomics [Marttinen et al., 2014, Zhu et al., 2014]. More recently, several works haveexpanded Bayesian methods for incoperating the sparsity into the reduced rank regres-sion [Goh et al., 2017, Chakraborty et al., 2020, Yang et al., 2020].It is noted however that most works in Bayesian reduced rank regression (BRRR)usually employ conjugate priors as the conditional posterior distributions can be ex-plitcitely derived that allow to implement Gibbs sampling [Geweke, 1996, Salakhutdinov and Mnih, 2008].The details of these priors are reviewed and discussed in [Alquier, 2013]. Nevertheless,1hese Gibbs sampling approaches need to calculate a number of matrix inversions orsingular value decompositions at each iteration and thus it will be costly and can slowdown significantly the algorithm for large data. Different attempts, however based onoptimization, have been made to address this issue including variational Bayesian meth-ods and maximum a posteriori approach [Lim and Teh, 2007, Marttinen et al., 2014,Yang et al., 2018].In this paper, we consider a different way for choosing the prior distribution onlow-rank matrices rather than based on the traditional low-rank matrix factorization.More specifically, a scaled Student prior is used in our approach for which the rank ofthe coefficient matrix does not need to be prespecified as in the matrix factorizationapproach. This prior has been successfully used before in a context related to low-rankmatrix completion [Yang et al., 2018] and image denoising [Dalalyan et al., 2020a].We develop, for the first time, a Langevin Monte Carlo (MC) approach in Bayesianreduced rank regression. The Langevin MC method was introduced in physics basedon Langevin diffusions [Ermak, 1975] and it became popular in statistics and ma-chine learning following the paper [Roberts et al., 1996]. Recent advances in the studyof Langevin Monte Carlo make it become more popular in practice [Dalalyan, 2017,Durmus et al., 2017, Dalalyan et al., 2020b] and a promissing approach for Bayesianstatistics [Durmus et al., 2019].More specifically, in this work, we first present a naive (unadjusted) Langevin MCalgorithm and then a Metropolis–Hastings correction for Langevin MC is proposed.Interestingly, our implementations does not require to perform matrix inversion norsingular values decomposition and thus our algorithms can deal with large data setefficiently. Numerical results from these two algorithm are comparable to the frequentistapproach for which the rank is chosen using cross validation. More particularly, wefurther show that our proposed Langevin MC algorithms are significantly faster thanGibbs sampler in high-dimensional settings, see Section 4.The paper is structured as follows. In Section 2 we present the reduced rank regeres-sion model, then the prior distribution is defined together with some discussion regardingthe low-rank factorization prior. In Section 3, the implementations of the Langevin MCapproach are given in details. Numerical simulations and a real data application arepresented in Section 4. Some discussion and conclusion are given in Section 5 and 6respectively.
We observe two matrices X and Y with Y = XB + E (1)where Y ∈ R n × m is the response matrix, X ∈ R n × p is the predictor matrix, B ∈ R p × m is the unknown regression coefficient matrix and E is an n × m random noise matrixwith E ( E ) = 0. 2e assume that the entries E i,j of E are i.i.d. Gaussian N (0 , σ ) and σ is given. Inthis case, note that the likelihood distribution is given by L ( Y ) ∝ exp (cid:26) − σ k Y − XB k F (cid:27) where k · k F denotes the Frobenius norm, k M k F = Tr( M T M ).The objective is to estimate the parameter matrix B . In many applications, it makessense to assume that the matrix B has low rank, i.e. rank( B ) ≪ min( p, m ).With a prior distribution π ( B ), the posterior for the Bayesian reduced rank regres-sion is L n ( B ) ∝ L ( Y ) π ( B ) . Then, the Bayesian estimator is defined as b B = Z B L n ( B ) . (2)Note that here we focus on estimating B and thus σ is fixed. On can relax thisassumption and place a prior distribution on σ . Borrowing motivation from a low-rank prior explored in recent works [Yang et al., 2018,Dalalyan et al., 2020a], we consider the following prior, π ( B ) ∝ det( λ I p + BB ⊺ ) − ( p + m +2) / (3)where λ > I p is the p × p identity matrix .To illustrate that this prior has the potential to encourage the low-rankness of B ,one can check that π ( B ) ∝ m Y j =1 ( λ + s j ( B ) ) − ( p + m +2) / , where s j ( B ) denotes the j the largest singular value of B . It is well known thatthe log-sum function P mj =1 log( λ + s j ( B ) ) encourages a sparsity on { s j ( B ) } , see[Candes et al., 2008, Yang et al., 2018]. Thus the resulting matrix B has a low-rankstructrure, approximately.The following Lemma explains the reason why this prior is a spectral scaled Studentprior distribution. Lemma 1. [Dalalyan et al., 2020a]
If a random matrix B has the density distributionas in (3) , then the random vectors B i are all drawn from the p -variate scaled Studentdistribtuion ( λ/ √ t ,p . n low-rank factorization priors The first idea about a low-rank prior was carried out in [Geweke, 1996]. That is toexpress the matrix parameter B as B p × m = M p × k N ⊤ m × k with k ≤ min( p, m ). The prioris difined on M and N rather than on B as π ( M, N ) ∝ exp (cid:26) − τ k M k F + k N k F ) (cid:27) for some τ >
0. This prior allows to obtain explicit forms for the marginal posteriorsthat allows an implementation of the Gibbs algorithm to sample from the posterior, see[Geweke, 1996]. However, the downside of this approach is the problem of choosing k ,the reduced rank, is not directly addressed. Thus, one has to perform model selectionfor any possible k , as done in [Kleibergen and Paap, 2002, Corander and Villani, 2004].Recent approaches focus on fixing a large k , e.g. k = min( p, m ), then sparsity-promoting priors are placed on the columns of M and N such that most columnsare almost null. So that the resulting matrix B = M N ⊤ is approximately low-rank.This direction was first proposed in [Babacan et al., 2012] in the context of matrixcompletion, but it still can be used in reduced rank regression. See [Alquier, 2013] forthe details and dicussions on low-rank factorization priors.With low-rank factorization priors, most authors simulate from the posterior byusing the Gibbs sampler as the conditional posterior distributions can be explitcitelyderived, e.g. in [Geweke, 1996, Salakhutdinov and Mnih, 2008]. However, these Gibbssampling algorithms update the factor matrices in a row-by-row fashion and invole anumber of matrix inverse operations at each iteration. This is expensive and slow downthe algorithm for large data set. In this section, we propose to compute an approximation of the posterior with the scaledmultivariate Student prior by a suitable version of the Langevin Monte Carlo algorithm.
Let us recall that the log posterior is of the following formlog L n ( B ) = − σ k Y − XB k F − p + m + 22 log det( λ I p + BB ⊺ ) , and consequently, ∇ log L n ( B ) = − σ X ⊺ ( Y − XB ) − ( p + m + 2)( λ I p + BB ⊺ ) − B, We use the constant step-size unadjusted Langevin MC (denoted by LMC) [Durmus et al., 2019].It is defined by choosing an initial matrix B and then by using the recursion B k +1 = B k − h ∇ log L n ( B k ) + √ h W k , k = 0 , , . . . , (4)4here h > W , W , . . . are independent random matrices with i.i.d.standard Gaussian entries. The detail of the algorithm is given in the Algorithm 1.Note that a direct application of the Langevin MC algorithm (4) needs to calculate a p × p matrix inversion at each iteration. This can slow down significantly the algorithmand might be expensive. However, one can easily verify that the matrix M = ( λ I p + BB ⊺ ) − B is the solution to the following convex optimization problemmin M (cid:8) k I m − B ⊤ M k F + λ k M k F (cid:9) . The solution of this optimization problem can be obtained by using the package ’glmnet’[Friedman et al., 2010] (with family option ’mgaussian’). This does not require matrixinversion nor other costly operation. However, it is noted that in this case we are usingthe Langevin MC with approximate gradient evaluation, theoretical assessment of thismethod can be found in [Dalalyan and Karagulyan, 2019].
Algorithm 1
LMC for BRRR Input : matrices Y ∈ R n × m , X ∈ R n × p Parameters : Positive real numbers λ, h, T Onput : The matrix b B Initialize : B ← ( X ⊤ X + 0 . I p ) − X ⊤ Y ; b B = p × m for k ← T do Simulate B k from (4); b B ← b B + B k /T end forRemark 1. It seems that the Algorithm 1 looks like an iterative gradient descent forminimizing the penalized Gaussian log-likelihood with a penalty. However, Algorithm 1computes the posterior mean and not a maximum a posteriori estimator. More precisely,our algorithm includes a final step of averaging.
Remark 2.
For small values of h , the ouput b B is very close to the integral (2) ofinterest. However, for some h that may not small enough, the Markov process is tran-sient and thus the sum explodes [Roberts and Stramer, 2002]. To address this problem,one have to take a smaller h and restart the algorithm or a Metropolis–Hastings cor-rection can be included in the algortihm. The Metropolis–Hastings approach ensuresthe convergence to the desired distribution, however, it greatly slows down the algorithmbecause of an additional acception/rejection step at each iteration. The approach bytaking a smaller h also slows down the algorithm but we keep some control on its timeof execution. Remark 3.
Based our observations from numerical studies in Section 4, the initialmatrix B can also effect the convergence of the algorithm. We suggest using B =( X ⊤ X + 0 . I p ) − X ⊤ Y as a default alternative. Here, we consider a Metropolis-Hasting correction to the Algorithm 1. This approachguarantees the convergence to the posterior. More precisely, we consider the update5ule in (4) as a proposal for a new state, e B k +1 = B k − h ∇ log L n ( B k ) + √ h W k , k = 0 , , . . . . (5)Note that e B k +1 is normally distributed with mean B k − h ∇ log L n ( B k ) and the co-variance matrix equals to 2 h I p . This proposal is accepted or rejected according to theMetropolis-Hastings algorithm. That is the proposal is accepted with probabiliy:min ( , L n ( e B k +1 ) q ( B k | e B k +1 ) L n ( B k ) q ( e B k +1 | B k ) ) , where q ( x ′ | x ) ∝ exp (cid:18) − h k x ′ − x + h ∇ log L n ( x ) k F (cid:19) is the transition probability density from x to x ′ . The detail of the Metropolis-adjustedLangevin algorithm (denoted by MALA) for BRRR is given Algorithm 2. Compared torandom-walk Metropolis–Hastings, MALA has the advantage that it usually proposesmoves into regions of higher probability, which are then more likely to be accepted. Remark 4.
Following [Roberts and Rosenthal, 1998], the choice of the step-size h istuned such that the acceptance rate is approximate . . See Section 4 for some choicesin special cases in our simulations. Algorithm 2
MALA for BRRR Input : matrices Y ∈ R n × m , X ∈ R n × p Parameters : Positive real numbers λ, h, T Onput : The matrix b B Initialize : B ← ( X ⊤ X + 0 . I p ) − X ⊤ Y ; b B = p × m for k = 1 to T do Simulate e B k from (5) Calculate α = min n , L n ( e B k ) q ( B k − | e B k ) L n ( B k ) q ( e B k | B k − ) o Sample u ∼ U [0 , if u ≤ α then B k = e B k else B k = B k − end if b B ← b B + B k /T end for First, we perform some numerical studies on simulated data to access the performanceof our proposed algorithms. We consider the following model setups:6
Model I: A low-dimensional set up is studied with n = 100 , p = 12 , m = 8 andthe true rank r = rank( B ) = 3. The design matrix X is generated from N (0 , Σ)where the covariance matrix Σ is with diagonal entries 1 and off-diagonal entries ρ X ≥
0. We consider ρ X = 0 and ρ X = 0 .
5, this creates a wide-range correlationin the predictors. The true coefficient matrix is generated as B = B B ⊤ where B ∈ R p × r , B ∈ R m × r and all entries in B and B are randomly sampled from N (0 , • Model II: This model is similar to Model I, however, a high-dimensional set up isconsidered with n = 100 , p = 150 , m = 90. • Model III: An approximate low-rank set up is studied. This series of simulationis similar to the Model II, except that the true coefficient is no longer rank 3, butit can be well approximated by a rank 3 matrix: B = 2 · B B ⊤ + E, where E is matrix with entries sampled from N (0 , . We also com-pare LMC and MALA to the Gibbs sampler from [Alquier, 2013], however, we are justable to perform these comparisions in Model I as the Gibbs sampler is too slow for largedimensions, see Figure 1. The R codes for the Gibbs sampler are kindly provided bythe author of the paper [Alquier, 2013].The evaluations are done by using the estimation error (Est) and the normalizedmean square error (Nmse)Est := k B − b B k F / ( pm ) , Nmse := k B − b B k F / k B k F , that are calculated as the average of the mean squared errors from all 100 runs. Wealso evaluate the average over 100 runs of the prediction error (Pred) asPred := k Y test − X test b B k F / ( nm )where X test is a newly generated n × p test-sample matrix of predictors and Y test is anewly generated n × m test-sample matrix of responses. We also report the average ofestimated rank, denote Rank, for different methods over all the runs.The choice of the step-size parameters is set as: for Model III, we take h =3 / ( √ mnp ); with Model II h = 5 / ( mnp ) and with Model I, h = 2 / ( pm √ n ). Thischoice is selected such that the acceptance rate of MALA is approximate 0.5. We fixed λ = 3 in all models. The LMC, MALA and Gibbs sampler are run with T = 200iterations and we take the first 100 steps as burn-in. https://cran.r-project.org/package=rrpack .2 Simulation results In low-dimensional setting where pm < n as in Model I, Langevin MC algorithms(LMC, MALA) are able to recover the true rank of the model, see Table 1. The resultsof MALA are slightly better than LMC. The prediction errors of LMC and MALA arecomparable to RRR and Gibbs sampler. In terms of other errors (Est and Nmse), LMCand MALA are twice worse than RRR and Gibbs sampler.However, it is worth noting that the running time of our algorithms is linearly with p where n and m are fixed, while the Gibbs sampler is not. More specifically, we conducteda comparision on the running time for these four algortihms where the dimension p isvaried by 10 , , ,
150 with fixed n = 100 , m = 90. The results is given in Figure 1.It is clear that the Gibbs sampler is several magnitude slower than our algorithms.Results from high dimensional settings as in Model II and II reveal that our algo-rithms perform quite similar the RRR method, see Table 2, in term of all considerederrors. Moreover, it is interesting that LMC and MALA are slightly better than RRRmethod in Model III where coefficient matrix is approximately low-rank. More specif-ically, MALA produces promissing results that are lightly better than LMC as well asRRR. Table 1:
Simulation results on simulated data in Model I for different methods,with their standard error in parentheses. (Est: average of estimation error;Pred: average of prediction error; Rank: average of estimated rank). ρ X = 0 . × Est 1.25 (0.21) 1.26 (0.21) 0.58 (0.14) 0.59 (0.13)Pred 1.15 (0.06) 1.15 (0.06) 1.07 (0.05) 1.07 (0.05)10 × Nmse 4.66 (2.21) 4.80 (2.46) 2.16 (1.11) 2.18 (1.06)Rank 3 (0.0) 3 (0.0) 3 (0.0) 3 (0.0) ρ X = 0 . × Est 2.52 (0.41) 2.39 (0.56) 1.03 (0.24) 1.12 (0.02)Pred 1.17 (0.07) 1.16 (0.07) 1.08 (0.06) 1.08 (0.07)10 × Nmse 0.94 (0.41) 0.96 (0.55) 0.43 (0.31) 0.46 (0.29)Rank 3 (0.0) 3 (0.0) 3 (0.0) 3 (0.0)
We apply our algorithms to a breast cancer dataset [Witten et al., 2009] to acess itsperformance on real data set. This data consisting of gene expression measurementsand comparative genomic hybridization measurements for n = 89 samples. The datasetis available from the R packge ’PMA’ [Witten et al., 2009]. This data were used beforein [Chen et al., 2013] in the context of reduced rank regression.Following [Chen et al., 2013], we consider the gene expression profiles of a chromo-some as predictors and the copy-number variations of the same chromosome as repsonse.8 able 2: Simulation results on simulated data in Model II & III for differentmethods, with their standard error in parentheses. (Est: average of estimationerror; Pred: average of prediction error; Rank: average of estimated rank).
Models Errors LMC MALA RRREst 1.00 (0.13) 1.00 (0.13) 0.99 (0.13)II, ρ X = 0 . − × Pred 1.51 (0.23) 1.51 (0.23) 1.49 (0.23)Nmse 0.34 (0.03) 0.34 (0.03) 0.33 (0.03)Rank 3 (0.0) 3 (0.0) 3 (0.0)Est 1.01 (0.14) 1.01 (0.14) 0.99 (0.14)II, ρ X = 0 . − × Pred 7.65 (1.20) 7.65 (1.20) 7.45 (1.20)Nmse 0.34 (0.03) 0.34 (0.03) 0.33 (0.03)Rank 3 (0.0) 3 (0.0) 3 (0.0)Est 4.32 (0.64) 4.32 (0.64) 4.34 (0.64)III, ρ X = 0 . − × Pred 6.53 (1.05) 6.53 (1.05) 6.56 (1.05)Nmse 0.33 (0.03) 0.33 (0.03) 0.33 (0.03)Rank 3.04 (0.20) 3.04 (0.20) 3.04 (0.20)Est 4.41 (0.58) 4.40 (0.57) 4.43 (0.57)III, ρ X = 0 . − × Pred 3.31 (0.47) 3.30 (0.47) 3.32 (0.47)Nmse 0.34 (0.03) 0.34 (0.03) 0.34 (0.03)Rank 3.09 (0.35) 3.08 (0.34) 3.08 (0.34) − t i m e i n l og − sc a l e p=10 p=50 p=100 p=150 Gibbs sampler MALA LMC RRR
Figure 1:
Plot to compare the running times for 10 iterations of LMC, MALAGibbs sampler and 10-fold cross validation RRR with fixed n = 100 , m = 90 , r =2 and the dimension p is varied. The analysis is focused on chromosome 21, for which m = 44 and p = 227. The data9re randomly divided into a training set of size n train = 79 and a test set of size n test = 10. Model estimation is done by using the training data. Then the predic-tive performance is calculated on the test data by its mean squared prediction error k Y test − X test b B k F / ( mn test ), where ( Y test , X test ) denotes the test set. We repeat therandom training/test spliting process 100 times and report the average mean squaredprediction error and the average rank estimate for each method. The results are givenin Table 3, we can see that MALA is better than LMC. Table 3:
Comparision of the model fits to the real data. The mean suqaredprediction errors (MSPE) and the estimated ranks are reported, with their stan-dard error in parentheses.
LMC MALA RRRMSPE 0.052 (.009) 0.049 (.008) 0.030 (.008)Rank 1.03 (.17) 1.03 (.17) 0.74 (.44)
It is noted that our Langevin MC approaches for BRRR are using a different prior on(approximate) low-rank matrix comparing with the matrix factorization as in Gibbssampler. There are several other ways to define such priors on a whole matrix, see[Sundin, 2016]. For example, one could consider, with λ > π ( B ) ∝ exp( λ Trace( BB ⊤ ) / )where its log-prior is the nuclear norm that is also promoting the low-rank structureon B . The application of Langevin MC method for such priors would be interestingresearch directions in the future.A vital part in the Langevin MC approach is choosing the step-size h . Here, inthis work, the choice of h is picked such that the acceptance rate in MALA is around0 .
5, motivating from [Roberts and Rosenthal, 1998]. We have tried with a decreasingstep-size as h t = h t − /t , however this choice does not improve the results at all compareto the choise defining through the acceptance rate. It is noted that there are severalother way for choosing h , for example, h is adaptively changed in each iteration as in[Marshall and Roberts, 2012]. The study of such approach to BRRR is left for futureresearch.Bayesian studies that incorporating sparsity into RRR model to account for bothrank selection and variable selection have been recently carried out in [Goh et al., 2017,Yang et al., 2020]. However, these works are still based on low-rank factorization priorsand the implementation of the Gibbs sampler. Thus, the application of Langevin MCto this problem would be another interesting future work.10 Conclusion
In the paper, we have proposed efficient Langevin MC approach for BRRR. The per-formances of our algorithms are similar to the state-of-the-art method in simulations.More importantly, we showed that the proposed algorithms are significant faster thanthe Gibbs sampler. This is an interesting way that makes BRRR become more appli-cable in large data set.
Availability of data and materials
The R codes and data used in the numerical experiments are available at: https://github.com/tienmt/BRRR . Acknowledgments
This research of T.T.M was supported by the European Research Council grant no.742158.
References
Alquier, P. (2013). Bayesian methods for low-rank matrix estimation: short survey andtheoretical study. In
International Conference on Algorithmic Learning Theory , pages309–323. Springer.Anderson, T. W. et al. (1951). Estimating linear restrictions on regression coefficients formultivariate normal distributions.
Annals of mathematical statistics , 22(3):327–351.Babacan, S. D., Luessi, M., Molina, R., and Katsaggelos, A. K. (2012). Sparse bayesianmethods for low-rank matrix estimation.
IEEE Transactions on Signal Processing ,60(8):3964–3977.Bunea, F., She, Y., Wegkamp, M. H., et al. (2011). Optimal selection of reduced rankestimators of high-dimensional matrices.
The Annals of Statistics , 39(2):1282–1309.Candes, E. J., Wakin, M. B., and Boyd, S. P. (2008). Enhancing sparsity by reweighted ℓ minimization. Journal of Fourier analysis and applications , 14(5-6):877–905.Chakraborty, A., Bhattacharya, A., and Mallick, B. K. (2020). Bayesian sparse mul-tiple regression for simultaneous rank reduction and variable selection.
Biometrika ,107(1):205–221.Chen, K., Dong, H., and Chan, K.-S. (2013). Reduced rank regression via adaptivenuclear norm penalization.
Biometrika , 100(4):901–920.Corander, J. and Villani, M. (2004). Bayesian assessment of dimensionality in reducedrank regression.
Statistica Neerlandica , 58(3):255–270.11alalyan, A. S. (2017). Theoretical guarantees for approximate sampling from smoothand log-concave densities.
Journal of the Royal Statistical Society: Series B (Statis-tical Methodology) , 3(79):651–676.Dalalyan, A. S. et al. (2020a). Exponential weights in multivariate regression and alow-rankness favoring prior. In
Annales de l’Institut Henri Poincar´e, Probabilit´es etStatistiques , volume 56, pages 1465–1483. Institut Henri Poincar´e.Dalalyan, A. S. and Karagulyan, A. (2019). User-friendly guarantees for the langevinmonte carlo with inaccurate gradient.
Stochastic Processes and their Applications ,129(12):5278–5311.Dalalyan, A. S., Riou-Durand, L., et al. (2020b). On sampling from a log-concavedensity using kinetic langevin diffusions.
Bernoulli , 26(3):1956–1988.Durmus, A., Moulines, E., et al. (2017). Nonasymptotic convergence analysis for theunadjusted langevin algorithm.
The Annals of Applied Probability , 27(3):1551–1587.Durmus, A., Moulines, E., et al. (2019). High-dimensional bayesian inference via theunadjusted langevin algorithm.
Bernoulli , 25(4A):2854–2882.Ermak, D. L. (1975). A computer simulation of charged particles in solution. i. techniqueand equilibrium properties.
The Journal of Chemical Physics , 62(10):4189–4196.Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalizedlinear models via coordinate descent.
Journal of Statistical Software , 33(1):1–22.Geweke, J. (1996). Bayesian reduced rank regression in econometrics.
Journal of econo-metrics , 75(1):121–146.Goh, G., Dey, D. K., and Chen, K. (2017). Bayesian sparse reduced rank multivariateregression.
Journal of multivariate analysis , 157:14–28.Izenman, A. J. (1975). Reduced-rank regression for the multivariate linear model.
Jour-nal of multivariate analysis , 5(2):248–264.Kleibergen, F. and Paap, R. (2002). Priors, posteriors and bayes factors for a bayesiananalysis of cointegration.
Journal of Econometrics , 111(2):223–249.Lim, Y. J. and Teh, Y. W. (2007). Variational bayesian approach to movie ratingprediction. In
Proceedings of KDD cup and workshop , volume 7, pages 15–21. Citeseer.Mai, T. T. and Alquier, P. (2015). A bayesian approach for noisy matrix completion:Optimal rate under general sampling distribution.
Electron. J. Statist. , 9(1):823–841.Marshall, T. and Roberts, G. (2012). An adaptive approach to langevin mcmc.
Statisticsand Computing , 22(5):1041–1057.Marttinen, P., Pirinen, M., Sarin, A.-P., Gillberg, J., Kettunen, J., Surakka, I., Kangas,A. J., Soininen, P., O’Reilly, P., Kaakinen, M., et al. (2014). Assessing multivariategene-metabolome associations with rare variants using bayesian reduced rank regres-sion.
Bioinformatics , 30(14):2026–2034. 12oberts, G. O. and Rosenthal, J. S. (1998). Optimal scaling of discrete approximationsto langevin diffusions.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 60(1):255–268.Roberts, G. O. and Stramer, O. (2002). Langevin diffusions and metropolis-hastingsalgorithms.
Methodology and computing in applied probability , 4(4):337–357.Roberts, G. O., Tweedie, R. L., et al. (1996). Exponential convergence of langevindistributions and their discrete approximations.
Bernoulli , 2(4):341–363.Salakhutdinov, R. and Mnih, A. (2008). Bayesian probabilistic matrix factorizationusing markov chain monte carlo. In
Proceedings of the 25th international conferenceon Machine learning , pages 880–887.Schmidli, H. (2019). Bayesian reduced rank regression for classification. In
Applicationsin Statistical Computing , pages 19–30. Springer.Sundin, M. (2016).
Bayesian methods for sparse and low-rank matrix problems . PhDthesis, KTH Royal Institute of Technology.Velu, R. and Reinsel, G. C. (2013).
Multivariate reduced-rank regression: theory andapplications , volume 136. Springer Science & Business Media.Witten, D. M., Tibshirani, R., and Hastie, T. (2009). A penalized matrix decomposition,with applications to sparse principal components and canonical correlation analysis.
Biostatistics , 10(3):515–534.Yang, D., Goh, G., and Wang, H. (2020). A fully bayesian approach to sparse reduced-rank multivariate regression.
Statistical Modelling , page 1471082X20948697.Yang, L., Fang, J., Duan, H., Li, H., and Zeng, B. (2018). Fast low-rank bayesianmatrix completion with hierarchical gaussian prior models.
IEEE Transactions onSignal Processing , 66(11):2804–2817.Zhu, H., Khondker, Z., Lu, Z., and Ibrahim, J. G. (2014). Bayesian generalized lowrank regression models for neuroimaging phenotypes and genetic markers.