[PDF] Multilevel Gibbs Sampling for Bayesian Regression

Abstract

Bayesian regression remains a simple but effective tool based on Bayesian inference techniques. For large-scale applications, with complicated posterior distributions, Markov Chain Monte Carlo methods are applied. To improve the well-known computational burden of Markov Chain Monte Carlo approach for Bayesian regression, we developed a multilevel Gibbs sampler for Bayesian regression of linear mixed models. The level hierarchy of data matrices is created by clustering the features and/or samples of data matrices. Additionally, the use of correlated samples is investigated for variance reduction to improve the convergence of the Markov Chain. Testing on a diverse set of data sets, speed-up is achieved for almost all of them without significant loss in predictive performance.

Full PDF

MMULTILEVEL GIBBS SAMPLING FOR BAYESIAN REGRESSION ∗ JORIS TAVERNIER † , JAAK SIMM ‡ , ´AD´AM ARANY ‡ , KARL MEERBERGEN † , AND

YVES MOREAU ‡ Abstract.

Bayesian regression remains a simple but eﬀective tool based on Bayesian inferencetechniques. For large-scale applications, with complicated posterior distributions, Markov ChainMonte Carlo methods are applied. To improve the well-known computational burden of MarkovChain Monte Carlo approach for Bayesian regression, we developed a multilevel Gibbs sampler forBayesian regression of linear mixed models. The level hierarchy of data matrices is created byclustering the features and/or samples of data matrices. Additionally, the use of correlated samplesis investigated for variance reduction to improve the convergence of the Markov Chain. Testing ona diverse set of data sets, speed-up is achieved for almost all of them without signiﬁcant loss inpredictive performance.

Key words.

Markov Chain Monte Carlo, Bayesian regression, linear mixed models, Multilevelmethods

AMS subject classiﬁcations.

1. Introduction.

Linear regression remains a simple yet widely used techniquein statistics and machine learning. Although more advanced techniques exist, linearmodels have the advantage that they oﬀer a simple explanation of how the inputvariables inﬂuence the output variable(s). The simplest linear model is a linear com-bination between a weight vector b ∈ R F × and a data matrix X ∈ R N × F y = X b with y ∈ R N × the target values, N the number of observations and F the number offeatures. This can be seen as statistical inference and the most common approachesare Bayesian inference and frequentist inference [2, 11, 15].Bayesian inference will draw any conclusions about the parameters or observa-tions in terms of probability distributions. One advantage of the Bayesian approach isthat it allows uncertainty before (prior) and after (posterior) the observations X and y have been incorporated. Using prior probability distributions allows the inclusionof information about the model parameters. Another advantage of the Bayesian ap-proach is the ability to make probabilistic predictions for new data using the posteriorof the parameter b .For high dimensional spaces, using exact analytical techniques is not feasibleor even impossible and the use of Monte Carlo (MC) methods is widely spread[2, 11, 29, 33]. MC still requires a distribution to draw samples from a probabil-ity distribution and Markov Chain Monte Carlo (MCMC) is a general method tosample from arbitrary distributions. MCMC algorithms have given researchers theopportunity to investigate Bayesian inference for complicated statistical models. Themost common MCMC methods are Metropolis-Hastings [16, 26] and Gibbs sampling[10, 12] with Gibbs sampling being a special case of Metropolis-hastings.Bayesian inference has been used for applications from genetics [6, 33], chemoge-nomics [31] and cell imaging [32]. The data matrices in these ﬁelds are often large-scale ∗ This work was supported Research Foundation - Flanders (FWO, No. G079016). † Departement of Computer Science, KU Leuven, [email protected], https://people.cs.kuleuven.be/˜joris.tavernier/, [email protected]. ‡ Department of Electrical Engineering, ESAT - STADIUS, KU Leuven,[email protected], [email protected], [email protected] a r X i v : . [ s t a t . C O ] S e p J. TAVERNIER, J. SIMM, ´A. ARANY, K. MEERBERGEN AND Y. MOREAU and sparse. High performance computing (HPC) techniques are required for thesedata sets. More speciﬁcally, research has been developed to speed-up Metropolis-Hastings, since not not all the generated samples are accepted in the Markov Chain.In Metropolis-Hastings, computing the next sample in the chain is often computation-ally intensive and this sample is not always accepted in the Markov chain. Severaltechniques have been developed in diﬀerent ﬁelds to improve the acceptance rate ofMetropolis-Hastings, for example see [4, 9, 20, 36]. In contrast, Gibbs sampling doesnot have an acceptance step and Simm et al. [31] describe a Gibbs sampler based onsolving a linear system using Krylov subspace methods and thus preserving the datasparsity essential for HPC.We draw inspiration from HPC techniques that have been developed for large-scalepartial diﬀerential equations (PDE) using multiple levels. The ﬁnite diﬀerence methodapproximates solution of the PDE using a spatial discretization step resulting in alinear system. By varying this discretization step, several levels of diﬀerent accuraciesare created as shown in Figure 1.1a. More speciﬁcally, discretizing the spatial variablesresults in a grid of nodes and the solution of the PDE is then approximated in thesenodes. A ﬁne discretization leads to many nodes. On the other hand, a coarsediscretization results in less grid points. Both the ﬁne and the coarse grid provideapproximate solutions of the PDE, where the ﬁne grid ﬁnds a better approximation tothe exact solution but is more expensive to compute due to the larger number of gridpoints. By varying the discretization, a hierarchy of grids is created with increasingnumber of grid points. This approach is called Multigrid and exploits this hierarchyto create an eﬃcient iterative solver or preconditioner for the underlying linear systemof the PDE [3, 37].While Multigrid exploits the hierarchy of grids when solving the resulting linearsystem, multilevel Monte Carlo (MLMC) for PDEs with random coeﬃcients howeverexploits the same hierarchy of grids in the sampling process. As detailed in [5] mostof the sampling variance is present in the coarsest grid and MLMC will take most ofthe samples on the coarsest and thus cheaper levels. MLMC has also been success-fully applied for stochastic diﬀerential equations in [13]. More recently, Robbe et al[28] have combined Multigrid with MLMC for PDEs with random coeﬃcients. Thesucces of MLMC for classical MC methods has led to increased interest in multilevelMetropolis-Hastings algorithms [8].We will design a multilevel version of the Gibbs sampler for Bayesian regression bySimm et al. [31]. In contrast with classical MLMC which varies the discretization step,a hierarchy of data matrices is created using clustering algorithms for the rows and orcolumns of the data matrix X . Figure 1.1 shows an example for dense matrices withclustering for both rows and columns. Samples from these levels are then combinedin one Markov Chain. Samples taken on the coarser levels are cheaper to compute,resulting in overall speed-up.The outline of the paper is as follows. First, we will provide further backgroundon MCMC and MLMC in Section 2. Next, the Gibbs sampler for Bayesian regression[31] is given and extended for linear mixed models in Section 3. This is followed by ourmultilevel Gibbs sampler in Section 4 and our multilevel Monte Carlo Gibbs samplerin Section 5. Finally results are given for a variety of data sets in Section 6 and aconclusion is given in Section 7.

2. Preliminaries.

Linear models are deﬁned by y = X b + e (2.1) ULTILEVEL GIBBS SAMPLING FOR BAYESIAN REGRESSION Level 2 Level 1 Level 0 (a) Hierarchy of Grids X T X X T X X T X Level 2 Level 1 Level 0 (b) Hierarchy of dense data matrices for X T X Fig. 1.1: Examples of the level hierarchies in PDEs and data. Figure 1.1a shows thehierarchy of grids for the discretization of a PDE. Figure 1.1b shows the hierarchy oflevels for X t X using a data matrix X and clustering both the rows and columns.with y ∈ R N × the N sample observations, X ∈ R N × F a collected feature matrix, F the number of features and e ∈ R N × the unknown residuals. Linear regression aimsto ﬁnd a distribution or point estimate for the weight vector b ∈ R F × .Bayesian inference for linear models estimates the parameter b using prior infor-mation and the Bayes rule. The Bayes rule is deﬁned by p ( b | X ) = p ( X | b ) p ( b ) p ( X )with p ( b | X ) the posterior, p ( X | b ) the likelihood, p ( b ) the prior and p ( X ) the marginallikelihood. Using the posterior, probabilistic predictions can be made for new data X new with the predictive distribution p ( X new | X ) = (cid:90) p ( X new | b ) p ( b | X ) d b . Deriving the exact posterior distribution is not always possible. However, thequantity of interest is often the expected value of the posterior p ( b | X ) or more gener- J. TAVERNIER, J. SIMM, ´A. ARANY, K. MEERBERGEN AND Y. MOREAU ally of a function f ( b ) under a target distribution p ( b ) with b a discrete or continuousvariable. The expected value is deﬁned by F = E [ f ( b )] = (cid:90) f ( b ) p ( b ) d b . The Monte Carlo method (MC) approximates this expected value by drawing H random and uncorrelated samples b ( h ) from p ( b ) and then takes the average E [ f ( b )] ≈ F MCH (2.2)= 1 H H (cid:88) h =1 f ( b ( h ) ) (2.3)with F MCH as the MC estimate for the expected value using H samples.Equation (2.3) estimates the expected value, but still requires to sample from thedistribution p ( b ). Sampling from a speciﬁc distribution p ( b ) can be severely hard,particularly in high-dimensional spaces and p ( b ) might even be unavailable. MarkovChain Monte Carlo (MCMC) is widely used to sample from these distributions and isa general method for drawing samples from arbitrary posterior distributions [2, 11].The idea is to sample from proposal distributions and each sample is iteratively drawnfrom a distribution that approximates the target distribution better, with each sam-ple used to improve and correct the proposal distribution. These samples are drawnsequentially, b , b , b , . . . , resulting in the Markov chain. Important is that eachsample is thus drawn from a proposal distribution that increasingly better approxi-mates the desired target distribution. A Markov Chain generally has a burn-in periodwhere the ﬁrst H burn in samples are not taken into account as these are not yet drawnfrom a distribution that is close enough to the target distribution.In order to create a Markov chain, a new sample b ( h ) is proposed using a transitiondistribution T h ( b ( h ) , b ( h − ) with b ( h − the previous sample. In a Markov chain, adistribution is called stationary or invariant if each new sample leaves that distributionunchanged. The target distribution p ∗ ( b ) should be invariant over the Markov chainand a suﬃcient condition is called detailed balance, deﬁned as p ∗ ( b ) T ( b , b (cid:48) ) = p ∗ ( b (cid:48) ) T ( b (cid:48) , b )with b and b (cid:48) two samples. The ﬁnal to be mentioned property is ergodicity, whichsays that for any start distribution, the target distribution is invariant and for h → ∞ the distribution p ( b ( h ) ) converges to the target distribution p ∗ ( b ). The invariantdistribution is then called the equilibrium distribution.The Metropolis-Hastings algorithm is widely used to generate samples in a Markovchain. For each sample b ( h ) , a follow-up sample b (cid:48) is proposed from a jumpingdistribution J h ( b (cid:48) , b ( h ) ). The proposed sample is then accepted with probability A ( b (cid:48) , b ( h ) ) = min (cid:32) , p ( b (cid:48) ) J h ( b ( h ) , b (cid:48) ) p ( b ( h ) ) J h ( b (cid:48) , b ( h ) ) (cid:33) . If the sample is accepted, then b ( h +1) = b (cid:48) else b ( h +1) = b ( h ) . Gibbs sampling can beconsidered as a special case of the more general Metropolis-Hastings algorithm [2, 11].Suppose the parameter b consists of F subcomponents b i with b i the individual weightfor linear regression. A Gibbs sampler will now sample each component b i individually ULTILEVEL GIBBS SAMPLING FOR BAYESIAN REGRESSION b \ i are given using p ( b i | b \ i ). A Gibbs sampler hasthe advantage that each new sample is accepted.As detailed in [8], the mean square error (MSE) of the MCMC estimator is givenby (cid:15) ( F MCH ) = E B (cid:2) ( F MCH − F ) (cid:3) (2.4)with E B the expected value with respect to the joint distribution of B = { b ( h ) } .Equation (2.4) is typically rewritten as (cid:15) ( F MCH ) = V B (cid:2) F MCH (cid:3) + (cid:0) E B (cid:2) F MCH (cid:3) − F (cid:1) (2.5)with V the variance. More details can be found in [8, 29]. It is not possible to expressthe MSE in terms of H due to the fact that the generated and consecutive samples inthe Markov Chain are not independent. This is in contrast with classical Monte Carlomethods, where the MSE can be expressed in terms of the number of independentsamples H , e.g. (cid:15) ∼ H .Classical Monte Carlo methods are used in applications for which the numericalsolution depends on a discretization step. For these applications, a hierarchy of levelscan be created. This is, for example, true for stochastic diﬀerential equations [13]using a time step or PDEs using a spatial discretization step [5] as seen in Figure1.1a. Multilevel Monte Carlo (MLMC) now combines decreasing discretization stepsresulting in several levels l = 0 , . . . , L with L the smallest discretization step and thusthe ﬁnest level. By introducing this level hierarchy, the given notation is adjusted toincorporate the level parameter l . We denote with f l ( b ) the function of interest onlevel l and with F MCl,H the MC estimate on level l for E [ f l ( b )]. Following the MLMCterminology, the telescoping sum uses this level hierarchy and exploits the linearity ofthe expectation operator E [ f L ( b )] = E [ f ( b )] + L (cid:88) l =1 E [ f l ( b ) − f l − ( b )] . (2.6)Using Monte Carlo as unbiased estimator for the expectation combined with (2.6)results in E [ f L ( b )] ≈ H H (cid:88) h =1 f ( b ( h ) ) + L (cid:88) l =1 H l H l (cid:88) h =1 (cid:16) f l ( b ( h ) ) − f l − ( b ( h ) ) (cid:17) with H l the number of samples on level l . Denoting Y MCl,H l = 1 H l H l (cid:88) h =1 (cid:16) f l ( b ( h ) ) − f l − ( b ( h ) ) (cid:17) (2.7)results in E [ f L ( b )] ≈ F MLMCL, { H l } = F MC ,H + L (cid:88) l =1 Y MCl,H l with F MLMCL, { H l } the MLMC estimate for E [ f L ( b )]. J. TAVERNIER, J. SIMM, ´A. ARANY, K. MEERBERGEN AND Y. MOREAU

In short, there are two principles behind MLMC. The ﬁrst principle is that takingsamples on the coarser levels l < L is computationally cheaper than taking samples onthe ﬁnest level L . These cheaper samples however still approximate the same solutionas the ﬁnest level with some loss in accuracy. The second principle can be seen as theuse of a control variate. A control variate E for a random variable U is deﬁned bythe facts that taking samples from E is cheap and E is positively correlated with U .Exploiting the linearity of the expectation operator results in E [ U ] = E [ E ] + E [ U − E ] . (2.8)Since E is positively correlated with U , V [ U − E ] < V [ U ] and an MC estimatorfor the right-hand side of (2.8) will require less samples taken from U than a MCestimator for E [ U ] in order to reach the same MSE. Note the resemblance of (2.8)and the telescoping sum (2.6). In order for the coarser level in (2.7) to act as acontrol variate, the samples f l ( b ( h ) ) and f l − ( b ( h ) ) need to be positively correlated.It is thus important that when using MC for estimating the diﬀerence E [ f l ( b ) − f l − ( b )] ≈ H l (cid:80) H l h =1 ( f l ( b ( h ) ) − f l − ( b ( h ) ), that the H l pairs { f l ( b ( h ) ) , f l − ( b ( h ) ) } are uncorrelated but within a pair the samples f l ( b ( h ) ) and f l − ( b ( h ) ) are positivelycorrelated. The coarser level l − l . In conclusion, the coarser levels in MLMC are constructed such that thecoarser levels contain most of the variance and MLMC will as a result take moresamples on the coarser levels and thus cheaper levels, resulting in less computationalwork [5, 8, 13, 28].Due to the success of MLMC in diﬀerent ﬁelds, Multilevel Markov chain MonteCarlo has been investigated [8, 36]. By using a hierarchy of levels, the acceptancerate of Metropolis-Hastings can be improved. In [8] they created a hierarchy of chainsto improve the acceptance rate and additionally reduce the variance for subsurfaceﬂow. Currently, there is no generally accepted method to design Multilevel MarkovChain Monte Carlo methods. One problem is that the generated samples in MCMCare not independent as required by MLMC. One technique is to only consider each k -th sample in the chain and is called thinning.

3. Noise Injection Gibbs Sampler for Bayesian regression.

In ordinarylinear regression [2, 11] the residuals are assumed uncorrelated and normally distrib-uted in (2.1), e.g. e ∼ N ( , τ − I N ) with I N the identity matrix of size R N × N and τ > y areprobabilistic p ( y | X, b ) = N (cid:89) i =1 N ( y i | x i b , τ − )with x i ∈ R × F the features of sample i . The conditional posterior distribution of b given τ, X, β and y is then p ( b | τ, y , X ) ∼ N (( X T X + βI F ) − X T y , (( X T X + βI F ) τ ) − )with β a regularization parameter. Computing the covariance matrix (( X T X + βI F ) τ ) − is inconvenient and often computationally infeasible. Simm et al. [31] how-ever describe a Gibbs sampler for Bayesian regression and using their noise injectiontrick, a sample of b can be taken by solving (cid:32) X T X + λ b τ I F (cid:33) b = X T ( y + e ) + e τ ULTILEVEL GIBBS SAMPLING FOR BAYESIAN REGRESSION e ∼ N ( , τ − I N ) and e ∼ N ( , λ b I F ). A zero mean and uncorrelated normaldistribution is used as prior for b : p ( b , λ b | α b , β b ) ∼ N ( b | , λ − b I F ) G ( λ b | α b , β b )with G ( λ b | α b , β b ) the gamma distribution used as conjugate prior for the precisionparameter λ b > y = W v + Z u + e (3.1)with y ∈ R N × the observations and N the sample size. The vector v ∈ R F × rep-resents the F ﬁxed eﬀects and u ∈ R S × are the S random eﬀects. The matrices W ∈ R N × F and Z ∈ R N × S are incidence matrices coupling the eﬀects to the ob-servations [6]. We assume that the random eﬀects and the residual error e ∈ R N × are normally distributed: u ∼ N ( , G ) and e ∼ N ( , R ). This formulation can fur-ther be simpliﬁed by assuming that the random eﬀects are a priori uncorrelated, e.g. G = λ − u I S with λ u > e ) are uncorrelated, resulting in R = τ − I N with τ > y = X b + e with X = [ W Z ] and b = [ v T u T ] T , resulting in p ( y | X, b ) = N (cid:89) i =1 N ( y i | x i b , τ − ) . In the Bayesian setting [11, 33], the ﬁxed eﬀects are uncertain and assumed normallydistributed v ∼ N (0 | λ − v I F ) with λ v > λ v and λ u , the conditional probability is p ( b , Λ | y , X ) = N ( b | , Λ − ) G ( λ v | α v , β v ) G ( λ u | α u , β u )with Λ =  λ v

0. . . λ v λ u . . .0 λ u  . (3.2)Using the noise injection trick [31], a sample of b can be taken by solving (cid:32) X T X + Λ τ (cid:33) b = X T ( y + e ) + e τ (3.3)with e = N ( , τ − I ) and e = N ( , Λ). Samples for τ , λ v and λ u are taken from J. TAVERNIER, J. SIMM, ´A. ARANY, K. MEERBERGEN AND Y. MOREAU respectively p ( τ | b , y , X, α e , β e ) = G (cid:18) τ (cid:12)(cid:12)(cid:12)(cid:12) α e + N , β e + N (cid:88) i =1 ( y i − X i b ) (cid:19) , (3.4) p ( λ v | b , y , X, α v , β v ) = G (cid:18) λ v (cid:12)(cid:12)(cid:12)(cid:12) α v + F , β v + F (cid:88) f =1 b f (cid:19) , (3.5) p ( λ u | b , y , X, α u , β u ) = G (cid:18) λ u (cid:12)(cid:12)(cid:12)(cid:12) α u + S , β u + F + S (cid:88) s = F +1 b s (cid:19) . (3.6)The noise injection Gibbs sampler for linear mixed models is given in Algorithm 3.1. Algorithm 3.1

Noise injection for linear mixed models Input : • matrix X • hyperparameters α e , α v , α u , β e , β v and β u sample τ , λ v and λ u from priors generate e = N ( , τ − I ) and e = N ( , Λ) using (3.2) sample b (1) by solving 3.3 for h = 2 , . . . , H do sample τ , λ v and λ u from (3.4), (3.5) and (3.6) generate e = N ( , τ − I ) and e = N ( , Λ) using (3.2) sample b ( h ) by solving (3.3) end for return E [ y ] ≈ H (cid:80) Hh =1 X b ( h )

4. Hierarchical Markov Chain Monte-Carlo for Bayesian regression.

We are interested in E [ y ] = E [ X b ]. Using the Gibbs sampler approach, we approxi-mate E [ y ] ≈ H H (cid:88) h =1 X b ( h ) with b ( h ) the samples taken by solving (3.3). For large-scale data, solving thousandslinear systems (3.3) can be computationally expensive even when iterative solvers areused. Therefore, we want to create a hierarchy of levels approximating the data set X resulting in data levels with increasing computational work and data information. Wepropose to use clustering algorithms to create this hierarchy of L + 1 data matrices X l for l = 0 . . . L by clustering the features (columns) of X = X L . This strategyhas been applied to create a two-level preconditioner for Bayesian regression [34]. Acoarser data matrix is created by deﬁning the coarse features as c S = 1 / (cid:112) |S| (cid:88) i ∈S X (: , i )with c S the coarse feature representing the features of cluster S and | S | the number offeatures in S . When the features in X L = [ X , X , . . . , X F C ] are ordered per clusterwith F C the number of clusters and X i the data matrix with the features of cluster i , ULTILEVEL GIBBS SAMPLING FOR BAYESIAN REGRESSION X L − = X L P L with P TL = n (cid:122) (cid:125)(cid:124) (cid:123) n (cid:122) (cid:125)(cid:124) (cid:123) . . . n FC (cid:122) (cid:125)(cid:124) (cid:123) √ n . . . √ n  √ n . . . √ n . . . √ n FC . . . √ n FC . (4.1)Note that this deﬁnition of P L leads to P TL ( X TL X L + βI ) P L = P TL X TL X L P L + P TL βIP L = X TL − X L − + β P TL P L (cid:124) (cid:123)(cid:122) (cid:125) = I as detailed in [34].The linear system (3.3) involves X T X and this matrix-product is used in principlecomponent analysis (PCA). PCA ﬁnds the eigenvectors with the largest eigenvaluesof X T X and is a projection method that maximizes total data variance [2, 34]. Byclustering the features, ideally the variance within a cluster is as small as possible.This means that the coarser levels can approximate the larger eigenvectors and containmost of the data variance. Note that applying PCA directly to create the coarser levelis not feasible since it requires the eigenvalue decomposition of X T X .Using a hierarchy of data matrices, it is possible to deﬁne a Gibbs sampler on eachlevel. One simple approach is to deﬁne a Markov Chain for each level. This wouldresult in L + 1 Markov Chains and thus L + 1 burn-ins. It is, however, possible tointerpolate a sample b l on level l to level l + 1 with b l +1 = P l +1 b l for l = 0 . . . L − τ , λ v and λ u in equations (3.4),(3.5) and (3.6). Next, a sample b l +1 on level l + 1 is drawn using equation (3.3).This provides us with samples for the parameter b l for each level l . The quantity ofinterest is the expected value of the predicted observations E [ y ]. Since X l P l = X l − and thus X l − b l − = X l P l b l − .This means that E [ y ] can be estimated using the ﬁne level data matrix X L andcoarser solutions b l with E [ y ] ≈ X L (cid:81) L − k = l P k +1 b l H l for l = 0 , . . . , L . This is important since generally we are interested in the predictionsof new and unseen data X test with the model trained using diﬀerent data X train .Combining the samples from each level l , E [ y ] is approximated by E [ y ] ≈ X L (cid:80) Ll =0 (cid:16)(cid:81) L − k = l P k +1 (cid:16)(cid:80) H l h =1 b ( h ) l (cid:17)(cid:17)(cid:80) Ll =0 H l . Multilevel Gibbs sampling using noise injection for linear mixed models is given inAlgorithm 4.1.0

J. TAVERNIER, J. SIMM, ´A. ARANY, K. MEERBERGEN AND Y. MOREAU

Algorithm 4.1

Multilevel Noise injection Gibbs sampler for linear mixed models Input : • Hierarchy of matrices X l for l = 0 , . . . , L • hyperparameters α e , α v , α u , β e , β v and β u Take H samples b ( h )0 on the coarsest level l = 0 using the Noise InjectionAlgorithm 3.1 Σ = X L (cid:81) L − k =0 P k +1 (cid:16)(cid:80) H h =1 b ( h )0 (cid:17) Sample on the remaining levels for l = 1 , . . . , L do interpolate b l = P l b ( H l − ) l − for h = 1 , . . . , H l do sample τ , λ v and λ u from (3.4), (3.5) and (3.6) generate e = N ( , τ − I ) and e = N ( , Λ) using (3.2) sample b ( h ) by solving (3.3) using X l end for Σ l = X L (cid:81) L − k = l P k +1 (cid:16)(cid:80) H l h =1 b ( h ) l (cid:17) end for return E [ y ] ≈ (cid:80) Ll =0 Σ l (cid:80) Ll =0 H l

5. Multilevel Markov Chain Monte Carlo using correlated samples.

MLMC has proven to speed-up Monte Carlo methods for diﬀerent applications. Asexplained in Section ?? , positive correlation between samples on diﬀerent levels resultsin variance reduction of the MC estimator. Before detailing the correlation for theGibbs sampler, the telescoping sum ( ?? ) is ﬁrst applied to linear models E [ y L ] = E [ y ] + L (cid:88) l =1 E [ y l − y l − ] (5.1)= E [ X b ] + L (cid:88) l =1 E [ X l b l − X l − b l − ]= X E [ b ] + L (cid:88) l =1 X l E [ b l − P l b l − ] ≈ X H (cid:88) h =1 b ( h )0 + L (cid:88) l =1 X l H l H l (cid:88) h =1 b ( h ) l − P l b ( h ) l − (5.2)and consists of the expected value on the coarsest level and L correcting terms. Forthe MLMC scheme to be proﬁtable, the samples b ( h ) l and P l b ( h ) l − in (5.2) should becorrelated. Two approaches to ensure correlation between the samples are considered.The ﬁrst approach is simply to deﬁne P Tl b l = b l − and this approach thus takesapproximately the average within a cluster (up to a factor) of the ﬁne solution as thecoarse value which results in E [ y l − y l − ] ≈ X L L − (cid:89) k = l P k +1 (cid:32) H l H l (cid:88) h =1 b ( h ) l − P l P Tl b ( h ) l (cid:33) . (5.3)This is a simple projection of the ﬁner solution to the coarser solution. It is, however,important that the ﬁner samples b ( h l ) l of the diﬀerence on level l = i are drawn from ULTILEVEL GIBBS SAMPLING FOR BAYESIAN REGRESSION b ( h l +1 ) l − for the next level l = i + 1 for i = 0 , . . . , L −

1. Simply aggregating the ﬁner solution does not guarantee this.The second approach solves (3.3) to draw sample b l − on the coarser level withexactly the same values τ, λ v , λ u and e used for drawing sample b l on the ﬁner level.Only the random vector e is projected to the coarser space P Tl e ,l = e ,l − . Empiricalresults will show that the second approach is computationally more expensive butobtains better results. The algorithm using the second approach for the correlatedsamples is outlined in Algorithm 5.1. Algorithm 5.1

Multilevel Monte Carlo noise injection Gibbs sampler using linearsystems solves Input : • Hierarchy of matrices X l for l = 0 , . . . , L • hyperparameters α e , α v , α u , β e , β v and β u Take H samples b ( h )0 on the coarsest level l = 0 using the Noise InjectionAlgorithm 3.1 E [ y ] ≈ X L (cid:81) L − k =0 P k +1 (cid:16) H (cid:80) H h =1 b h (cid:17) Sample on the remaining levels for l = 1 , . . . , L do interpolate b l = P l b ( H l − ) l − for h = 1 , . . . , H l do sample τ , λ v and λ u from (3.4), (3.5) and (3.6) generate e = N ( , τ − I ) and e = N ( , Λ) using (3.2) sample b ( h ) l by solving (3.3) using X l restrict e ,l − = P Tl e ,l Using X l − and e ,l − sample b ( h ) l − by solving 3.3 reusing τ , λ v , λ u and e store diﬀerence d ( h ) l = b ( h ) l − P l b ( h ) l − end for E [ y l − y l − ] ≈ X L (cid:81) L − k = l P k +1 (cid:16) H l (cid:80) H l h =1 d ( h ) l (cid:17) end for return E [ y ] ≈ E [ y ] + (cid:80) Ll =1 E [ y l − y l − ]Algorithms 4.1 and 5.1 start sampling at the coarsest level and then reﬁne untilthe ﬁnest level. This means that all the samples on each level are drawn consecutively.In contrast to multilevel Monte Carlo, the samples of a Markov Chain are not i.i.d.and consecutive samples are correlated. Instead of taking all the samples at once foreach level, it is possible to take a few samples at one level, then move to anotherlevel and take a few samples before changing again to another level. Figure 5.1 showsexamples of traversing the levels. Figure 5.1a details a V-cycle with three samples perlevel and ﬁgure 5.1b a W-cycle. Restricting the previous sample b l from a ﬁner levelto a coarser level b l − is done using b l − = P Tl b l .

6. Experiments.

We have investigated the described multilevel sampling schemesfor Bayesian regression. Using real data X but simulated b and y , the model wastrained with y train = X train b + e and evaluated using y test = X test b with X train and X test determined by 5-fold cross-validation. The characteristics of the diﬀerent datamatrices X are given in Table 6.1. Most are available online as indicated. Generating2 J. TAVERNIER, J. SIMM, ´A. ARANY, K. MEERBERGEN AND Y. MOREAU level 2level 1level 0 (a) three level V-cycle level 2level 1level 0 (b) three level W-cycle

Fig. 5.1: Three level V-cycle and W-cycle with three samples after each change oflevel.normal distributed b ∼ N (0 , I F ) and e ∼ N (0 , I N ), y train was simulated.The application of the ChEMBl data is chemogenomics. They predict the activityof chemical compounds on certain proteins. More precisely, one of the indicatorsfor drug-protein activity is the IC50 value. IC50 or the half maximal inhibitoryconcentration measures the substance concentration required to inhibit the activityof a protein by 50%. The bioactivity database ChEMBL version 19 [1] is publiclyavailable. These chemical compounds are modeled by combinations of molecules.These chemical substructures are described by the extended-connectivity ﬁngerprints(ECFP [30]) and were computed using rdkit [27] with 3 layers. More details can befound in [35].The single-nucleotide polymorphism (SNP) data sets are used in genomic predic-tion for animals or plants [6]. This application uses a linear mixed model (3.1). Thematrix Z was simulated using AlphaMPSim [19] and the random eﬀects u representthe SNP marker eﬀects. The number of SNP markers were set to 10 000, 50 000 and100 000 resulting in three diﬀerent data sets. Only the overall mean was modeled forthe ﬁxed eﬀects resulting in W = . For the SNP data sets, the observations y wereavailable and used. Name Source value type

Table 6.1: The characteristics of the data matrices X used in the experiments.Leader-follower clustering [14] was recursively used to cluster features, resultingin a hierarchy of data matrices. The performance of clustering algorithmns is oftendata dependent and Leader-follower clustering has the advantage that the algorithm ULTILEVEL GIBBS SAMPLING FOR BAYESIAN REGRESSION , Name Number of levels Feature sizesTrec10 3 [181, 319, 478]CNAE 3 [198, 354, 856]micromass 3 [542, 754, 1300]DrivFace 3 [2061, 3907, 6400]arcene 3 [3341, 5933, 10000]SNP1 3 [3582, 5178, 10000]tmc2007 3 [5881, 18709, 30438]nlpdata 3 [5258, 12095, 34023]rcv1 multi 3 [6858, 18363, 47236]SNP2 4 [7647, 15764, 28326, 50000]news20 4 [5922, 17811, 31000, 62061]SNP3 4 [8235, 21608, 44928, 100000]E2006 4 [7647, 28326, 78091, 150360]ChEMBL 6 [5664, 17019, 50382, 112612, 170223, 291714]

Table 6.2: The number of levels and average feature sizes for the diﬀerent data setsused in the experiments.Abbreviation DescriptionGibbs Noise injection Gibbs sampler given in Algorithm 3.1ML-G Noise injection Multilevel (ML) Gibbs sampler (G) given inAlgorithm 4.1MLCSS-G Noise injection Multilevel (ML) Gibbs sampler (G) with corre-lated samples (CS) using linear solves (S) given in Algorithm5.1MLCSP-G Noise injection Multilevel (ML) Gibbs sampler (G) with cor-related samples (CS) using projections (P) as detailed in (5.3)MLMLP-G Noise injection Multilevel (ML) Multilevel-preconditioned(MLP) Gibbs sampler (G) given in Algorithm 4.1MLMLPCSS-G Noise injection Multilevel (ML) Multilevel-preconditioned(MLP) Gibbs sampler (G) with correlated samples (CS) us-ing linear solves (S) given in Algorithm 5.1MLMLPCSP-G Noise injection Multilevel (ML) Multilevel-preconditioned(MLP) Gibbs sampler (G) with correlated samples (CS) us-ing projections (P) as detailed in (5.3)W-cycle(30) W-cycle level sample conﬁguration with 30 samples at eachlevel before changing to the next level.Table 6.3: Abbreviations.The level hierarchy created by clustering can be used as a two-level preconditionerto accelerate the convergence of CG [34]. This generally results in lower executiontimes for solving the linear system in the sampling process. This does, however,4

J. TAVERNIER, J. SIMM, ´A. ARANY, K. MEERBERGEN AND Y. MOREAU requires the computation of the singular value decomposition on the coarsest level.We investigated both CG as solver on each level and CG accelerated by a two-levelpreconditioner with two iterations of CG as pre-smoothing for the multilevel sampling[34]. Table 6.3 shows the abbreviations of all the possible diﬀerent samplers in theexperiments.Note that sampling distributions of τ , λ v , λ u in equations (3.4), (3.5) and (3.6)are determined by the solution b found by solving the linear system. As a result,the sampling process is inﬂuenced by the underlying solver. Note that CG has anatural regularization role determined by the required tolerance or maximum numberof iterations. Solving exactly on the coarse level can lead to diﬀerent solutions forbetter or worse. Especially when using the correlated samples with system solves(MLCSS or MLMLPCSS) on the adjacent levels, the solvers on both levels should beof the same nature, e.g. iterative and thus not exact. If this not the case, the solutionson both levels can be too diﬀerent, resulting in noise for the diﬀerence b ( h ) l − P l b ( h ) l − ]in (5.2). In the presented experiments, CG was always used on the coarsest level forfair comparison with all the other methods. Note that using an exact solver on thecoarsest level for multilevel sampling without correlated samples is possible and wouldlead to even faster execution time with slightly diﬀerent solutions.Table 6.4 provides the average setup time, execution time, speed-up, pearsoncorrelation ρ , root mean square error (RMSE) and mean absolute error (MAE) for thegiven data sets using the noise injection Gibbs sampler (Gibbs), the multilevel Gibbssampler (ML-G), the multilevel preconditioned multilevel Gibbs sampler (MLMLP-G)and the multilevel preconditioned multilevel Gibbs sampler with correlated samplesusing systems solves (MLMLPCSS-G) on the test set. 2200 Samples were taken with200 discarded as burn-in. Non-informative priors were used setting α e = 1 . β e = 1 . α v = 1 . β v = 10 − , α u = 1 .

0. The burn-in samples of the multilevel samplerswere taken at the coarsest level and the remaining samples were taken consecutivelyand equally distributed on the diﬀerent levels starting at the coarsest level. Theexperiments were implemented in C++ and compiled with gcc 9.3.0 and OPENMP4.0 with compile option -O3. The experiments were run on a machine with an Intel(R)Core(TM) i7-6560U (2.20GHz) processor with an L3 cache memory of 4096 KB and 16GB of DRAM where 3 out of 4 cores were used. The reported values for the diﬀerentexperiments are the average of 5-fold cross-validation.As can be seen from Table 6.4 taking all the samples on the ﬁne level (Gibbs) oftenresults in the slightly better predictions. Using the multilevel sampling techniquesresult in small loss of accuracy, but for almost all data sets within range of the standarddeviation. The ﬁnest level does contain all of the information, so it is natural thatthis leads to the best predictions. The ﬁne level does, however, contain more noisecomponents which are less present in the coarser level and the multilevel approach canachieve better accuracy as is the case for the SNP3 data set. Note that for the SNP3data set, there is a signiﬁcant improvement using the multilevel sampling schemes. https://scm.cs.kuleuven.be/scm/git/multilevel macauULTILEVEL GIBBS SAMPLING FOR BAYESIAN REGRESSION G i bb s M L - G M L M L P - G M L M L P C SS - G s e t up ( s ) . E - . E - . E - e x ec . ( s ) . . . . T r ec s p ee d - up . . . . ρ . ( . ) . ( . ) . ( . ) . ( . ) R M S E . E ( . E ) . E ( . E ) . E ( . E ) . E ( . E ) M A E . E ( . E ) . E ( . E ) . E ( . E ) . E ( . E ) s e t up ( s ) . E - . E - . E - e x ec . ( s ) . E - . E - . . C NA E s p ee d - up . . . . ρ . ( . ) . ( . ) . ( . ) . ( . ) R M S E . ( . E - ) . ( . E - ) . ( . E - ) . ( . E - ) M A E . ( . E - ) . ( . E - ) . ( . E - ) . ( . E - ) s e t up ( s ) . E - . E - . E - e x ec . ( s ) . E . E . E . E m i c r o m a sss p ee d - up . . . . ρ . ( . ) . ( . ) . ( . ) . ( . ) R M S E . E ( . E ) . E ( . E ) . E ( . E ) . E ( . E ) M A E . E ( . E ) . E ( . E ) . E ( . E ) . E ( . E ) s e t up ( s ) . E . E . E e x ec . ( s ) . E . E . E . E D r i v F a ce s p ee d - up . . . . ρ . ( . ) . ( . ) . ( . ) . ( . ) R M S E . ( . ) . ( . ) . ( . ) . ( . ) M A E . ( . E - ) . ( . ) . ( . ) . ( . ) s e t up ( s ) . E . E . E e x ec . ( s ) . E . E . E . E S N P s p ee d - up . . . . ρ . ( . ) . ( . ) . ( . ) . ( . ) R M S E . E - ( . E - ) . E - ( . E - ) . E - ( . E - ) . E - ( . E - ) M A E . E - ( . E - ) . E - ( . E ) . E - ( . E - ) . E - ( . E - ) s e t up ( s ) . . E . E e x ec . ( s ) . E . E . E . E t m c s p ee d - up . . . . ρ . ( . ) . ( . ) . ( . ) . ( . ) R M S E . E - ( . E - ) . E - ( . E - ) . E - ( . E - ) . E - ( . E - ) M A E . E - ( . E - ) . E - ( . E - ) . E - ( . E - ) . E - ( . E - ) G i bb s M L - G M L M L P - G M L M L P C SS - G s e t up ( s ) . . E . E e x ec . ( s ) . E . E . E . E n l pd a t a s p ee d - up . . . . ρ . ( . ) . ( . ) . ( . ) . ( . ) R M S E . E ( . ) . E ( . ) . E ( . ) . E ( . ) M A E . E ( . E - ) . E ( . E - ) . E ( . E - ) . E ( . E - ) s e t up ( s ) . . E . E e x ec . ( s ) . E . E . E . E r c v m u l t i s p ee d - up . . . . ρ . ( . ) . ( . ) . ( . ) . ( . ) R M S E . ( . E - ) . ( . E - ) . ( . E - ) . ( . E - ) M A E . E - ( . E - ) . E - ( . E - ) . E - ( . E - ) . ( . E - ) s e t up ( s ) . E . E . E e x ec . ( s ) . E . E . E . E S N P s p ee d - up . . . . ρ . ( . ) . ( . ) . ( . ) . ( . ) R M S E . E - ( . E - ) . E - ( . E - ) . E - ( . E - ) . E - ( . E - ) M A E . E - ( . E - ) . E - ( . E - ) . E - ( . E - ) . E - ( . E - ) s e t up ( s ) . . E . E e x ec . ( s ) . E . E . E . E n e w s s p ee d - up . . . . ρ . ( . ) . ( . ) . ( . ) . ( . ) R M S E . E ( . ) . E ( . E ) . E ( . E ) . E ( . ) M A E . E ( . E - ) . E ( . E - ) . E ( . E - ) . E ( . E - ) s e t up ( s ) . E . E . E e x ec . ( s ) . E . E . E . E S N P s p ee d - up . . . . ρ . ( . ) . ( . ) . ( . ) . ( . ) R M S E . E - ( . E - ) . E - ( . E - ) . E - ( . E - ) . E - ( . E - ) M A E . E - ( . E - ) . E - ( . E - ) . E - ( . E - ) . E - ( . E - ) s e t up ( s ) . E . E . E e x ec . ( s ) . E . E . E . E E s p ee d - up . . . . ρ . ( . ) . ( . ) . ( . ) . ( . ) R M S E . E - ( . E - ) . E - ( . E - ) . E - ( . E - ) . E - ( . E - ) M A E . E - ( . E - ) . E - ( . E - ) . E - ( . E - ) . E - ( . E - ) s e t up ( s ) . E . E . E e x ec . ( s ) . E . E . E . E C h E M B L s p ee d - up . . . . ρ . ( . ) . ( . ) . ( . ) . ( . ) R M S E . E ( . E - ) . E ( . E - ) . E ( . E - ) . E ( . E - ) M A E . E ( . E - ) . E ( . E - ) . E ( . E - ) . E ( . E - ) T a b l e . : A v e r ag e s e t up t i m e ( s e t up ) , e x ec u t i o n t i m e ( e x ec . ) , p e a r s o n c o rr e l a t i o n ( ρ ) , r oo t - m e a n - s q u a r ee rr o r ( R M S E ) a nd m e a n a b s o l u t e e rr o r ( M A E ) i n s c i e n t i ﬁ c n o t a t i o n f o r t h e g i v e nd a t a s e t s u s i n g G i bb s , M L - G , M L M L P - G a nd M L M L P C SS - G . T h e s t a nd a r dd e v i a t i o n f o r t h e a cc u r a c y m e a s u r e s i s g i v e n w i t h i n r o undb r a c k e t s . J. TAVERNIER, J. SIMM, ´A. ARANY, K. MEERBERGEN AND Y. MOREAU

The speed-up achieved by the multilevel schemes varies for the diﬀerent data sets.Part of the speed-up depends on the use of preconditioning. Using the preconditionerresults in larger setup times due to the calculation of the SVD but generally resultsin faster execution times since the linear systems are solved more eﬃciently. For theChEMBL data the multilevel sampling without preconditioning is slower than takingall the samples on the ﬁne level due to the fact that the nonzeros are binary on the ﬁnelevel and ﬂoating point numbers on the coarser levels. The multilevel sampling withcorrelated samples results in better accuracy for some data sets but not consistently.The execution time for the correlated samples with linear systems (MLMLPCSS-G)has increased since, for each sample, a linear system has to be solved on two adjacentlevels.Since taking samples at the coarser levels is cheaper, more samples can be takenat the coarse level in the same execution time as at the ﬁnest level. Table 6.5 showsthe number of samples taken in diﬀerent stages for 3 and 4 levels. Figure 6.1 showsthe RMSE for four data sets in function of the execution time using the stages inTable 6.5.

Stage [ H , H , H ] [ H , H , H , H ] H Table 6.5: The increasing number of samples H l for the diﬀerent levelsFigure 6.1 shows that the multilevel sampler (MLMLP-G) achieves better accu-racy faster. Additionally the multilevel sampler converges faster than only samplingat the ﬁne level (Gibbs) in Figures 6.1a, 6.1b and 6.1d. For the E2006 data set theconvergence is fast for all samplers. For the ChEMBL data, only the 4 coarsest levelswere used and no samples were taken at the two ﬁnest levels. The multilevel samplerwith correlated samples (MLMLPCSS-G) results in slightly better accuracy for twodata sets than without correlated samples but requires more execution time. For the results in Table 6.4, the samples are takenconsecutively on each level. One can easily change levels while building the chain andTable 6.6 shows the number of samples taken at each level for diﬀerent conﬁgurationsusing 3 and 6 levels.Figure 6.2 now shows the RMSE for the diﬀerent conﬁgurations in Table 6.6. Ascan be seen from Figure 6.2, more than 10 samples should be taken before changingto another level. It takes a few, generally less than 3, samples for the Markov Chainto adopt to the new level. This can easily be incorporated by using a burn-in aftera level change or by means of thinning. If enough samples are taken before eachlevel change, one can achieve the same accuracy as if the samples were taken consec-utively and equally distributed over the levels with the remark that when using theseconﬁgurations less samples are taken on the ﬁne level.As seen from Table 6.6 and Figure 6.2, it is possible to achieve high accuracywithout taking much samples at the ﬁnest and thus expensive levels. For multilevelMarkov chain Monte Carlo involving PDEs, one can determine the optimal number

ULTILEVEL GIBBS SAMPLING FOR BAYESIAN REGRESSION

100 200 400 600 1 ,

000 2 ,

000 4 , . . . . .

35 Exec. time(s) R M S E GibbsMLMLP-GMLMLPCSS-G (a) SNP1 R M S E GibbsMLMLP-GMLMLPCSS-G (b) nlpdata

40 60 100 200 400 600 1 ,

000 2 , . . . .

11 Exec. time(s) R M S E GibbsMLMLP-GMLMLPCSS-G (c) E2006

40 60 100 200 400 600 1 , R M S E GibbsMLMLP-GMLMLPCSS-G (d) ChEMBL

Fig. 6.1: The average RMSE in function of the execution time in seconds for SNP1,nlpdata, E2006 and ChEMBL for Gibbs sampling, MLMLP-G and MLMLPCSS-G.

Conﬁg 3 levels 6 levelsV-cycle(3) [498, 999, 501] [201, 399, 399, 399, 399, 201]V-cycle(10) [500, 1000, 500] [200, 400, 400, 400, 400, 200]V-cycle(30) [480, 990, 510] [210, 390, 390, 390, 390, 210]V-cycle(100) [500, 1000, 500] [200, 400, 400, 400, 400, 200]W-cycle(3) [666, 999, 333] [522, 783, 387, 189, 90, 27]W-cycle(10) [670, 1000, 330] [530, 790, 390, 190, 80, 20]W-cycle(30) [660, 990, 330] [570, 840, 390, 150, 30, 0]W-cycle(100) [700, 1000, 300] [600, 900, 400, 100, 0, 0]

Table 6.6: The number of samples H l given as [ H , . . . , H L ] for the diﬀerent conﬁgu-rations used in the experiments.of samples on each level [8]. The number of samples on level l is given by H l = 2 (cid:15) (cid:32) L (cid:88) l =0 (cid:113) s l C l (cid:33) (cid:115) s l C l (6.1)with C l the cost to take one sample at level l , s l the sample variance at level l and (cid:15) the required tolerance on the RMSE of the quantity of interest.In contrast, we are interested in the predictions for new and unknown data. Weactually do not know our quantity of interest. One could optimize the number of8 J. TAVERNIER, J. SIMM, ´A. ARANY, K. MEERBERGEN AND Y. MOREAU R M S E V-MLMLPW-MLMLPV-MLMLPCSSW-MLMLPCSSB-MLMLPB-MLMLPCSSGibbs (a) DrivFace . . . . .

19 Number of samples per change of level R M S E V-MLMLPW-MLMLPV-MLMLPCSSW-MLMLPCSSB-MLMLPB-MLMLPCSSGibbs (b) SNP1 R M S E V-MLMLPW-MLMLPV-MLMLPCSSW-MLMLPCSSB-MLMLPB-MLMLPCSSGibbs (c) nlpdata R M S E V-MLMLPW-MLMLPV-MLMLPCSSW-MLMLPCSSB-MLMLPB-MLMLPCSSGibbs (d) ChEMBL

Fig. 6.2: RMSE given in function of diﬀerent samples conﬁgurations as givenin Table 6.6 for DrivFace, SNP1, nlpdata and ChEMBL using MLMLP-G andMLMLPCSS-G. The reference values for taking H l consecutively (consec.) and Gibbssampling from Table 6.4 are additionally given.samples based on the training set, but these are noisy observations. A closer lookat equation (6.1) shows that the factor (cid:15) (cid:16)(cid:80) Ll =0 (cid:112) s l C l (cid:17) is the same for all levelsand the factor (cid:113) s l C l is level speciﬁc. The latter can thus be used to deﬁne the ratiobetween the number of samples H l at each level l . Using this factor we deﬁne thenumber of samples at level l as H l =  (cid:113) s l C l (cid:18)(cid:80) Ll =0 (cid:113) s l C l (cid:19) H  (6.2)with H the given total number samples over all levels and s l the sample variance ofthe average of the observations y . The cost C l to take one sample is deﬁned as thenumber of nonzeros of X l . Deﬁnition 6.2 is based on the fact that the variance of thediﬀerence V [ y l − y l − ] is smaller than the variance V [ y l ] on level l due to positivecorrelation. As an alternative we considered a ratio purely based on the cost for taking ULTILEVEL GIBBS SAMPLING FOR BAYESIAN REGRESSION H l =  C l (cid:16)(cid:80) Ll =0 1 C l (cid:17) H  (6.3)with C l and H as before. DrivFace nlpdata SNP2 news20 ChEMBLGibbs 1.13E3 3.23E2 4.34E4 7.22E3 1.72E3MLMLP

Cost-MLMLPCSS 3.83E2 3.41E2 1.25E4 7.00E3 2.30E3Gibbs 9.41 7.09E1 1.57E-1 2.87E1 1.54E1MLMLP 9.36 7.20E1

Var-MLMLP 9.42 7.78E1 1.53E-1 3.32E1

RMSE Cost-MLMLP

Var-MLMLPCSS 9.49 7.25E1 1.55E-1 3.30E1 2.09E1Cost-MLMLPCSS 9.40

Table 6.7: The average execution time (s)and RMSE using diﬀerent data sets formultilevel sampling with and without correlated samples with the number of samples H l based on the variance (Var) of the levels in equation (6.2) (Var) or the numberof nonzeros of X l (Cost) in equation (6.3). The values for Gibbs, MLMLP-G andMLMLPCSS with the samples equally distributed over levels from Table 6.4 are pro-vided as reference. The best values for the multilevel sampling schemes are presentedin bold.Table 6.7 details the average execution time and the RMSE for the multilevelsampler with and without correlated samples with the number of samples H l for eachlevel l determined by (6.2) or (6.3). Using the variance of the average predictions y onlevel l combined with the correlated samples performs worse than using only the costof taking one sample on level l . The sample variance was calculated using 50 sampleson each level and then the number of total samples on each level was calculated using(6.2). Using the cost based distribution of number of samples H in equation (6.3)results in more samples at the coarser levels and generally without loss in accuracywith respect to uniformly distributing the number of samples H over the levels inMLMLP. This results in slightly faster execution time. As shown in previousresults, using correlated samples can improve accuracy but does increase the executiontime since each sample requires two linear solves on the adjacent levels. As mentionedin Section 5, there is a second way to deﬁne correlation between samples. The secondapproach simply projects the ﬁner solution to the coarser level. This could lead toproblems when sampling from diﬀerent distributions for the same level dependingon the quality of the clustering. If all features within one cluster were the same,this would not be a problem. In reality this is almost never true. Table 6.8 showsthe execution time, pearson correlation and RMSE for correlated samples using twolinear solves (CSS) and projecting the ﬁne level solution to the coarser level (CSP).0

J. TAVERNIER, J. SIMM, ´A. ARANY, K. MEERBERGEN AND Y. MOREAU

For almost all data sets, the accuracy of CSP, although faster, is signiﬁcantly lessthan CSS.

Execution time(s) Pearson Correlation RMSEData Gibbs CSS CSP Gibbs CSS CSP Gibbs CSS CSPDrivFace 1.13E3 3.88E2 2.65E2 0.974 0.974 0.974 9.41 9.40 9.50nlpdata 3.23E2 3.35E2 2.07E2 0.955 0.955 0.955 7.09E1 7.12E1 7.11E1SNP2 4.34E4 9.43E3 6.45E3 0.959 0.960 0.958 1.57E-1 1.54E-1 1.60E-1news20 7.22E3 6.75E3 4.25E3 0.912 0.892 0.859 2.87E1 3.17E1 3.68E1E2006 7.76E3 2.85E3 1.77E3 0.962 0.961 0.957 7.88E-2 8.01E-2 8.33E-2ChEMBL 1.72E3 2.44E3 1.39E3 0.772 0.768 0.766 1.54E1 1.56E1 1.57E1

Table 6.8: The average execution time (s), pearson correlation and RMSE usingdiﬀerent data sets for Gibbs sampling and multilevel sampling using correlated samplesby linear solves (CSS) and Projections (CSP). Using CSP performs worse than CSSfor almost all data sets.Figure 6.3 shows the variances for each level l and variance of the diﬀerence of thecorrelated samples for observation y (0) for CSS and CSP. The ﬁgure shows that, asexpected, the variance of the diﬀerence on level l and l − l . For the ChEMBL data, the ﬁner levels for CSP do not contain additionalinformation with respect to the coarser levels to aid in the prediction of y (0) resultingin zero variance. l V CSS y l CSS y l − y l − CSP y l CSP y l − y l − (a) nlpdata − l V CSS y l CSS y l − y l − CSP y l CSP y l − y l − (b) SNP2 − − − − − − − l V CSS y l CSS y l − y l − CSP y l CSP y l − y l − (c) E2006 − − − − − l V CSS y l CSS y l − y l − CSP y l CSP y l − y l − (d) ChEMBL Fig. 6.3: The variance V [ y l ] of level l for l = 0 , . . . , L and the variance V [ y l − y l − ]of the diﬀerence for l = 1 , . . . , L on the training data of nlpdata, SNP2, E2006 andChEMBL. The results are given using correlated samples with linear solves (CSS) andprojections (CSP). ULTILEVEL GIBBS SAMPLING FOR BAYESIAN REGRESSION

7. Conclusion.

We have presented a multilevel Gibbs sampler for Bayesian re-gression for linear mixed models. Using clustering algorithms, a hierarchy of datamatrices is created. This hierarchy allows to sample on diﬀerent levels reducing exe-cution time without signiﬁcant loss in predictive performance for almost all data sets.Since the coarser levels contain most of the variance of the data, more samples can betaken at the coarser levels and the chain converges faster than taking all the samplesat the ﬁnest level. Distributing the number of samples over the levels based on thecost to take one level sample was shown to provide fast and accurate results.Furthermore, we investigated the use of correlated samples to reduce the varianceof the Markov Chain. The use of correlated samples increased the execution time butdid not consistently improve the accuracy of the predictions. The distribution of thenumber of samples across the levels based on the variance performed less optimal thandistribution based on sampling cost alone. The multilevel sampler with correlatedsamples was often still faster than plain sampling on the ﬁne level and by storingboth the solutions of level l and the diﬀerence between the levels l and l −

1, it ispossible to get the predictions of both multilevel samplers as part of an ensemble.

8. Acknowledgement.

This work was supported Research Foundation - Flan-ders (FWO, No. G079016N). We would like to thank David Vanavermaete for gener-ating the SNP data sets.

REFERENCES[1]

A. P. Bento, A. Gaulton, A. Hersey, L. J. Bellis, J. Chambers, M. Davies, F. A.Kr ˜Aijger, Y. Light, L. Mak, S. McGlinchey, M. Nowotka, G. Papadatos, R. San-tos, and J. P. Overington , The ChEMBL bioactivity database: an update , Nucleic AcidsResearch, 42 (2014), pp. D1083–D1090.[2]

C. M. Bishop , Pattern Recognition and Machine Learning , Springer-Verlag New York, 2006.[3]

W. Briggs, V. Henson, and S. McCormick , A Multigrid Tutorial, Second Edition , Soci-ety for Industrial and Applied Mathematics, second ed., 2000, https://doi.org/10.1137/1.9780898719505, https://epubs.siam.org/doi/abs/10.1137/1.9780898719505, https://arxiv.org/abs/https://epubs.siam.org/doi/pdf/10.1137/1.9780898719505.[4]

J. A. Christen and C. Fox , Markov Chain Monte Carlo using an approximation , Journal ofComputational and Graphical Statistics, 14 (2005), pp. 795–810, https://doi.org/10.1198/106186005X76983, https://doi.org/10.1198/106186005X76983.[5]

K. A. Cliffe, M. B. Giles, R. Scheichl, and A. L. Teckentrup , Multilevel Monte Carlomethods and applications to elliptic PDEs with random coeﬃcients , Computing and Visu-alization in Science, 14 (2011), p. 3.[6]

A. D. Coninck, J. Fostier, S. Maenhout, and B. De Baets , Dairry-blup: A high-performance computing approach to genomic prediction

T. A. Davis and Y. Hu , The university of ﬂorida sparse matrix collection , ACM Transactionson Mathematical Software (TOMS), 38 (2011), p. 1.[8]

T. Dodwell, C. Ketelsen, R. Scheichl, and A. Teckentrup , A hierarchical multi-level Markov Chain Monte Carlo algorithm with applications to uncertainty quantiﬁca-tion in subsurface ﬂow , SIAM/ASA Journal on Uncertainty Quantiﬁcation, 3 (2015),pp. 1075–1108, https://doi.org/10.1137/130915005, https://doi.org/10.1137/130915005,https://arxiv.org/abs/https://doi.org/10.1137/130915005.[9]

Y. Efendiev, T. Hou, and W. Luo , Preconditioning Markov Chain Monte Carlo simulationsusing coarse-scale models , SIAM Journal on Scientiﬁc Computing, 28 (2006), pp. 776–803,https://doi.org/10.1137/050628568, https://doi.org/10.1137/050628568.[10]

A. E. Gelfand and A. F. M. Smith , Sampling-based approaches to calculating mar-ginal densities , Journal of the American Statistical Association, 85 (1990), pp. 398–409, https://doi.org/10.1080/01621459.1990.10476213, https://amstat.tandfonline.com/doi/abs/10.1080/01621459.1990.10476213.[11]

A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin , J. TAVERNIER, J. SIMM, ´A. ARANY, K. MEERBERGEN AND Y. MOREAU

Bayesian data analysis , Chapman and Hall/CRC, 2013.[12]

S. Geman and D. Geman , Stochastic relaxation, gibbs distributions, and the bayesian restora-tion of images , IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6(1984), pp. 721–741, https://doi.org/10.1109/TPAMI.1984.4767596.[13]

M. B. Giles , Multilevel Monte Carlo path simulation , Operations Research, 56 (2008),pp. 607–617, https://doi.org/10.1287/opre.1070.0496, https://doi.org/10.1287/opre.1070.0496, https://arxiv.org/abs/https://doi.org/10.1287/opre.1070.0496.[14]

J. A. Hartigan , Clustering Algorithms , John Wiley & Sons, Inc., New York, NY, USA,99th ed., 1975.[15]

T. Hastie, R. Tibshirani, and J. Friedman , The Elements of Statistical Learning , Springer-Verlag New York, 2009.[16]

W. K. Hastings , Monte Carlo sampling methods using Markov Chains and their applications

C. R. Henderson , Selection index and expected genetic advance , Statistical genetics and plantbreeding, 982 (1963), pp. 141–163.[18]

C. R. Henderson, O. Kempthorne, S. R. Searle, and C. Von Krosigk , The estimation ofenvironmental and genetic trends from records subject to culling , Biometrics, 15 (1959),pp. 192–218.[19]

J. M. Hickey, G. Gorjanc, S. Hearne, and B. E. Huang , Alphampsim: ﬂexible simu-lation of multi-parent crosses , Bioinformatics, 30 (2014), pp. 2686–2688, https://doi.org/10.1093/bioinformatics/btu206, http://dx.doi.org/10.1093/bioinformatics/btu206,https://arxiv.org/abs//oup/backﬁle/content public/journal/bioinformatics/30/18/10.1093 bioinformatics btu206/2/btu206.pdf.[20]

V. H. Hoang, C. Schwab, and A. M. Stuart , Complexity analysis of accelerated MCMCmethods for bayesian inversion , Inverse Problems, 29 (2013), p. 085010, https://doi.org/10.1088/0266-5611/29/8/085010, https://doi.org/10.1088%2F0266-5611%2F29%2F8%2F085010.[21]

S. S. Keerthi and D. DeCoste , A modiﬁed ﬁnite newton method for fast solution of largescale linear svms , Journal of Machine Learning Research, 6 (2005), pp. 341–361.[22]

S. Kogan, D. Levin, B. R. Routledge, J. S. Sagi, and N. A. Smith , Predicting risk fromﬁnancial reports with regression

D. D. Lewis, Y. Yang, T. G. Rose, and F. Li , Rcv1: A new benchmark collection for textcategorization research , J. Mach. Learn. Res., 5 (2004), pp. 361–397, http://dl.acm.org/citation.cfm?id=1005332.1005345.[24]

M. Lichman , UCI machine learning repository , 2013, http://archive.ics.uci.edu/ml.[25]

MATLAB statistics and machine learning toolbox , 2018. The MathWorks, Natick, MA, USA.[26]

N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller , Equation of state calculations by fast computing machines , The journal of chemical physics,21 (1953), pp. 1087–1092.[27]

RDKit: Open-source cheminformatics

P. Robbe, D. Nuyens, and S. Vandewalle , Recycling samples in the Multigrid multilevel(quasi-)Monte Carlo method , SIAM Journal on Scientiﬁc Computing, 41 (2019), pp. S37–S60, https://doi.org/10.1137/18M1194031.[29]

C. Robert and G. Casella , Monte Carlo statistical methods , Springer-Verlag New York, 2004,https://doi.org/10.1007/978-1-4757-4145-2.[30]

D. Rogers and M. Hahn , Extended-connectivity ﬁngerprints , Journal of Chemical Informationand Modeling, 50 (2010), pp. 742–754. PMID: 20426451.[31]

J. Simm, A. Arany, P. Zakeri, T. Haber, J. K. Wegner, V. Chupakhin, H. Ceulemans, andY. Moreau , Macau: Scalable bayesian factorization with high-dimensional side informa-tion using MCMC , in 2017 IEEE 27th International Workshop on Machine Learning for Sig-nal Processing (MLSP), Sept 2017, pp. 1–6, https://doi.org/10.1109/MLSP.2017.8168143.[32]

J. Simm, G. Klambauer, A. Arany, M. Steijaert, J. K. Wegner, E. Gustin, V. Chu-pakhin, Y. T. Chong, J. Vialard, P. Buijnsters, I. Velter, A. Vapirev, S. Singh,A. E. Carpenter, R. Wuyts, S. Hochreiter, Y. Moreau, and H. Ceulemans , Re-purposing high-throughput image assays enables biological activity prediction for drugdiscovery [33] D. Sorensen and D. Gianola , Likelihood, Bayesian, and MCMC methods in quantitativegenetics , Springer Science+Business Media New York, 1 ed., 2002, https://doi.org/10.1007/b98952.[34]

J. Tavernier, J. Simm, K. Meerbergen, and Y. Moreau , Multilevel preconditioning forridge regression , CoRR, abs/1806.05826 (2018), http://arxiv.org/abs/1806.05826, https://arxiv.org/abs/1806.05826.[35]

J. Tavernier, J. Simm, K. Meerbergen, J. K. Wegner, H. Ceulemans, and Y. Moreau , Fast semi-supervised discriminant analysis for binary classiﬁcation of large data sets

H. Vandecasteele and G. Samaey , A micro-macro Markov chain Monte Carlo method formolecular dynamics using reaction coordinate proposals I: direct reconstruction , 2020,https://arxiv.org/abs/2002.09324.[37]