[PDF] Macau: Scalable Bayesian Multi-relational Factorization with Side Information using MCMC

Abstract

We propose Macau, a powerful and flexible Bayesian factorization method for heterogeneous data. Our model can factorize any set of entities and relations that can be represented by a relational model, including tensors and also multiple relations for each entity. Macau can also incorporate side information, specifically entity and relation features, which are crucial for predicting sparsely observed relations. Macau scales to millions of entity instances, hundred millions of observations, and sparse entity features with millions of dimensions. To achieve the scale up, we specially designed sampling procedure for entity and relation features that relies primarily on noise injection in linear regressions. We show performance and advanced features of Macau in a set of experiments, including challenging drug-protein activity prediction task.

Full PDF

MMacau: Scalable Bayesian Multi-relationalFactorization with Side Information using MCMC

Jaak Simm ∗ ESAT-STADIUS, KU LeuvenB-3001 Heverlee, Belgium [email protected]

Adam Arany ∗ ESAT-STADIUS, KU LeuvenB-3001 Heverlee, Belgium [email protected]

Pooya Zakeri

ESAT-STADIUS, KU LeuvenB-3001 Heverlee, Belgium [email protected]

Tom Haber

Hasselt University3500 Hasselt, Belgium [email protected]

J¨org K. Wegner

Janssen PharmaceuticaB-2340 Beerse, Belgium [email protected]

Vladimir Chupakhin

Janssen PharmaceuticaB-2340 Beerse, BelgiumJanssen Pharmaceuticals45007 Toledo, Spain [email protected]

Hugo Ceulemans

Janssen PharmaceuticaB-2340 Beerse, Belgium [email protected]

Yves Moreau

ESAT-STADIUS, KU LeuvenB-3001 Heverlee, Belgium [email protected]

Abstract

We propose Macau, a powerful and ﬂexible Bayesian factorization method for het-erogeneous data. Our model can factorize any set of entities and relations that canbe represented by a relational model, including tensors and also multiple relationsfor each entity. Macau can also incorporate side information, speciﬁcally entityand relation features, which are crucial for predicting sparsely observed relations.Macau scales to millions of entity instances, hundred millions of observations,and sparse entity features with millions of dimensions. To achieve the scale up,we specially designed sampling procedure for entity and relation features that re-lies primarily on noise injection in linear regressions. We show performance andadvanced features of Macau in a set of experiments, including challenging drug-protein activity prediction task.

Matrix factorization (MF) has a long history and a wide range of applications in data sciences,engineering and many ﬁelds of scientiﬁc research. While classical approaches, such as SVD, fac-torize fully observed matrices, a previous work proposed matrix factorization for partially observedmatrices [8]. This enabled the direct use of MF in predictive machine learning problems ( e.g. , incollaborative ﬁltering). However, the original formulation [8] can easily overﬁt the data and it was ∗ Adam Arany and Jaak Simm contributed both equally as ﬁrst authors. a r X i v : . [ s t a t . M L ] D ec mproved in Probabilistic Matrix Factorization (PMF) [4]. The main idea in these two papers andin subsequent research has been to represent each row and each column by a latent vector of size D and ﬁnd the best match to the observed elements of the matrix: min u , v (cid:88) ( i,j ) ∈ I R ( R ij − u (cid:62) i v j ) + λ u (cid:107) u (cid:107) F + λ v (cid:107) v (cid:107) F , (1)where u i , v j ∈ R D are the latent vectors for i th row and j th column, I R is the set of matrix cellswhose value has been observed, R ij ∈ R are the observed values, (cid:107) · (cid:107) F is the Frobenius norm,and λ u , λ v > are regularization parameters. The last two terms in optimization problem (1),introduced in PMF [4], are derived from zero-mean Gaussian priors on u i and v j and a Gaussiannoise model on the observed values R ij . Bayesian PMF (BPMF) [6] extends PMF to full Bayesian inference by introducing common multi-variate Gaussian priors for the latent variables, one for the rows and one for the columns. To inferthese two priors from the data, BPMF places ﬁxed uninformative Normal-Wishart hyperpriors onthem. Let µ u and Λ u ( µ v and Λ v ) be the mean and precision matrix of the Gaussian prior for rows(columns) then the model used by BPMF is p ( u , µ u , Λ u | Θ ) = (cid:89) i =1 N ( u i | µ u , Λ − u ) N W ( µ u , Λ u | Θ ) (2) p ( v , µ v , Λ v | Θ ) = (cid:89) j =1 N ( v j | µ v , Λ − v ) N W ( µ v , Λ v | Θ ) , (3)where N and N W are Normal and Normal-Wishart distributions and Θ are the ﬁxed hyperparam-eters of the Normal-Wishart hyperprior. Similarly to PMF the noise model of BPMF is Gaussian: p ( R | u , v , α R ) = (cid:89) ( i,j ) ∈ I R N ( R ij | u (cid:62) i v j , α − R ) , (4)where α R > is the precision parameter and is assumed to be known. From (2)–(4), it is straight-forward to derive block Gibbs sampler for each latent vector u i and v j , and for the parameters ofthe Gaussian priors µ u , Λ u , µ v , Λ v .It is generally observed that BPMF shows improvement in predictive performance compared to PMF( e.g. , in the collaborative ﬁltering task of Netﬂix [6]). Another advantage of BPMF is that it providescredibility intervals for the estimates. It should be also noted that BPMF can be easily parallelizedand can handle large scale data sets, such as the Netﬂix challenge data, which contains 200 millionobservations. In this paper we propose Macau, a powerful and ﬂexible method for factorization of heterogeneousdata. Its essential features are • The ability to factorize wide range of data models, which we represent by a hypergraphwhere entities are nodes and relations are hyperedges. Supported models include the casesof ordinary graphs and tensor relations, see Appendix A.1. • The incorporation of features (side information) for any entity and for any relation. • Scalability up to millions of entity instances, hundred millions of observations, and sparseentity features with millions of dimensions.We follow the approach of BPMF by proposing a Gibbs sampler scheme that includes speciallydesigned noise injection step for entity and relation features. This enables the scaling of the methodto millions of sparse or to tens of thousands of dense features.The novelty of the proposed method is the combination of all above mentioned functionality into auniﬁed Bayesian framework (see Section 3 for detail overview). Also in the context of matrix factor-ization with entity features Macau carries out

MCMC inference rather than variational approximateapproaches, such as Variational Bayes as proposed in previous research [5].2e apply our method to a standard matrix factorization benchmark of MovieLens, outperforming thestate-of-the-art MF approaches. Additionally, we explore the performance of Macau in a challeng-ing biochemistry task of drug-protein activity prediction where we demonstrate the effectiveness ofthe aforementioned characteristics of the method. This task is based on publicly available data fromChEMBL [2]. Finally, we report runtime information for private industrial data set from Pharma-ceutica containing millions of drug candidates with millions of sparse features and tens of millionsof observed activity values.Our contribution includes an open source package implementing all of the above mentioned fea-tures together with multi-core and multi-node parallelization in the Julia language. In this section we outline the probabilistic model for Macau. Then we give an overview of relatedresearch (Section 3). Finally, we outline the details for the Gibbs sampler (Section 2.4), includingthe crucial noise injection based scheme for sampling the weight variables linking the entity andrelation features (Section 2.5).

In practice, data sets can often contain multiple relations between entities ( e.g. , drugs and proteins,see Fig. 3). To handle it in Macau, we consider a relational model with a set of entities E and aset of relations R such that each relation R ∈ R can link together two or more entities, i.e., R isa tensor. Each relation R maps the instances of its entities to a real number, denoted by R j where j = ( j , . . . , j k ) is the index vector and k is the degree of the relation ( i.e. , the number of entitiesconnected by R ). Formally, R is a map N k → R . As in the case of partially observed matrix thevalues R j ,...,j k are partially observed. We denote the latent vector of instance i ∈ N of entity e ∈ E by u ( e ) i ∈ R D .Each relation R has a Gaussian noise model with precision α R > p ( R | u , α R ) = (cid:89) j ∈ I R N ( R j | (cid:62) u E R j , α − R ) , (5)where I R ⊂ N k is the set of index vectors for which R is observed, E R is the ordered list of entitiesconnected by relation R where the order is the same as in the index vectors j ∈ I R , is the vector ofones and u ej = u ( e ) j ◦ . . . ◦ u ( e k ) j k is the element-wise product of the latent vectors. The conditionalprobability of the observations of all relations is then p ( R| u , α ) = (cid:89) R ∈R (cid:89) j ∈ I R N ( R j | (cid:62) u E R j , α − R ) . (6)Equation (6) allows Macau to simultaneously factorize more than two entities and multiple relationswith possibly different degrees. Entity features are extra information available about instances of entities, often referred to as side-information. For example, in the case of movie ratings, it could be the genre and the release year formovies, or the age and the gender for users. In the example of drug-protein activity modeling, it ispossible to use substructure information of the drug candidate, represented by a large sparse binaryvector. The idea exploited in Macau is that we can use this extra information to predict the latentvector of the instance and, thus, get more accurate factorization, especially for entities that have fewor no observations.First, let us write the standard Gaussian prior (2), used in BPMF, for the latent variable u ( e ) i of aninstance i of entity e : p ( u ( e ) i | µ e , Λ e ) = N ( u ( e ) i | µ e , Λ − e ) , (7) URL to the package: https://github.com/jaak-s/BayesianDataFusion.jl µ e and Λ e are the common prior mean and precision matrix for entity e , respectively. Toincorporate the instance’s feature x ( e ) i ∈ R F e we add a term β (cid:62) e x ( e ) i into the Gaussian mean: p ( u ( e ) i | x ( e ) i , µ e , Λ e ) = N ( u ( e ) i | µ e + β (cid:62) e x ( e ) i , Λ − e ) , (8)where β e ∈ R F e × D is the weight matrix for the entity features and F e is the dimensionality of thefeatures. Equation (8) can be interpreted as a linear model for the latent vectors. If an instancedoes not have any observations then the distribution of its latent variable is fully determined by (8),because there are no terms involving its latent variable in (6). On the other hand, if the instance hasmany observations its features will have only a minor impact.To have a full Bayesian treatment for β e , we introduce a zero mean multivariate normal as its prior: p ( β e | Λ e , λ β e ) = N (vec( β e ) | , Λ − e ⊗ ( λ β e I ) − ) (9) ∝ λ F e D/ β e | Λ e | D/ exp( − λ β e tr( β e Λ − e β (cid:62) e )) (10)where vec( β e ) is the vectorization of β e , ⊗ denotes the Kronecker product and λ β e ≥ is thediagonal element of the precision matrix. The inclusion of Λ e (the precision matrix of the latentvectors) in (9) is crucial for deriving a computationally efﬁcient noise injection sampler, describedin detail in Section 2.5.As the choice of λ β e is problem dependent, we set a gamma distribution as its hyperprior, as used insimilar context for neural networks [3]: p ( λ β e | µ, ν ) = G ( λ β e | µ, ν ) ∝ λ ν/ − β e exp( − ν µ λ β e ) , (11)where µ and ν are ﬁxed hyperparameters, which are both set to in the experiments. Often there is extra information regarding observations ( e.g. , the day (from the release) when theuser went to see the movie or the temperature of the chemical experiment). However, this data is notlinked to a single entity instance but instead to the observation ( e.g. , a particular user-movie pair).If these features are fully observed for a particular relation R , then Macau can incorporate themdirectly into the observation model. Let x ( R ) j ∈ R F R be the relation feature for observation R j , thenthe previous observation model (5) for relation R is replaced by p ( R | u , α R ) = (cid:89) j ∈ I R N ( R j | (cid:62) u E R j + β (cid:62) R x ( R ) j , α − R ) , (12)where β R ∈ R F R is the weight vector. The treatment of β R is similar to β e , i.e. , Macau uses zeromean Gaussian prior on β R with precision λ β R ≥ that has gamma hyperprior: p ( β R | λ β R ) = N ( β R | , λ β R I ) (13) p ( λ β R | µ, ν ) = G ( λ β R | µ, ν ) , (14)where as before µ and ν are ﬁxed hyperparameters, set to 1 in experiments. Gibbs sampling is used to sample from the posterior of the model variables. In this section wepresent the conditional distributions of the Gibbs sampler for all variables (except β e and β R forwhich we propose a specially designed sampler in Section 2.5). Based on (8) and (12) the conditional probability for u ( e ) i is p ( u ( e ) i |R , u , x , β, λ, α, Λ e ) = N ( u ( e ) i | µ ( e ) ∗ i , [Λ ( e ) ∗ i ] − ) (15) ∝ (cid:89) R ∈R e (cid:89) j ∈ I R ( e,i ) N ( R j | (cid:62) u E R j + β (cid:62) R x ( R ) j , α − R ) (16) × N ( u ( e ) i | µ e + β (cid:62) e x ( e ) i , Λ − e ) , Λ ( e ) ∗ i = Λ e + (cid:88) R ∈R e α R (cid:88) j ∈ I R ( e,i ) (cid:32) u E R j u ( e ) i (cid:33) (cid:32) u E R j u ( e ) i (cid:33) (cid:62) µ ( e ) ∗ i = [Λ ( e ) ∗ i ] −  Λ e ( µ e + β (cid:62) e x ( e ) i ) + (cid:88) R ∈E R α R (cid:88) j ∈ I R ( e,i ) ( R j − β (cid:62) R x ( R ) j ) u E R j u ( e ) i  , where R e is the set of relations that the entity e is linked to, u E R j / u ( e ) i denotes element-wise divi-sion , I R ( e, i ) ⊆ I R is the set of indexes of observations to which instance i of entity e is linked to, i.e. , all the observed data in relation R for instance i . The above equations use a shorthand that ifrelation R does not have features then β (cid:62) R x ( R ) j is and similarly if entity e does not have featuresthen β (cid:62) e x ( e ) i is . Macau also uses the same Normal-Wishart hyperprior for µ e and Λ e as BPMF [6]: p ( µ e , Λ e | Θ ) = N ( µ e | µ , ( β Λ e ) − ) W (Λ e | W , ν ) , (17)where the hyperparameters are set to uninformative values of µ = , β = 2 , W = I (the identitymatrix), and ν = D . Combining the hyperprior (17) with (8) we get conditional probability p ( µ e , Λ e | u ( e ) , x , β, Θ ) = N ( µ e | µ ∗ , ( β ∗ Λ e ) − ) W (Λ e | W ∗ , ν ∗ ) . (18)Because of space constraints the formulas for µ ∗ , β ∗ , W ∗ , ν ∗ are presented in the Appendix A.2.One of the essential differences compared to BPMF is that in BPMF the Gaussian priors model thelatent vectors u ( e ) i whereas in Macau they model the residual u ( e ) i − β (cid:62) e x ( e ) i . From (9) and (11) we can derive the conditional probability for λ β e as p ( λ β e | β e , Λ e , µ, ν ) = G ( λ β e | ˜ µ, ˜ ν ) , (19)where ˜ ν = F e D + ν ˜ µ = ( F e D + ν ) µν + µ tr( β (cid:62) e β e Λ e ) . The conditional probability for λ β R is analogous and is described in Appendix A.3. From (8) and (9) we can write out the conditional probability for β e p ( β e | µ e , Λ e , u , x , λ β e ) (20) ∝ exp (cid:32) − N e (cid:88) i =1 ( u ( e ) i − µ e − β (cid:62) e x ( e ) i ) (cid:62) Λ e ( u ( e ) i − µ e − β (cid:62) e x ( e ) i ) − λ β e tr( β e Λ e β (cid:62) e ) (cid:33) Let us denote U = [ u ( e )1 − µ e , . . . , u ( e ) N e − µ e ] (cid:62) and X = [ x ( e )1 , . . . , x ( e ) N ] (cid:62) then, because both thelikelihood and the prior contain Λ e , we can factorize Λ e out: p ( β e | µ e , Λ e , u , x , λ β e ) ∝ exp (cid:18) −

12 tr[(( U − X β e ) (cid:62) ( U − X β e ) + λ β e β (cid:62) e β e )Λ e ] (cid:19) (21) The formula assumes that u ( e ) i is only present once. To handle such cases where, e.g. , E R = ( e, e ) , andthere are observations on the diagonal, equations should be modiﬁed. p ( β e | µ e , Λ e , u , x , λ β e ) ∝ exp (cid:18) −

12 vec( β e − ˆ β e ) (cid:62) (Λ e ⊗ ( X (cid:62) X + λ β e I )) vec( β e − ˆ β e ) (cid:19) (22)where ˆ β e = ( X (cid:62) X + λ β e I ) − X (cid:62) U is the mean and Λ e ⊗ ( X (cid:62) X + λ β e I ) is the precision of the pos-terior. However, even for moderate feature dimensions F e the standard sampling of the multivariateGaussian is computationally intractable because the size of the precision matrix is DF e × DF e .By exploiting the Kronecker product structure of the precision matrix and the existence of ( X (cid:62) X + λ β e I ) in both the mean and the precision we derive an alternative approach. A sample of β e from(22) can be obtained by solving a linear system for ˜ β : ( X (cid:62) X + λ β e I ) ˜ β = X (cid:62) ( U + E ) + (cid:112) λ β e E , (23)where each row of matrices E ∈ R N e × D and E ∈ R F e × D is sampled from N ( , Λ − e ) . Thecorrectness of (23) is proven in Appendix A.5. The derivation of noise injection sampler for theweight vector β R for relation features is analogous, with the difference that the linear system hasonly single right-hand side.Thus, to sample β e we need to solve a linear system of size F e × F e with D different right-handsides. If F e is medium size (up to 20,000) we propose to use direct solvers . If X is sparse wecan tackle high-dimensional systems by solving each right-hand side separately by using iterativemethod of conjugate gradient (CG). CG only requires multiplication of X (and X (cid:62) ) with a vectorand can handle cases where F e is in the order of millions. There are several extensions already proposed to BPMF that are related to our work. BayesianProbabilistic Tensor Factorization [9] extends BPMF to the factorization of a single 3-way relation(tensor) without entity and relation features. Like BPMF their approach uses Gibbs sampler.Singh and Gordon [7] propose Bayesian MF method that can link together more than one 2-wayrelation (matrix). Their sampling approach is analogous to BPMF except using Hamiltonian MonteCarlo within Gibbs where each latent vector is sampled separately by using Hamiltonian MonteCarlo. Their method does not have support for tensors, entity features or relation features. Thelack of support for tensors also means their method cannot support multiple relations between twoentities, which is useful, for example, in the case of drug-protein activity modeling (see Section 4.2for the experiment on ChEMBL data with two relations between potential drugs and proteins).Hierarchical Bayesian Matrix Factorization with Side Information (HBMFSI) [5] is a method for thespecial case of factorizing a single matrix with entity features based on Variational Bayes. HBMFSIdoes not allow the model to use relation features as in Macau. However, they propose to add theconcatenation of row entity features and column entity features in the same way as relation featuresin Macau, i.e. x ( R ) i,j = ( x ( row ) i , x ( col ) j ) . This section gives results for 1) the standard MF benchmark MovieLens and 2) a challenging bio-chemical problem based on the ChEMBL data set [2], and reports runtimes of Macau on a large-scaleindustrial drug–protein activity data set. The performance reported is mean RMSE and the error barsin ﬁgures represent standard deviations. All experiments are repeated 10 times.

The MovieLens data set consists of a single matrix of movie–user ratings from 6,040 users and3,952 movies. There are total of 1,000,209 ratings taking values from 1.0 to 5.0. Recent research On a system with 2 Intel Xeon E5-2699 v3 CPUs using 8 cores Julia takes 25 seconds to solve a 20,000 ×

61] has investigated in detail the noise level in movie ratings. Amatriain et al. made a conservativeestimate of between-trial RMSE of . . Based on that estimate, we chose to use α = 1 . inour experiments. The data set contains 29 and 18 dimensional entity features for users and movies,respectively. We compare Macau against HBMFSI , which is the state-of-the-art MF approach withentity features, using the same evaluation setup used in their paper [5]. Namely, one half of theratings are randomly set as the test set and another half as the training set. The methods comparedare 1) Macau-E: Macau with entity features, 2) Macau-ER: Macau with entity features and relationfeatures constructed as in HBMFSI (see Section 3), 3) HBMFSI, and 4) BPMF. All methods uselatent dimension D = 30 . It should be noted that the relative performance between the methodswas similar when we used D = 10 (data not shown). The results show that both Macau setupsoutperform HBMFSI and BPMF, see Figure 1. Macau-ER Macau-E HBMFSI BPMF0.8460.8480.8500.8520.8540.8560.858 R M S E p<0.0001 Figure 1: Results for MovieLens experi-ments. The p-value of the two-sided t-test be-tween Macau-ER and HBMFSI is lower than0.0001.

BPMF Macau Ridge0.600.650.700.750.800.850.900.95 R M S E Figure 2: Results for ChEMBL experiments.BPMF and Macau use D = 30 . The prediction of drug and protein interactions is crucial for the development of new drugs. Inthis case study we focus on the interaction measure IC50, which measures the concentration of thedrug necessary to inhibit the activity of the protein by . We prepared a data set from the publicbioactivity database ChEMBL[2] Version 19. First, we selected proteins that had at least 200 IC50measurements, and then we kept drugs with 3 or more IC50 measurements. Finally, we ﬁlteredout some measurements with clear data errors (these were also reported to ChEMBL). The ﬁnalnumbers for small molecules and proteins are 15,073 and 346, respectively, with total of 59,280IC50 measurements. In all of the ChEMBL experiments we model log of IC50 and set α = 5 . ,because this corresponds to a reasonable standard deviation of . . For drugs, we use sparsefeatures (substructure ﬁngerprint) with F drug = F prot = × × , and the Ki part contains 5,121 observations. As before In the experiments we used the MATLAB implementation of HBMFSI provided by the authors. It is possible to enhance the performance by tuning/sampling α . rug Protein β drug β prot IC50 Ki

VariablesAvailable data

Drug feat. Protein feat.

Type {IC50, Ki}

Figure 3: The IC50+Ki model has two

Type sof relations between entity

Drug and entity

Protein . IC50 IC50+Ki0.600.610.620.630.64 R M S E p<0.0001 Figure 4: Comparison of IC50 and IC50+KiMacau models using D = 30 . Drug Protein β drug β prot IC50

Drug feat. Protein feat.

Assay

Pheno

Figure 5: The IC50+Pheno model has threeentities and two relations.

10 15 20 25 30Latent dimension [D]0.730.740.750.760.770.78 R M S E IC50IC50+Pheno

Figure 6: Comparison of IC50 andIC50+Pheno models.the test set contains measurements only from the IC50 part. As can be seen from Figure 4 the tensormodel IC50+Ki signiﬁcantly outperforms the single relation model with only IC50 ( p < . ).In the ﬁnal experiment, we explore the effect of connecting drugs to two relations. For the drugsin our IC50 data set, we compiled all cancer assays from ChEMBL that had at least 20 compoundsin that set. From these, we created a new relation Pheno with , observations measuring thephenotypic effect of drugs in Assay s, which is depicted in Figure 5. Because the effects ofassays are not directly linked to any speciﬁc protein, we expected weaker effect than from the Kidata. Therefore, for the comparison of IC50 and IC50+Pheno models the test set of IC50 mea-surements are selected only from that 1,678 compounds that have

Pheno observations. In Figure 6we can observe the IC50+Pheno model outperforms the IC50 model when appropriately large latentdimension is used. It is interesting to note, that with D = 10 the IC50+Pheno model is slightlylosing in performance, which can be an evidence that, with too small D , adding more relations canresult in an overcrowded latent space. For large scale problems, our implementation has multi-core and multi-node parallelization. Thesampling of latent vectors can be parallelized straightforwardly as the the latent vectors of a singleentity are, in our use cases, independent of each other and can be sampled in parallel. The onlydifference here compared to BPMF is that, for entities that have features Macau requires computing β (cid:62) e x ( e ) i , for which our implementation provides parallelization as well.In the parallelization of (23), if F e ≤ , the direct solver is fast enough not to require addi-tional parallelization. As mentioned in the case of sparse features, Macau uses CG to solve (23) foreach right-hand side separately. For each CG, our implementation parallelizes the matrix product8perations in a multi-core way and CGs can be distributed across multiple processes and thus canbe parallelized over multiple nodes.The large-scale data set is a subset of a proprietary data set from Janssen Pharmaceutica containingmillions of compounds. The subset has more than 1.8M compounds and more than 1,000 proteinsfor a total of several tens of millions of compound–protein measurements. Here we report thecomputation times of Macau on two types of features for the compounds using systems with 2 IntelXeon E5-2699 v3 CPUs. Firstly, for the feature dimension F e ≈ and D = 30 , the computationof the full Gibbs step using 8 cores of a single node takes about 40 seconds (using a direct solver for(23)). Secondly, for the feature dimension F e ≈ . and D = 30 thecomputation of the full Gibbs step using 10 cores per CG and total of 15 nodes (2 CGs per node)takes about 600 seconds. We observed that 1,000 Gibbs iterations (from which 800 were discardedas burn-in) were sufﬁcient to reach a stable posterior. The best of our knowledge, this paper proposes the ﬁrst Bayesian factorization method that allowsto handle tensors, multiple relations, and entity and relation features.

Acknowledgments

Jaak Simm, Adam Arany, Pooya Zakeri and Yves Moreau are funded by Research Council KU Leu-ven (CoE PFV/10/016 SymBioSys) and by Flemish Government (IOF, Hercules Stitching, iMindsMedical Information Technologies SBO 2015, IWT: O&O ExaScience Life Pharma; ChemBio-Bridge, PhD grants).

References [1] X. Amatriain, J. M. Pujol, and N. Oliver. I like it... i like it not: Evaluating user ratings noisein recommender systems. In

User modeling, adaptation, and personalization , pages 247–258.Springer, 2009.[2] A. P. Bento, A. Gaulton, A. Hersey, L. J. Bellis, J. Chambers, M. Davies, F. A. Kr¨uger, Y. Light,L. Mak, S. McGlinchey, et al. The chembl bioactivity database: an update.

Nucleic acidsresearch , 42(D1):D1083–D1090, 2014.[3] D. Husmeier, W. D. Penny, and S. J. Roberts. An empirical evaluation of bayesian samplingwith hybrid monte carlo for training neural network classiﬁers.

Neural Networks , 12(4):677–705, 1999.[4] A. Mnih and R. Salakhutdinov. Probabilistic matrix factorization. In

Advances in neural infor-mation processing systems , pages 1257–1264, 2007.[5] S. Park, Y.-D. Kim, and S. Choi. Hierarchical bayesian matrix factorization with side informa-tion. In

Proceedings of the Twenty-Third international joint conference on Artiﬁcial Intelligence ,pages 1593–1599. AAAI Press, 2013.[6] R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using markov chainmonte carlo. In

Proceedings of the 25th international conference on Machine learning , pages880–887. ACM, 2008.[7] A. Singh and G. Gordon. A bayesian matrix factorization model for relational data. In

Proceed-ings of the Twenty-Sixth Conference Annual Conference on Uncertainty in Artiﬁcial Intelligence(UAI-10) , pages 556–563, Corvallis, Oregon, 2010. AUAI Press.[8] N. Srebro, T. Jaakkola, et al. Weighted low-rank approximations. In

ICML , volume 3, pages720–727, 2003.[9] L. Xiong, X. Chen, T.-K. Huang, J. G. Schneider, and J. G. Carbonell. Temporal collaborativeﬁltering with bayesian probabilistic tensor factorization. In

SDM , volume 10, pages 211–222.SIAM, 2010. 9

Appendices

A.1 Data Models Supported by Macau

Let M be a Macau model ( i.e. , hypergraph) specifying the entities and their relations. We say M is factorizable if given large enough latent dimension D we can choose latent vectors for every entityin such a way that they can ﬁt any relation data with arbitrarily small error.However, it is clear that some models are not factorizable . For example, consider two entities e and e , with the latent vectors u and v , respectively, and two relations, R and S , between them.Since both relations are modelled by the same formula u (cid:62) i v j it is not possible to ﬁt arbitrary data.Actually we can only ﬁt the case when the two relations are equal ( i.e., R ij = S ij ).Let us deﬁne the relation R in model M to be factorizable if in the single latent dimensional case( D = 1 ) it is possible to specify arbitrary values to the latent variables of its entities, e ∈ E R ,while keeping the predictions for all other relations in the model M equal to . It is straight-forward to see that if R is factorizable we can ﬁt any observed data of the relation R by adding newlatent dimensions without affecting the predictions for other relations. Additionally, the fact that allrelations of M are factorizable implies M is factorizable , because we can always add new latentdimensions that only effect a speciﬁc relation and, thus, ﬁt all relations as accurately as needed.It is easy to show that if all pairs of entities in the hypergraph (Macau model) M have at most onehyperedge (relation) between them, the model M is factorizable . To see that this is true considera relation R in such a model. R is factorizable because if we set the latent variables of the non-participating entities, e / ∈ E R , equal to then the predictions of the other relations will be zero asthey all contain at least one non-participating entity.From this we can see that Macau can factorize any • ordinary undirected graph, • acyclic hypergraph.Additionally, it is also possible to tensorize simple cases when there are multiple edges betweentwo entities. For example, the model IC50+Ki in our paper has two relations, namely IC50 and Ki,between the entities Drug and Protein. To handle it in Macau we represent it as a tensor with threemodes: Drug, Protein, Type, where the third mode speciﬁes the relation (either IC50 or Ki). A.2 Sampling of Gaussian Priors of the Latent Variables

In Macau µ e and Λ e are interpreted as the model for the residuals u ( e ) i − β (cid:62) e x ( e ) i . The conditionaljoint probability µ e , Λ e used in sampler is p ( µ e , Λ e | u ( e ) , x , β, Θ ) = N ( µ e | µ ∗ , ( β ∗ Λ e ) − ) W (Λ e | W ∗ , ν ∗ ) , (24)where µ ∗ = β µ + N e ¯ Uβ + N e , (25) β ∗ = β + N e , (26) ν ∗ = ν + N e + F e , (27) [ W ∗ ] − = W − + N e ¯ S + β µ µ (cid:62) − β ∗ µ ∗ µ ∗ (cid:62) + λ β e β (cid:62) e β e , (28) ¯ U = 1 N e N e (cid:88) i =1 ( u ( e ) i − β (cid:62) e x ( e ) i ) , (29) ¯ S = 1 N e N e (cid:88) i =1 ( u ( e ) i − β (cid:62) e x ( e ) i )( u ( e ) i − β (cid:62) e x ( e ) i ) (cid:62) , (30)where N e is the number of instances of entity e . Note that the terms λ β e β (cid:62) e β e in (28) and F e in (27)come due to the dependence of the prior of β e on Λ e , see (9).10 .3 Gibbs Sampling of Precision Parameter of Weight Vector of Relation Features Recall that the λ β R is the diagonal value for the precision variable in the prior for weight vector β R and that the hyperprior of λ β R is gamma distribution with ﬁxed parameters µ and ν : p ( β R | λ β R ) = N ( β R | , λ β R I ) (31) p ( λ β R | µ, ν ) = G ( λ β R | µ, ν ) , (32)where G ( λ β R | µ, ν ) ∝ λ ν/ − β R exp( − ν µ λ β R ) . (33)The conditional probability is then p ( λ β R | β R , µ, ν ) ∝ p ( β R | λ β R ) p ( λ β R | µ, ν ) (34) ∝ λ F R / β R exp (cid:18) − λ β R β (cid:62) R β R (cid:19) λ ν/ − β R exp (cid:18) − ν µ λ β R (cid:19) (35) ∝ λ ( F R + ν ) / − β R exp (cid:18) − λ β R ν + µβ (cid:62) R β R µ (cid:19) (36) ∝ λ ( F R + ν ) / − β R exp (cid:18) − λ β R ( ν + µβ (cid:62) R β R )( F R + ν )2( F R + ν ) µ (cid:19) (37) ∝ G ( λ β R | ˜ µ, ˜ ν ) , (38)where ˜ ν = F R + ν, (39) ˜ µ = ( F R + ν ) µν + µβ (cid:62) R β R , (40)where F R is the dimensionality of the relation features of R . A.4 Derivation of Gaussian Mean and Precision for the Weight Vector of the Entity Features

The conditional probability for β e is p ( β e | µ e , Λ e , u , x , λ β e ) (41) ∝ exp (cid:18) −

12 tr[(( U − X β e ) (cid:62) ( U − X β e ) + λ β e β (cid:62) e β e )Λ e ] (cid:19) . (42)Next we use the link X (cid:62) U = ( X (cid:62) X + λ β e I ) ˆ β e (from the deﬁnition of ˆ β e ) and expand the innerpart of (42): ( U − X β e ) (cid:62) ( U − X β e ) + λ β e β (cid:62) e β e (43) = U (cid:62) U + β (cid:62) e X (cid:62) X β e − U (cid:62) X β e − β (cid:62) e X (cid:62) U + λ β e β (cid:62) e β e . (44) = U (cid:62) U + β (cid:62) e ( X (cid:62) X + λ β e I ) β e − U (cid:62) X β e − β (cid:62) e X (cid:62) U (45) = U (cid:62) U + β (cid:62) e ( X (cid:62) X + λ β e I ) β e − ˆ β (cid:62) e ( X (cid:62) X + λ β e I ) β e − β (cid:62) e ( X (cid:62) X + λ β e I ) ˆ β e (46) = U (cid:62) U + β (cid:62) e ( X (cid:62) X + λ β e I ) β e − ˆ β (cid:62) e ( X (cid:62) X + λ β e I ) β e − β (cid:62) e ( X (cid:62) X + λ β e I ) ˆ β e + ˆ β (cid:62) e ( X (cid:62) X + λ β e I ) ˆ β e − ˆ β (cid:62) e ( X (cid:62) X + λ β e I ) ˆ β e (47) = U (cid:62) U (cid:124) (cid:123)(cid:122) (cid:125) const +( β (cid:62) e − ˆ β e ) (cid:62) ( X (cid:62) X + λ β e I )( β (cid:62) e − ˆ β e ) − ˆ β (cid:62) e ( X (cid:62) X + λ β e I ) ˆ β e (cid:124) (cid:123)(cid:122) (cid:125) const (48)Next we plug the non-constant part back to (42) p ( β e | µ e , Λ e , u , x , λ β e ) (49) ∝ exp (cid:18) −

12 tr[(( β (cid:62) e − ˆ β e ) (cid:62) ( X (cid:62) X + λ β e I )( β (cid:62) e − ˆ β e ))Λ e ] (cid:19) . (50) ∝ exp (cid:18) −

12 vec( β e − ˆ β e ) (cid:62) (Λ e ⊗ ( X (cid:62) X + λ β e I )) vec( β e − ˆ β e ) (cid:19) , (51)where we can clearly see the precision and mean of β e .11 .5 Correctness of Noise Injection Sampler Let X , U , Λ e be matrices described in Gibbs sampling section of β e . Here we prove a more generalversion of the sampler where instead of precision matrix λ β e I we allow any positive deﬁnite matrix Λ ∈ R F × F . Let √ Λ be a matrix such that √ Λ √ Λ (cid:62) = Λ . Lemma A.1.

Let E ∈ R N e × D and E ∈ R F e × D be matrices where their each row is independentlygenerated from N ( , Λ − e ) and let the variable ˜ β be the solution to the linear system ( X (cid:62) X + Λ ) ˜ β = X (cid:62) ( U + E ) + √ ΛE , (52) then vec( ˜ β ) is distributed by multinomial Gaussian distribution with mean vec(( X (cid:62) X + Λ ) − X (cid:62) U ) and precision Λ e ⊗ ( X (cid:62) X + Λ ) .Proof. From ˜ β = ( X (cid:62) X + Λ ) − ( X (cid:62) ( U + E ) + √ ΛE ) (53)it is clear that ˜ β is distributed by Gaussian as it constructed by afﬁne transformations and sums ofGaussian variables. As E and E have zero mean we get E [ ˜ β ] = ( X (cid:62) X + Λ ) − ( X (cid:62) E [ U + E ] + √ Λ E [ E ]) (54) = ( X (cid:62) X + Λ ) − X (cid:62) U , (55)proving the correctness of the mean.For the precision we investigate the covariance between i and j column of ˜ β . In what follows weuse notation A i to denote the column i of matrix A . Let’s also denote K = ( X (cid:62) X + Λ ) − , givingus E [ ˜ β ] = KX (cid:62) U , then cov( ˜ β i , ˜ β j ) = E (cid:104) ( ˜ β i − E [ ˜ β i ])( ˜ β j − E [ ˜ β j ]) (cid:62) (cid:105) (56) = E (cid:104)(cid:16) K ( X (cid:62) ( U + E ) i + ( √ ΛE ) i ) − KX (cid:62) U i (cid:17) (57) · (cid:16) K ( X (cid:62) ( U + E ) j + ( √ ΛE ) j ) − KX (cid:62) U j (cid:17) (cid:62) (cid:21) = K E (cid:20)(cid:16) ( X (cid:62) E ) i + ( √ ΛE ) i (cid:17) (cid:16) ( X (cid:62) E ) j + ( √ ΛE ) j (cid:17) (cid:62) (cid:21) K (58) = KX (cid:62) E (cid:2) ( E ) i (( E ) j ) (cid:62) (cid:3) XK (59) + KX (cid:62) E (cid:2) ( E ) i (( E ) j ) (cid:62) (cid:3) √ Λ (cid:62) K + K √ Λ E (cid:2) ( E ) i (( E ) j ) (cid:62) (cid:3) XK + K √ Λ E (cid:2) ( E ) i (( E ) j ) (cid:62) (cid:3) √ Λ (cid:62) K . The expectations in the ﬁrst and last term of the equation (59) give E [( E ) i (( E ) j ) (cid:62) ] = (Λ − e ) i,j I N e (60) E [( E ) i (( E ) j ) (cid:62) ] = (Λ − e ) i,j I F e , (61)where I n is n -dimensional identity matrix. The middle two terms are equal to zero because E (cid:2) ( E ) i ( E ) (cid:62) j (cid:3) = due to E and E being zero mean and independent of each other. Thus,we get cov( ˜ β i , ˜ β j ) = KX (cid:62) (Λ − e ) i,j XK + K √ Λ (Λ − e ) i,j √ Λ (cid:62) K (62) = (Λ − e ) i,j K ( X (cid:62) X + Λ ) K (63) = (Λ − e ) i,j K (64) = (Λ − e ) i,j ( X (cid:62) X + Λ ) − . (65)This means the covariance matrix of vec( ˜ β ) is Λ − e ⊗ ( X (cid:62) X + Λ ) − and thus the precision is Λ e ⊗ ( X (cid:62) X + Λ ) . 12

10 20 30Latent dimension [D]0.620.630.640.650.660.670.68 R M S E Figure 7: The IC50 model with different latent dimensions.