[PDF] Neural content-aware collaborative filtering for cold-start music recommendation

Abstract

State-of-the-art music recommender systems are based on collaborative filtering, which builds upon learning similarities between users and songs from the available listening data. These approaches inherently face the cold-start problem, as they cannot recommend novel songs with no listening history. Content-aware recommendation addresses this issue by incorporating content information about the songs on top of collaborative filtering. However, methods falling in this category rely on a shallow user/item interaction that originates from a matrix factorization framework. In this work, we introduce neural content-aware collaborative filtering, a unified framework which alleviates these limits, and extends the recently introduced neural collaborative filtering to its content-aware counterpart. We propose a generative model which leverages deep learning for both extracting content information from low-level acoustic features and for modeling the interaction between users and songs embeddings. The deep content feature extractor can either directly predict the item embedding, or serve as a regularization prior, yielding two variants (strict} and relaxed) of our model. Experimental results show that the proposed method reaches state-of-the-art results for a cold-start music recommendation task. We notably observe that exploiting deep neural networks for learning refined user/item interactions outperforms approaches using a more simple interaction model in a content-aware framework.

Full PDF

aa r X i v : . [ c s . I R ] F e b Neural content-aware collaborative ﬁlteringfor cold-start music recommendation ∗ Paul Magron † , C´edric F´evotte † Abstract

State-of-the-art music recommender systems are based on collaborative ﬁltering, which buildsupon learning similarities between users and songs from the available listening data. These ap-proaches inherently face the cold-start problem, as they cannot recommend novel songs with nolistening history. Content-aware recommendation addresses this issue by incorporating content infor-mation about the songs on top of collaborative ﬁltering. However, methods falling in this categoryrely on a shallow user/item interaction that originates from a matrix factorization framework. In thiswork, we introduce neural content-aware collaborative ﬁltering, a uniﬁed framework which alleviatesthese limits, and extends the recently introduced neural collaborative ﬁltering to its content-awarecounterpart. We propose a generative model which leverages deep learning for both extracting con-tent information from low-level acoustic features and for modeling the interaction between usersand songs embeddings. The deep content feature extractor can either directly predict the item em-bedding, or serve as a regularization prior, yielding two variants ( strict and relaxed ) of our model.Experimental results show that the proposed method reaches state-of-the-art results for a cold-startmusic recommendation task. We notably observe that exploiting deep neural networks for learningreﬁned user/item interactions outperforms approaches using a more simple interaction model in acontent-aware framework.

Keywords—

Content-aware recommendation, neural collaborative ﬁltering, matrix factorization, cold-start prob-lem.

Music recommendation consists in predicting users’ listening habits in order to suggest them novel tracks that theymight enjoy [1]. This task, which is at the core of many commercial platforms, has been extensively investigated,but remains challenging due to the complexity of music and to the lack of explicit and reliable users feedback [2].In particular, leveraging the musical function at hand is necessary to perform recommendation that are tailoredfor a speciﬁc usage [3, 4]. More generally, the usage of contextual information is of paramount importance inorder to adapt the recommendation to a particular location or event [5, 6]. Besides, it has been shown thatpsychological cues such as personality and emotional response play a major role in musical tastes and usersbehavior [7]. Recommender systems could then beneﬁt from such ﬁndings in music psychology research [8, 9, 10].Finally, music recommender systems face the cold-start problem [11], which is the topic of investigation of thispaper: when a new song is added to a music streaming platform, it has no interaction data with the set of users.Consequently, the system cannot properly recommend this novel item to existing users. Even though this problemhas fueled an important amount of studies, it is still considered as a major challenge in music recommendationresearch [12].State-of-the-art approaches for music recommendation are based on collaborative ﬁltering [2], a family oftechniques which rely solely on users’ listening history: the interest of a given user for a given song is predictedusing similarities between various user proﬁles. The users’ feedback are most often implicit and in the form of playcounts , that is, how many times a given user has listened to a particular song. Collaborative ﬁltering techniquesconsist in learning user and item embeddings from the data, where these embeddings respectively characterize theusers’ preferences and the items’ attributes. This listening history is however noisy and lacks negative feedbackdata [2, 13]. To alleviate this problem, weighted matrix factorization (WMF) techniques [14, 15] process binarizeddata computed from the raw playcounts, and associate a measure of conﬁdence to these binarized feedback. Thisfamily of techniques has shown good performance in music recommender systems [16]. More recently, deep learninghas been leveraged in collaborative ﬁltering techniques with promising results [17, 18, 19]. While some earliermodels rely on shallow architectures [20, 21], deep neural networks (DNNs) are now exploited for learning deep ∗ This work is supported by the European Research Council (ERC FACTORY-CoG-6681839). † IRIT, Universit´e de Toulouse, CNRS, Toulouse, France (e-mail: ﬁ[email protected]). ser/item embeddings [22] and interaction models to replace the matrix product in WMF [23], or a combinationof both [24].However, these approaches are agnostic to any form of item-related content, which becomes a major issuefor new items. Indeed, since collaborative ﬁltering methods only exploit the available user/item interaction data,they are not able to recommend a song without listening history, and therefore face the cold-start problem. Thisproblem has been tackled with content-based methods, which aim to exploit additional information about theitems for recommendation [25, 26, 27, 28]. Deep learning has been used as a tool to learn features from thecontent that can help the collaborative ﬁltering. In [16], the last hidden layer of an auto-tagging DNN is usedas content feature. Conversely, in [29, 30], acoustic features are mapped to the learned item attribute matrixin order to be further used for cold-start recommendation. These are however limited in performance since theuser/item embeddings and the deep content feature extractor are learned in two distinct stages. To alleviatethis issue, several recent works have proposed to combine these steps into a single-stage approach, where theuser/item embeddings and the content feature extractor are jointly learned. Examples of architectures for thedeep content feature extractor include stacked denoising auto-encoders [31, 32], variational auto-encoders [33],recurrent neural networks [34] and convolutional networks [35, 36]. However, these approaches still rely on asimple matrix product for modeling the interaction between users and items, and therefore do not leverage DNNsfor learning more complex interaction models.In this work, we introduce neural content-aware collaborative ﬁltering (NCACF), a uniﬁed framework whichovercomes these limits, and extends the recently introduced neural collaborative ﬁltering (NCF) [23] to its content-aware counterpart. We propose a generative model which leverages DNNs for both extracting content informationfrom low-level acoustic features and for learning reﬁned interaction models between user and item embeddings.Two variants of the model are considered. In the strict variant, the deep content feature extractor directly pre-dicts the item embedding. Conversely, in the relaxed variant, it serves as a regularization prior for the itemembedding. These two approaches allow for more ﬂexibility and for generalizing previous models from the litera-ture. In particular, when the interaction model reduces to a simple product (which is a building block of matrixfactorization models), we derive an estimation algorithm which hybridizes closed-form updates using alternatingleast squares (ALS) for the embeddings and gradient descent (GD) for the deep content feature extractor. Wefurther incorporate the embeddings in the network in order to estimate the whole model jointly using a singleGD algorithm. Finally, we use this technique to estimate the more general NCACF model which uses a deepuser/item interaction model. To assess the potential of these methods, we conduct experiments on the MillionSong Dataset, a publicly available database for music information retrieval tasks. We observe that the proposedNCACF outperforms the baselines using a shallow user/item interaction model, which reveals its interest forcold-start music recommendation applications.The rest of this paper is structured as follows. Section 2 presents the work related to our approach. Theproposed NCACF model is then introduced in Section 3. The experimental protocol is detailed in Section 4 andthe results are presented and discussed in Section 5. Finally, Section 6 draws some concluding remarks. In this section, we present the related work upon which our methods build. We ﬁrst describe the baseline WMFmodel, and then present approaches that incorporate deep content feature extraction in WMF. Finally, we presentthe collaborative ﬁltering methods using deep learning for user/item interaction modeling. The notations usedthroughout this paper are summarized in Table 1.

Let us consider a large data matrix Y ∈ R U × I representing interactions between a set of U users and I items,where y u,i denotes the interaction between user u and item i . Matrix factorization [37, 2] consists in decomposingthe data Y as the product of two low-dimensional factors: a (transposed) user preferences matrix W ∈ R K × U and an item attributes matrix H ∈ R K × I , such that: Y ≈ W T H . (1)The rank K of the decomposition is chosen such that K ( U + I ) ≪ UI to ensure dimensionality reduction. Morespeciﬁcally, weighted matrix factorization (WMF) [38, 15] is a variant of matrix factorization that is appropriatefor handling implicit feedback data. In this variant, the factorization is performed on a binarized interaction datamatrix R (that is, r u,i = 1 if user u has interacted with item i and 0 otherwise). The factors are estimated byaddressing the following optimization problem:min W , H X u,i c u,i ( r u,i − w T u h i ) + λ W X u || w u || + λ H X i || h i || , (2)where the weight c u,i quantiﬁes the conﬁdence in the ( u, i )-th interaction [15], and where λ W and λ H are regulariza-tion hyperparameters. The loss in (2) can be minimized using the alternating least square (ALS) algorithm [2, 39], U Number of users. I Number of items (= songs). K Dimension of the user/item embeddings. L Dimension of the content vectors (= acoustic features). P Number of layers in the deep content feature extractor. Q Number of layers in the deep interaction model. Y User/item interaction matrix (raw playcounts). R Binarized playcounts matrix. r u , r i Binarized playcounts for user u and item i . C Conﬁdence matrix. c u , c i Conﬁdence for user u and item i . x i Acoustic feature vector for item i . W , H User and item embedding matrices. w u , h i Embedding vectors for user u and item i . φ θ Deep content feature extractor, with parameters θ . ψ γ User/item interaction model, with parameters γ . − Matrix inverse. ⊙ Element-wise vector or matrix multiplication. . T Vector or matrix transpose. I K Identity matrix of size K . Vector or matrix whose entries are all equal to 1.diag( a ) Diagonal matrix whose entries are given by the vector a . || . || Euclidean norm. which yields the following updates in closed-form: ∀ u ∈ { , ..., U } , w u ← ( H diag( c u ) H T + λ W I K ) − H diag( c u ) r u , (3) ∀ i ∈ { , ..., I } , h i ← ( W diag( c i ) W T + λ H I K ) − W diag( c i ) r i . (4)Once the user and item factors have been estimated, recommendation can be performed using the predictedinteraction deﬁned as ˆ r u,i = w T u h i , and constructing a recommendation list for each user by picking the itemswith the highest predicted interaction for this user. Even though the WMF model has shown good performancefor recommender systems [2, 29], it only applies to songs for which some listening history is available, hence facingthe cold-start problem. To alleviate the aforementioned cold-start problem, several works [16, 29] have proposed to leverage deep learningin conjunction with this WMF framework. These are illustrated in Fig. 1 and described hereafter. In [16], theauthors propose to incorporate some content information in WMF. More precisely, they consider the followingprior on the item attribute matrix: ∀ i ∈ { , ..., I } , h i ∼ N ( Φz i , λ − H I K ) , (5)where z i is a content feature vector calculated beforehand, and Φ linearly maps this vector to the item embeddingspace. To obtain such a content feature vector, the authors ﬁrst train an auto-tagging DNN, that is, a networkthat maps low-level acoustic features x i to a set of tags. Once the tagging network is trained, the last hiddenrepresentation is used as content feature vector z i in (5). This approach allows to alleviate the cold-start problem:for novel items without listening history, binarized playcounts can still be predicted through ˆ r u,i = w T u Φz i ,from which recommendation can be made. However, the quality of the content features is highly dependent onthe performance of the auto-tagging network. While these methods have signiﬁcantly improved in the recentyears [40, 41], they are still limited by the noisy nature of the tags, which are mostly user-annotated. Besides,the learned features are not necessarily relevant for the task at hand (i.e., recommendation instead of tagging),since they are neither estimated jointly with the user/item embeddings nor ﬁne-tuned for this task.In [29], the authors adopt a diﬀerent strategy in order to leverage WMF for content-based recommendation.First, the factors W and H are estimated from the listening data using ALS updates, as described in Section 2.1.Then, in a second stage, they train a DNN whose purpose is to map input acoustic content features x i to theestimated attributes h i . To that end, they consider two possible losses: the ﬁrst one is based on the mean squareerror between attributes: min θ X i || h i − φ θ ( x i ) || , (6) 'rock', 'happy', '2017'} DNN DNN

Figure 1: Illustration of the content-aware WMF methods using deep content features. In [16], contentfeatures are extracted using an auto-tagging DNN and then incorporated as prior information in theWMF model (left). Conversely, [29] ﬁrst obtains the WMF decomposition, and the learned attributematrix is then used as target for training the deep content extractor (right). and the second one is based on the binarized playcounts weighted prediction error:min θ X u,i c u,i ( r u,i − w T u φ θ ( x i )) , (7)where φ θ denotes the deep content feature extractor with parameters θ . The authors use Mel spectrograms asinput acoustic features x i and a convolutional architecture for φ θ , as it is shown to outperform a multilayerperceptron (MLP) structure. However, this approach strongly relies on the proper estimation of the factors W and H , which are needed for minimizing the objective function ((6) or (7)). Since the authors do not alternatethe updates on the factors and the network parameters, the quality of the prior factorization sets a performancelimit to this approach.In a nutshell, these methods are mainly limited by that they operate in two stages, which means that thecontent feature extractor and the factorization model are not trained jointly. As a result, in [16] the extractedfeatures are not necessarily relevant for a recommendation task, and in [29] the capacity of the deep contentextractor is inherently limited by the prior factorization.To alleviate these drawbacks, several recent works have proposed to combine these steps into a single-stageapproach, which allows for jointly learning the user/item embeddings and the deep content feature extractor.Various architectures have been proposed for modeling the DNN φ θ , such as stacked denoising auto-encoders [31,32] or variational auto-encoders [33]. These approaches propose to estimate the whole model jointly, that is,alternating updates on the factors and the network parameters (this will be described in details in Section 3.2).Predictions are ﬁnally given by: ˆ r u,i = w T u ( h i + φ θ ( x i )) , (8)where h i corresponds to the item embedding obtained in a collaborative ﬁltering framework However, this ap-proach relies on a shallow user/item interaction model, since the collaborative ﬁltering part remains matrixfactorization-based. In the recent years, leveraging deep learning in collaborative ﬁltering-based recommender systems has attracteda lot of attention [18]. The core idea of such approaches is to replace a shallow matrix factorization model suchas presented in Section 2.1 with a DNN. Such a framework allows for learning reﬁned user/item embeddings [22]and more complex interaction between these factors [23, 24]. More speciﬁcally, [23] proposes to integrate the userand item factors as embeddings in a neural network, which are then jointly learned using a gradient descent (GD)algorithm. First, they propose a model termed generalized matrix factorization, in which the predicted binarizedplaycounts are modeled as: ˆ r u,i = σ ( w T u h i ) , (9) here σ is a non-linear activation function which replaces the identity function used in matrix factorization (12).They further replace the dot product with a concatenation of the user and item factors, on top of which an MLPis applied to learn complex interactions between these factors. They also combine both approaches, yielding amodel termed neural matrix factorization. This approach based on learning deep user/item interaction models isshown to outperform the shallow matrix factorization model, which has further been conﬁrmed in [24]. However,this so-called neural collaborative ﬁltering (NCF) framework does not use any extra content information, andas such, face the cold-start problem. Several attempts were made to incorporate content information in sucha framework [42, 43], but these approaches rely on using a single layer for learning user/item interaction, andtherefore do not leverage the full potential of deep learning. Besides, their potential has been shown for traditionalcollaborative ﬁltering tasks, but not for cold-start recommendation.We propose to overcome this issue by introducing a uniﬁed model in which deep learning is leveraged for bothlearning deep user/item interactions (as presented in this section) and extracting content features (as describedin Section 2.2 for cold-start recommendation). We consider implicit feedback data in the form of a playcount matrix Y ∈ R U × I . As recalled in Section 2.1,in order to better account for the over-dispersed nature of such data, it is common to consider the binarizedplaycount matrix R rather than the raw listening history. R indicates whether a user has listened to a song morethan a certain amount of times τ or not [16, 44, 45]: ∀ ( u, i ), r u,i = ( y u,i ≥ τ, ∀ ( u, i ), c u,i = 1 + α log (cid:16) y u,i ǫ (cid:17) , (11)whose parameters are commonly set at α = 2 and ǫ = 10 − in the literature [2, 16]. Note that alternative schemesexist for deﬁning the conﬁdence, such as using constant values [31, 33]. We resorted to using (11) as it yieldedslightly better results in our preliminary experiments. We model the binarized playcounts as the result of the interaction between the user and item factors. Similarlyas in WMF, we consider a Gaussian generative model: ∀ ( u, i ), r u,i ∼ N ( ψ γ ( w u , h i ) , c − u,i ) , (12)where c u,i is the conﬁdence deﬁned in (11) and ψ γ is the interaction model, which might depend on someparameters γ . The structure of this interaction model will be speciﬁcally described in Sections 3.2 and 3.3.As in (2), we consider the following prior on the user factor: ∀ u ∈ { , ..., U } , w u ∼ N (0 , λ − W I K ) , (13)where λ W is a regularization hyperparameter. In order to exploit some content information, one needs to consideran additional assumption about the item factor h i . Let us ﬁrst consider a relaxed formulation, where this contentinformation is incorporated in the model in the form of a prior as: ∀ i ∈ { , ..., I } , h i ∼ N ( φ θ ( x i ) , λ − H I K ) , (14)where φ θ is a deep content feature extractor with parameters θ and x i is a vector of low-level acoustic features foritem i . We also consider a strict formulation, where the item attribute is directly predicted by the deep contentfeature extractor: ∀ i ∈ { , ..., I } , h i = φ θ ( x i ) , (15)which corresponds to (14) with λ H → ∞ . .1.3 Deep content feature extractor As for the deep content feature extractor φ θ , we consider an MLP architecture with P layers, that is: φ θ ( x i ) = φ P ( φ P − ( ...φ ( x i ))) , (16)where x i is a set of low-level acoustic features (described in Section 4.2), and φ p denotes the p -th layer of φ θ (fora clarity purpose, we do not explicitly write that φ p depends on some parameters θ p ), such that: ∀ p ∈ { , ..., P } , φ p ( z ) = σ p ( A p z + b p ) , (17)where A p , b p and σ p respectively denote the weights, biases and activation function of the p -th layer. Theactivation function σ p is chosen as the rectiﬁed linear unit (ReLU) function for all layers except for the last one,which uses the identity function. The choice of ReLU over alternative non-linear activation functions (such asthe hyperbolic tangent or the sigmoid) is motivated by their consistent performance in the literature, notablytheir capability to reduce the gradient vanishing and overﬁtting problems [46]. Based on preliminary validationexperiments and on prior work [16], each layer uses 1024 neurons, except for the last one which outputs the itemfactor of dimension K . We consider a total of P = 3 layers in our experiments.The goal of this paper is to assess the potential of combining deep content feature extraction with deep inter-action modeling in a uniﬁed content-aware collaborative ﬁltering framework, rather than speciﬁcally optimizingthe networks architectures. As such, we leave the usage of more advanced networks (e.g., convolutional layers [29],potentially acting on raw audio waveforms [36]) to future work. Estimating the whole model in the maximum a posteriori sense results in the following optimization problem forthe relaxed variant:min ξ R L NCACF-R ( ξ R ) := X u,i c u,i ( r u,i − ψ γ ( w u , h i )) + λ W X u || w u || + λ H X i || h i − φ θ ( x i ) || , (18)where ξ R = { θ, γ, W , H } is the whole set of parameters. The problem for the strict variant writes similarly:min ξ S L NCACF-S ( ξ S ) := X u,i c u,i ( r u,i − ψ γ ( w u , φ θ ( x i ))) + λ W X u || w u || . (19)with ξ S = { θ, γ, W } . These formulations generalize several past works presented in Section 2. Depending on thedesign choice for the interaction model ψ γ , we can propose several optimization schemes for problems (18) and (19).In general, we will rely on a gradient descent strategy, but when the interaction model reduces to a shallow dotproduct, it becomes possible to leverage closed-form updates for estimating W and H (see Section 3.2.1). Once the model is trained, recommendation is performed by computing the predicted binarized playcounts for alluser/item pairs through: ˆ r u,i = ψ γ ( w u , φ θ ( x i )) . (20)Using (20) allows to perform recommendation in a cold-start scenario, which is the considered framework in thispaper. However, the proposed model can also be used for a traditional collaborative ﬁltering task which does notsuﬀer from the cold-start problem. With the relaxed variant, predictions can be made through:ˆ r u,i = ψ γ ( w u , h i ) . (21)Prior studies such as [16] have outlined that incorporating additional content information in collaborative ﬁlteringdoes not signiﬁcantly improve the performance over a content-free approach when there is no cold-start issue.Consequently, in such a scenario, we shall advise to use the relaxed variant instead of the strict one, and toperform recommendation through (21), that is, using the item embedding h i rather than the deeply learnedcontent feature φ θ ( x i ) (as used in (20)). Let us ﬁrst consider a particular case of the model described above, where the user/item interaction ψ γ reducesto a matrix factorization model: ψ γ ( w u , h i ) = w T u h i . (22)The corresponding model is illustrated in Fig. 2 in its relaxed and strict variants. In such a scenario, the sets ofparameters reduce to ξ R = { θ, W , H } and ξ S = { θ, W } and problems (18) and (19) respectively rewrite:min ξ R L MF-R ( ξ R ) := X u,i c u,i ( r u,i − w T u h i ) + λ W X u || w u || + λ H X i || h i − φ θ ( x i ) || , (23) redicted rating Item index Acoustic contentextractorContent featureLayer P...Layer 1User indexembeddingUser embeddingItem User indexembeddingUser Acoustic contentextractorContent featureLayer P...Layer 1Dot product Predicted ratingDot product Figure 2: Proposed NCACF models in the speciﬁc case where the interaction model reduces to a productbetween the user and item embeddings, in its relaxed (left) and strict (right) variants. and min ξ S L MF-S ( ξ S ) := X u,i c u,i ( r u,i − w T u φ θ ( x i )) + λ W X u || w u || . (24)To address the (23) and (24), we propose two optimization strategies, which we describe hereafter. First, we propose a hybrid approach termed MF-Hybrid which boils down to combining two diﬀerent algorithms.Similarly to the WMF model and its content-aware counterpart presented in Section 2.1 and 2.2, this strategyconsists in estimating the user and item factors using an ALS scheme, which yields closed-form updates for W and H at each iteration. On the other hand, the parameters θ of the deep content network φ are estimated usinga GD algorithm. This yields the following update scheme for the relaxed version: ∀ u ∈ { , ..., U } , w u ← ( H diag( c u ) H T + λ W I K ) − H diag( c u ) r u , (25) ∀ i ∈ { , ..., I } , h i ← ( W diag( c i ) W T + λ H I K ) − ( W diag( c i ) r i + λ H φ θ ( x i )) , (26) θ ← θ − η ∇ θ X i || h i − φ θ ( x i ) || ! , (27)where ∇ denotes the gradient operator, η is the learning rate, r u = [ r u, , ..., r u,I ] T , r i = [ r ,i , ..., r U,i ] T , andsimilarly for c u and c i . Using a similar approach for the strict variant yields: ∀ u ∈ { , ..., U } , w u ← ( φ θ ( X )diag( c i ) φ θ ( X ) T + λ W I K ) − φ θ ( X )diag( c i ) r u , (28) θ ← θ − η ∇ θ X u,i c u,i ( r u,i − w T u φ θ ( x i )) ! . (29)Note that in practice, (27) and (29) might diﬀer as one might use a stochastic variant of a GD algorithm withmomentum. The proposed procedures are termed MF-Hybrid-Relaxed and MF-Hybrid-Strict and summarized inAlgorithms 1 and 2, respectively. These are iterative schemes in which the ﬁrst stage of each iteration consistsin updating the embeddings using ALS updates. While the relaxed variant can use an arbitrary number N als ofALS updates, its strict counterpart uses 1 since it only requires updating W instead of alternating between thetwo factors. Then, a number N gd of epochs is performed for estimating the parameters θ using (27) or (29).Therefore, these algorithms generalize the related work [29], where the second stage consists of learning themapping φ θ using either (27) or (29). This approach corresponds to using Algorithms 1 and 2 with N = 1, whichwe will refer to as Baseline-Relaxed and Baseline-Strict, respectively.In order to save some computational time in our practical implementation, we resort to a pretraining strategy.To that end, we ﬁrst compute the embeddings using ALS updates without content, which corresponds to apply-ing (25) and (26) with φ θ = 0. Then, we perform GD updates using the relaxed (27) or strict (29) variants. Thispretraining procedure is detailed in Algorithm 3. The pretrained embeddings W and H and network parameters θ are then used as inputs to the Baseline and MF-Hybrid methods for a fair comparison. While r i denotes the i -th column of R , r u denotes its u -th row, which might appear as a slight notation abuse. Indeed,using the same notation convention, the u -th row should be denoted by [ R T ] T u . Nonetheless, we decided to keep the notation r u for brevity. lgorithm 1: MF-Hybrid-Relaxed Inputs : Binarized playcounts R ∈ [0 , U × I and conﬁdence C ∈ R U × I + Initial user preferences W ∈ R K × U and item attributes H ∈ R K × I matrices Initial deep content feature network parameters θ Number of overall, ALS and GD iterations N , N als and N gd for j = 1 to N do for j ′ = 1 to N als do Update W using (25) Update H using (26) end for j ′ = 1 to N gd do Update θ using (27) end end Output : W , H and θ . Algorithm 2:

MF-Hybrid-Strict Inputs : Binarized playcounts R ∈ [0 , U × I and conﬁdence C ∈ R U × I + Initial user preferences matrix W ∈ R K × U Initial deep content feature network parameters θ Number of overall and GD iterations N and N gd for j = 1 to N do Update W using (28) for j ′ = 1 to N gd do Update θ using (29) end end Output : W and θ . We now address the optimization problems (23) and (24) using a single GD algorithm. We integrate the user anditem factors as embedding layers within a DNN, as illustrated in Fig. 2. This network consists of a collaborativeﬁltering part and a content feature extractor part. The collaborative ﬁltering part is fed with user and itemindices to yield user and item factors through an embedding layer. These are then combined using an interactionmodel (which here reduces to a dot product) to yield predicted binarized playcounts. The item factor can beregularized using the deep content feature extractor branch (in the relaxed variant) or directly predicted by thisbranch (in the strict variant).Even though the resulting models are equivalent to those presented in 3.2.1, this now allows for a uniﬁedlearning approach where only the GD algorithm is used to train all the parameters, by considering as objectivefunctions the losses given in (23) and (23). As a result, the gradient updates are: ξ R ← ξ R − η ∇ ξ R L MF-R ( ξ R ) , (30)and ξ S ← ξ S − η ∇ ξ S L MF-S ( ξ S ) . (31)The corresponding procedures will be referred to as MF-Uni-Relaxed and MF-Uni-Strict, respectively. We now propose a more general version of the network where we leverage a deep interaction model in orderto learn more reﬁned interactions between embedding vectors [23, 47]. This approach is illustrated in Fig. 3.First, the two embeddings are combined into a single vector. Drawing on prior work [48, 23, 47], we propose twoembedding combinations, based on a multiplication or a concatenation of the embeddings: ∀ ( u, i ), v u,i =  w u ⊙ h i (multiplication) , (cid:20) w u h i (cid:21) (concatenation). (32) lgorithm 3: MF-Hybrid Pretraining Inputs : Binarized playcounts R ∈ [0 , U × I and conﬁdence C ∈ R U × I + Number of ALS and GD iterations N als and N gd Initialize the user preferences matrix W ∈ R K × U and the deep content feature networkparameters θ for j = 1 to N als do ∀ u ∈ { , ..., U } , w u ← ( H diag( c u ) H T + λ W I K ) − H diag( c u ) r u ∀ i ∈ { , ..., I } , h i ← ( W diag( c i ) W T + λ H I K ) − W diag( c i ) r i end for j = 1 to N gd do Update θ using (27) (relaxed) or (29) (strict) end Output : W , H and θ . Item index Acoustic contentextractorContent featureLayer P...Layer 1User indexembeddingUser embeddingItem User indexembeddingUser modelInteractionPredicted rating ...Layer QLayer 1Embeddingscombination modelInteractionPredicted rating ...Layer QLayer 1Embeddingscombination Acoustic contentextractorContent featureLayer P...Layer 1

Figure 3: Proposed neural content-aware collaborative ﬁltering model, which uses a deep interactionmodel in addition to the deep content feature extractor, with its relaxed (left) and strict (right) variants.

The resulting combined vector v u,i is of length K or 2 K in the case of a multiplication or concatenation, respec-tively. Previous studies such as [24] have shown that using a concatenation of the user and item factors overalloutperforms using a multiplication, regardless of the size and number of deep layers of the subsequent network.However, since our framework is diﬀerent (we consider the cold-start scenario and we use an additional deepcontent extractor), we will evaluate both combination methods experimentally. This combined vector is then fedas input to an MLP with Q layers to output the predicted binarized playcounts, which is deﬁned similarly as thedeep content feature extractor (17): ˆ r u,i = ψ Q ( ψ Q − ( ...ψ ( v u,i ))) , (33)where ψ q denotes the q -th layer of the network ψ γ , such that: ∀ q ∈ { , ..., Q } , ψ q ( z ) = σ q ( A q z + b q ) , (34)where A q , b q and σ q are the weights, biases and activation function of the q -th layer, respectively (we use thesame notations as for the the deep content feature extractor φ θ for clarity, but the corresponding parameters arediﬀerent). The last layer has a particular structure, since it does not use biases (that is, b Q = 0) and uses a singleneuron in order to output the predicted binarized playcount of dimension 1.As activation functions, we choose ReLU for the ﬁrst Q − he size of each successive higher layers, as this allows for learning more abstract representations in higher layersof the network [23]. In practice, the last layer uses 1 neuron, the intermediate layers use 8 ∗ Q − − q neurons (thatis, 8 neurons for layer Q −

1, 16 for layer Q −

2, and so on), and the ﬁrst layer uses K or 2 K neurons, dependingon whether the embedding combination model is a multiplication or a concatenation.These models will be referred to as NCACF-Relaxed and NCACF-Strict, and are estimated by addressing theoptimization problems (18) and (19) using a GD algorithm, which yields the following updates: ξ R ← ξ R − η ∇ ξ R L NCACF-R ( ξ R ) , (35)and ξ S ← ξ S − η ∇ ξ S L NCACF-S ( ξ S ) , (36)where L NCACF-R and L NCACF-S are the losses deﬁned in (18) and (19), respectively. In particular, when theinteraction model is based on a multiplication of the embeddings and uses a single layer with a ﬁxed weight vector a = and no activation function (i.e., the sigmoid is replaced with the identity), this model becomes equivalentto the MF-Uni model presented in Section 3.2.2. The proposed NCACF model and its variants are based on a Gaussian generative process for the playcounts,from which a weighted prediction error naturally arises as loss function ( cf . (18) and (19)). However, this lossis known not to be the most appropriate choice when handling implicit feedback data such as playcounts [20].Consequently, other losses such as the log loss have been preferred in such tasks [23, 22, 24]. Nonetheless, wedecided to use the weighted prediction error in this work since the underlying Gaussian generative model allows toeasily incorporate prior information, which has motivated previous work on content-aware collaborative ﬁlteringsuch as [31, 34, 35] to adopt this framework. Besides, alternative losses might improve the overall performance ofall proposed methods, but it would make the comparison unfair with the baselines [29].Therefore, we leave the usage of alternative losses to future work. Note that this would be consistent withthe design of alternative generative models, for instance by directly modeling the raw interaction data insteadof the binarized playcounts. In particular, reﬁned statistical models based on the Poisson or compound Poissondistributions [50, 51, 52, 53] allow for robust modeling of such over-dispersed implicit feedbacks, and might yieldmore appropriate losses for this task. Except for MF-Hybrid-Relaxed, learning the proposed models involves to minimize a weighted prediction erroron the form of: U X u =1 I X i =1 c u,i ( r u,i − ˆ r u,i ) . (37)In practice, this problem is split over batches of data B ⊂ { , .., U } × { , .., I } , each of which being processed ateach iteration of the GD algorithm. The loss then rewrites for each batch: X ( u,i ) ∈B c u,i ( r u,i − ˆ r u,i ) . (38)Processing all the available data (both non-zero and zero playcounts) results in a relatively high computationalburden, which might become cumbersome for very large datasets. To alleviate this issue, and to account for thesparsity of the dataset, it is common to resort to a negative sampling strategy [54, 36, 24], where only part of thenull playcounts is considered, that is: X ( u,i ) ∈B +  c u,i ( r u,i − ˆ r u,i ) + X j ∈B − u c u,j ( r u,j − ˆ r u,j )  (39)where B + is a batch of positive samples (that is, a set of user/item pairs with a non-zero playcount) and B − u = { j = 1 , . . . , N − | r u,j = 0 } is a set of N − negative samples for user u . We tested this strategy in preliminaryexperiments, but we obtained a very poor performance for all methods. Consequently, we rather consider a moretraditional sampling strategy, where we considered batches of items: X i ∈B i U X u =1 c u,i ( r u,i − ˆ r u,i ) . (40) Note that an alternative common choice consists in considering batches of users instead of items. However, in a content-aware framework, considering batches of items is more straightforward to train both the collaborative ﬁltering part and thecontent extractor part, since the latter operate on items (and not on users). ✓ ✓ ✗

BaselineRelaxed [29] ✗ ✗ ✗ ✓

Strict [29] ✗ ✗ ✗ ✓

MF-HybridRelaxed [31, 33] ✓ ✗ ✗ ✓

Strict - ✓ ✗ ✗ ✓

MF-UniRelaxed - ✓ ✓ ✗ ✓

Strict [36] ✓ ✓ ✗ ✓

NCACFRelaxed [43] ✓ ✓ ✓ ✓

Strict [42] ✓ ✓ ✓ ✓ where B i ⊂ { , .., I } . Even though this approach imposes to use a smaller batch size, since playcounts for allusers are considered for each sample i ∈ B i , it yields a signiﬁcantly better performance for all tested methods.Consequently, we present the results obtained using this sampling strategy. The proposed NCACF models and their variants generalize several collaborative ﬁltering-based models in theliterature. We summarize these in Table 2 along with their main characteristics, that is: • whether the model is trained using a two-stage or single-stage approach; • whether the training algorithm hybridizes two forms of updates or a single gradient descent; • whether the model uses a shallow or a deep interaction model; • and whether it is content-aware, and therefore suitable for cold-start recommendation, or not.The NCACF-Relaxed model is notably an extension of the NCF approach [23] with an additional branch toaccount for some content information, which makes it suitable for cold-start recommendation. In particular, whenthe interaction model is based on a multiplication of the embeddings and uses a single layer with a ﬁxed weightvector a = , NCACF reduces to the variant termed generalized matrix factorization ( cf . (9)).When the interaction model in NCACF reduces to a dot product, the corresponding models (MF-Hybrid andMF-Uni) also encompass several past proposals. In particular, MF-Hybrid-Relaxed is similar to [31, 33], as theseworks use a one-stage approach for learning the model which is based on a hybrid algorithm (using both ALS andGD).On the other hand, MF-Uni-Strict is somewhat equivalent to the model introduced in [36]. However, theauthors in [36] do not test their approach in a speciﬁc cold-start setting, even though this model is suitable forthis task. They consider a traditional collaborative ﬁltering application, where they actually obtain worse resultsthan when using a more simple model based on matrix factorization. This observation is reminiscent with priorwork conclusions [16] where such a task usually do not beneﬁt from adding extra content information, as recalledin Section 3.1.5. Therefore, a cold-start recommendation task appears as more appropriate for highlighting thepotential of incorporating additional content information in collaborative ﬁltering. Besides, we also proposedMF-Uni-Relaxed, which is a more ﬂexible approach than its strict counterpart and also allows for performingtraditional collaborative ﬁltering without performance drop [16].More generally, NCACF-Relaxed shares some similarity with the model presented in [43]. The main diﬀerencelies in that [43] uses only one shallow layer as interaction model ( Q = 0) and the hyperbolic tangent as activationfunction. On the other hand, NCACF-Strict is similar to [18] when the interaction model reduces to a factorizationmachine without non-linear activation function. Once again, these techniques have not been tested for cold-startrecommendation. Besides, NCACF has more expressive power since it might use an arbitrary (higher) number oflayers for the interaction model. In this section, we present the experiments conducted to assess the potential of the proposed model for musicrecommendation. Even though this method is suitable for traditional collaborative ﬁltering tasks, as recalled inSection 3.1.4, we consider here a cold-start recommendation problem, where leveraging content information isnecessary. All the computations have been performed using an NVIDIA Quadro RTX 6000 GPU with 24 GoRAM. , ,

472 2 ,

706 1 ,

354 13 , ,

423 101 ,

835 50 ,

919 509 , As implicit feedback data, we use the Taste Proﬁle dataset which is part of the Million Song Dataset [55]. Itprovides listening counts of 1 million users and 380 ,

000 songs. After removing duplicates, we kept the songs whoseacoustic content features were available (see Section 4.2). The playcount data is binarized by retaining values ofseven or higher as implicit feedback (that is, τ = 7 in (10)), since lower values yield feedback that are commonlyconsidered as non-informative [45]. In accordance with (10), other values are set at 0. As in [16, 28, 45], in orderto keep the computational burden low, we retain the top songs and users (sorted by playcounts) and we removeinactive users and items: that is, we only keep users who listened to at least 20 songs, and songs which have beenlistened to by at least 50 users. The resulting dataset has a density of about 0 . Several previous works [29, 36] rely on extracting acoustic features from the raw audio waveforms directly. Due tocopyright restriction, the audio ﬁles corresponding to the songs in the Million Song Dataset cannot be distributed.Therefore, these prior works have relied on the 7digital platform, which allows to download short audio samplesfor small evaluations and prototyping. However, these audio snippets are not easily obtained due to accessrestrictions of the 7digital-API. Besides, these are limited to 30s-long snippets, which makes it diﬃcult to extractfeatures that encompass the full temporal dynamic of the songs. An alternative option consists in using the pre-extracted features provided with the Million Song Dataset and computed using the Echo Nest platform. While asubset of these features is easily available for prototyping, the full dataset is however diﬃcult to access, as it isstored on an Amazon snapshot which access requires a non-free account. Finally, we could not access the EchoNest API to recompute these features. In the spirit of reproducible research, we aimed at using easily accessible acoustic features. Therefore, weused the statistical spectrum descriptors (SSD) [56], since these features have shown good performance in severalmusic information retrieval tasks such as genre recognition [57, 58], and are freely available online. The SSD isa set of statistical moments extracted from the sonogram of each song, a mid-level time-frequency representationthat reﬂects the human loudness sensation. To compute the SSD, the authors in [57] ﬁrst obtained the raw audiowaveform of each song using the 7digital platform, from which they obtained a time-frequency representationby applying a short-time Fourier transform. The resulting frequency channels are then grouped into 24 psycho-acoustically motivated critical-bands, accounting for several masking eﬀects. Finally, seven statistical moments(mean, median, variance, skewness, kurtosis, min, and max) are computed in each critical-band to account forthe temporal dynamic of each song. This results in a set of low-level acoustic features for each song x i ∈ R L with L = 168, which we scale to have 0 mean and unit variance. Following a previous music recommendation work using the same dataset [45], we set the user and item embeddingsdimension at K = 128, whether these are computed using ALS or integrated within the network. The user/itemembeddings are initialized (in all models) with random values drawn from a centered normal distribution with astandard deviation of 10 − . The deep content feature extractor and deep interaction model parameters θ and γ areinitialized using the default Le Cun’s initialization scheme [59]. GD is performed using the Adam algorithm [60]with a learning rate of 10 − and a batch size of 128, except for NCACF where a batch size of 8 yielded betterresults. http://millionsongdataset.com/tasteprofile/ https://us.7digital.com/ The Echo Nest developer API ( http://developer.echonest.com/ ) was not accessible at the time of conducting thisresearch. or pretraining, Algorithm 3 uses 30 iterations of ALS (performance was not further improved beyond) andthe number of GD epochs is determined on the validation set. The hyperparameters λ W and λ H are tuned on thevalidation set and chosen to maximize the NDCG metric (see Section 4.4). Note that to save some computationaltime, NCACF uses the same hyperparameters as MF-Uni. The subsequent number of epochs for all methods isselected on the validation set using the NDCG, with a maximum number of epochs set at 30. We use the NDCG [61] as a measure of the overall quality of the recommendation. For each user u , we computea ranked list of items in the test set based on the predicted preferences ˆ r u . We then compute the relevance ofthis list with respect to the ground truth preferences, that is, the observed user/item interaction in the test set:rel u,i = 1 if the item i is in the listening history of user u (in the test set) and 0 otherwise. In order to favorrecommendations that place the test items high in the list, we apply a discounted weight to the relevance, whichyields the discounted cumulative gain (DCG):DCG u = I ′ X i =1 rel u,i log ( i + 1) , (41)where I ′ denotes the length of the list of ranked items. To obtain a metric that accounts for all items in the testset, I ′ should be equal to the number of songs in this set (that is, approximately 11 ,

700 for our dataset). However,this considerably increases the computational load to evaluate the methods. Consequently, a common approachconsists in considering the truncated list of top- I ′ items instead [44]. Following similar work conducted on thisdataset [45], we used I ′ = 50 in our experiments, which is reasonable for a music recommendation application,and signiﬁcantly reduces the computational burden.The normalized version of the DCG is then obtained as follows:NDCG u = DCG u IDCG u , (42)where IDCG is the ideal DCG, which corresponds to the DCG of a perfectly ranked list. Finally, these scores areaveraged over users to yield an overall recommendation performance. The resulting NDCG ranges from 0 to 1(higher is better), and will be expressed in % in our experiments for readability. The goal of this experiment is to assess the potential of a joint training strategy over a two-stage approach. Tothat end, we compare MF-Hybrid and the Baselines, for which we ﬁrst tune the hyperparameters λ W and λ H onthe validation set (the same optimal values are used for both approaches). Note that even though the strict variantdoes not directly depend on λ H , the pretraining strategy relies on estimating W beforehand: as such, Algorithm 2is still impacted by this hyperparameter through its initial input W . The results presented in Figure 4 show thatthe relaxed variant exhibits a better performance and a smoother (quasi-monotonic) behavior for large values of λ W and λ H . On the other hand, the strict variant exhibits a less stable behavior overall, and performs better forsmall values of λ H . Overall, these results emphasize the importance of a (relatively important) regularization onthe user preference factor W , which has a great impact on the recommendation performance.Let us now investigate more speciﬁcally the impact of the number of GD epochs per iteration N gd on theperformance of MF-Hybrid algorithms. Indeed, as can be read in Algorithms 1 and 2, these approaches consistin alternating ALS updates for estimating W (and potentially H for the strict variant) and GD updates forestimating the deep content extractor’s parameters. As such, the number of GD epochs performed in eachiteration in between ALS updates is expected to have an impact on the overall performance. We test the MF-Hybrid algorithms with N gd = 1, 2 and 5, and note that even though the number of ALS updates in Algorithm 1might also have an impact on the performance, we set it at N als = 1 for a fair comparison with Algorithm 2,which only uses one update for W . The results are presented in Table 4. For the relaxed variant, we observethat N gd = 1 and 2 yield similar results, even though the performance is slightly improved with N gd = 2. Theoverall trend for the relaxed and strict variants is that increasing the number of epochs per iteration does notimprove the performance. In turns, since the total number of epochs is ﬁxed for a fair comparison, increasing N gd implies to decrease the number of ALS updates, which yields a lower performance. The decrease in performanceis more pronounced in the strict variant ( − . − . W appears to be the optimal choice in both variants.Finally, we compare the performance of these algorithms with the baselines, which is also presented in Table 4.We observe that the MF-Hybrid approaches outperform the corresponding baselines in both variants, with a . . N gd = 1 23 . . N gd = 2

482 16 . N gd = 5 22 . . . similar performance improvement (+1 . . In this experiment, we compare the learning strategies of the MF models, that is, whether they are trained using acombination of ALS and GD (MF-Hybrid) or a single GD algorithm (MF-Uni). We ﬁrst tune the hyperparameters λ W (and λ H for the relaxed variant) on the validation set and present the results in Figure 4. Interestingly, weobserve than unlike MF-Hybrid, the performance in both variants is maximized when using small values for λ W ,which might explain why several previous studies incorporating the factors within the network did not speciﬁcallyconsider a regularization on the user preference matrix [23]. We use the value λ W = 0 . λ H = 1 for therelaxed variant) in the following experiment.The performance of MF-Uni on the test set is reported in the last row of Table 4. We observe that in the strictvariant, MF-Uni signiﬁcantly outperforms MF-Hybrid, but a diﬀerent behavior is observed in the relaxed variant,where MF-Uni yields the worst performance, which underlines the importance of choosing the learning algorithmin accordance with the underlying model. This might explain why a strict formulation has been preferred insimilar studies for music recommendation such as [36].We also present in Table 4 the total computational time for these methods. The fastest method is MF-Hybrid-Relaxed: indeed, training the network in this approach only involves the target attributes H instead of the set of allbinarized playcounts R , as opposed to its strict counterpart and to the MF-Uni approach. Even though MF-Uni-Strict outperforms the baseline, MF-Hybrid-Relaxed is signiﬁcantly faster to train and yields a larger performanceimprovement. Therefore, this experiment shows that when (part of) the model is tractable, leveraging closed-formupdates in a ﬂexible formulation is more interesting than treating all the factors as embedding layers in a deepnetwork trained with GD, in terms of both computational time and performance. Incorporating user/item factorsin the network then becomes interesting when the interaction model no longer allows for deriving closed-formupdates, as will be shown in the next experiment. In this experiment, we study the impact of the interaction model on the performance of NCACF. To save somecomputational time, we use the same hyperparameters values than in the previous experiment. We consider twopossible embedding combinations based on multiplication or concatenation, and a variable number of layers Q inthe interaction network. The results are presented in Figure 6. First, when considering embedding multiplication,we remark that NCACF-Relaxed beneﬁts from a deep interaction model, since the performance obtained with Q ≥ Q = 2 yields thebest overall performance, and the NDCG then decreases for higher values of Q , even though it still outperformsa shallow single-layer model. In the strict variant however, NCACF does not beneﬁt from extra interactionlayers, and its performance remains lower than MF-Uni, which might be caused by overﬁtting when using a deepinteraction model. Nonetheless, its performance remains superior to that of MF-Hybrid-Strict, which conﬁrmsour conclusion from the previous experiment and outlines the potential of fully learning a strict model with asingle GD algorithm.When the interaction model is based on a concatenation of the embeddings, a diﬀerent trend can be observed.NCACF-Relaxed exhibits a relatively smooth behavior, since increasing the number of layers results in a noticeableimprovement in terms of NDCG, but this performance remains signiﬁcantly lower than that of the Baseline. Notethat alternative strategies (increasing the number of layers, using other activation functions or a greater numberof epochs) did not further improve the recommendation performance. In turns, NCACF-Strict exhibits an overallbad performance, and we observe overﬁtting systematically when training this model. . . . . . These results show that several architectural design choices employed in “pure” collaborative ﬁltering tech-niques cannot be systematically exploited in the more challenging cold-start scenario. Indeed, embedding con-catenation has been shown to perform better than multiplication in several studies addressing pure collaborativeﬁltering [23, 24]. We conducted additional experiments in such a pure collaborative ﬁltering scenario, and weobserved that embedding concatenation perform comparably (or slightly better) than multiplication, but we couldnot obtain a satisfactory performance in terms of cold-start recommendation. This might be explained by theability of the (non-linear) concatenation technique to learn complex interaction patterns between user and itemsfor which some shared feedback is available, which in turns does not properly generalize to unseen items, wherea more simple multiplicative model yields better performance.Note that this conclusion is reminiscent of recent work such as [62], where it is shown that a carefullytuned matrix factorization technique outperforms an NCF model using embeddings concatenation. The authorsin [62] also outline that a “leave-one-out strategy”, which is commonly employed for evaluating recommendersystems [23, 24], is not appropriate for drawing general conclusions, notably when a diﬀerent train/test splittingstrategy is used, which is the case in our experiment as described in Section 4.1. Further investigation is thenrequired to fully identify to what extent a concatenation of the embeddings is appropriate for addressing thecold-start problem.Finally, we compare the best performing method in each category, since several models introduced in thispaper (MF-Hybrid and MF-Uni) encompass methods from the literature. The results are summarized in Table 5,where we also report the performance of a naive random recommendation approach, which constitutes a lowerbound for the performance. NCACF yields the best performance, with an NDCG improvement of 3 . In this work, we introduced the neural content-aware collaborative ﬁltering model for cold-start recommendation.We proposed several variants of this model in order to study the impact of the learning strategy (two-stagevs. joint learning) and algorithm (hybrid vs. uniﬁed), as well as the interaction model (shallow vs. deep). Inparticular, the proposed NCACF method fully leverages deep learning for modeling user/item interactions andextracting content information, and achieved state-of-the-art cold-start recommendation results on a large scaleand highly sparse music dataset.In future work, we will investigate alternative generative models which better accounts for the over-dispersednature of the data, such as compound Poisson models [50, 53, 63]. From these models should also arise alternativelosses, such as the binary cross entropy, which has been shown more appropriate for handling implicit feedbackthan the quadratic loss [24]. We will also explore more reﬁned sampling strategy [64, 45] in order to reducethe computational time while keeping the performance high. Besides, our work can be extended to handle othermodalities in order to fully exploit the available content such as artists biographies [65] or musical tags [16, 66, 28],but also contextual data such as culture [67] or location [5]. Finally, alternative architectures could be exploited,such as convolutional networks that directly extract content features from the raw audio data [36] in an end-to-endfashion.

References [1] Markus Schedl, Peter Knees, Brian McFee, Dmitry Bogdanov, and Marius Kaminskas, “Music recommendersystems,” in

Recommender Systems Handbook , pp. 453–492. Springer, 2015.[2] Yifan Hu, Yehuda Koren, and Chris Volinsky, “Collaborative ﬁltering for implicit feedback datasets,” in

Proceedings of the 2008 Eighth IEEE International Conference on Data Mining (ICDM ’08) , December 2008,pp. 263–272.

3] Markus Schedl, Peter Knees, and Fabien Gouyon, “New paths in music recommender systems research,”in

Proceedings of the Eleventh ACM Conference on Recommender Systems (RecSys ’17) , August 2017, p.392–393.[4] Thomas Sch¨afer and Claudia Mehlhorn, “Can personality traits predict musical style preferences? A meta-analysis,”

Personality and Individual Diﬀerences , vol. 116, pp. 265 – 273, October 2017.[5] Michael Gillhofer and Markus Schedl, “Iron Maiden while jogging, Debussy for dinner? an analysis of musiclistening behavior in context,” in

Proceedings of the 21st International conference on MultiMedia Modeling(MMM 2015) , January 2015, p. 380–391.[6] Zhiyong Cheng and Jialie Shen, “On eﬀective location-aware music recommendation,”

ACM Transactionson Information Systems , vol. 34, no. 13, April 2016.[7] Bruce Ferwerda, Emily Yang, Markus Schedl, and Marko Tkalcic, “Personality traits predict music taxonomypreferences,” in

Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human Factors inComputing Systems (CHI EA ’15) , April 2015, pp. 2241–2246.[8] Audrey Laplante, “Improving music recommender systems: What can we learn from research on musictastes?,” in

Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) ,October 2014, pp. 451–456.[9] Mohammad Soleymani, Aljanaki Aljanaki, Frans Wiering, and Remco C. Veltkamp, “Content-based musicrecommendation using underlying music preference structure,” in

Proceedings IEEE International Conferenceon Multimedia and Expo (ICME) , June 2015.[10] Paul Magron and C´edric F´evotte, “Leveraging the structure of musical preference in content-aware musicrecommendation,” in https: // arxiv. org/ abs/ 2010. 10276 , 2021.[11] Andrew I. Schein, Alexandrin Popescul, Lyle H. Ungar, and David M. Pennock, “Methods and metrics forcold-start recommendations,” in

Proceedings of the 25th Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval , August 2002, SIGIR ’02, p. 253–260.[12] Markus Schedl, Hamed Zamani, Ching-Wei Chen, Yashar Deldjoo, and Mehdi Elahi, “Current challenges andvisions in music recommender systems research,”

International Journal of Multimedia Information Retrieval ,vol. 7, no. 2, pp. 95–116, June 2018.[13] Steﬀen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme, “BPR: Bayesian per-sonalized ranking from implicit feedback,” in

Proceedings of the Twenty-Fifth Conference on Uncertainty inArtiﬁcial Intelligence (UAI ’09) , June 2009, p. 452–461.[14] Ruslan Salakhutdinov and Andriy Mnih, “Probabilistic matrix factorization,” in

Proceedings of the 20th In-ternational Conference on Neural Information Processing Systems (NIPS’07) , December 2007, p. 1257–1264.[15] Yehuda Koren, Robert Bell, and Chris Volinsky, “Matrix factorization techniques for recommender systems,”

Computer , vol. 42, no. 8, pp. 30–37, August 2009.[16] Dawen Liang, Minshu Zhan, and Daniel P.W. Ellis, “Content-aware collaborative music recommendationusing pre-trained neural networks,” in

Proceedings of the 16th International Society for Music InformationRetrieval Conference (ISMIR) , October 2015.[17] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, GlenAnderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain,Xiaobing Liu, and Hemal Shah, “Wide & deep learning for recommender systems,” in

Proceedings of the 1stWorkshop on Deep Learning for Recommender Systems , September 2016, p. 7–10.[18] Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay, “Deep learning based recommender system: A survey andnew perspectives,”

ACM Compututing Surveys , vol. 52, no. 1, February 2019.[19] Dawen Liang, Rahul G. Krishnan, Matthew D. Hoﬀman, and Tony Jebara, “Variational autoencoders forcollaborative ﬁltering,” in

Proceedings of the 2018 World Wide Web Conference , 2018, WWW ’18, p. 689–698.[20] Ruslan Salakhutdinov, Andriy Mnih, and Geoﬀrey Hinton, “Restricted Boltzmann machines for collaborativeﬁltering,” in

Proceedings of the International Conference on Machine Learning (ICML) , June 2007, p.791–798.[21] Yao Wu, Christopher DuBois, Alice X. Zheng, and Martin Ester, “Collaborative denoising auto-encoders fortop-N recommender systems,” in

Proceedings of the Ninth ACM International Conference on Web Searchand Data Mining (WSDM ’16) , February 2016, p. 153–162.[22] Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, Shujian Huang, and Jiajun Chen, “Deep matrix factorizationmodels for recommender systems,” in

Proceedings of the Twenty-Sixth International Joint Conference onArtiﬁcial Intelligence (IJCAI-17) , August 2017, pp. 3203–3209.[23] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua, “Neural collaborativeﬁltering,” in

Proceedings of the 26th International Conference on World Wide Web (WWW ’17) , April 2017,pp. 173–182.

24] Wanyu Chen, Fei Cai, Honghui Chen, and Maarten De Rijke, “Joint neural collaborative ﬁltering forrecommender systems,”

ACM Transactions on Information Systems , vol. 37, no. 39, pp. 1–30, December2019.[25] Kazuyoshi Yoshii, Masataka Goto, Kazunori Komatani, Tetsuya Ogata, and Hiroshi G. Okuno, “Hybrid col-laborative and content-based music recommendation using probabilistic model with latent user preferences,”in

Proceedings of the 7th International Society for Music Information Retrieval Conference (ISMIR) , October2006.[26] Chong. Wang and David M. Blei, “Collaborative topic modeling for recommending scientiﬁc articles,” in

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining(KDD ’11) , August 2011, pp. 448–456.[27] Yi Fang and Luo Si, “Matrix co-factorization for recommendation with rich side information and implicitfeedback,” in

Proceedings of the 2nd International Workshop on Information Heterogeneity and Fusion inRecommender Systems (HetRec ’11) , October 2011, p. 65–69.[28] Olivier Gouvert, Thomas Oberlin, and C´edric F´evotte, “Matrix co-factorization for cold-start recommen-dation,” in

Proceedings of the 19th International Conference on Music Information Retrieval (ISMIR) ,September 2018, p. 792–798.[29] A¨aron van den Oord, Sander Dieleman, and Benjamin Schrauwen, “Deep content-based music recommen-dation,” in

Proceedings of the 26th International Conference on Neural Information Processing Systems ,December 2013, NIPS’13, p. 2643–2651.[30] Xinxi Wang and Ye Wang, “Improving content-based and hybrid music recommendation using deep learning,”in

Proceedings of the 22nd ACM international conference on Multimedia (MM ’14) , November 2014, pp. 627–636.[31] Hao Wang, Naiyan Wang, and Dit-Yan Yeung, “Collaborative deep learning for recommender systems,” in

Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD ’15) , August 2015, p. 1235–1244.[32] Sheng Li, Jaya Kawale, and Yun Fu, “Deep collaborative ﬁltering via marginalized denoising auto-encoder,”in

Proceedings of the 24th ACM International on Conference on Information and Knowledge Management(CIKM ’15) , October 2015, p. 811–820.[33] Xiaopeng Li and James She, “Collaborative variational autoencoder for recommender systems,” in

Proceed-ings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’17) , August 2017, p. 305–314.[34] Hao Wang, Xingjian Shi, and Dit-Yan Yeung, “Collaborative recurrent autoencoder: Recommend whilelearning to ﬁll in the blanks,” in

Proceedings of the 30th International Conference on Neural InformationProcessing Systems (NIPS’16) , December 2016, p. 415–423.[35] Huo Huan, Zhang Wei, Liu Liang, and Li Yang, “Collaborative ﬁltering recommendation model based onconvolutional denoising auto encoder,” in

Proceedings of the 12th Chinese Conference on Computer SupportedCooperative Work and Social Computing (ChineseCSCW ’17) , September 2017, p. 64–71.[36] Jongpil Lee, Kyungyun Lee, Jiyoung Park, Jangyeon Park, and Juhan Nam, “Deep content-user embeddingmodel for music recommendation,” July 2018.[37] Benjamin Marlin and Richard S. Zemel, “The multiple multiplicative factor model for collaborative ﬁltering,”in

Proceedings of the twenty-ﬁrst international conference on Machine learningProceedings of the InternationalConference on Machine Learning (ICML ’04) , July 2004.[38] Ruslan Salakhutdinov and Andriy Mnih, “Bayesian probabilistic matrix factorization using markov chainmonte carlo,” in

Proceedings of the 25th international conference on Machine learning (ICML ’08) , July2008, p. 880–887.[39] Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua, “Fast matrix factorization for onlinerecommendation with implicit feedback,” in

Proceedings of the 39th International ACM Conference onResearch and Development in Information Retrieval (SIGIR ’16) , July 2016, p. 549–558.[40] Taejun Kim, Jongpil Lee, and Juhan Nam, “Sample-level CNN architectures for music auto-tagging usingraw waveforms,” in

Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , April 2018, pp. 366–370.[41] Jordi Pons, Oriol Nieto, Matthew Prockup, Erik Schmidt, Andreas Ehmann, and Xavier Serra, “End-to-endlearning for music audio tagging at scale,” in

Proceedings of the 19th International Conference on MusicInformation Retrieval (ISMIR) , September 2018, pp. 637–644.[42] Lei Zheng, Vahid Noroozi, and Philip S. Yu, “Joint deep modeling of users and items using reviews forrecommendation,” in

Proceedings of the Tenth ACM International Conference on Web Search and DataMining (WSDM ’17) , February 2017, p. 425–434.

43] Jianxun Lian, Fuzheng Zhang, Xing Xie, and Guangzhong Sun, “CCCFNet: A content-boosted collaborativeﬁltering neural network for cross domain recommender systems,” in

Proceedings of the 26th InternationalConference on World Wide Web Companion , April 2017, p. 817–818.[44] Dawen Liang, Laurent Charlin, James McInerney, and David M. Blei, “Modeling user exposure in recom-mendation,” in

Proceedings of the International World Wide Web Conference (WWW) , April 2016, pp.951–961.[45] Viet-Anh Tran, Romain Hennequin, Jimena Royo-Letelier, and Manuel Moussallam, “Improving collabora-tive metric learning with eﬃcient negative sampling,” in

Proceedings of the 42nd International ACM SIGIRConference on Research and Development in Information Retrieval , 2019, SIGIR’19, p. 1201–1204.[46] Xavier Glorot, Antoine Bordes, and Yoshua Bengio, “Deep sparse rectiﬁer neural networks,” in

Proceedingsof the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics , April 2011, vol. 15, pp.315–323.[47] Feng Xue, Xiangnan He, Xiang Wang, Jiandong Xu, Kai Liu, and Richang Hong, “Deep item-based collabo-rative ﬁltering for top-N recommendation,”

ACM Transactions on Information Systems , vol. 37, no. 3, April2019.[48] Xiaomeng Liu, Yuanxin Ouyang, Wenge Rong, and Zhang Xiong, “Item category aware conditional restrictedboltzmann machine based recommendation,” in

Proceeings, Part II, of the 22nd International Conferenceon Neural Information Processing - Volume 9490 , 2015, ICONIP 2015, p. 609–616.[49] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,”in

Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June2016, pp. 770–778.[50] Prem K. Gopalan, Laurent Charlin, and David Blei, “Content-based recommendations with Poisson fac-torization,” in

Proceedings of the 27th International Conference on Neural Information Processing Systems(NIPS’14) , December 2014, p. 3176–3184.[51] Prem K. Gopalan, Jake M. Hofman, and David Blei, “Scalable recommendation with hierarchical Pois-son factorization,” in

Proceedings of the Thirty-First Conference on Uncertainty in Artiﬁcial Intelligence(UAI’15) , July 2014, p. 326–335.[52] Mehmet E. Basbug and Barbara E. Engelhardt, “Hierarchical compound Poisson factorization,” in

Proceed-ings of the 33rd International Conference on Machine Learning (ICML) , June 2016, pp. 1795–1803.[53] Olivier Gouvert, Thomas Oberlin, and C´edric F´evotte, “Recommendation from raw data with adaptivecompound Poisson factorization,” in

Proceedings of the International Conference on Machine Learning(ICML) , July 2019.[54] Cheng-Kang Hsieh, Longqi Yang, Yin Cui, Tsung-Yi Lin, Serge Belongie, and Deborah Estrin, “Collaborativemetric learning,” in

Proceedings of the 26th International Conference on World Wide Web (WWW ’17) , April2017, pp. 193–201.[55] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Brian Lamere, “The million song dataset,”in

Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR) , October 2011,pp. 591–596.[56] Thomas Lidy and Andreas Rauber, “Evaluation of feature extractors and psycho-acoustic transformations formusic genre classiﬁcation,” in

Proceedings of the 6th International Society for Music Information RetrievalConference (ISMIR) , September 2005.[57] Alexander Schindler and Andreas Rauber, “Capturing the temporal domain in Echonest features for improvedclassiﬁcation eﬀectiveness,” in

Proceedings of the International Workshop on Adaptive Multimedia Retrieval(AMR) , October 2012, pp. 214–227.[58] Alexander Schindler, Rudolf Mayer, and Andreas Rauber, “Facilitating comprehensive benchmarking exper-iments on the million song dataset,” in

Proceedings of the 13th International Society for Music InformationRetrieval Conference (ISMIR) , October 2012, pp. 469–474.[59] Yann A. LeCun, L´eon Bottou, Genevieve B. Orr, and Klaus-Robert M¨uller,

Eﬃcient BackProp , pp. 9–48,Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.[60] Diederik P. Kingma and Jimmy L. Ba, “Adam: a method for stochastic optimization,” in

Proceedings of theInternational Conference on Learning Representations (ICLR) , May 2015.[61] Yining Wang, Liwei Wang, Li Yuanzhi, Di He, and Tie-Yan Liu, “A theoretical analysis of NDCG typeranking measures,” in

Proceedings of the 26th Conference on Learning Theory (COLT) , April 2013, pp.25–54.[62] Steﬀen Rendle, Walid Krichene, Li Zhang, and John Anderson, “Neural collaborative ﬁltering vs. matrixfactorization revisited,” in

Proceedings of the Fourteenth ACM Conference on Recommender Systems (RecSys’20) , September 2020, p. 240–248.

63] Olivier Gouvert, Thomas Oberlin, and C´edric F´evotte, “Ordinal Non-negative Matrix Factorization forRecommendation,” in

Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence (UAI) , July2020.[64] Ting Chen, Yizhou Sun, Yue Shi, and Liangjie Hong, “On sampling strategies for neural network-basedcollaborative ﬁltering,” in

Proceedings of the 23rd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD) , August 2017, p. 767–776.[65] Sergio Oramas, Oriol Nieto, Mohamed Sordo, and Xavier Serra, “A deep multimodal approach for cold-startmusic recommendation,” in

Proceedings of the 2nd Workshop on Deep Learning for Recommender Systems(DLRS 2017) , August 2017, p. 32–37.[66] Yi Zuo, Jiulin Zeng, Maoguo Gong, and Licheng Jiao, “Tag-aware recommender systems based on deepneural networks,”

Neurocomputing , vol. 204, pp. 51 – 60, September 2016.[67] Eva Zangerle, Martin Pichl, and Markus Schedl, “User models for culture-aware music recommendation:Fusing acoustic and cultural cues,”

Transactions of the International Society for Music Information Retrieval(TISMIR) , vol. 3, no. 1, pp. 1–16, March 2020.

20 407.67.88.08.28.48.68.8 N D C G ( % ) λ W =0.01 λ W =0.1 λ W =1 N D C G ( % ) λ W =10 λ W =100 λ W =1000 λ H =0.001λ H =0.01λ H =0.1λ H =1λ H =10λ H =100 (a) Relaxed N D C G ( % ) λ W =0.01 λ W =0.1 λ W =1 N D C G ( % ) λ W =10 λ W =100 λ W =1000 (b) Strict Figure 4: NDCG on the validation set during pretraining of the MF-Hybrid algorithms for several valuesof the hyperparameters and for the relaxed (top) and strict (bottom) variants.20

25 50 75 100810121416 N D C G ( % ) λ W =0.01 λ W =0.1 λ W =1 N D C G ( % ) λ W =10 λ W =100 λ W =1000 λ H =0.001λ H =0.01λ H =0.1λ H =1λ H =10 (a) Relaxed N D C G ( % ) λ W =0.01 λ W =0.1 λ W =1 N D C G ( % ) λ W =10 λ W =100 λ W =1000 (b) Strict Figure 5: NDCG on the validation set for the MF-Uni models for several values of the hyperparametersand for the relaxed (top) and strict (bottom) variants.21 N D C G ( % ) Relaxed