[PDF] Offline Recommender Learning Meets Unsupervised Domain Adaptation

Abstract

We study the offline recommender learning problem in the presence of selection bias in rating feedback. A current promising solution to address the bias is to use the propensity score. However, the performance of the existing propensity-based methods can significantly suffer from propensity estimation bias. To solve the problem, we formulate the recommendation with selection bias as unsupervised domain adaptation and derive a propensity-independent generalization error bound. We further propose a novel algorithm that minimizes the bound via adversarial learning. Our theory and algorithm do not depend on propensity scores, and thus can result in a well-performing rating predictor without requiring the true propensity information. Empirical evaluation demonstrates the effectiveness and real-world applicability of the proposed approach.

Full PDF

OOfﬂine Recommender Learning MeetsUnsupervised Domain Adaptation

Yuta Saito

Tokyo Institute of Technology [email protected]

Abstract

We study the ofﬂine recommender learning problem in the presence of selectionbias in rating feedback. A current promising solution to address the bias is to usethe propensity score . However, the performance of the existing propensity-basedmethods can signiﬁcantly suffer from propensity estimation bias. To solve theproblem, we formulate the recommendation with selection bias as unsuperviseddomain adaptation and derive a propensity-independent generalization error bound.We further propose a novel algorithm that minimizes the bound via adversariallearning. Our theory and algorithm do not depend on propensity scores, andthus can result in a well-performing rating predictor without requiring the truepropensity information. Empirical evaluation demonstrates the effectiveness andreal-world applicability of the proposed approach.

It is essential to obtain a well-performing rating predictor using sparse rating feedback to recommendrelevant items to users in recommender systems. An important challenge is that most of the missingmechanism of the real-world rating data is missing-not-at-random (MNAR) (Hernández-Lobato et al.,2014; Marlin & Zemel, 2009; Schnabel et al., 2016; Wang et al., 2019, 2018). The following twomajor factors create the MNAR mechanism. The ﬁrst is the past recommendation policy. Suppose werelied on a policy that recommends popular items with high probability, then the observed ratingsunder that policy include more data of popular items (Bonner & Vasile, 2018; Yang et al., 2018). Theother is the self-selection of users. For example, users tend to rate items for which they exhibit positivepreferences, and the ratings for items with negative preferences are more likely to be missing (Marlin& Zemel, 2009; Schnabel et al., 2016).

Open Problems. The selection bias makes it difﬁcult to learn rating predictors, as naive methodstypically result in sub-optimal and biased recommendations with MNAR data (Schnabel et al., 2016;Steck, 2010; Wang et al., 2019). One of the most established solutions to the problem is a propensity-based approach. It deﬁnes the probability of each feedback being observed as a propensity score andobtains an unbiased estimator for the true metric of interest via inverse propensity weighting (Lianget al., 2016; Schnabel et al., 2016; Wang et al., 2019). Generally, its unbiasedness is desirable;however, this is valid only when true propensities are available. Previous studies utilized someamount of missing-completely-at-random (MCAR) test data to estimate propensity scores and ensuretheir empirical performances (Schnabel et al., 2016; Wang et al., 2019). However, in most real-worldrecommender systems, true propensities are mostly unknown, and MCAR data are unavailable as well,resulting in severe bias in the estimation of the loss function of interest and to the poor performanceof the resulting recommender. We provide a detailed version of the related work section in Appendix B.Preprint. Under review. a r X i v : . [ s t a t . M L ] J un wo previously proposed methods aim to solve the challenge of propensity-based methods. Theﬁrst method is causal embeddings (CausE) by (Bonner & Vasile, 2018), and it introduces a newregularization term to address the bias. However, this regularization is a heuristic approach and lacksa theoretical guarantee; thus, the reason why this method works is unsure. Moreover, CausE needssome amount of MCAR data by its design; it cannot be generalized to a realistic setting with onlyMNAR data. The other method is to use (1BitMC) by (Ma & Chen, 2019) toestimate propensity scores using only MNAR data, along with a theoretical guarantee. However, theproblem is that the method presupposes the debiasing procedure with inverse propensity weighting,and thus it cannot be used when there is a user–item pair with zero observed probability. Furthermore,the experiments on 1BitMC were conducted using only small datasets (Coat and MovieLens 100k)and prediction accuracy measures (MSE); accordingly, its performance on moderate size benchmarkdata (e.g., Yahoo! R3 (Mnih & Salakhutdinov, 2008)) and on a ranking task are unknown. Contributions.

To overcome the limitations of the existing methods, we establish a new theoryof MNAR recommendation inspired by the theoretical framework of unsupervised domain adapta-tion (Ben-David et al., 2010, 2007; Ganin & Lempitsky, 2015; Kuroki et al., 2018; Mansour et al.,2009). It aims to obtain a good predictor in the settings where feature distributions between trainingand test sets are different. To this end, it utilizes distance metrics that measure the dissimilaritybetween probability distributions and do not depend on propensity scores (Ganin & Lempitsky, 2015;Ganin et al., 2016; Saito et al., 2017, 2018; Zhang et al., 2019). Thus, the framework is usable whentrue propensities are unknown and expected to alleviate the issues caused by propensity estimationbias in the absence of MCAR data. Moreover, the method is valid when there is a user–item pairwith zero observed probability. However, the connection between MNAR recommendation andunsupervised domain adaptation has not been thoroughly investigated.To bridge the two potentially related ﬁelds, we ﬁrst deﬁne a novel discrepancy metric to measurethe dissimilarity between two missing mechanisms of rating feedback. Subsequently, we derive ageneralization error bound building on our discrepancy. Furthermore, we propose domain adversarialmatrix factorization , which minimizes the derived theoretical bound in an adversarial manner. Ourtheoretical bound and algorithm are independent of propensity scores, and thus the issues related topropensity estimation bias are expected to be solved. Finally, we conduct extensive experiments usingpublic real-world datasets. Particularly, we demonstrate that the proposed approach outperforms theexisting propensity-based methods in terms of both rating prediction and ranking performances undera realistic situation in which the true propensities are inaccessible.These theoretical and empirical ﬁndings provide practitioners with guidelines on how to buildrecommender systems in an ofﬂine environment by using only biased rating feedback.

In this study, we denote a user as u ∈ [ m ] and an item as i ∈ [ n ] . We also denote the set of all theuser–item pairs as D = [ m ] × [ n ] . Let R ∈ R m × n denote a ﬁxed true rating matrix, where eachentry R u,i represents a real-valued true rating of user u for item i .We aim to develop an algorithm to obtain a better predicted rating matrix (or a hypothesis ) (cid:98) R , whereeach entry (cid:98) R u,i denotes a predicted rating value for ( u, i ) . To this end, we formally deﬁne “the idealloss function of interest” that should ideally be optimized to obtain a recommender as follows: L (cid:96)ideal (cid:16) (cid:98) R (cid:17) = 1 mn (cid:88) ( u,i ) ∈D (cid:96) (cid:16) R u,i , (cid:98) R u,i (cid:17) . (1)where (cid:96) ( · , · ) : R × R → [0 , ∆] denotes any L -Lipschitz loss function bounded by a positive constant ∆ . For example, when (cid:96) ( x, y ) = ( x − y ) , Eq. (1) represents the mean-squared-error (MSE).In real-life recommender systems, one cannot directly calculate the ideal loss function, as most ratingdata are missing. To precisely formulate this missing mechanism, we utilize two other matrices.The ﬁrst one corresponds to the propensity matrix denoted as P ∈ P , where P represents the spaceof probability distributions over D . Each of its entry P u,i ∈ [0 , is the propensity score of ( u, i ) ,and it represents the probability of the feedback being observed. Next, let O ∈ { , } m × n be2n observation matrix in which each entry O u,i ∈ { , } is a Bernoulli random variable with itsexpectation E [ O u,i ] = P u,i . If O u,i = 1 , the rating of the pair is observed, otherwise unobserved.We will denote O ∼ P when the entries of O are the realizations of Bernoulli distributions deﬁnedby entries of P . For simplicity and without loss of generality, we assume M = (cid:80) ( u,i ) ∈D O u,i forall the observation matrices hereinafter.In our formulation, it is essential to approximate the ideal loss function by using only observablefeedback to obtain an effective recommender ofﬂine. Given a set of observed rating feedback, the most basic estimator for the ideal loss is the naiveestimator, which is deﬁned as follows: (cid:98) L (cid:96)naive (cid:16) (cid:98) R | O (cid:17) = 1 M (cid:88) ( u,i ) ∈D O u,i · (cid:96) (cid:16) R u,i , (cid:98) R u,i (cid:17) . (2)The naive estimator is the averaged loss values over the observed rating feedback. It is valid whenthe missing mechanism is MCAR, as it is unbiased against the ideal loss function with MCARdata (Schnabel et al., 2016; Steck, 2010). However, several previous studies indicated that thisestimator exhibits a bias under general MNAR settings (i.e., E [ (cid:98) L (cid:96)naive ] (cid:54) = L (cid:96)ideal for some (cid:98) R ). Thus,one should use an estimator that addresses the bias as an alternative to using the naive one (Schnabelet al., 2016; Steck, 2010). To improve the naive estimator, several previous studies applied the IPS estimation to alleviate the biasof MNAR rating feedback (Liang et al., 2016; Schnabel et al., 2016). In causal inference, propensityscoring estimators are widely used to estimate the causal effects of treatments from observationaldata (Imbens & Rubin, 2015; Rosenbaum & Rubin, 1983; Rubin, 1974). In our formulation, one canderive an unbiased estimator for the loss function of interest by using the true propensity scores asfollows: (cid:98) L (cid:96)IP S (cid:16) (cid:98) R | O (cid:17) = 1 mn (cid:88) ( u,i ) ∈D O u,i · (cid:96) (cid:16) R u,i , (cid:98) R u,i (cid:17) P u,i . (3)This estimator is unbiased against the ideal loss (i.e., E [ (cid:98) L (cid:96)IP S ] = L (cid:96)ideal for any (cid:98) R ), and thus it is moredesirable than the naive estimator in terms of bias. However, its unbiasedness is ensured only whenthe true propensity score is available, and it can have a bias with an inaccurate propensity estimator(see Lemma 5.1 of (Schnabel et al., 2016)). The bias of IPS typically occurs in most real-worldrecommender systems, as the missing mechanism of rating feedback can depend on user self-selection,which cannot be controlled by analysts and is difﬁcult to estimate (Marlin & Zemel, 2009; Schnabelet al., 2016; Wang et al., 2018). Speciﬁcally, most previous studies estimated propensity scores byusing some amount of MCAR test data to ensure empirical performance (Schnabel et al., 2016; Wanget al., 2019). However, this is infeasible owing to the costly annotation process (Gilotte et al., 2018;Joachims et al., 2017). This method is also infeasible when the ratings for all the user–item pairs areobserved with non-zero probability, which is difﬁcult to verify (i.e., P u,i ∈ (0 , , ∀ ( u, i ) ∈ D ) (Ma& Chen, 2019). Therefore, we explore theory and algorithm independent of propensity scores andMCAR data, aiming to alleviate the issues related to propensity-based methods. In this section, we ﬁrst derive a propensity independent generalization error bound of the ideal lossfunction. Then we propose an algorithm to minimize this bound from observable data.3hroughout this section, we use the following

Rademacher complexity (Bartlett & Mendelson, 2002;Mohri et al., 2018), which captures the complexity of a class of functions by measuring its capabilityto correlate with random noise (Kuroki et al., 2018) . Deﬁnition 1. (Rademacher complexity) Let H be any set of real-valued matrices. Given i.i.d sampleswith observed rating { ( u, i, R u,i ) | O u,i = 1 , ( u, i ) ∈ D} , the Rademacher complexity of H is deﬁnedas follows: R P ,M ( H ) := E O ∼ P E σ  sup (cid:98) R ∈H M (cid:88) ( u,i ): O u,i =1 σ u,i ˆ R u,i  . where σ = ( σ , . . . σ M ) denotes a set of independent uniform random variables taking values in { +1 , − } . There exist many results that bound the empirical version of Rademacher complexity. For example,for a class of matrices with max-norm constraint (i.e., H = { (cid:98) R ∈ R m × n | || (cid:98) R || max ≤ A } ) where A denotes the maximum max-norm, its bound is O ( (cid:112) A ( m + n ) /M ) (Foygel & Srebro, 2011), whichconverges to zero as M increases. To derive our theoretical upper bound, we ﬁrst deﬁne a discrepancy measure between two differentpropensity matrices in the following.

Deﬁnition 2. (Propensity Matrix Divergence (PMD)) Let H be a set of real-valued predicted ratingmatrices and (cid:98) R ∈ H be a speciﬁc prediction. The PMD between any two given propensity matrices P and P (cid:48) is deﬁned as follows: ψ (cid:98) R , H (cid:0) P , P (cid:48) (cid:1) := sup (cid:98) R (cid:48) ∈H (cid:110) L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | P (cid:17) − L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | P (cid:48) (cid:17)(cid:111) . where L (cid:96) ( (cid:98) R , (cid:98) R (cid:48) | P ) = E O ∼ P (cid:104) (cid:98) L (cid:96)naive ( (cid:98) R , (cid:98) R (cid:48) | O ) (cid:105) = M − (cid:80) ( u,i ) ∈D P u,i · (cid:96) ( (cid:98) R u,i , (cid:98) R (cid:63)u,i ) . Notably, PMD is well-deﬁned because ψ (cid:98) R , H ( P , P ) = 0 , ∀ P ∈ P , and it satisﬁes both nonnegativityand subadditivity. Moreover, it is independent of the true rating matrices, and thus it is calculable forany given pair of propensity matrices without the true rating information. However, in reality, thetrue propensity matrices ( P and P (cid:48) ) are unknown, and it is necessary to estimate PMD using theirrealizations ( O and O (cid:48) ). The following lemma shows the deviation bound of PMD. Lemma 1.

Any pair of propensity matrices ( P and P (cid:48) ) and their realizations ( O and O (cid:48) ) are given.Accordingly, for any δ ∈ (0 , and H , the following inequality holds with the probability of at least − δ : (cid:12)(cid:12)(cid:12) ψ (cid:98) R , H (cid:0) P , P (cid:48) (cid:1) − ψ (cid:98) R , H (cid:0) O , O (cid:48) (cid:1)(cid:12)(cid:12)(cid:12) ≤ L ( R P ,M ( H ) + R P (cid:48) ,M ( H )) + ∆ (cid:114) /δ ) M .

See Appendix A.2 for the proof.

Subsequently, we use ψ (cid:98) R , H and derive a propensity-independent upper bound of the ideal lossfunction. Theorem 1. (Propensity Independent Generalization Error Bound) Two observation matrices withMCAR and MNAR mechanisms ( O MCAR ∼ P MCAR and O MNAR ∼ P MNAR ) are given. For anyprediction matrix (cid:98) R ∈ H and for any δ ∈ (0 , , the following inequality holds with the probabilityof at least − δ : L (cid:96)ideal (cid:16) (cid:98) R (cid:17) ≤ (cid:98) L (cid:96)naive (cid:16) (cid:98) R | O MNAR (cid:17) + ψ (cid:98) R , H ( O MCAR , O MNAR )+ 2 L (3 R P ,M ( H ) + 2 R P (cid:48) ,M ( H )) + 3∆ (cid:114) log(6 /δ )2 M . (4)

See Appendix A.4 for the proof. The existing studies assume that the hypothesis space H is ﬁnite (Schnabel et al., 2016; Wang et al., 2019),and this is unrealistic. Thus, we use this complexity measure to consider inﬁnite hypothesis spaces. lgorithm 1 Domain Adversarial Matrix Factorization (DAMF)

Input:

MNAR observation matrix O MNAR ; trade-off parameter β ; batch_size T ; number of steps k Output:

Prediction matrix (cid:98) R = U V (cid:62) Randomly initialize U , V , U (cid:48) , and V (cid:48) repeat Sample the mini-batch data of size T from O MNAR for n = 1 , . . . , k do Update U and V by gradient descent according to Eq. (7) with ﬁxed (cid:98) R = U V (cid:62) end for Uniformly sample the user–item pairs of size T from D to construct O MCAR for n = 1 , . . . , k do Update U (cid:48) and V (cid:48) by gradient ascent according to Eq. (6) with ﬁxed (cid:98) R (cid:63) = U (cid:48) ( V (cid:48) ) (cid:62) end for until convergence;The theoretical bound comprises the following four factors: (i) naive loss on MNAR data, (ii)empirical PMD, (iii) complexity measures of the hypothesis class, (iv) conﬁdence term that dependson the value of δ . Notably, (i) and (ii) can be optimized using the observable data, as we describein the next section. Additionally, (iii) and (iv) converge to zero as M increases with an appropriatehypothesis class. We empirically show that the bound is informative in the sense that optimizing (i)and (ii) results in a desired value of the ideal loss function. Here, we describe the proposed algorithm. Building on the theoretical bound derived in Theorem 1,we consider minimizing the following objective: min (cid:98) R ∈H (cid:98) L (cid:96)naive (cid:16) (cid:98) R | O MNAR (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) naive loss on MNAR feedback + β · ψ (cid:98) R , H ( O MCAR , O MNAR ) (cid:124) (cid:123)(cid:122) (cid:125) discrepancy between MCAR and MNAR + λ · Ω (cid:16) (cid:98) R (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) regularization . (5)where β ≥ denotes the trade-off hyperparameter between the naive loss and discrepancy measure, Ω( · ) is an arbitrary regularization function on the complexity of (cid:98) R , and λ is the hyperparameter forthe regularization term. This objective builds on the two controllable terms of the theoretical boundin Eq. (4). Note that all the components of Eq. (5) are independent of the propensity score, and thuswe need not estimate the propensity score to optimize this objective.First, by deﬁnition, we can empirically approximate ψ (cid:98) R , H as (cid:98) R (cid:63) = arg max (cid:98) R (cid:48) ∈H ψ (cid:98) R , H ( O MCAR , O MNAR )= arg max U (cid:48) , V (cid:48) (cid:110) L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) (cid:0) U (cid:48) u , V (cid:48) i (cid:1) | O MCAR (cid:17) − L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) (cid:0) U (cid:48) u , V (cid:48) i (cid:1) | O MNAR (cid:17)(cid:111) . (6)where we can obtain O MCAR by uniformly sampling unlabeled user–item pairs from D , and U (cid:48) ∈ R m × d , V (cid:48) ∈ R n × d denote the user–item latent factors to construct (cid:98) R (cid:48) . This optimizationcorresponds to accurately estimate PMD from observable data.Subsequently, using the derived (cid:98) R (cid:63) , we can optimize Eq. (5) as follows: min U , V M (cid:88) ( u,i ): O u,i =1 (cid:96) (cid:16) R u,i , (cid:98) R ( U u , V i ) (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) empirical loss on MNAR feedback + β · (cid:110) L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:63) | O MCAR (cid:17) − L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:63) | O MNAR (cid:17)(cid:111)(cid:124) (cid:123)(cid:122) (cid:125) approximated discrepancy between MCAR and MNAR + λ · (cid:0) || U || F + || V || F (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) regularization . (7)where U ∈ R m × d , V ∈ R n × d denote the user–item latent factors to be optimized, and (cid:98) R ( U u , V i ) = U u V (cid:62) i denotes a predicted rating value for ( u.i ) .5lgorithm 1 describes the detailed procedure of DAMF. Note that our learning procedure is generaland can be used in combination with any recommendation model for explicit feedback.Table 1: Statistics of the datasets used in the experiments after preprocessing. Datasets

Notes : The sparsity is deﬁne as

M/mn . KL-div is the Kullback–Leibler divergence of rating distributions betweenthe train and test sets.

We empirically test and analyze the proposed method via experiments with real-world datasets . Following previous works (Ma & Chen, 2019; Schnabel et al., 2016; Wang et al., 2019), we used

Yahoo! R3 and

Coat datasets. These are the explicit feedback data with ﬁve-star ratings and containtraining and test sets with different user-item distributions, therefore include the MNAR probleminherently. Table 1 summarizes some statistics of these datasets.For both datasets, the original datasets were divided into training and test sets. We randomly selected10% of the original training set as the validation set. To the best of our knowledge, these are the onlyreal-world recommendation datasets that contain test sets with MCAR rating data.

Baselines and Propensity Estimators.

We compared the following methods with our proposed DAMF : (i) Naive Matrix Factorization(MF) (Koren et al., 2009):

It optimizes its model parameters by minimizing the naive loss in Eq. (2)with regularization terms and does not depend on the propensity score. (ii)

Matrix Factorizationwith Inverse Propensity Score (MF-IPS) (Schnabel et al., 2016):

It optimizes its model parame-ters by minimizing the IPS loss in Eq. (3) with regularization terms. (iii)

Matrix Factorization withDoubly Robust (MF-DR) (Wang et al., 2019):

It optimizes its model parameters by minimizingthe doubly robust (DR) loss with regularization terms. (iv)

CausE (Bonner & Vasile, 2018):

Itminimizes the sum of naive loss and regularization terms, and this measures the divergence betweenthe predictions of MNAR and MCAR datasets. To calculate its loss function, we sampled 10% of theMCAR test data. Generally, MCAR data are unavailable; we report the results of this method simplyas reference.For MF-IPS and MF-DR, we used user propensity , item propensity , user–item propensity , and (Ma & Chen, 2019), as the variants of propensity estimators and report the results with thebest estimator for each dataset. Notably, these four propensity estimators are usable in real-worldrecommender systems, as they use only MNAR data. We provide the exact deﬁnitions of thesepropensity estimators in Section C.1.We also report the results of MF-IPS and MF-DR with the true propensity score given by P u,i = P ( R = R u,i | O =1) P ( O =1) P ( R = R u,i ) . Previous works calculated it by using 5% of MCAR test data (Schnabel et al.,2016; Wang et al., 2019), and we followed this procedure. Note that the true propensity is incalculablein most real-world situations, as it requires MCAR explicit feedback to estimate the denominator.Therefore, we report the results with the true propensity score as a reference. The code for reproducing the results is provided as a part of supplementary materials and will be publicizedupon publication. Our code contains the implementations of the proposed method and all baselines, as well asused hyperparameters for all the methods. We describe the exact loss functions of MF-DR and CausE in Appendix C.3

Avg. Metrics ( ± StdDev)Datasets Methods MSE NDCG RecallYahoo! R3 MF 1.8311 ( ± ± ± ± ± ± ± ± ± ± ± ± CausE 1.6229 ( ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± CausE 1.3178 ( ± ± ± ± ± ± ± ± ± Notes : The bold fonts indicate the best performance in each metric and dataset among methods with only MNARdata. We tuned the dimensions of the latent factors and parameter for the L2-regularization of all the methods usingvalidation sets. For DAMF and CausE, the trade-off hyperparameter β was also tuned. The combinations of thehyperparameters were selected using an adaptive procedure implemented in Optuna (Akiba et al., 2019). We providethe hyperparameter search space in Appendix C.

Table 3: Comparing different propensity estimators by MSE

PropensityDatasets Methods user item user-item 1BitMC

Yahoo! R3 MF-IPS 1.7633 (+81.8%) 1.8054 (+86.2%) 1.8684 (+92.7%) 1.7320 (+78.6%)MF-DR 1.7915 (+76.1%) 1.8311 (+80.0%) 1.9703 (+93.7%) 1.8944 (+86.3%)Coat MF-IPS 1.2199 (+9.7%) 1.2127 (+9.1%) 1.2169 (+9.4%) 1.2193 (+9.7%)MF-DR 1.2070 (+9.8%) 1.2061 (+9.7%) 1.2070 (+9.7%) 1.2186 (+10.8%)

Notes : Performances relative to the true propensity score are in parentheses. The result suggests that theperformances of MF-IPS and MF-DR with only MNAR data signiﬁcantly worsen the performance of those with truepropensity.

We evaluated the prediction performance by using MSE and ranking performance by using normalizeddiscounted cumulative gain (NDCG) and

Recall . Table 2 provides the averaged metrics and theirstandard deviations (StdDev) over 10 different initializations.

How do propensity-based methods perform with different propensity estimators?

First, consistentwith the previous works (Schnabel et al., 2016; Wang et al., 2019), MF-IPS and MF-DR with truepropensity exhibit the best rating prediction performance. Furthermore, MF-DR outperforms MF-IPS,and this is also consistent with the results of (Wang et al., 2019). However, as shown in Table 3,they poorly perform with other propensity estimators including , especially for the Yahoo!R3 dataset. In some cases, they underperform naive MF. Therefore, although propensity-basedmethods are potentially high-performing with true propensity, they are highly sensitive to the choiceof propensity estimators and negatively affected by the propensity estimation bias.Another important insight derived from the results is that the propensity-based methods are noteffective in terms of ranking metrics. Even with true propensity, they are outperformed by ourproposed method for both data. This fact suggests that they do not improve user experience comparedwith the naive one, although they satisfactorily predict the rating values.

How well does the proposed algorithm perform empirically?

Next, we discuss the performance ofDAMF. For the Yahoo! R3 dataset, DAMF performs the best among the methods using only MNAR7a) Yahoo! R3 (b) CoatFigure 1: Comparing the theoretical upper bound and ideal loss function for DAMF.data in terms of MSE. Speciﬁcally, it outperforms MF by 29.7%, MF-IPS by 25.7%, and MF-DR by28.2% in MSE. Additionally, DAMF performs the best among all the methods, including the methodswith the true propensity score in terms of ranking metrics. Particularly, it outperforms MF by 5.72%,MF-IPS by 3.15%, MF-DR by 3.87%, MF-IPS (true) by 2.27%, and MF-DR (true) by 2.28% inNDCG. These results demonstrate that the proposed method well predicts the rating values in theabsence of true propensities and MCAR data. Furthermore, as suggested by its ranking performance,our method is useful to improve the recommendation quality and user experience with only biasedrating data.For the Coat dataset, differences in MSE between the proposed method and other baselines werenot signiﬁcant compared to Yahoo! R3. This is because, as shown in Table 1, the distributionalshift between the training and test sets is signiﬁcantly smaller than that in the Yahoo! R3 dataset,although its test data are ensured to be MCAR by its collection process. Nonetheless, it outperformsMF by 6.39%, MF-IPS by 5.86%, and MF-DR by 5.29% in MSE. Moreover, it performs the best inranking metrics, improving MF by 2.60%, MF-IPS by 3.52%, MF-DR by 1.99%, MF-IPS (true) by2.43%, and MF-DR (true) by 1.39% in NDCG. These results demonstrate that the proposed methodworks satisfactory even when the dataset size is small and the level of bias is not large. Note that it isreasonable to assume that there exists a user–item pair with zero observed probability in Coat. This isbecause the training set was collected via workers’ self-selection, and male workers did not providethe ratings of women’s coats and vice versa. Thus, this result suggests the stability and adaptabilityof the proposed method to the data with a user–item pair with P u,i = 0 . Conversely, the performanceof the propensity-based methods on the Coat dataset is not theoretically grounded, as the training andtest sets of the Coat dataset do not overlap. How informative is the theoretical upper bound in Theorem 1?

Finally, we investigate thecorrelation between the propensity-independent upper bound in Eq. (4) and the ideal loss functionin Eq. (1). Figure 1 depicts the upper bound and ideal loss (in terms of MSE) during the trainingof DAMF. First, it is evident that the upper bound of the ideal loss is effectively minimized by ouradversarial learning procedure. Next, the ﬁgure suggests that the upper bound well correlates withthe ideal loss function. Thus, minimizing our theoretical bound is a valid approach toward improvingthe recommendation quality on the MCAR test set. However, for both the datasets, there is a gapbetween the bound and the ideal loss at the initial part of the training steps. This observation suggeststhat our theory and algorithm can be further improved.In summary, propensity-based methods are signiﬁcantly affected by the choice of propensity esti-mators and exhibit poor performance when the true propensity is unusable. The proposed methodsigniﬁcantly outperforms other methods that use only MNAR data in the rating prediction task.Moreover, it considerably outperforms all the methods in terms of ranking performance measures,thereby suggesting its ability to improve the user experience. Figure 1 validates that our theoryis useful to construct well-performing recommendation algorithms. These results demonstrate thereal-world applicability of the proposed upper bound minimization approach.8

Conclusion

We explored the problem of the learning of rating predictors using MNAR explicit feedback. To thisend, we derived the propensity independent generalization error bound of the loss function of interestand proposed an algorithm that minimizes the bound via adversarial learning. Through experiments,we demonstrated that the proposed method signiﬁcantly outperformed the baselines in terms of ratingprediction and ranking measures when true propensities are inaccessible.

References

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In

Proceedings of the 25th ACM SIGKDDInternational Conference on Knowledge Discovery & , KDD ’19, pp. 2623–2631,New York, NY, USA, 2019. ACM. ISBN 978-1-4503-6201-6. doi: 10.1145/3292500.3330701.URL http://doi.acm.org/10.1145/3292500.3330701 .Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds andstructural results.

Journal of Machine Learning Research , 3(Nov):463–482, 2002.Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations fordomain adaptation. In

Advances in neural information processing systems , pp. 137–144, 2007.Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer WortmanVaughan. A theory of learning from different domains.

Machine learning , 79(1-2):151–175, 2010.Stephen Bonner and Flavian Vasile. Causal embeddings for recommendation. In

Proceedingsof the 12th ACM Conference on Recommender Systems , RecSys ’18, pp. 104–112, New York,NY, USA, 2018. ACM. ISBN 978-1-4503-5901-6. doi: 10.1145/3240323.3240360. URL http://doi.acm.org/10.1145/3240323.3240360 .Rina Foygel and Nathan Srebro. Concentration-based guarantees for low-rank matrix reconstruction.In

Proceedings of the 24th Annual Conference on Learning Theory , pp. 315–340, 2011.Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. InFrancis Bach and David Blei (eds.),

Proceedings of the 32nd International Conference on MachineLearning , volume 37 of

Proceedings of Machine Learning Research , pp. 1180–1189, Lille, France,07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/ganin15.html .Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, FrançoisLaviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks.

The Journal of Machine Learning Research , 17(1):2096–2030, 2016.Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé.Ofﬂine a/b testing for recommender systems. In

Proceedings of the Eleventh ACM InternationalConference on Web Search and Data Mining , pp. 198–206. ACM, 2018.José Miguel Hernández-Lobato, Neil Houlsby, and Zoubin Ghahramani. Probabilistic matrix fac-torization with non-random missing data. In

International Conference on Machine Learning , pp.1512–1520, 2014.Guido W Imbens and Donald B Rubin.

Causal inference in statistics, social, and biomedical sciences .Cambridge University Press, 2015.Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. Unbiased learning-to-rank with biasedfeedback. In

Proceedings of the Tenth ACM International Conference on Web Search and DataMining , pp. 781–789. ACM, 2017.Joseph DY Kang, Joseph L Schafer, et al. Demystifying double robustness: A comparison ofalternative strategies for estimating a population mean from incomplete data.

Statistical science ,22(4):523–539, 2007.Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommendersystems.

Computer , (8):30–37, 2009. 9eiichi Kuroki, Nontawat Charonenphakdee, Han Bao, Junya Honda, Issei Sato, and MasashiSugiyama. Unsupervised domain adaptation based on source-guided discrepancy. arXiv preprintarXiv:1809.03839 , 2018.Jongyeong Lee, Nontawat Charoenphakdee, Seiichi Kuroki, and Masashi Sugiyama. Domaindiscrepancy measure using complex models in unsupervised domain adaptation. arXiv preprintarXiv:1901.10654 , 2019.Dawen Liang, Laurent Charlin, and David M Blei. Causal inference for recommendation. In

Causation: Foundation to Application, Workshop at UAI , 2016.Wei Ma and George H Chen. Missing not at random in matrix completion: The effectiveness ofestimating missingness probabilities under a low nuclear norm assumption. In

Advances in NeuralInformation Processing Systems , pp. 14871–14880, 2019.Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning boundsand algorithms. arXiv preprint arXiv:0902.3430 , 2009.Benjamin M Marlin and Richard S Zemel. Collaborative prediction and ranking with non-randommissing data. In

Proceedings of the third ACM conference on Recommender systems , pp. 5–12.ACM, 2009.Andriy Mnih and Ruslan R Salakhutdinov. Probabilistic matrix factorization. In

Advances in neuralinformation processing systems , pp. 1257–1264, 2008.Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.

Foundations of machine learning . MITpress, 2018.Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observationalstudies for causal effects.

Biometrika , 70(1):41–55, 1983.Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.

Journal of educational Psychology , 66(5):688, 1974.Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric tri-training for unsupervised do-main adaptation. In Doina Precup and Yee Whye Teh (eds.),

Proceedings of the 34th InternationalConference on Machine Learning , volume 70 of

Proceedings of Machine Learning Research , pp.2988–2997, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/saito17a.html .Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classiﬁerdiscrepancy for unsupervised domain adaptation. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pp. 3723–3732, 2018.Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims.Recommendations as treatments: Debiasing learning and evaluation. In Maria Florina Balcanand Kilian Q. Weinberger (eds.),

Proceedings of The 33rd International Conference on Ma-chine Learning , volume 48 of

Proceedings of Machine Learning Research , pp. 1670–1679, NewYork, New York, USA, 20–22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/schnabel16.html .Harald Steck. Training and testing of recommender systems on data missing not at random. In

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and datamining , pp. 713–722. ACM, 2010.Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domainadaptation. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,pp. 7167–7176, 2017.Xiaojie Wang, Rui Zhang, Yu Sun, and Jianzhong Qi. Doubly robust joint learning for recommen-dation on data missing not at random. In

International Conference on Machine Learning , pp.6638–6647, 2019. 10ixin Wang, Dawen Liang, Laurent Charlin, and David M. Blei. The deconfounded recommender:A causal inference approach to recommendation.

CoRR , abs/1808.06581, 2018. URL http://arxiv.org/abs/1808.06581 .Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Deborah Estrin. Unbiasedofﬂine recommender evaluation for missing-not-at-random implicit feedback. In

Proceedingsof the 12th ACM Conference on Recommender Systems , RecSys ’18, pp. 279–287, New York,NY, USA, 2018. ACM. ISBN 978-1-4503-5901-6. doi: 10.1145/3240323.3240355. URL http://doi.acm.org/10.1145/3240323.3240355 .Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. Bridging theory and algorithm fordomain adaptation. In

International Conference on Machine Learning , pp. 7404–7413, 2019.11

Omitted Proofs

Table 4: Summary of notations used in this paper.

Notations Descriptions u, i

Indices for a user and an item in a recommender systems. R , R u,i The true rating matrix and its entry for ( u, i ) . (cid:98) R , ˆ R u,i A predicted rating matrix and its entry for ( u, i ) . P , P u,i A propensity score matrix and its entry for ( u, i ) called the propensity score . O , O u,i An observation matrix and its entry for ( u, i ) . H A class of any real-valued matrices called hypothesis space . R P ,M The Rademacher complexity of H over P . ψ (cid:98) R , H The proposed divergence that measures the difference between two propensity scorematrices. (cid:96) A L -lipschitz bounded loss function. L (cid:96)ideal The ideal loss function of interest.

A.1 Uniform Derivation BoundLemma 2. (Rademacher Generalization Bound; A modiﬁed version of Theorem 3.3 in (Mohri et al.,2018)) Let F = { f : D → [0 , ∆] } be a class of bounded functions where ∆ > is a positiveconstant and { ( u, i, R u,i ) | O u,i = 1 , ( u, i ) ∈ D} be any i.i.d. sample drawn from P of size M .Then, for any δ ∈ (0 , , the following inequality holds with probability of at least − δ sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) ( u,i ): O u,i =1 f ( u, i ) (cid:124) (cid:123)(cid:122) (cid:125) ( a ) − mn (cid:88) ( u,i ) ∈D P u,i · f ( u, i ) (cid:124) (cid:123)(cid:122) (cid:125) ( b ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ R P ,M ( F ) + ∆ (cid:114) log(2 /δ )2 M . (8) where ( a ) is an empirical mean of a function ( f ), and ( b ) is its expectation over P . A.2 Proof of Lemma 1

Proof.

For any given real-valued prediction matrix (cid:98) R , we have (cid:12)(cid:12)(cid:12) ψ (cid:98) R , H (cid:0) P , P (cid:48) (cid:1) − ψ (cid:98) R , H (cid:0) O , O (cid:48) (cid:1)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup (cid:98) R (cid:48) ∈H (cid:110) L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | P (cid:17) − L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | P (cid:48) (cid:17)(cid:111) − sup (cid:98) R (cid:48) ∈H (cid:110) L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | O (cid:17) − L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | O (cid:48) (cid:17)(cid:111)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup (cid:98) R (cid:48) ∈H (cid:12)(cid:12)(cid:12)(cid:110) L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | P (cid:17) − L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | P (cid:48) (cid:17)(cid:111) − (cid:110) L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | O (cid:17) − L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | O (cid:48) (cid:17)(cid:111)(cid:12)(cid:12)(cid:12) = sup (cid:98) R (cid:48) ∈H (cid:12)(cid:12)(cid:12)(cid:110) L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | P (cid:17) − L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | O (cid:17)(cid:111) − (cid:110) L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | P (cid:48) (cid:17) − L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | O (cid:48) (cid:17)(cid:111)(cid:12)(cid:12)(cid:12) = sup (cid:98) R , (cid:98) R (cid:48) ∈H (cid:12)(cid:12)(cid:12)(cid:110) L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | P (cid:17) − L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | O (cid:17)(cid:111) − (cid:110) L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | P (cid:48) (cid:17) − L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | O (cid:48) (cid:17)(cid:111)(cid:12)(cid:12)(cid:12) ≤ sup (cid:98) R , (cid:98) R (cid:48) ∈H (cid:12)(cid:12)(cid:12) L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | P (cid:17) − L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | O (cid:17)(cid:12)(cid:12)(cid:12) + sup (cid:98) R , (cid:98) R (cid:48) ∈H (cid:12)(cid:12)(cid:12) L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | P (cid:48) (cid:17) − L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | O (cid:48) (cid:17)(cid:12)(cid:12)(cid:12) . (9)The deviations in the last line can be bounded by using Lemma 2, and the following inequalities holdwith a probability of at least − δ/ . sup (cid:98) R (cid:48) ∈H (cid:12)(cid:12)(cid:12) L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | P (cid:17) − L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | O (cid:17)(cid:12)(cid:12)(cid:12) ≤ R P ,M ( H (cid:48) ) + ∆ (cid:114) log(4 /δ )2 M , (10) sup (cid:98) R (cid:48) ∈H (cid:12)(cid:12)(cid:12) L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | P (cid:48) (cid:17) − L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | O (cid:48) (cid:17)(cid:12)(cid:12)(cid:12) ≤ R P (cid:48) ,M ( H (cid:48) ) + ∆ (cid:114) log(4 /δ )2 M . (11)12here we regard H (cid:48) := { ( u, i ) → (cid:96) ( ˆ R u,i , ˆ R (cid:48) u,i ) | (cid:98) R , (cid:98) R (cid:48) ∈ H} as F in Lemma 2. Then, we have sup (cid:98) R , (cid:98) R (cid:48) ∈H (cid:12)(cid:12)(cid:12) L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | P (cid:17) − L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | O (cid:17)(cid:12)(cid:12)(cid:12) ≤ L R P (cid:48) ,M ( H ) + ∆ (cid:114) log(4 /δ )2 M , (12) sup (cid:98) R , (cid:98) R (cid:48) ∈H (cid:12)(cid:12)(cid:12) L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | P (cid:48) (cid:17) − L (cid:96) (cid:16) (cid:98) R , (cid:98) R (cid:48) | O (cid:48) (cid:17)(cid:12)(cid:12)(cid:12) ≤ L R P ,M ( H ) + ∆ (cid:114) log(4 /δ )2 M . (13)where R P ,M ( H (cid:48) ) ≤ L R P ,M ( H ) by using the result of Corollary 5 of (Mansour et al., 2009).Finally, combining Eq. (9), Eq. (12), and Eq. (13) with the union bound completes the proof. A.3 Proof of Additional Lemmas

Here, we state the generalization error bound under an ideal MCAR environment.

Lemma 3. (Generalization Error Bound under MCAR observation) An MCAR-observation matrix O MCAR ∼ P MCAR where P u,i = E [ O u,i ] = M/ |D| , ∀ ( u, i ) ∈ D and any matrix as predictions (cid:98) R ∈ H are given. Then, for any δ ∈ (0 , , the following inequality holds with a probability of atleast − δ : L (cid:96)ideal (cid:16) (cid:98) R (cid:17) ≤ (cid:98) L (cid:96)naive (cid:16) (cid:98) R | O MCAR (cid:17) + 2 L R P ,M ( H ) + ∆ (cid:114) log(2 /δ )2 M . (14)

Proof.

By using Lemma 2, we have sup (cid:98) R ∈H (cid:12)(cid:12)(cid:12) L (cid:96)ideal (cid:16) (cid:98) R (cid:17) − (cid:98) L (cid:96)naive (cid:16) (cid:98) R | O MCAR (cid:17)(cid:12)(cid:12)(cid:12) ≤ R P ,M ( (cid:96) ◦ H ) + ∆ (cid:114) log(2 /δ )2 M . with probability of at least − δ for any δ ∈ (0 , . We regard (cid:96) ◦H = {D → (cid:96) ( R u,i , ˆ R u,i ) | (cid:98) R ∈ H} as F in Lemma 2. Then, by using the Talagrand’s lemma (Lemma 5.7 of (Mohri et al., 2018)), wehave R P ,M ( (cid:96) ◦ H ) ≤ L R P ,M ( H ) . as (cid:96) is L -lipschitz. Lemma 4.

For any given predicted rating matrix (cid:98) R ∈ H and two propensity matrices ( P and P (cid:48) ),the following inequality holds L (cid:96) (cid:16) (cid:98) R | P (cid:17) ≤ L (cid:96) (cid:16) (cid:98) R | P (cid:48) (cid:17) + ψ (cid:98) R , H (cid:0) P , P (cid:48) (cid:1) . Proof.

By the deﬁnition of PMD, we have L (cid:96) (cid:16) (cid:98) R | P (cid:17) = L (cid:96) (cid:16) (cid:98) R | P (cid:48) (cid:17) − L (cid:96) (cid:16) (cid:98) R | P (cid:48) (cid:17) + L (cid:96) (cid:16) (cid:98) R | P (cid:17) ≤ L (cid:96) (cid:16) (cid:98) R | P (cid:48) (cid:17) + sup (cid:98) R (cid:48) ∈H (cid:110) L (cid:96) (cid:16) (cid:98) R | P (cid:17) − L (cid:96) (cid:16) (cid:98) R | P (cid:48) (cid:17)(cid:111) = L (cid:96) (cid:16) (cid:98) R | P (cid:48) (cid:17) + ψ (cid:98) R , H (cid:0) P , P (cid:48) (cid:1) . A.4 Proof of Theorem 1

Proof.

First, we obtain the following inequality by replacing P and P (cid:48) for P MCAR and P MNAR in Lemma 3. L (cid:96)ideal (cid:16) (cid:98) R (cid:17) ≤ (cid:98) L (cid:96) (cid:16) (cid:98) R | P MNAR (cid:17) + ψ (cid:98) R , H ( P MCAR , P MNAR ) . (15)13here L (cid:96)ideal ( (cid:98) R ) = L (cid:96) ( (cid:98) R | P MCAR ) by deﬁnition. Then, from Lemma 3 and Lemma 4, thefollowing inequalities hold with a probability of at least − δ/ and − δ/ , respectively. L (cid:96) (cid:16) (cid:98) R | P MNAR (cid:17) ≤ (cid:98) L (cid:96)naive (cid:16) (cid:98) R | O MNAR (cid:17) + 2 L R P ,M ( H ) + ∆ (cid:114) log(6 /δ )2 M , (16) (cid:12)(cid:12)(cid:12) ψ (cid:98) R , H ( P MCAR , P MNAR ) − ψ (cid:98) R , H ( O MCAR , O MNAR ) (cid:12)(cid:12)(cid:12) ≤ L ( R P ,M ( H ) + R P (cid:48) ,M ( H )) + 2∆ (cid:114) log(6 /δ ) M . (17)Combining Eq. (15), Eq. (16), and Eq. (17) with the union bound completes the proof.

B More Related Work

B.1 Recommendation for MNAR Feedback

To address the selection bias of MNAR explicit feedback, several related works assume the missingdata model and rating model and estimate parameters using the iterative procedure (Hernández-Lobatoet al., 2014; Marlin & Zemel, 2009). However, the methods are highly complex and do not performwell on real-world rating datasets (Schnabel et al., 2016; Wang et al., 2019, 2018).Propensity-based methods were proposed to overcome the limitations of these conventional methodsand theoretically address the bias (Liang et al., 2016; Schnabel et al., 2016; Wang et al., 2019,2018). Among these, the most basic method is IPS estimation, which was originally established incausal inference (Imbens & Rubin, 2015; Rosenbaum & Rubin, 1983; Rubin, 1974). This estimationmethod provides an unbiased estimator of the true metric of interest by weighting each data using theinverse of its propensity. The rating predictor based on the IPS estimator empirically outperformedboth the naive MF (Koren et al., 2009) and probabilistic generative model (Hernández-Lobato et al.,2014). Propensity-based methods can reasonably remove the bias of naive methods in; however,their performances mainly depend on propensity score estimation. Speciﬁcally, it is challengingto ensure the performance of propensity estimators in real-world recommendations, as users areindependent to select which items to rate, and one cannot control the missing mechanism (Marlin &Zemel, 2009). In addition to the simple IPS estimator, (Wang et al., 2019) proposed a doubly robust (DR) variant to decrease the effect of the variance of the propensity-weighting approach. The DRestimator utilizes both the error imputation model and propensity score and theoretically improvesthe bias and estimation error bound compared with the IPS counterpart. However, the proposed jointlearning algorithm still requires pre-estimated propensity scores (Wang et al., 2019). Furthermore,the estimation performance of the DR estimator is signiﬁcantly degraded when both error imputationmodels and propensity models are misspeciﬁed (Kang et al., 2007). In the empirical evaluationsof propensity-based methods, MCAR test data are used to estimate the propensity score (called the naive Bayes estimator) (Schnabel et al., 2016; Wang et al., 2019). However, practically, MCAR dataare unavailable in most situations, as gathering a sufﬁcient amount of MCAR data necessitates moretime and cost requirements for the annotation process (Gilotte et al., 2018).Currently, there are two studies that address the issues related to the conventional propensity-basedrecommendation methods. The ﬁrst study is by (Bonner & Vasile, 2018). It proposed an algorithmcalled CausE, which is a domain adaptation inspired method that introduces a regularizer term on thediscrepancy between latent factors obtained from MCAR and MNAR data. However, this method, byits design, requires (small size of) MCAR training data, which are generally unavailable. Moreover,it uses the idea of domain adaptation in a heuristic manner; there is no theoretical guarantee for theproposed loss function. Therefore, our method is more desirable than CausE in terms of the followingtwo points: (i) our method is theoretically reﬁned in the sense that it is designed to minimize thepropensity-independent upper bound of the ideal loss function (ii) our method does not use anyMCAR data in its training process and thus is feasible in a realistic situation with no MCAR data.The other one is by (Ma & Chen, 2019). It proposed a propensity estimation method, (1BitMC), which does not require MCAR data. The authors of (Ma & Chen, 2019) constructedthe theoretical guarantee for the consistency of the proposed method. However, (Ma & Chen, 2019)presupposed the use of inverse propensity weighting to debias downstream recommenders; However,it cannot be used when there exists a user–item pair with zero observed probability. Furthermore, the14xperiments in (Ma & Chen, 2019) were conducted using only small-size recommendation datasets,i.e., Coat and MovieLens 100k. The performance of recommendation methods with 1BitMC wereevaluated using only prediction accuracy measures, MSE and MAE. From the above discussion,the advantages of our method over the one by (Ma & Chen, 2019) are as follows: (i) our proposedmethod and theory do not depend on propensity score and thus are applicable to the settings wherethere exists a user–item pair with zero observed probability (i.e., P u,i = 0 ). (ii) Via comprehensiveexperiments, we demonstrate that our proposed method performs better than MF-IPS and MF-DRwith 1BitMC for both rating prediction and ranking tasks. B.2 Unsupervised Domain Adaptation

The aim of unsupervised domain adaptation (UDA) is to train a predictor that works well on a targetdomain by using only labeled source data and unlabeled target data during training (Kuroki et al.,2018; Saito et al., 2017). However, the major challenge of UDA is that the feature distributions andlabeling functions can differ between the source and target domains. Thus, a predictor trained usingonly labeled source data does not generalize well on the target domain. Therefore, it is essentialto measure the dissimilarity between both the domains to achieve the desired performance on thetarget domain (Kuroki et al., 2018; Lee et al., 2019). Several discrepancy measures were proposed tomeasure the difference in feature distributions between the source and target domains (Ben-Davidet al., 2010; Kuroki et al., 2018; Lee et al., 2019; Zhang et al., 2019). For example, H -divergence and H ∆ H -divergence (Ben-David et al., 2010, 2007) were used to construct many prediction methodsin UDA such as DANN, ADDA, and MCD (Ganin & Lempitsky, 2015; Ganin et al., 2016; Tzenget al., 2017; Saito et al., 2018). These methods are based on the adversarial learning framework andcan be theoretically explained as minimizing empirical errors and discrepancy measures between thesource and target domains. The optimization of these methods does not depend on the propensityscore. Thus, the methods of UDA are useful in constructing an effective recommender with biasedrating feedback, given the absence of access to the true propensities.This study extended the idea of using discrepancy measure to calculate the difference between twopropensity score matrices and derive a propensity-independent generalization error bound for the ﬁrsttime in the literature. Moreover, we provided an algorithm to optimize the upper bound of the idealloss function by using adversarial learning. C Detailed Experimental Setups and Additional Results

Here we describe the detailed experimental setups and results. Moreover, Table 5 presents the usedhyper-parameter searching spaces for all datasets. We also provide the tuned hyperparameters for allmethods and for all datasets to ensure the reproducibility (see hyper_params.yaml ﬁle in our code).Below, we describe the deﬁnitions of the propensity estimators, performance measures, and the lossfunctions of MF-DR and CausE used in the experiments.

C.1 Deﬁnitions of propensity estimators

The four variants of propensity estimators used in the experiments are deﬁned as follows. user propensity : (cid:98) P u, ∗ = (cid:80) i ∈I O u,i max u ∈ U (cid:80) i ∈I O u,i item propensity : (cid:98) P ∗ ,i = (cid:80) u ∈U O u,i max i ∈ I (cid:80) u ∈U O u,i user-item propensity : (cid:98) P u,i = (cid:98) P u, ∗ · (cid:98) P ∗ ,i : (cid:98) P u,i = arg min (cid:98) P u,i ∈F (cid:88) ( u,i ) ∈D { O u,i log( σ (Γ u,i )) + (1 − O u,i ) log(1 − σ (Γ u,i )) } where F τ,γ := { Γ ∈ R m × n | || Γ || ∗ ≤ τ √ mn, || Γ || max ≤ γ } , ||·|| ∗ is the nuclear norm and ||·|| max is the entry-wise max norm, and τ, γ > . σ ( · ) is the sigmoid function and (cid:98) P u,i := σ (Γ u,i ) . We usedthe implementations provided by the authors of (Ma & Chen, 2019) to use 1BitMC.15 .2 Performance Measures Here we formally deﬁne performance measures used in Section 4. • NDCG measures ranking quality and is deﬁne as

DCG@K = 1 m (cid:88) u ∈ [ m ] (cid:88) i ∈I testu R u,i − · I { rank ( u, i ) ≤ K } log ( rank ( u, i ) + 1) , NDCG@K = DCG@KIDCG@K . where IDCG@K is the maximum possible DCG@K. • Recall evaluates how many relevant items are selected and is deﬁned as

Recall@K = 1 m (cid:88) u ∈ [ m ] (cid:88) i ∈I testu R u,i · I { rank ( u, i ) ≤ K } (cid:80) i ∈I testu R u,i . • MSE and MAE evaluates how far the predicted ratings are away from the true rating and aredeﬁned as

MSE = 1 |D test | (cid:88) ( u,i ) ∈D test (cid:16) R u,i − ˆ R u,i (cid:17) . where I {·} is the indicator function, rank ( u, i ) is a ranking of i for u induced by (cid:98) R , I testu is a set ofitems in test set for user u , D test is user-item pairs in test set. C.3 Loss functions of MF-DR and CausE

First, MF-DR (Wang et al., 2019) optimizes the following doubly robust (DR) estimator for the idealloss function in Eq. (1) (cid:98) L (cid:96)DR (cid:16) (cid:98) R | O (cid:17) = 1 mn (cid:88) ( u,i ) ∈D  ˆ (cid:96) u,i + O u,i · (cid:96) (cid:16) R u,i , (cid:98) R u,i (cid:17) − ˆ (cid:96) u,i P u,i  . (18)where ˆ (cid:96) u,i is called the imputation model and estimates (cid:96) ( R u,i , (cid:98) R u,i ) . As discussed in (Wang et al.,2019), this estimator satisﬁes the unbiasedness with the true propensities (i.e., E [ (cid:98) L (cid:96)DR ] = L (cid:96)ideal ), andhas tighter estimation error tail bound than the IPS estimator under mild conditions. MF-DR obtainsthe ﬁnal prediction ( (cid:98) R ) and the imputation model simultaneously via a joint learning procedure (seeAlgorithm 1 of (Wang et al., 2019)).Next, in the experiments, CausE (Bonner & Vasile, 2018) optimizes the following loss function. (cid:98) L (cid:96)CausE (cid:16) (cid:98) R MCAR , (cid:98) R MNAR | O MCAR , O MNAR (cid:17) = (cid:98) L (cid:96)naive (cid:16) (cid:98) R MCAR | O MCAR (cid:17) + (cid:98) L (cid:96)naive (cid:16) (cid:98) R MNAR | O MNAR (cid:17) + β (cid:0) || U MCAR − U MNAR || F + || V MCAR − V MNAR || F (cid:1) . (19)where β ≥ is a trade-off hyperparameter, and || · || F is the Frobenius norm. Two prediction matricesare given by (cid:98) R MCAR = U MCAR V (cid:62) MCAR and (cid:98) R MNAR = U MNAR V (cid:62) MNAR , respectively, where (cid:98) R MNAR is used as a ﬁnal prediction matrix. The last two terms in Eq. (19) represent the regularizersbetween tasks and penalize the divergences between user and item factors for MNAR and MCARdatasets. As induced by Eq. (19), CausE needs both MNAR and MCAR datasets, and thus is infeasiblein most real-world recommender systems where costly MCAR data is unavailable. Note that thismethod is identical to Naive MF when MCAR test data is unavailable.16i) Yahoo! R3 (

KL-div = 0 . ) (ii) Coat ( KL-div = 0 . )Figure 2: Comparing rating distributions of training and test sets for Yahoo! R3 and Coat datasets Notes : The rating distributions are signiﬁcantly different between the training and test sets for both datasets. Notethat

KL-div is the Kullback–Leibler divergence of the rating distributions between training and test sets. Therefore,the distributional shift of Yahoo! R3 dataset is relatively large compared to that of the Coat dataset.

Table 5: Hyperparameter searching spaces.

HyperparametersDatasets Methods d λ β optimizer init. learning_rate max interationsYahoo! R3 MF-IPS { , , . . . , } [10 − , - Adam 0.01 2500MF-DRCausE [10 − , DAMF (ours)Coat MF-IPS { , , . . . , } [10 − , - Adam 0.01 2500MF-DRCausE [10 − , DAMF (ours) [10 − , Notes : The same searching space was used in all datasets. Speciﬁcally, d denotes the dimension of the latent factors, λ denotes the hyperparameter for the L2-regularization, and β is the trade-off hyperparameter for DAMF and CausE.is the trade-off hyperparameter for DAMF and CausE.