[PDF] Matrix Completion under Low-Rank Missing Mechanism

Abstract

Matrix completion is a modern missing data problem where both the missing structure and the underlying parameter are high dimensional. Although missing structure is a key component to any missing data problems, existing matrix completion methods often assume a simple uniform missing mechanism. In this work, we study matrix completion from corrupted data under a novel low-rank missing mechanism. The probability matrix of observation is estimated via a high dimensional low-rank matrix estimation procedure, and further used to complete the target matrix via inverse probabilities weighting. Due to both high dimensional and extreme (i.e., very small) nature of the true probability matrix, the effect of inverse probability weighting requires careful study. We derive optimal asymptotic convergence rates of the proposed estimators for both the observation probabilities and the target matrix.

Full PDF

aa r X i v : . [ s t a t . M L ] M a r Matrix Completion under Low-Rank Missing Mechanism

Xiaojun Mao ∗ Raymond K. W. Wong † Song Xi Chen ‡ Abstract

Matrix completion is a modern missing data problem where both the missing struc-ture and the underlying parameter are high dimensional. Although missing structure is akey component to any missing data problems, existing matrix completion methods oftenassume a simple uniform missing mechanism. In this work, we study matrix completionfrom corrupted data under a novel low-rank missing mechanism. The probability matrix ofobservation is estimated via a high dimensional low-rank matrix estimation procedure, andfurther used to complete the target matrix via inverse probabilities weighting. Due to bothhigh dimensional and extreme (i.e., very small) nature of the true probability matrix, theeﬀect of inverse probability weighting requires careful study. We derive optimal asymptoticconvergence rates of the proposed estimators for both the observation probabilities and thetarget matrix. keywords:

Low-rank; Missing; Nuclear-norm; Regularization.

The problem of recovering a high-dimensional matrix A ⋆ ∈ R n × n from very few (noisy) ob-servations of its entries is commonly known as matrix completion, whose applications include,collaborative ﬁltering, computer visions and positioning. From a statistical viewpoint, it is a ∗ School of Data Science, Fudan University, Shanghai 200433, China. Email: [email protected] † Department of Statistics, Texas A&M University, College Station, Texas 77843, U.S.A. Email: [email protected] ‡ Department of Business Statistics and Econometrics, Guanghua School of Management and Center for Sta-tistical Science, Peking University, Beijing 100651, China. Email: [email protected] A ⋆ under aﬂexible high-dimensional low-rank sampling structure. This is achieved by a weighted empiricalrisk minimization, with application of inverse probability weighting (e.g., Schnabel et al., 2016;Mao et al., 2019) to adjust for the eﬀect of non-uniform missingness.Data arising in many applications of matrix completion, such as recommender systems,usually possesses complex “sampling” structure which is largely unknown. For a movie recom-mender system, users tend to rate movies that they prefer or dislike most, while often remain“silent” to other movies. Another example of the complex sampling regime is in the onlinemerchandising, where some users may purchase certain items regularly without often ratingthem, but often evaluate products that they rarely buy. Similar to the widely adopted modelthat ratings are generated from a small number of hidden factors, it is reasonable to believethat the missingness is also governed by a small and possibly diﬀerent set of hidden factors,which leads to a low-rank model the missing structure.Inspired by generalized linear models, we model the probabilities of observation Θ ⋆ =( θ ⋆,ij ) n ,n i,j =1 ∈ (0 , n × n by a high-dimensional low-rank matrix M ⋆ = ( m ⋆,ij ) n ,n i,j =1 ∈ R n × n f . That means, on the entry-wise level, we have θ ⋆,ij = f ( m ⋆,ij ). Ingeneralized linear models, the linear predictor m ⋆,ij is further modeled as a linear function of ob-served covariates. However, to reﬂect diﬃculties to attain (appropriate and adequate) covariateinformation and the complexity in the modeling of Θ ⋆ in some situations of the matrix comple-tion, the predictor matrix M ⋆ is assumed completely hidden in this study. Despite M ⋆ beinghidden, as demonstrated in this work, the low-rankness of M ⋆ together with the high dimension-ality of M ⋆ allows both identiﬁcation and consistent estimation of Θ ⋆ , which facilitates inverseprobability weighting based matrix completion. Motivated by the nature of matrix completion,we propose a novel parametrization M ⋆ = µ ⋆ n T n + Z ⋆ where Z ⋆ satisﬁes T n Z ⋆ n = 0.Our proposal extends the work of Davenport et al. (2014), which aims to solve a binary matrixcompletion problem and pursues a diﬀerent goal. Compared with Davenport et al. (2014), theproposed method does not regularize the estimation of µ ⋆ , but only regularize the nuclear normof the estimation of Z ⋆ , which require diﬀerent algorithmic treatment to avoid bias caused bythe nuclear-norm penalty.There are three fundamental aspects that set our work aside from the existing works ofmatrix completion as we consider: (i) the high-dimensional probability matrix Θ , whose di-mensions n , n go to inﬁnity in our asymptotic setting; (ii) the diminishing lower bound of theobservation probabilities (as n , n go to inﬁnity), and added issue to the instability of inverseprobability weighting; (iii) the eﬀects of estimation error in inverse probability weighting to thematrix completion procedure. Aspects (i) and (ii) are unique to our problem, and not foundin the literature of missing data. The work related to Aspect (iii) is sparse in the literature ofmatrix completion. It is noted that Mao et al. (2019) focused on a low-dimensional parametricmodeling of inverse probability weighting with observable covariates.We develop non-asymptotic upper bounds of the mean squared errors for the proposedestimators of the observation probabilities and the target matrix. To sustain the convergencerate of the target matrix under the high-dimensionality of M ⋆ and low levels of observationprobabilities, we propose to re-estimate Z ⋆ by constraining the magnitude of its entries toa smaller threshold. Our analysis shows that the proposed constrained inverse probability3eighting estimator achieves the optimal rate (up to a logarithmic factor in estimation of targetmatrix). We also compare the inverse probability weighting based completion based on theproposed constrained estimation, with the completion based on direct weight trimming (orwinsorization), a known practice in the conventional missing value literature (e.g., Rubin, 2001;Kang and Schafer, 2007; Schafer and Kang, 2008) and show that the constrained estimation hasboth theoretical and empirical advantages. Let A ⋆ = ( a ⋆,ij ) n ,n i,j =1 ∈ R n × n be an unknown high-dimensional matrix of interest, and Y =( y ij ) n ,n i,j =1 be a contaminated version of A ⋆ according to the following additive noise model: y ij = a ⋆,ij + ǫ ij , for i = 1 , . . . , n ; j = 1 , . . . , n , (1)where { ǫ ij } are independently distributed random errors with zero mean and ﬁnite variance.In the setting of matrix completion, only a portion of { y ij } is observed. For the ( i, j )th entry,deﬁne the sampling indicator w ij = 1 if y ij is observed, and 0 otherwise, and assume { ǫ ij } areindependent of { w ij } .As for the sampling mechanism, we adopt a Bernoulli model where { w ij } are indepen-dent Bernoulli random variables with observation probabilities { θ ⋆,ij } , collectively denoted bya matrix Θ ⋆ = ( θ ⋆,ij ) n ,n i,j =1 ∈ (0 , n × n . Similar to generalized linear models, the observationprobabilities can be expressed in terms of an unknown matrix M ⋆ = ( m ⋆,ij ) n ,n i,j =1 ∈ R n × n anda pre-speciﬁed monotone and diﬀerentiable function f : R → [0 , θ ⋆,ij = f ( m ⋆,ij ) for all i, j . The matrix M ⋆ plays the same role as a linear predictor in the generalized linear model,while the function f is an inverse link function. Two popular choices of f are inverse logitfunction g ( m ) = e m / (1 + e m ) (logistic model) and the standard normal cumulative distributionfunction (probit model). 4 .2 Low-rank Modeling of A ⋆ and M ⋆ The above setup is general. Without additional assumption, it is virtually impossible to recoverthe hidden feature matrix M ⋆ and also the target matrix A ⋆ . A common and powerful assump-tion is that A ⋆ is a low-rank matrix, i.e., rank( A ⋆ ) ≪ min { n , n } . Take the Yahoo! Webscopedata set (to be analyzed in Section 7) as an example. This data set contains a partially observedmatrix of ratings from 15,400 users to 1000 songs, and the goal is to complete the rating matrix.The low-rank assumption reﬂects the belief that users’ ratings are generated by a small numberof factors, representing several standard preference proﬁles for songs. This viewpoint has beenproven useful in the modeling of recommender systems (e.g., Cand`es and Plan, 2010; Cai et al.,2010).The same idea could be adapted to the missing pattern, despite that the factors that inducethe missingness may be diﬀerent from those that generate the ratings. To this end, we assume M ⋆ is also low-rank. Next, we consider decomposing M ⋆ as M ⋆ = µ ⋆ J + Z ⋆ where T n Z ⋆ n = 0 (2)with n being an n -vector of ones, and J = n T n . Here µ ⋆ is the mean of M ⋆ , i.e., µ ⋆ = T n M ⋆ n / ( n n ). Note that this decomposition holds for any matrix M by setting µ =( n n ) − T n M n and Z = M − µ J . Moreover, the decomposition is unique due to theconstraint that T n Z ⋆ n = 0. The key here is to reparametrize M ⋆ in terms of µ ⋆ and Z ⋆ ,which require diﬀerent treatments in their estimation. See Section 3 for details. Further, thelow-rankness of M ⋆ can be translated to the low-rankness of Z ⋆ .We note that the rank of M ⋆ is not the same as that of Θ ⋆ due to the nonlinear transfor-mation f . In general, the low-rank structure of M ⋆ implies a speciﬁc low-dimensional nonlinearstructure of Θ ⋆ . For a common high missingness scenario, most entries of M ⋆ are signiﬁcantlynegative, where many common choices of the inverse link function can be well-approximated bya linear function. So our modeling can be regarded as a low-rank modeling of Θ ⋆ .There are some related but more specialized models. Srebro and Salakhutdinov (2010) and5egahban and Wainwright (2012) utilize an independent row and column sampling mechanism,leading to a rank-1 structure for Θ ⋆ . Cai et al. (2016) consider a matrix block structure for Θ ⋆ and hence M ⋆ , which can be regarded as a special case of the low-rank modeling. Mao et al.(2019) considered the case when the missingness is depends on observable covariates, andadopted a low-rank modeling with a known row space of M ⋆ . The proposal in this paperis for the situation when the missingness is dependent on some hidden factors, which reﬂectssituations when obvious covariates are unknown or not available. Write the Hadamard product as ◦ and the Frobenius norm as k·k F . To recover the target matrix A ⋆ , many existing matrix completion techniques assume uniform missing structure and henceutilize an unweighted/uniform empirical risk function b R UNI ( A ) = ( n n ) − k W ◦ ( A − Y ) k F (e.g., Cand`es and Plan, 2010; Koltchinskii et al., 2011; Mazumder et al., 2010), which is anunbiased estimator of the risk R ( A ) = E ( k A − Y k F ) / ( n n ) (up to a multiplicative constant)under uniform missingness. The work of Klopp (2014) is a notable exception that considers theuse of b R UNI under non-uniform missingness.For any matrix B = ( b ij ) n ,n i,j =1 , we denote B † = ( b − ij ) n ,n i,j =1 and B ‡ = ( b − / ij ) n ,n i,j =1 . Undergeneral missingness (uniform or non-uniform), one can show that, for any A ∈ R n × n , R ( A ) = 1 n n E (cid:16) k A − Y k F (cid:17) = 1 n n E (cid:18)(cid:13)(cid:13)(cid:13) W ◦ Θ ‡ ⋆ ◦ ( A − Y ) (cid:13)(cid:13)(cid:13) F (cid:19) . Clearly, A ⋆ uniquely minimizes R . If Θ ⋆ were known, an unbiased estimator of R would be b R ( A ) = 1 n n (cid:13)(cid:13)(cid:13) W ◦ Θ ‡ ⋆ ◦ ( A − Y ) (cid:13)(cid:13)(cid:13) F , (3)which motivates the use of inverse probability weighting in matrix completion as in Mao et al.(2019). In addition, our theoretical analysis shows that the nuclear-norm-regularized empiricalrisk estimator (to be deﬁned in details later) based on b R (assuming the use of true observation6robabilities) improves upon existing error upper bound of corresponding estimator based on b R UNI achieved by Klopp (2014) as shown in Section 5.3. However, the inverse probabilityweights Θ ‡ ⋆ are often unknown and have to be estimated, which has to be conducted carefullyin the context of matrix completion.Despite the popularity of inverse probability weighting in missing data literature, it is knownto produce unstable estimation due to occurrences of small probabilities (e.g., Rubin, 2001;Kang and Schafer, 2007; Schafer and Kang, 2008). This problematic scenario is common formatrix completion problems for recovering a target matrix from very few observations. The-oretically, a reasonable setup should allow some θ ⋆,ij to go to zero as n , n → ∞ , leadingto diverging weights and a non-standard setup of inverse probability weighting. Due to theseobservations, a careful construction of the estimation procedure is required.For uniform sampling ( θ ⋆,ij ≡ θ for some probability θ ), one only has to worry about asmall common probability θ (or that θ diminishes in an asymptotic sense.) Although small θ increases the diﬃculty of estimation, b R ( A ) changes only up to a multiplicative constant.However, for non-uniform setting, it is not as straightforward due to the heterogeneity among { θ ⋆,ij } . To demonstrate the issue, we now brieﬂy look at the Yahoo! Webscope dataset describedin Section 7. A sign of the strong heterogeneity in { θ ⋆,ij } is a large θ U /θ L , where θ L = min i,j θ ⋆,ij and θ U = max i,j θ ⋆,ij . We found that the corresponding ratio of estimated probabilities b θ U / b θ L based on the rank-1 structure of Negahban and Wainwright (2012) was 25656 .

2, and that basedon our proposed method (without re-estimation, to be described below) was 23988 .

0. Thisstrong heterogeneity can jeopardize the convergence rate of our estimator, which will be properlyaddressed in our framework.In the following section, we propose an estimation approach for Θ ⋆ in Section 3.1 and anappropriate modiﬁcation in Section 3.3 which, when substituted into the empirical risk b R , allowsus to construct a stable estimator for A ⋆ . 7 Estimation of Θ ⋆ We develop the estimation of Θ ⋆ based upon the framework of regularized maximum likelihood.Given the inverse of link function f , the log-likelihood function with respect to the indicatormatrix W = ( w ij ) ∈ R n × n is ℓ W ( M ) = X i,j h I [ w ij =1] log { f ( m ij ) } + I [ w ij =0] log { − f ( m ij ) } i , for any M = ( m ij ) n ,n i,j =1 ∈ R n × n , where I A is the indicator of an event A . Due to the low-rankassumption of M ⋆ , one natural candidate of estimators to M ⋆ is the maximizer of the regularizedlog-likelihood ℓ W ( M ) − λ k M k ∗ , where k · k ∗ represents the nuclear norm and λ > k M k ∞ ≤ α forsome α > M = 0, corresponding to that Pr( w ij = 1) = 0 . i, j . Nevertheless,this would not align well with common settings of matrix completion under which the averageprobability of observations is quit small, and hence would result in a large bias. In view of this,we instead adopt a parametrization M ⋆ = µ ⋆ J + Z ⋆ and consider the following estimator of( µ ⋆ , Z ⋆ ): (cid:16)b µ, b Z (cid:17) = arg max ( µ, Z ) ∈C n ,n ( α ,α ) ℓ W ( µ J + Z ) − λ k Z k ∗ , where (1) C n ,n ( α , α ) = { ( µ, Z ) ∈ R × R n × n : | µ | ≤ α , k Z k ∞ ≤ α , T n Z n = 0 } . Note that the mean µ of the linear predictor µ J + Z is not penalized. The constraint T n Z n =0 ensures the identiﬁability of µ and Z . Apparently, the constraints in C n ,n ( α , α ) areanalogous to k M k ∞ ≤ α , where α = α + α , but on the parameters µ and Z respectively.With ( b µ, b Z ), the corresponding estimator of M ⋆ is c M = b µ J + b Z .Davenport et al. (2014) considered a regularized maximum likelihood approach for a binarymatrix completion problem. Their goal was diﬀerent, as they aimed at recovering a binary8ating matrix in lieu of the missing structure, and considered a regularization on M (insteadof Z ) via k M k ∗ ≤ α ′ { rank( M ⋆ ) n n } / . As for the scaling parameter α ′ , Davenport et al.(2014) considered an α ′ independent of the dimensions n and n to restrict the “spikiness” of M . As explained earlier, in our framework, θ L should be allowed to go to zero as n , n → ∞ .To this end, we allow α and α to depend on the dimensions n and n . See more details inSection 5. To solve the optimization (1), we begin with the observation that ℓ W is a smooth concavefunction, which allows the usage of an iterative algorithm called accelerated proximal gradientalgorithm (Beck and Teboulle, 2009). Given a pair ( µ old , Z old ) from a previous iteration, aquadratic approximation of the objective function − ℓ W ( µ J + Z ) + λ k Z k ∗ is formed: P L { ( µ, Z ) , ( µ old , Z old ) } = − ℓ W ( µ old J + Z old )+ ( µ − µ old ) T n {−∇ µ ℓ W ( µ old J + Z old ) } n + Ln n µ − µ old ) + h Z − Z old , −∇ Z ℓ W ( µ old J + Z old ) i + L k Z − Z old k F + λ k Z k ∗ , where L > h B , C i = P i,j b ij c ij for any matrices B = ( b ij ) and C = ( c ij ) of same dimensions.In this iterative algorithm, a successive update of ( µ, Z ) can be obtained byarg min ( µ, Z ) ∈C n ,n ( α ,α ) P L { ( µ, Z ) , ( µ old , Z old ) } , where the optimization with respect to µ and Z can be performed separately. For µ , one canderive a closed-form updatemin h α , max h − α , µ old + ( Ln n ) − T n {−∇ µ ℓ W ( µ old J + Z old ) } n ii .

9s for Z , we need to perform the minimizationarg min k Z k ∞ ≤ α , T n Z n =0 h Z − Z old , −∇ Z ℓ W ( µ old J + Z old ) i + L k Z − Z old k F + λ k Z k ∗ , which is equivalent toarg min k Z k ∞ ≤ α , T n Z n =0 (cid:13)(cid:13)(cid:13)(cid:13) Z − Z old − L ∇ Z ℓ W ( µ old J + Z old ) (cid:13)(cid:13)(cid:13)(cid:13) F + λL k Z k ∗ . (2)We apply a three-block extension of the alternative direction method of multipliers (Chen et al.,2016) to an equivalent form of (2):arg min Z = G = G , T n G n =0 , k G k ∞ ≤ α λL k Z k ∗ + 12 (cid:13)(cid:13)(cid:13)(cid:13) G − Z old − L ∇ Z ℓ W ( µ old J + Z old ) (cid:13)(cid:13)(cid:13)(cid:13) F . (3)Write H = ( H , H ). The augmented Lagrangian for (3) is L u ( Z , G , G ; H ) = λL k Z k ∗ + 12 (cid:13)(cid:13)(cid:13)(cid:13) G − Z old − L ∇ Z ℓ W ( µ old J + Z old ) (cid:13)(cid:13)(cid:13)(cid:13) F − h H , Z − G i − h H , Z − G i + u k Z − G k F + u k Z − G k F + I [ T n G n =0 ] + I [ k G k ∞ ≤ α ] , where u > I A = 0 if the constraint A holds and ∞ otherwise.The detailed algorithm to solve (3) is summarized in Algorithm 1 in the supplementary material.It is noted that, in general, the multi-block alternative direction method of multipliers may failto converge for some u > u iscrucial. However, we are able to show that the form of our algorithm belongs to a special class(Chen et al., 2016) in which convergence is guaranteed for any u >

0. Therefore, we simplyset u = 1. We summarize the corresponding convergence result in the following theorem whoseproof is provided in the supplementary material. Theorem 1.

The sequence { Z ( k ) , G ( k )1 , G ( k )2 } , generated by Algorithm 1 in the supplementarymaterial, converges to the global optimum of (3) . Theorem 2.

The estimator ( b µ, b Z ) generated by the proximal gradient algorithm, converges tothe global optimum of (1) . The tuning parameters α and α can be chosen according to prior knowledge of the problemsetup, if available. When a prior knowledge is not available, one can choose large values forthese parameters. Once these parameters are large enough, our method is not sensitive to theirspeciﬁc values. A more principled way to tune α and α is a challenging problem and beyondthe scope of this work. As for λ , we adopt Akaike information criterion ( aic ) where the degreeof freedom is approximated by r c M ( n + n − r c M ). To use b R of (3), a naive idea is to obtain b Θ = (ˆ θ ij ) n ,n i,j =1 = F ( c M ), where F ( M ) = ( f ( m ij )) n ,n i,j =1 ∈ R n × n for any M = ( m ij ) n ,n i,j =1 ∈ R n × n , and then replace Θ ‡ ⋆ by b Θ ‡ = (ˆ θ − / ij ) n ,n i,j =1 . However,this direct implementation is not robust to extremely small probabilities of observation, and ourtheoretical analysis shows that this could lead to a slower convergence rate in the estimationof A ⋆ . In the literature of missing data, a simple solution to robustify is winsorizing the smallprobabilities (Potter, 1990; Scharfstein et al., 1999).In the estimation of b Θ deﬁned in (1) that assumes k Z ⋆ k ∞ ≤ α , a large α has an adverseeﬀect on the estimation. In the setting of diverging α (due to diminishing θ L ), the convergencerate of b Z becomes slower and the estimator obtained after direct winsorization will also beaﬀected. That is, even though the extreme probabilities could be controlled by winsorizing, theunchanged entries of b Z (in the procedure of winsorizing) may already suﬀer from a slower rateof convergence. This results in a larger estimation error under certain settings of missingness,11s revealed in Section 5.A seemingly better strategy is to impose a tighter constraint directly in the minimizationproblem (1). That is to adopt the constraint k Z k ∞ ≤ β where 0 ≤ β ≤ α . Theoretically,one can better control the errors on those entries of magnitude smaller than β . However, themean-zero constraint of Z no longer makes sense as the constraint k Z k ∞ ≤ β may have shiftedthe mean.We propose a re-estimation of Z ⋆ with a diﬀerent constraint level β : b Z β = arg max Z ∈ R n × n ℓ W ( b µ J + Z ) − λ ′ k Z k ∗ subject to k Z k ∞ ≤ β. (4)Note that we only re-compute Z but not µ , which allows us to drop the mean-zero constraint.Thus, c M β = b µ J + b Z β . The corresponding algorithm for optimization (4) can be derivedsimilarly as in Davenport et al. (2014), and is provided in the supplementary material. In whatfollows, we write b Θ = F ( c M ) and b Θ β = F ( c M β ). A ⋆ Now, we come back to (3) and replace Θ ‡ ⋆ by b Θ ‡ β to obtain a modiﬁed empirical risk: e R ( A ) = 1 n n (cid:13)(cid:13)(cid:13) W ◦ b Θ ‡ β ◦ ( A − Y ) (cid:13)(cid:13)(cid:13) F , (1)where b Θ ‡ β = ( b θ − / ij,β ) ∈ R n × n . Since A is a high-dimensional parameter, a direct minimizationof ˆ R ∗ often results in over-ﬁtting. To circumvent it, we consider a regularized version: e R ( A ) + τ k A k ∗ , (2)where τ > A ⋆ is b A β = arg min k A k ∞ ≤ a (cid:26) n n (cid:13)(cid:13)(cid:13) W ◦ b Θ ‡ β ◦ ( A − Y ) (cid:13)(cid:13)(cid:13) F + τ k A k ∗ (cid:27) , (3)12here a is an upper bound on k A ⋆ k ∞ . The above b A β contains as special cases (i) the matrixcompletion b A α , with unconstrained probability estimator b Θ , by setting β = α and (ii) theestimator b A β , with constrained probability estimator b Θ β , when β < α .We use an accelerated proximal gradient algorithm (Beck and Teboulle, 2009) to solve (3).For the choice of tuning parameter τ in (3), we adopt a 5-fold cross-validation to select theremaining tuning parameters. Due to the non-uniform missing mechanism, we use a weightedversion of the validation errors. The speciﬁc details are given in Algorithm 3 in the supplemen-tary material. Let k B k = σ max ( B ), k B k ∞ = max i,j | b ij | and k B k ∞ , = (max i P j b ij ) / be the spectralnorm, the maximum norm and l ∞ , -norm of a matrix B respectively. We use the symbol ≍ to represent the asymptotic equivalence in order, i.e., a n ≍ b n if a n = O ( b n ) and b n = O ( a n ). The average squared distance between two matrices B , C ∈ R n × n is d ( B , C ) = k B − C k F / ( n n ). The average squared errors of c M β and b Θ † β are then d ( c M β , M ⋆ ) and d ( b Θ † β , Θ † ⋆ ) respectively. We adopt the Hellinger distance for any two matrices S, T ∈ [0 , n × n , d H ( S, T ) = ( n n ) − P n ,n i,j =1 d H ( s ij , t ij ) where d H ( s, t ) = ( s / − t / ) + { (1 − s ) / − (1 − t ) / } for s, t ∈ [0 , n . For convenience, we usepolylog( n ) for polynomials of log n .To investigate the asymptotic properties of c M β and b Θ † β deﬁned in Section 3, we introducethe following conditions on the missing structure.C1. The indicators { w ij } n ,n i,j =1 are mutually independent, and independent of { ǫ ij } n ,n i,j =1 . For i = 1 , . . . , n and j = 1 , . . . , n , w ij follows a Bernoulli distribution with probability of success θ ⋆,ij = f ( m ⋆,ij ) ∈ (0 , f is monotonic increasing and diﬀerentiable.C2. The hidden feature matrix M ⋆ = µ ⋆ J + Z ⋆ where Tn Z ⋆ n = 0, | µ ⋆ | ≤ α < ∞ and13 Z ⋆ k ∞ ≤ α < ∞ . Here α and α are allowed to depend on the dimensions n and n . Thisalso implies that there exists a lower bound θ L ∈ (0 ,

1) (allowed to depend on n , n ) such thatmin i,j { θ ij } ≥ θ L ≥ f ( − α − α ) > µ ⋆ , Z ⋆ ) deﬁned by the constrained maximization problem (1) instead of the Lagrangian form(1). For r Z ⋆ ≤ min { n , n } and α , α ≥

0, let (cid:16)b µ, b Z (cid:17) = arg max ( µ, Z ) ∈ e C n ,n ( α ,α ) ℓ W ( µ J + Z ) , where (1) e C n ,n ( α , α ) = (cid:8) ( µ, Z ) ∈ R × R n × n : | µ | ≤ α , k Z k ∞ ≤ α , k Z k ∗ ≤ α √ r Z ⋆ n n , T n Z n = 0 (cid:9) . It is easy to see that we have ( µ ⋆ , Z ⋆ ) ∈ e C n ,n ( α , α ) once ( µ ⋆ , Z ⋆ ) ∈ C n ,n ( α , α ) holds. Forthe ease of presentation, we assume n = n = n and choose the logit function as the inverselink function f in the rest of Section 5, while corresponding results under general settings of n , n and f are delegated to Section S1.3 in the supplementary material. We ﬁrst establishthe convergence results for b µ , b Z and c M , respectively. To simplify notations, let α = α + α , h α ,β = (1 + e α + β ) − and Γ n = e α ( α + α r / Z ⋆ ) n − / . Lemma 1.

Suppose Conditions C1-C2 hold, and ( µ ⋆ , Z ⋆ ) ∈ C n ,n ( α , α ) . Consider c M = b µ J + b Z where ( b µ, b Z ) is the solution to (1) . There exist positive constants C , C such that wehave with probability at least − C /n , ( µ ⋆ − b µ ) ≤ C (cid:0) α ∧ Γ n (cid:1) , n (cid:13)(cid:13)(cid:13) b Z − Z ⋆ (cid:13)(cid:13)(cid:13) F ≤ C (cid:0) α ∧ Γ n (cid:1) and n (cid:13)(cid:13)(cid:13) c M − M ⋆ (cid:13)(cid:13)(cid:13) F ≤ C (cid:0) α ∧ Γ n (cid:1) . (2)The upper bounds in (2) all consist of trivial bounds α j and a more dedicated bound Γ n . Thetrivial upper bounds α , α and α can be easily derived from the constraint set C n ,n ( α , α ).14or extreme settings of increasing α , the more dedicated bound Γ n is diverging and the trivialbounds may provide better control. The term Γ n can be controlled by the rank of Z ⋆ . For arange of non-extreme scenarios, i.e., α ≤ / n or θ L ≥ n − / , the second term in Γ n attainsthe convergence order once r Z ⋆ = O (1).Similarly, we study the theoretical results of the re-estimation of Z ⋆ in terms of the con-strained optimization: b Z β = arg max Z ∈ R n × n ℓ W ( b µ J + Z )subject to k Z k ∞ ≤ β, k Z k ∗ ≤ β p r T β ( Z ⋆ ) n n , . (3)We now consider the constrained estimation for Z ⋆ , M ⋆ and Θ † ⋆ . For any matrix B = ( b ij ) n ,n i,j =1 ,deﬁne the winsorizing operator T β by T β ( B ) = ( T β ( b ij )) where T β ( b ij ) = b ij I [ − β ≤ b ij ≤ β ] + β I [ b ij >β ] − β I [ b ij < − β ] for any β ≥ . (4)Write M ⋆,β = µ ⋆ J + T β ( Z ⋆ ) and c M ⋆,β = b µ J + T β ( Z ⋆ ), and Θ ⋆,β = F ( M ⋆,β ) and b Θ ⋆,β = F ( c M ⋆,β ) respectively. It is noted that c M ⋆,β serves as a “bridge” between the underlying M ⋆,β and the empirical c M β . Write N β = P i,j ( I [ z ⋆,ij >β ] + I [ z ⋆,ij < − β ] ) as the number of extremevalues in Z ⋆ at level β . The convergence rates of d ( c M β , M ⋆ ) and d ( b Θ † β , Θ † ⋆ ) are investigatedin the next theorem. Deﬁne Λ n = min[ β , e Γ n + h − α ,β n − β { N β + ( n − N β ) | µ ⋆ − b µ |} ] where e Γ n = h − α ,β ( α + βr / T β ( Z ⋆ ) ) n − / . Theorem 3.

Assume that Conditions C1-C2 hold, and ( µ ⋆ , Z ⋆ ) ∈ e C n ,n ( α , α ) . Consider c M β = b µ J + b Z β where b Z β is the solution to (3) and β ≥ , there exist some positive constants C , C and C such that we have with probability at least − C /n , d n b Z β , T β ( Z ⋆ ) o ≤ C Λ n , d (cid:16) c M β , M ⋆ (cid:17) ≤ C (cid:0) α ∧ Γ n (cid:1) + C Λ n + 2( α − β ) N β n nd d (cid:16) b Θ † β , Θ † ⋆ (cid:17) ≤ C h α ,β (cid:0) α ∧ Γ n (cid:1) + C Λ n h α ,β + 8 N β n θ L . (5)We can derive an upper bound 4 β for d ( b Z Win ,β , T β ( Z ⋆ )) from the second term in Theorem1 where b Z Win ,β = T β ( b Z ) is directly winsorized from b Z . Obviously, the order of this upper boundis larger than or equal to Λ n . Moreover, there are scenarios where Λ n is a smaller order of β .To illustrate, assume that both α ≍ β ≍

1, we have h α ,β ≍

1. Once we have N β = o ( n ), r T β ( Z ⋆ ) = o ( n ) and | b µ − µ ⋆ | = o (1), then Λ n = o ( β ).With a more dedicated investigation of (5), one can derive an upper bound for d ( b Θ † β , b Θ † ⋆,β ),which will be used in Section 5.2. Denote k ′ α ,α ,n = min { α , e α ( α + α r / Z ⋆ ) n − / } , such anupper bound is of order k α ,α ,β,n h − α ,β where k α ,α ,β,n ≍ min h β , h − α ,β β n N β + (cid:0) n − N β (cid:1) k ′ / α ,α ,n o n − + h − α ,β n − / ( α + βr / T β ( Z ⋆ ) ) i . To study the convergence of d ( b A β , A ⋆ ), we require the following conditions regarding therandom errors ǫ and the target matrix A ⋆ . Recall that b A β includes both the estimationsobtained with the unconstrained estimator b Θ and the constrained estimator b Θ β as b A ( b Θ ) = b A α with β = α .C3. (a) The random errors { ǫ ij } in Model (1) are independently distributed random variablessuch that E ( ǫ ij ) = 0 and E ( ǫ ij ) = σ ij < ∞ for all i, j . (b) For some ﬁnite positive constants c σ and η , max i,j E | ǫ ij | l ≤ l ! c σ η l − for any positive integer l ≥ a such that k A ⋆ k ∞ ≤ a .Denote h (1) ,β = max i,j ( θ − ⋆,ij θ ⋆,ij,β ) and∆ = max  ( c σ ∨ a ) e − µ ⋆ / α − β + | α / − β | ( n log n ) / n , ηe µ ⋆ / α + | α / − β | k / α ,α ,β,n log / nh α ,β n  . (6)The following theorem established a general upper bound for d ( b A β , A ⋆ ).16 heorem 4. Assume Conditions C1-C4 hold. For β ≥ , there exist some positive constants C , C , C and C , both independent of β , such that for h (1) ,β τ ≥ C ∆ , we have with probabilityat least − / (2 n ) , d (cid:16) b A β , A ⋆ (cid:17) ≤ max n C n h ,β r A ⋆ τ + C h ,β h ,β r A ⋆ n − log ( n ) , C h (1) ,β h (3) ,β n − log / ( n ) o . (7)As for the estimator of the target matrix based on direct winsorization b Θ Win ,β = F ( b µ J + b Z Win ,β ) where b Z Win ,β = T β ( b Z ), an upper bound can be derived using Theorem 3. As noted ina remark after Theorem 3, d ( b Z Win ,β , T β ( Z ⋆ )) converges at a slower rate β which will cause alarger error bound for the target matrix.Now, we discuss the rates of d ( b A β , A ⋆ ) under various missing structures. For simplicity, thefollowing discussion focuses on the low-rank linear predictor ( M ⋆ ) setting such that r M ⋆ ≍ θ ij ≡ θ , it has been shown inKoltchinskii et al. (2011) that θ − n − polylog( n ) is the optimal rate for d ( b A β , A ⋆ ). Thereforeit is reasonable to require α + α = α = O (polylog( n )) for the convergence of d ( b A β , A ⋆ ).Under the uniform missingness, we have α = 0, α = α and e µ ⋆ ≍ θ . For β = 0, our estimator b A β degenerates to the estimator based on the unweighted empirical risk function. Theorem 4shows that b A β achieves the optimal rate θ − n − polylog( n ). As for β >

0, by taking β → k α ,α ,β,n = O ( e µ ⋆ − α − β n − log − n ), the estimator can also reach the optimal rate. Ofinterest here is that β is allowed to be strictly positive to achieve the same rate.Non-uniform missingness. Under the non-uniform missingness, suppose the lower and upperbounds of observation probability satisfy θ L ≍ e µ ⋆ − α and θ U ≍ e µ ⋆ + α . For the non-constrainedcase of β = α and h α ,β ≍ e − α − α , the second term of ∆ in (6) dominates due to the fact that e − µ ⋆ / α / n − / log / n = o ( e µ ⋆ / α / α / n − / log / n ) . Thus, the convergence rate of d ( b A β , A ⋆ ) is e µ ⋆ +5 α +3 α n − / log n . To guarantee convergence,as e µ ⋆ / α / α / ≤ e α +3 α / , it requires that α + α / < (1 /

12) log n which implies that17 − L = O ( n / ).However, the above range of θ L = O ( n − / ) excludes θ L ≡ ( n − polylog( n )), the case thatresults in the number of the observed matrix entries at the order of n polylog( n ) which representsthe most sparse case of observation where the matrix can still be recovered (Cand`es and Recht,2009; Cand`es and Plan, 2010; Koltchinskii et al., 2011; Negahban and Wainwright, 2012). Wewill show in the following that with an appropriately chosen β , the constrained estimator b Θ β can accommodate the case of θ − L = O ( n log − n ).Case (I): β = 0. To demonstrate this, we start with the absolute constrained case, i.e., β = 0,which forces the estimated probabilities to be uniform and implies e − µ ⋆ / α − β + | α / − β | = e − µ ⋆ / α / ≍ θ / U θ − L . Then, according to Theorem 4, d ( b A β , A ⋆ ) attains the convergencerate θ U θ − L n − log( n ), which converges to 0 provided θ U θ − L = o ( n log − n ). Obviously, thecondition θ U θ − L = o ( n log − n ) includes the extreme case of θ − L = O ( n log − n ) and n polylog( n )observations.Case (II): β >

0. For the more interesting setting β >

0, to simplify the discussion, weconcentrate on the case when the ﬁrst term in k α ,α ,β,n is of a smaller order, which can beachieved by choosing β = O ( e − µ ⋆ − α + α n − / log − n ). Then, according to Theorem 4, d ( b A β , A ⋆ ) = O p ( e − µ ⋆ +2 α − β +2 | α / − β | n − log n ) = O p ( e α / α / n − log n ) , since e − µ ⋆ / α − β + | α / − β | ≤ e α / α / . In the following we consider two further cases: (i) α = O ((log log n ) − α ) and (ii) α = o ( α log log n ). Note that for either cases, e − µ ⋆ +2 α − β +2 | α / − β | ≍ θ U θ − L which leads to d ( b A β , A ⋆ ) = O p ( θ U θ − L n − log n ) . If α = O { (log log n ) − α } , we require α < (1 + 3 log log n ) − (log n − log log n ) to guaranteeconvergence, which implies that θ L = O ( n − ). Thus, we only lose a polylog( n ) factor whencompared with the most extreme but feasible setting of θ − L = O [ n { polylog( n ) } − ]. Also β = O ( e − µ ⋆ − α + α n − / log − n ) implies that β = O ( n − / log − n ).If α = o { (log log n ) α } , we require that α < { n ) − } − (log n − log log n )18hich leads to θ − L = O ( n / ). Also β = O ( e − µ ⋆ − α + α n − / log − n ) implies that β = O ( n − / log − n ). However, to make d ( b A β , A ⋆ ) convergent, the attained rate for θ − L has tobe O ( n / ), which excludes the most extreme heterogeneity case of θ − L = O { n (polylog( n )) − } .The reason for not being able to cover the most extreme case of θ − L = O { n (polylog( n )) − } isthat the current Case (ii) allows more heterogeneity in Z ⋆ as reﬂected by having a larger α thanthat prescribed under Case (i). As µ ⋆ is jointly estimated with Z ⋆ in the unconstrained esti-mation (Section 3.1), stronger heterogeneity slows down the convergence rate in the estimationof µ ⋆ , which becomes a bottleneck for further improvement. If µ ⋆ was observable, the problemwould not be as serious despite the adverse eﬀect of stronger heterogeneity on the estimationof Z ⋆ .To summarize, under the uniform missing and Case (I), (II)(i) in the non-uniform missing,we can achieve the optimal rate up to a polylog( n ) order. For Case (II)(ii), when the missingnessis not extreme, with an appropriately chosen β >

0, the proposed estimator can also attain theoptimal rate up to the polylog( n ) order. Recall that the unweighted empirical risk function b R UNI ( A ) = n − k W ◦ ( A − Y ) k F is adopted bymany existing matrix completion techniques (Klopp, 2014). An interesting question is whetherthere is any beneﬁt in adopting the proposed weighted empirical risk function for matrix com-pletion. In this subsection, we aim to shed some light on this aspect by comparing the non-asymptotic error bounds of the corresponding estimators. Due to the additional complicationfrom the estimation error of the observation probability matrix, we only focus on the weightedempirical risk function with true inverse probability weighting in this section. We will demon-strate empirically in Sections 6 and 7 the beneﬁts of the weighted objective function withestimated weights.Most existing work with unweighted empirical risk function assume the true missingnessis uniform (Cand`es and Plan, 2010; Koltchinskii et al., 2011). One notable exception is Klopp(2014), where unweighted empirical risk function is studied under possibly non-uniform missing19tructure. The estimator of Klopp (2014) is equivalent to our estimator when β = 0, which isdenoted by b A UNI . Thus, according to Theorem 4, we have with probability at least 1 − / (2 n ), d (cid:16) b A UNI , A ⋆ (cid:17) ≤ min n ( C + C ) r A ⋆ θ U θ − L n − log n, C θ / U θ − L n − / log / n o = U UNI , which is the same upper bound obtained in Klopp (2014). Deﬁne b A KNOWN as the estimatorwhich minimizes the known weighted empirical risk function (3). Then, d (cid:16) b A KNOWN , A ⋆ (cid:17) ≤ min n ( C + C ) r A ⋆ θ − L n − log n, C θ − / L n − / log / n o = U KNOWN . The improvement in the upper bounds of the weighted objective function b R lies in that, undernon-uniform missingness, θ U θ − L > U KNOWN < U

UNI as summarized below.

Theorem 5.

Assume Conditions C1-C4 holds, and take τ KNOWN = C θ − / L n − / log / n and τ UNI = C θ / U f − ( µ ⋆ ) n − / log / n . The upper bound of d ( b A UNI , A ⋆ ) is the same as U UNI andthe upper bound of d ( b A KNOWN , A ⋆ ) is the same as U KNOWN . In addition, U KNOWN ≤ U UNI ,and U KNOWN < U

UNI if θ U > θ L , i.e., the true missing mechanism is non-uniform. Our approach draws inspiration from the missing value literature, for instance in Chen et al.(2008), which showed that using the estimated parameters in the inverse probability weightingcan actually reduce the variance of the parameter of interest; see Theorem 1 of the paper. Giventhe results of Chen et al. (2008), we would expect using the estimated parameters b Θ β in theweighting probability would not be inferior to the version with the true parameter b Θ ⋆ . This section reports results from simulation experiments which were designed to evaluate thenumerical performance of the proposed methodologies. We ﬁrst evaluate the estimation per-formances of the observation probabilities in Section 6.1 and then those of the target matrixin Section 6.2. In the simulation, the true observation probabilities Θ ⋆ and the target matrix20 ⋆ were randomly generated once and kept ﬁxed for each simulation setting to be describedbelow. To generate Θ ⋆ , we ﬁrst generated U M ⋆ ∈ R n × ( r M ⋆ − and V M ⋆ ∈ R ( r M ⋆ − × n asrandom Gaussian matrices with independent entries each following N ( − . , M ⋆ = U M ⋆ V T M ⋆ − ¯ m n ,n ,r M ⋆ J where ¯ m n ,n ,r M ⋆ is a scalar chosen to ensure the averageobservation rate is 0 . Θ ⋆ = F ( M ⋆ ) where the inverselink function f is a logistic function.In our study, we set r M ⋆ = 11, (or r Z ⋆ = 10) and chose n = n with four sizes: 600, 800,1000 and 1200, and the number of simulation runs for each settings was 500.For the purpose of benchmarking, we compared various estimators of the missingness:1. the non-constrained estimator b Θ α deﬁned in (1);2. the constrained estimator b Θ β deﬁned in (4);3. the directly winsorized estimator b Θ Win ,β = F { b µ J + T β ( b Z ) } ;4. the 1-bit estimator b Θ ,α proposed in Davenport et al. (2014) and its correspondingconstrained and winsorized versions b Θ ,β and b Θ , Win ,β ; (note that the 1-bit estimator b Θ ,α imposes the nuclear-norm regularization on the whole M instead of Z , whencompared to b Θ α )5. the rank-1 probability estimator b Θ NW used in Negahban and Wainwright (2012) where g i. = n − P n j =1 w ij , g .j = n − P n i =1 w ij and θ ij, NW = g i. g .j ;6. the uniform estimator b Θ UNI = N/ ( n n ) J .For the non-constrained estimator b Θ α and the 1-bit estimator b Θ ,α , the parameter α is setaccording to the knowledge of the true M ⋆ . For the constrained estimators b Θ β and b Θ Win ,β , theconstraint level β was chosen so that either 5% or 10% of the elements in b Z α were winsorized.Similarly for b Θ ,β and b Θ , Win ,β .To quantify the estimation performance of linear predictor M ⋆ and observation probabilities Θ ⋆ , we considered the empirical root mean squared errors RMSE( B , C ) with respect to anytwo matrices B and C of dimension n × n , and the Hellinger distance d H ( b Θ , Θ ⋆ ) between b Θ c M , M ⋆ ), Hellinger distance d H ( b Θ , Θ ⋆ ), rank oflinear predictor c M and estimated b Θ and their standard errors (in parentheses) under the lowrank missing observation mechanism, with ( n , n ) = (600, 600), (800, 800), (1000, 1000), (1200,1200) and r M ⋆ = 11, for the proposed estimators b Θ α , b Θ ,α and the two existing estimators( b Θ NW and b Θ UNI ). b Θ α b Θ ,α b Θ NW b Θ UNI

RMSE( c M , M ⋆ ) 2.6923 (0.0342) 2.9155 (0.0295) - - d H ( b Θ , Θ ⋆ ) 0.0369 (0.0015) 0.0450 (0.0016) 0.1233 (1e-04) 0.1729 (1e-04) r c M r b Θ b Θ α b Θ ,α b Θ NW b Θ UNI

RMSE( c M , M ⋆ ) 2.5739 (0.0116) 2.7796 (0.0033) - - d H ( b Θ , Θ ⋆ ) 0.0317 (5e-04) 0.0379 (1e-04) 0.1219 (1e-04) 0.1767 (1e-04) r c M r b Θ b Θ α b Θ ,α b Θ NW b Θ UNI

RMSE( c M , M ⋆ ) 2.4870 (0.0212) 2.7731 (0.0015) - - d H ( b Θ , Θ ⋆ ) 0.0266 (8e-04) 0.0351 (1e-04) 0.1246 (1e-04) 0.1767 (1e-04) r c M r b Θ b Θ α b Θ ,α b Θ NW b Θ UNI

RMSE( c M , M ⋆ ) 2.3809 (0.0018) 2.6470 (0.0012) - - d H ( b Θ , Θ ⋆ ) 0.0242 (1e-04) 0.0314 (1e-04) 0.1211 (1e-04) 0.1761 (1e-04) r c M r b Θ and Θ ⋆ deﬁned as follows:RMSE ( B , C ) = k B − C k F ( n n ) / and d H (cid:16) b Θ , Θ ⋆ (cid:17) = P n ,n i,j =1 d H (cid:16)b θ ij , θ ⋆,ij (cid:17) ( n n ) / . As the estimators F − ( b Θ α ) and F − ( b Θ ,α ) are both low-rank, we also report their corre-sponding ranks.Table 1 summarizes the simulation results for the missingness. The most visible aspect ofthe results is that the proposed estimators b Θ α and b Θ ,α both have superior performancethan the two existing estimators b Θ NW and b Θ UNI by having smaller root mean square errorswith respect to c M , Hellinger distances d H ( b Θ , Θ ⋆ ) and more accuracy estimated rank of M ⋆ .Without the separation of µ ⋆ from M ⋆ , b Θ ,α has larger error and Hellinger distance thanthe proposed estimators. The performance of b Θ NW is roughly between the proposed estimatorsand the uniform estimator b Θ UNI . Estimator b Θ UNI is a benchmark which captures no variationof the observation probabilities. 22 .2 Target matrix

To generate a target matrix A ⋆ , we ﬁrst generated U A ⋆ ∈ R n × ( r A ⋆ − and V A ⋆ ∈ R ( r A ⋆ − × n as random matrices with independent Gaussian entries distributed as N (0 , σ A ⋆ ) and obtained A ⋆ = 2 . J + U A ⋆ V T A ⋆ . Here we set the standard deviation of the entries in the matrix product U A ⋆ V T A ⋆ to be 2 . σ A ⋆ = (2 . / ( r A ⋆ − / . The contaminated version of A ⋆ was then generated as Y = A ⋆ + ǫ ,where ǫ ∈ R n × n has i.i.d. mean zero Gaussian entries ǫ ij ∼ N (0 , σ ǫ ). The σ ǫ is chosensuch that SNR = ( E k A ⋆ k F / E k ǫ k F ) / = 1, where E k A ⋆ k F = n n ( r A ⋆ − . ) implies σ ǫ = 0 . r A ⋆ − . ) / .For the estimation of the target matrix, we evaluated ten versions of the proposed estimators Proposed b Θ β t , Proposed b Θ Win ,β t , Proposed b Θ α , Proposed b Θ ,β t , Proposed b Θ , Win ,β t and Proposed b Θ ,α . Here Proposed indicates the estimators are obtained by solving problem (3),while b Θ β , b Θ Win ,β , b Θ α , b Θ ,β , b Θ , Win ,β and b Θ ,α represents the probability estimatorsused in (3), as described in Section 6.1, and t = 0 .

05 or 0 . β is chosen. In addition, same as Mao et al. (2019), we also compared them with threeexisting matrix completion techniques: the methods proposed in Negahban and Wainwright(2012) ( NW ), Koltchinskii et al. (2011) ( KLT ) and Mazumder et al. (2010) (

MHT ). Among thesethree methods, NW is the only one that adjusts for non-uniform missingness. All three methodsrequire tuning parameter selection, for which cross-validation is adopted. See Mao et al. (2019)for more details.To quantify the performance of the matrix completion, in addition to the empirical rootmean squared errors with respect to b A β and A ⋆ , we used one more measure:Test Error = (cid:13)(cid:13)(cid:13) W ⋆ ◦ (cid:16) b A β − A ⋆ (cid:17)(cid:13)(cid:13)(cid:13) F k W ⋆ ◦ A ⋆ k F , where W ⋆ is the matrix of missing indicator with the ( i, j )th entry being (1 − w ij ). The testerror measures the relative estimation error of the unobserved entries to their signal strength.The estimated ranks of b A β are also reported. 23ables 2 summarize the simulation results for diﬀerent dimensions n = n ranges from 600to 800 and two diﬀerent settings of r A ⋆ = 11. The results of r A ⋆ = 11 for diﬀerent dimensions n = n ranges from 1000 to 1200 are delegated to Table S1 and the results of r A ⋆ = 31 aredelegated to Tables S2-S3 of Section S1.5 in the supplementary material. From the tables, wenotice that the ten versions of the proposed methods possess superior performance than thethree existing methods by having smaller root mean squared errors and Test Errors. Amongthe ﬁrst ﬁve proposed methods in the tables, Proposed b Θ β is better than Proposed b Θ α for mostof the time. It is because that the constrained estimator b Θ β has much smaller ratio b θ U / b θ L than b Θ α which improve the stability of prediction and the accuracy. Another observation is that Proposed b Θ β performs better than Proposed b Θ ,α at most times. In this section we demonstrate the proposed methodology by analyzing the Yahoo! Webscopedataset (ydata-ymusic-user-artist-ratings-v1 0) available at http://research.yahoo.com/Academic Relations . It contains (incomplete) ratings from 15,400users on 1000 songs. The dataset consists of two subsets, a training set and a test set. Thetraining set records approximately 300,000 ratings given by the aforementioned 15,400 users.Each song has at least 10 ratings. The test set was constructed by surveying 5,400 out ofthese 15,400 users, each rates exactly 10 songs that are not rated in the training set. Themissing rates are 0 . . . . . α and α , we suggest to choose α and α large enough, say α = 100 and α = 100, to ensure that the range covers all the missingprobabilities. It was noted that b Θ α is not sensitive to larger α .Table 3 reports the root mean squared prediction errors, where RMSPE = k W test ◦ ( b A β − Y ) k F / ( P n i =1 P n j =1 w testij ) / and W test is the indicator matrix of test set with the ( i, j )th entry24able 2: Root mean squared errors, test errors, estimated ranks r b A β and their standard devi-ations (in parentheses) under the low rank missing observation mechanism, for three existingmethods and ten versions of the proposed methods where Proposed indicates the estimatorsare obtained by solving problem (3), while b Θ β , b Θ Win ,β , b Θ α , b Θ ,β , b Θ , Win ,β and b Θ ,α represents the probability estimators used in (3), as described in Section 6.1, and t = 0 .

05 or0 . β is chosen. ( n , n ) = (600, 600) RMSE( b A β , A ⋆ ) Test Error r b A β Proposed b Θ Win ,β .

05 1.5615 (0.0147) 0.3005 (0.0062) 65.28 (5.72)

Proposed b Θ β .

05 1.5548 (0.0085) 0.2996 (0.0034) 54.98 (3.01)

Proposed b Θ Win ,β . Proposed b Θ β . Proposed b Θ α Proposed b Θ , Win ,β .

05 1.5664 (0.0093) 0.3028 (0.0037) 62.76 (5.96)

Proposed b Θ ,β .

05 1.5573 (0.0089) 0.2996 (0.0036) 61.80 (5.34)

Proposed b Θ , Win ,β . Proposed b Θ ,β . Proposed b Θ ,α NW KLT

MHT n , n ) = (800, 800) RMSE( b A β , A ⋆ ) Test Error r b A β Proposed b Θ Win ,β .

05 1.4754 (0.0107) 0.2669 (0.0041) 88.58 (10.81)

Proposed b Θ β .

05 1.4797 (0.0080) 0.2714 (0.0030) 71.79 (4.12)

Proposed b Θ Win ,β . Proposed b Θ β . Proposed b Θ α Proposed b Θ , Win ,β .

05 1.4917 (0.0078) 0.2743 (0.0030) 83.51 (1.45)

Proposed b Θ ,β .

05 1.4804 (0.0080) 0.2705 (0.0031) 82.60 (3.47)

Proposed b Θ , Win ,β . Proposed b Θ ,β . Proposed b Θ ,α NW KLT

MHT With r M ⋆ = 11, r A ⋆ = 11, ( n , n ) = (600, 600), (800, 800) and SNR = 1.The three existing methods are proposed respectively in Negahban and Wainwright(2012)( NW ), Koltchinskii et al. (2011)( KLT ) and Mazumder et al. (2010)(

MHT ) being w testij . Note that Proposed b Θ β performs the best among all ten versions of proposedmethods. Besides, Proposed b Θ α also has much smaller root mean squared prediction error thanthe other eight versions of proposed methods. This may indicate that only slight constraintis required for the probabilities estimator for this dataset. Note that we cannot guarantee theoptimal convergence rate or even asymptotic convergence in certain setting of missingness for Proposed b Θ α , see Section 5.2 for details.With the separation of µ , Proposed b Θ α is better than Proposed b Θ ,α ; analogously, Pro- NW ), Koltchinskii et al. (2011)( KLT ) and Mazumder et al.(2010)(

MHT ). Proposed b Θ Win ,β b Θ β b Θ Win ,β RMSPE 1.0396 1.0381 1.0476

Proposed b Θ β b Θ α Proposed b Θ , Win ,β RMSPE 1.0490 1.0383 1.0831

Proposed b Θ ,β b Θ , Win ,β b Θ ,β RMSPE 1.1091 1.0760 1.0523

Proposed b Θ ,α NW KLT

RMSPE 1.1065 1.7068 3.6334

MHT

RMSPE 1.3821 posed b Θ β t is better than Proposed b Θ ,β t with diﬀerent constraint level t , same to Pro-posed b Θ Win ,β s and Proposed b Θ , Win ,β s with diﬀerent winsorization level s .As compared with the existing methods NW , KLT and

MHT , our proposed methods performsigniﬁcantly better in terms of root mean squared prediction errors, and achieve as much as25% improvement when compared with Mazumder, Hastie and Tibshirani’s method (the bestamong the three existing methods). This suggests that a more ﬂexible modeling of missingstructure improves the prediction power.

When the matrix entries are heterogeneously observed due to selection bias, this heterogeneityshould be taken into account. This paper focuses on the problem of matrix completion underlow-rank missing structure. In the recovery of probabilities of observation, we adopt a gen-eralized linear model with a low-rank linear predictor matrix. To avoid unnecessary bias, weintroduce a separation of the mean eﬀect µ . As the extreme values of probabilities may lead tounstable estimation of target matrix, we propose an inverse probability weighting based methodwith constrained probability estimates and demonstrate the improvements in empirical perspec-tives. Our theoretical result shows that the estimator of the high dimensional probability matrixcan be embedded into the inverse probability weighting framework without compromising therate of convergence of the target matrix (for an appropriately tuned β > β > β = 0). In addition,26orresponding computational algorithms are developed, and a related algorithmic convergenceresult is established. Empirical studies show the attractive performance of the proposed meth-ods as compared with existing matrix completion methods. References

Beck, A. and Teboulle, M. (2009) A fast iterative shrinkage-thresholding algorithm for linearinverse problems.

SIAM Journal on Imaging Sciences , , 183–202.Bi, X., Qu, A., Wang, J. and Shen, X. (2017) A group-speciﬁc recommender system. Journalof the American Statistical Association , , 1344–1353.Cai, J.-F., Cand`es, E. J. and Shen, Z. (2010) A singular value thresholding algorithm for matrixcompletion. SIAM Journal on Optimization , , 1956–1982.Cai, T., Cai, T. T. and Zhang, A. (2016) Structured matrix completion with applications togenomic data integration. Journal of the American Statistical Association , , 621–633.Cai, T. T. and Zhou, W.-X. (2016) Matrix completion via max-norm constrained optimization. Electronic Journal of Statistics , , 1493–1525.Cand`es, E. J. and Plan, Y. (2010) Matrix completion with noise. Proceedings of the IEEE , ,925–936.Cand`es, E. J. and Recht, B. (2009) Exact matrix completion via convex optimization. Founda-tions of Computational Mathematics , , 717–772.Chen, C., He, B., Ye, Y. and Yuan, X. (2016) The direct extension of admm for multi-blockconvex minimization problems is not necessarily convergent. Mathematical Programming , , 57–79.Chen, S. X., Leung, D. H. and Qin, J. (2008) Improving semiparametric estimation by usingsurrogate data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , , 803–823. 27avenport, M. A., Plan, Y., van den Berg, E. and Wootters, M. (2014) 1-bit matrix completion. Information and Inference , , 189–223.Foygel, R., Shamir, O., Srebro, N. and Salakhutdinov, R. R. (2011) Learning with the weightedtrace-norm under arbitrary sampling distributions. In Advances in Neural Information Pro-cessing Systems , 2133–2141.Kang, J. D. and Schafer, J. L. (2007) Demystifying double robustness: A comparison of alter-native strategies for estimating a population mean from incomplete data.

Statistical science , , 523–539.Keshavan, R. H., Montanari, A. and Oh, S. (2009) Matrix completion from noisy entries. In Advances in Neural Information Processing Systems , 952–960.Klopp, O. (2014) Noisy low-rank matrix completion with general sampling distribution.

Bernoulli , , 282–303.Koltchinskii, V., Lounici, K. and Tsybakov, A. B. (2011) Nuclear-norm penalization and optimalrates for noisy low-rank matrix completion. The Annals of Statistics , , 2302–2329.Mao, X., Chen, S. X. and Wong, R. K. (2019) Matrix completion with covariate information. Journal of the American Statistical Association , , 198–210.Mazumder, R., Hastie, T. and Tibshirani, R. (2010) Spectral regularization algorithms forlearning large incomplete matrices. Journal of Machine Learning Research , , 2287–2322.Negahban, S. and Wainwright, M. J. (2012) Restricted strong convexity and weighted matrixcompletion: Optimal bounds with noise. Journal of Machine Learning Research , , 1665–1697.Potter, F. J. (1990) A study of procedures to identify and trim extreme sampling weights. In Proceedings of the American Statistical Association, Section on Survey Research Methods ,vol. 225230. 28echt, B. (2011) A simpler approach to matrix completion.