[PDF] Binary matrix completion with nonconvex regularizers

Abstract

Many practical problems involve the recovery of a binary matrix from partial information, which makes the binary matrix completion (BMC) technique received increasing attention in machine learning. In particular, we consider a special case of BMC problem, in which only a subset of positive elements can be observed. In recent years, convex regularization based methods are the mainstream approaches for this task. However, the applications of nonconvex surrogates in standard matrix completion have demonstrated better empirical performance. Accordingly, we propose a novel BMC model with nonconvex regularizers and provide the recovery guarantee for the model. Furthermore, for solving the resultant nonconvex optimization problem, we improve the popular proximal algorithm with acceleration strategies. It can be guaranteed that the convergence rate of the algorithm is in the order of 1/T , where T is the number of iterations. Extensive experiments conducted on both synthetic and real-world data sets demonstrate the superiority of the proposed approach over other competing methods.

Full PDF

aa r X i v : . [ c s . L G ] A p r Binary Matrix Completion with NonconvexRegularizers

Chunsheng Liu, Hong ShanApril 9, 2019

Abstract

Many practical problems involve the recovery of a binary matrix frompartial information, which makes the binary matrix completion (BMC)technique received increasing attention in machine learning. In particu-lar, we consider a special case of BMC problem, in which only a subset ofpositive elements can be observed. In recent years, convex regularizationbased methods are the mainstream approaches for this task. However, theapplications of nonconvex surrogates in standard matrix completion havedemonstrated better empirical performance. Accordingly, we propose anovel BMC model with nonconvex regularizers and provide the recoveryguarantee for the model. Furthermore, for solving the resultant nonconvexoptimization problem, we improve the popular proximal algorithm withacceleration strategies. It can be guaranteed that the convergence rateof the algorithm is in the order of 1/ T , where T is the number of itera-tions. Extensive experiments conducted on both synthetic and real-worlddata sets demonstrate the superiority of the proposed approach over othercompeting methods. Binary matrix completion, Link prediction, Nonconvex regularizers, Topol-ogy inference

The matrix completion problem attempts to recover a low-rank or an approx-imate low-rank matrix by observing only partial elements [1]. In recent years,many strong theoretical analyses have been developed on the matrix completionproblem [2–7], which has been applied to a wide variety of practical applicationssuch as background modeling [8, 9], recommender systems [10], sensor localiza-tion [11, 12], image and video processing [13, 14], and link prediction [15]. Inparticular, these all results are based on a potential assumption that the ob-served entries are continuous-valued. However, in many practical applications,the observations are not only incomplete but also are often highly quantized toa single bit [16]. Therefore, there is a conspicuous gap between those existing1pproaches and practical situation, which promotes the rapid development of1-bit matrix completion [16].Instead of observing a subset of full entries, a more common situation inpractice is that only the subset of positive elements can be observed. Thus, theobservations are not only binary but also nonnegative. For instance, considerthe link prediction problem in social networks, where only positive relation-ships, such as “friendships”, can be observed, while no “non-friendships” areobserved. The goal here is to recover the whole social network from the ob-served friendships (positive entries). In the context of binary classiﬁcation, theproblems learned from positive and unlabeled examples are called positive andunlabeled learning (PU learning for short) [17]. Consequently, the unobservedentries were regarded as unlabeled samples, and then PU learning is applied tomatrix completion [18].The existing methods of PU matrix completion [18, 19] are all based on theconvex regularizers such as nuclear norm and max-norm. However, many works[9,13,14,20] stated that the (convex) nuclear norm might not be a good enoughapproximation of the rank function. In contrast, better recovery performancecan be achieved by nonconvex surrogates [21–23]. Accordingly, we attempt tointroduce the nonconvex regularizers into PU matrix completion.In this paper, we propose a novel model of PU matrix completion with non-convex regularizers and provide recovery guarantee for the model. To cope withthe challenges of the resultant nonconvex optimization problem, we improvethe proximal algorithm with two acceleration schemes: i) Instead of full singu-lar value decomposition (SVD), only a few leading singular values are needed togenerate the next iteration. ii) We replace a large matrix by its projection onleading subspace, and then the reduction of matrix size makes the calculation ofproximal operator more eﬃcient. Moreover, we show that further accelerationis available by taking advantage of the sparsity structure. Subsequently, theresultant algorithm, named “PU matrix completion with nonconvex regulariz-ers (PUMC N),” is analyzed in detail from the aspects of convergence and timecomplexity, respectively.The primary contributions of our work can be summarized as follows: • Employing the nonconvex regularization, we propose a novel PU matrixcompletion model and provide a strong guarantee for matrix recovery, i.e.,the error in recovering an m × n O (cid:16) δ √ mn (cid:17) , where δ denotesthe sampling rate of positive entries. • We develop an accelerated version of the proximal algorithm for solvingthe resultant nonconvex optimization model. It can be guaranteed that theproposed algorithm has a convergence rate of O (1/ T ) , where T denotesthe number of iterations. • We implement and analyze the proposed algorithm on both synthetic andreal-world data sets. Experimental results demonstrate the superiority ofthe resultant algorithm to state-of-the-art methods.

Notation : In this paper, vectors and matrices are denoted by lowercase anduppercase boldface, respectively. For a matrix X , X ⊤ denotes its transpose, X k = X (: , k ) is its leading k columns, k X k F = qP i,j X ij is the Frobeniusnorm of X , and k X k ∗ = P i σ i ( X ) is the nuclear norm, where σ i ( X ) is the i -thlargest singular value of X . For a set Ω, | Ω | is its cardinal number. In addition,we use ∇ f for the gradient of a diﬀerentiable function f . In the last decade, based on the remarkable result of low-rank matrix completion[1], a tremendous amount of work has focused on the problem, which enabled aburst of progress concerning the matrix completion theory. A strong theoreticalbasis of matrix completion [2, 3, 5], including the case of approximate low-rankmatrices and noisy observations [1, 4, 9, 11], has been established.However, these all results are based on the underlying assumption that theobserved entries are continuous-valued. In practice, many applications, suchas the popular

Netﬂix and MovieLens [16] in recommender systems, have arating matrix whose entries are discrete and quantized rather than continuous.Consequently, there is a conspicuous gap between standard matrix completiontheory and practice, revealing the inadequacy of the corresponding methods indealing with the above case.Motivated by the above challenge, 1-bit matrix completion was advocatedfor the ﬁrst time in [16] to deal with the binary (1-bit) observations. Theoreticalguarantees were provided to show the eﬃciency of the method. In addition, asuite of experiments on both synthetic and real-world data sets illustrated someof the practice applications and demonstrated the superiority of 1-bit matrixcompletion. Then, [24] considered a general nonuniform sampling distributionconcerning 1-bit matrix completion problem followed by corresponding theoret-ical guarantees. Moreover, the noisy version was studied in [25] under the samesampling scheme with [24]. Instead of nuclear norm, [25] used the max-norm asa convex relaxation for rank function. Similarly, [26] addressed the problem ofsocial trust prediction with a 1-bit max-norm constrained formulation. Underconstraints on inﬁnity norm and exact rank, the noisy 1-bit matrix completionproblem was explored in [27] and [28]. Furthermore, though the analysis onPAC-Bayesian bounds, [29] evaluated the performance of 1-bit matrix comple-tion.Relative to the settings of 1-bit matrix completion, there is a more common https://netﬂixprize.com/index.html ℓ penalty [21] have been successfully applied in many ﬁelds and have betterempirical performance than nuclear norm regularizers. Matrix completion is the problem of recovering the underlying target matrixgiven its partial information [1]. Following the “basic setting” of [18], let thetarget matrix M ∈ { , } m × n be a binary matrix that consists only of ones andzeros, and Ω = { ( i, j ) | M ij = 1 } denotes the index set of all positive elements in M . Equivalently, the observation matrix is denoted, herein, by A ∈ { , } m × n ,and Ω denotes the index set of observation elements. According to the “one-sided” sampling in [18], only a subset of positive entries of M can be observed,that is Ω ⊆ Ω . We suppose that the observation process follows the uniformsampling distribution, which is the popular choice for majority work, i.e., Ω issampled randomly from Ω . For the observation matrix A , A ij = 1 if ( i, j ) ∈ Ωand A ij = 0 otherwise. Here, our goal is to recover the underlying target matrix M from the observation matrix A .According to the above description, the relationship between M and A canbe expressed in the following conditional probability. P ( A ij = 0 | M ij = 1) = 1 − δP ( A ij = 1 | M ij = 0) = 0 . (1)where δ = | Ω | / | Ω | denotes the sampling rate. We consider the problem of pos-itive and unlabeled matrix completion (PU matrix completion) with nonconvex4egularizers as the following form.min X ∈{ , } m × n F ( X ) ≡ ℓ ( X ) + λr n ( X ) . (2)where λ is a regularization parameter, ℓ is a smooth loss function, r n is a non-convex regularizer in Table 1. In addition, (2) has the following characteristics. • ℓ is diﬀerentiable with β -Lipschitz continuous gradient, that is, it follows k∇ ℓ ( X ) − ∇ ℓ ( X ) k F ≤ β k X − X k F , β > . Moreover, ℓ is boundedfrom below, i.e., inf ℓ > −∞ . • r n ( X ) = P mi =1 r ( σ i ( X )) is a nonconvex and nonsmooth function, where r is a nondecreasing concave function and r (0) = 0 . • r n can be formulated as the diﬀerence of two convex functions [47], i.e., r n ( X ) = ⌢ r n ( X ) − ⌣ r n ( X ) , where ⌢ r n and ⌣ r n are convex. (The cor-responding convex functions of the nonconvex regularizers mentioned inSection II are provided in Appendix A.) For accurately quantifying the error in recovering the underlying binary ma-trix, we propose to adopt the ω -weighted square loss [37,38] as the loss function.The ω -weighted square loss is deﬁned as ℓ ω ( x, a ) = ωI a = a ℓ ( x, a ) + (1 − ω ) I a = a ℓ ( x, a ) . (3)where ℓ ( x, a ) = ( x − a ) is the square loss, I a = a and I a = a are indicator func-tions, i.e., I a = a is 1 if a = a is true and 0 otherwise.Consequently, (2) can be further formulated as follows.min X , A ∈{ , } m × n F ( X ) ≡ λr n ( X ) + X i,j ℓ ω ( X ij , A ij ) . (4)where X and A are the recovery matrix and observation matrix of the un-derlying target matrix M , respectively, and ℓ ω ( X ij , A ij ) = ωI A ij =1 ℓ ( X ij ,

1) +(1 − ω ) I A ij =0 ℓ ( X ij , Recovery error of (4) . Following the deﬁnition of the recovery errorin [16, 18], here the recovery error can be formulated as R ( X ) = 1 mn k X − M k F . (5)where M , X ∈ R m × n is the underlying target matrix and its recovery matrix,respectively.In [38], the label-dependent loss is deﬁned as U ( x, a ) = I x =1 I a =0 + I x =0 I a =1 . And similar to (3), we deﬁne the weighted version of the label-dependent lossas U ω ( x, a ) = (1 − ω ) I x =1 I a =0 + ωI x =0 I a =1 . (6)Therefore, the corresponding ω -weighted expected error can be written as R ω ( X ) = E hX i,j U ω ( X ij , A ij ) i . (7)5able 1: The functions r and thresholds γ for nonconvex regularizers where µ > η = λρ . For the TNN regularizer, µ is an integer denoting the number ofleading singular values that are not penalized. ηr ( σ i ( X )) γ TNN (cid:26) , i ≤ µησ i ( X ) , i > µ max ( σ µ +1 ( X ) , η )capped ℓ (cid:26) ησ i ( X ) , σ i ( X ) ≤ µηµ , σ i ( X ) > µ min (cid:0) η, √ µη (cid:1) LSP η log ( σ i ( X )/ µ + 1) min ( η / µ, µ )According to the class-conditional random noise model in [39], (1) can bewritten correspondingly as P ( A ij = 0 | M ij = 1) = 1 − δ = ρ +1 P ( A ij = 1 | M ij = 0) = 0 = ρ − . (8) Theorem 1 . For the choices ˆ ω = − ρ +1 and κ = − ρ +1 , there exists a con-stant c that is dependent of X such that, for any matrix X , we have R ˆ ω ( X ) = κR ( X ) + c . The above theorem (Theorem 1) is a special case of Theorem 9 in [39].At this juncture, the linear mapping between the recovery error R ( X ) and ω -weighted expected error R ˆ ω ( X ) indicates that minimizing R ( X ) is equivalentto minimizing R ˆ ω ( X ) on the partial information. Theorem 2 (Main Result 1).

Let ˆX ∈ R m × n be the solution to (4) , thenwith probability at least − α , mn (cid:13)(cid:13)(cid:13) ˆX − M (cid:13)(cid:13)(cid:13) F ≤ Cδ r log (2/ α ) mn + b ! . (9) where C is absolute constant, δ denotes the sampling rate, b = √ m + √ n + √ | Ω | mn , Ω is the index set of all positive elements in M . The proof can be found inAppendix C. In this section, we will show the nonconvex model can be solved much faster.First, Subsection A shows the basic algorithm for nonconvex regularizers, fol-lowed by acceleration measures in Subsection B. Subsection C summarizes thewhole algorithm and Subsection D analyses the resultant algorithm from theaspect of convergence and time complexity.6 .1 Basic algorithm

Let L ω ( X , A ) = P i,j ℓ ω ( X ij , A ij ), and in line with the deﬁnition of the Frobe-nius norm, we have the following derivation. L ω ( X , A )= (1 − ω ) P A ij =0 ( X ij − A ij ) + ω P A ij =1 ( X ij − A ij ) = (1 − ω ) k X − A k F + (2 ω − kP Ω ( X − A ) k F . (10)where [ P Ω ( X − A )] ij = ( X − A ) ij if ( i, j ) ∈ Ω and 0 otherwise. In recent years,the proximal algorithm [32] has been regarded as an eﬃcient method for solvingthe optimization problem min X f ( X ) + f ( X ), when f and f are convex. Thefollowing theorem shows that the convergence of the proximal algorithm. Theorem 3 [40].

Let f , f be lower semicontinuous, and f is diﬀerentiablewith β -Lipschitz continuous gradient. If f + f is coercive and strictly convex,the solution of the problem takes on uniqueness. For an arbitrary initial ma-trix X , ∀ ρ ≥ β , the iterative sequence generated by the following statement canconverge to the unique solution of the problem. X t +1 = prox λρ f (cid:16) X t − ρ ∇ f ( X t ) (cid:17) . (11) where prox λρ f ( Z ) = arg min X , Z ∈ R m × n n k X − Z k F + λρ f ( X ) o denotes the proximaloperator. If f is the nuclear norm, the following theorem shows that the proximaloperator of the nuclear norm has a closed form solution. Theorem 4 [33].

For an arbitrary matrix Z ∈ R m × n , ∀ τ > , the proximaloperator of the nuclear norm of matrix X is prox τ k X k ∗ ( Z ) = U ( Σ − τ I ) + V ⊤ . (12) where I denotes the identity matrix, SV D ( Z ) = UΣV ⊤ , and [ M + ] ij = max ( M ij , . For solving (4), we extend the proximal operator to nonconvex problem,similar to Theorem 3, at t -iteration, it products the iterative sequence as follows. X t +1 = prox λρ r n (cid:16) X t − ρ ∇ L ω ( X t , A ) (cid:17) . (13)where the learning rate, herein, denoted by ρ is a ﬁxed value, and ∇ L ω ( X , A )denotes the gradient of the ω -weighted loss function, which can be computedeﬃciently as ∇ L ω ( X , A ) = (1 − ω ) ( X − A ) + (2 ω − P Ω ( X − A ) . (14)7ecently, due to the successful application on convex optimization problem,the proximal algorithm has been extended to nonconvex situation [9, 13, 14, 20].Similar to the nuclear norm (Theorem 4), the generalized singular value thresh-olding [20] was proposed to handle the nonconvex surrogates. Theorem 5

Generalized singular value thresholding (GSVT) [20].

For an ar-bitrary matrix Z ∈ R m × n , let r n be a function that satisﬁes the characteristicsin (2) , then the proximal operator of r n has the following closed form solution. prox r n ( Z ) = U diag (ˆ s ) V ⊤ . (15) where UΣV ⊤ is the SVD of Z , and ˆ s = { ˆ s i } with ˆ s i ∈ arg min s i ≥ ( s i − σ i ( Z )) + λρ r ( s i ) . (16)Similar to Theorem 4, the above theorem indicates that the closed-formsolutions of the nonconvex regularizer in Table 1 do exist. In addition, wegeneralize the above procedure as the Basic Algorithm shown in Algorithm 1. Algorithm 1:

Basic Algorithm

Input: A ∈ R m × n , ρ > β , the sampling set Ω, and the regularizationparameter λ . initialize X = 0; for t = 1 , , ..., T do X ig = X t − ρ ∇ L ω ( X t , A ) ; [ U , Σ , V ] = SV D ( X ig ) ; for i = 1 , , ..., m do ˆ s i ∈ arg min s i ≥ ( s i − Σ ii ) + λρ r ( s i ) ; end X t = U diag ( { ˆ s i } ) V ⊤ ; endResult: X T +1 However, the basic algorithm involves a full SVD (step 3) with the time com-plexity O (cid:0) mn (cid:1) . Next, we will take the following two schemes to make thebasic algorithm much faster. S1 . The ﬁrst thing that comes to our mind is to use partial SVD. However, ˆ s i in (16) actually becomes zero when the singular value σ i ( Z ) is not larger than athreshold γ , that is automatic thresholding in [41]. This means only the leadingfew singular values, instead of all singular values, are needed to compute theproximal operator in Theorem 5. The thresholds γ for the mentioned nonconvexregularizers are shown in Table 1. 8 . Another measure our carry out is reducing the size of SVD. The follow-ing theorem shows that the proximal operator on a large matrix can be replacedby the counterpart on a smaller one. The proof can be found in Appendix D. Theorem 6 . For an arbitrary matrix Z ∈ R m × n , let k be the number ofsingular values of Z that are not less than γ , its rank- k SVD is U k Σ k V ⊤ k ,and W ∈ R m × k be an orthogonal matrix, if span ( U k ) ⊆ span ( W ) ,then thefollowing equation holds. prox r n ( Z ) = W prox r n (cid:0) W ⊤ Z (cid:1) . (17)Theorem 6 performs the process of replacing a large matrix by its projectionon leading subspace. How to obtain such W in Theorem 6? Two approaches areavailable to ﬁnd the W exactly in the same time complexity. The ﬁrst method,PROPACK package [42], is widely applied to partial SVD. And the second oneis the power method which has good approximation guarantee [43]. Comparedwith the former method, the latter one can beneﬁt particularly from warm-starttaking full advantage of the iterative nature of the proximal algorithm. Hence,we use the power method to get the W , and the details are shown in Algorithm2. Algorithm 2: power method

Input:

Let Z ∈ R m × n , Y ∈ R n × k , and the number of iterations H . R = ZY ; for h = 1 , , ..., H do W h = QR ( R h ) ;//QR decomposition R h +1 = Z (cid:0) Z ⊤ W h (cid:1) ; endResult: W H Let X ig = X t − ρ ∇ L ω ( X t , A ), through the implementation of the abovetwo acceleration measures, (13) can be rewritten as X t +1 = WU a diag (ˆ s ) V ⊤ a . (18)where U a Σ a V ⊤ a is the rank- a SVD of W ⊤ X ig , a is the number of singular valuesthat are greater than the threshold γ , and ˆ s can be obtained from (16). We summarize the whole procedure for solving (4) in Algorithm 3 and nameit PUMC N (Positive and Unlabeled Matrix Completion with Nonconvex regu-larizers). In step 3, similar to the nmAPG algorithm [44], a linear combinationof X t − and X t is used to accelerate the algorithm. The column spaces of thecurrent iteration ( V t ) and previous iteration ( V t − ) are used to accomplish thewarm start in step 5 as in [32]. Step 6 and step 7 performs S2 , and in step 8, a9ontinuation strategy is introduced to speed up the algorithm further. Specif-ically, as the iteration proceeds, λ is dynamic and gradually decreases from alarge value. In addition, step 9-11 performs S1 . Algorithm 3:

PUMC N

Input:

Let λ > λ , ρ > β , υ , A ∈ R m × n , and the sampling set Ω. Initialize X = X = 0, α = α = 1, and V , V ∈ R n × as randomGaussian matrices; for t = 1 , , ..., T do Z t = X t + α t − − α t ( X t − X t − ) ; Z ig = Z t − ρ ∇ L ω ( Z t , A ) ; Y t = QR ([ V t , V t − ]) ; W = powermethod ( Z ig , Y t ) ; [ U , Σ , V ] = SV D (cid:0) W ⊤ Z ig (cid:1) ; λ t = ( λ t − − λ ) υ t + λ ; for i = 1 , , ..., k do ˆ s i ∈ arg min s i ≥ ( s i − Σ ii ) + λ t ρ r ( s i ) ; end X t +1 = WU k diag ( { ˆ s i } ) V ⊤ k ; V t +1 = V ; α t +1 = (cid:16)p α t + 1 + 1 (cid:17) ; endResult: X T +1 Convergence analysis . Firstly, we present a lemma which provides the basicsupport for the convergence analysis of the proposed algorithm. The followinglemma shows that the objective function F is nonincreasing as iterations pro-ceed. Lemma 1 [45].

Let { X t } be the iterative sequence produced by (13) , for theoptimization problem (2) , we have F ( X t +1 ) ≤ F ( X t ) − ρ − β k X t +1 − X t k F , where ρ > β . The following theorem shows that the proposed algorithm generates a boundediterative sequence. The proof can be found in Appendix E.

Theorem 7 . Let { X t } be the iterative sequence produced by (13) , we say { X t } is a bounded iterative sequence, i.e., P ∞ t =1 k X t +1 − X t k F < ∞ . For the convex optimization problem in Theorem 3, the proximal mapping10n [45] is denoted by G ρ f ( X t ) = prox ρ f (cid:16) X t − ρ ∇ f ( X t ) (cid:17) − X t . In partic-ular, when f is convex, (cid:13)(cid:13)(cid:13) G ρ f ( X t ) (cid:13)(cid:13)(cid:13) can be used to conduct the convergenceanalysis. In contrast, if f is nonconvex, it is no longer applicable. Hence, weuse (cid:13)(cid:13)(cid:13) G ρ f ( X t ) (cid:13)(cid:13)(cid:13) F = k X t +1 − X t k F instead to perform convergence analysis ofthe proposed algorithm. The convergence of Algorithm 3 is shown in the fol-lowing theorem, and the proof can be found in Appendix F. Theorem 8 (Main Result 2).

Let { X t } be the iterative sequence in Algorithm3, for the consecutive elements X t and X t +1 , we have min t =1 , ··· ,T k X t +1 − X t k F ≤ ρ − β ) T ( F ( X ) − inf F ) . (19) Time complexity . Assume that Y t in step 5 of Algorithm 3 has k t columnsat the current iteration. Consequently, step 5 takes O (cid:0) nk t (cid:1) time. Next, step 3shows that Z t is a linear combination of X t − and X t . Let c t = ( α t − − α t ,combining step 4 and step 5, we have Z ig = { c X t + c X t − + c A } + c P Ω ( Z t − A ) . (20)where c = (1 + c t ) (1 − c ), c = c t ( c − c = − ω ) ρ , c = − ω ) ρ .The ﬁrst three terms involve low-rank matrices, whereas the last term in-volves a sparsity structure. The combined structure in (20) was studied specif-ically in [42]. Consider the multiplication of Z ig and a vector b ∈ R n . Forthe low-rank part the multiplication cost O (( m + n ) k t ) time, whereas thesparse part cost O ( k Ω k ) time. Hence, the cost is O (( m + n ) k t + k Ω k ) pervector multiplication. Step 8 performs a rank- k t SVD of W ⊤ Z ig , it takes O (cid:0) ( m + n ) k t + k Ω k k t (cid:1) time. In summary, the order of the time complexityat the current iteration is O (cid:0) ( m + n ) k t + k Ω k k t (cid:1) . In this section, we perform experiments on synthetic and real-world data setsand demonstrate the eﬀectiveness of the proposed algorithm in practical applica-tions, including link prediction, topology inference, and recommender system.All experiments are implemented in Matlab on Windows 10 with Intel XeonCPU (2.8GHz) and 128GB memory.

Data sets . As in [18,41], we assume the matrix Q ∈ R m × m is generated by Q = M M , where the elements of M ∈ R m × k and M ∈ R k × m are obtained fromthe Gaussian distribution N (0 , M ∈ R m × m can be generated by M ij = I Q ij ≥ q , without loss of generality, weassume that q = 0 .

5. We ﬁx k = 5 and very m in { , , , , } .11or each scenario, the mean square error MSE = kP ¯Ω ( X − M ) k F . kP ¯Ω ( M ) k F is used for performance evaluation, where X is the recovery matrix for underly-ing matrix M and ¯Ω is the index set of unobserved elements. Each experimentis repeated ten times with the sampling rate ( δ ) varying from 0.3 to 0.9, andthe average results are reported. From Theorem 1, if the sampling rate δ = 0 . M are observed), ω = 0 .

15 is chosen. sampling rate M SE m=50m=100m=500m=1000m=2000 (a) -1 time(seconds) M SE m=50m=100m=500m=1000m=2000 (b) Figure 1: Performance analysis of the proposed PUMC N algorithm on syntheticdatasets. (a) MSE vs sampling rate on synthetic datasets. (b) The samplingrate is ﬁxed at 0.5, MSE vs time (in seconds) on synthetic data sets.Results are shown in Fig. 1. Only TNN regularizer (with µ in Table 1 set to5) is used in synthetic experiments, similar results can be obtained from othernonconvex regularizers in Table 1. Fig. 1(a) shows the testing MSE at diﬀerentsampling rates. Notice that for an increasing sampling rate we see a monotonousdecrease in testing MSE until it is close to zero. This is reasonable since a largernumber of observations give rise to more accurate information of the underlyingmatrix. In addition, there is also a negative correlation between MSE andthe matrix size, which is particularly evident at a smaller sampling rate. Inparticular, if the underlying matrix M is large enough, it can be recoveredaccurately at a much small sampling rate (with small number of observations).In Fig. 1(b), we follow the settings in [41] and ﬁx the sampling rate at 0.5(from Theorem 1, ω = 0 . Data sets . In this practical setting, the popular data sets (Table 2), includ-ing

MovieLens (100K) [16], FlimTrust [53], and Douban [54], are used to http://vladowiki.fmf.uni-lj.si/doku.phpid=pajek:data:pajek:students https://github.com/fmonti/mgcnn .3 0.4 0.5 0.6 0.7 0.8 0.9 sampling rate M SE AIS_Impute1 bit MCPUMC_N (a) sampling rate M SE AIS_Impute1 bit MCPUMC_N (b) sampling rate M SE AIS_Impute1 bit MCPUMC_N (c)

Figure 2: Performance comparison of the recommender system methods. (a)Testing MSE vs sampling rate on the

MovieLens data set. (b) Testing MSE vssampling rate on the

FlimTrust data set. (c) Testing MSE vs sampling rate onthe

Douban data set.evaluate the performance of our algorithm. We follow the setup in [16, 19] andconvert these ratings in each data set to binary observations by comparing eachrating to the average value (which is ∼

3, considering three data sets together)of whole data sets.

Methods . We compare with the nuclear norm based algorithm AIS-Impute[32], as well as 1-bit matrix completion in [19]. In particular, the AIS-Imputecan be considered as an accelerated and inexact version of the proximal al-gorithm, and the 1-bit matrix completion is constrained by inﬁnity norm andnuclear norm.Results are shown in Fig. 2. Each point in the ﬁgure is the average across tenreplicate experiments. Moreover, Table ?? shows the performance of the men-tioned methods on three data sets. From the above experiments, LSP regularizerusually has better or comparable performance than the other two regularizers,thus we only use LSP regularizer here. It can be seen that in the three recom-mended algorithms, as long as the sampling rate is not less than 0.4, PUMC Nwill result in the lowest MSE in the least time. In addition, when the sampling13able 2: Summary of Recommender system data sets. MovieLens-100K

943 1,682 100,000

FlimTrust

Douban

In this paper, we addressed the problem of binary matrix completion with non-convex regularizers, where the observations consist only of positive entries. Weproposed a novel PU matrix completion model (4) for tacking the task basedon the commonly-used nonconvex regularizers and the ω -weighted loss. In par-ticular, the error bound for the model is derived to show the underlying matrix M ∈ { , } m × n can be recovered accurately. Accordingly, we improved theproximal algorithm with two main acceleration strategies in nonconvex settingsfor solving (4), and the convergence can also be guaranteed. The experimentson both synthetic and real-world data sets have veriﬁed the eﬀectiveness ofthe proposed approach and validated the superiority over the state-of-the-artmethods.There still remain several directions for further work. From the experi-mental results, it can be seen that there is still room for further performanceimprovements at a low sampling rate. Besides, as in [24, 25], a general nonuni-form sampling distribution will be considered. In addition, to further speed upthe proposed algorithm and apply it to massive data sets, we will focus on itsdistributed version. References [1] E. J. Cand`es and B. Recht, “Exact matrix completion via convex opti-mization,”

Foundations of Computational Mathematics , vol. 9, no. 6, pp.717–772, 2009.[2] E. J. Candes and T. Tao, “The power of convex relaxation: near-optimalmatrix completion,”

IEEE Transactions on Information Theory , vol. 56,no. 5, pp. 2053–2080, May 2010.[3] B. Recht, “A simpler approach to matrix completion,”

Journal of MachineLearning Research , vol. 12, pp. 3413–3430, Dec. 2011.[4] A. Rohde and A. B. Tsybakov, “Estimation of high-dimensional low-rankmatrices,”

The Annals of Statistics , vol. 39, no. 2, pp. 887–930, Apr. 2011.145] R. H. Keshavan, A. Montanari, and S. Oh, “Matrix completion from afew entries”

IEEE Transactions on Information Theory , vol. 56, no. 6, pp.2980–2998, Jun. 2010.[6] R. H. Keshavan, A. Montanari, and S. Oh, “Matrix completion from noisyentries,”

Journal of Machine Learning Research , vol. 11, no. 22, pp. 2057–2078, Aug. 2010.[7] T. T. Cai and W.-X. Zhou, “Matrix completion via max-norm constrainedoptimization,”

Electronic Journal of Statistics , vol. 10, no. 1, pp. 1493–1525, 2016.[8] Q. Sun, S. Xiang, and J.-P. Ye, “Robust principal component analysisvia capped norms,” in

Proceedings of the 19th ACM SIGKDD interna-tional conference on Knowledge discovery and data mining , Chicago, Illi-nois, USA, 2013, pp. 311–319.[9] T. H. Oh, Y. W. Tai, J. C. Bazin, H. Kim, and I. S. Kweon, “Partial summinimization of singular values in robust PCA: algorithm and applica-tions,”

IEEE Transactions on Pattern Analysis and Machine Intelligence ,vol. 38, no. 4, pp. 744–758, Apr. 2016.[10] Z. Wang, M. J. Lai, Z.-S. Lu, W. Fan, H. Davulcu, and J.-P Ye, “Or-thogonal rank-one matrix pursuit for low rank matrix completion,”

SIAMJournal on Scientiﬁc Computing , vol. 37, no. 1, pp. A488–A514, Jan. 2015.[11] F. Xiao, W. Liu, Z. Li, L. Chen, and R. Wang, “Noise-tolerant wireless sen-sor networks localization via multinorms regularized matrix completion,”

IEEE Transactions on Vehicular Technology , vol. 67, no. 3, pp. 2409–2419,Mar. 2018.[12] C.-S Liu, H. Shan, and B. Wang, “Wireless sensor network localization viamatrix completion based on bregman divergence,”

Sensors , vol. 18, no. 9,pp. 2974–2991, Sep. 2018.[13] C. Lu, J. Tang, S. Yan, and Z. Lin, “Nonconvex nonsmooth low rankminimization via iteratively reweighted nuclear norm,”

IEEE Transactionson Image Processing , vol. 25, no. 2, pp. 829–839, Feb. 2016.[14] S. Gu, Q. Xie, D. Meng, W. Zuo, X. Feng, and L. Zhang, “Weighted nuclearnorm minimization and its applications to low level vision,”

InternationalJournal of Computer Vision , vol. 121, no. 2, pp. 183–208, Jan. 2017.[15] R. Pech, D. Hao, L. Pan, H. Cheng, and T. Zhou, “Link prediction viamatrix completion,”

Europhysics Letters , vol. 117, no. 3, pp. 38002, Mar.2017.[16] M. A. Davenport, Y. Plan, E. van den Berg, and M. Wootters, “1-bitmatrix completion,”

Information and Inference , vol. 3, no. 3, pp. 189–223,Sep. 2014. 1517] C. Elkan and K. Noto, “Learning classiﬁers from only positive and unla-beled data,” in

Proceeding of the 14th ACM SIGKDD international confer-ence on Knowledge discovery and data mining , Las Vegas, Nevada, USA,2008, pp. 213–22.[18] C. J. Hsieh , N. Natarajan , and I. S. Dhillon, “PU learning for matrixcompletion,”

Journal of Machine Learning Research , vol 37, pp.2445–2453,Jul. 2015.[19] M. Hayashi , T. Sakai, and M. Sugiyama, “Binary matrix com-pletion using unobserved entries,” arXiv , 2018. [Online]. Available:https://arxiv.org/pdf/1803.04663.pdf[20] C. Lu, C. Zhu, C. Xu, S. Yan, and Z. Lin, “Generalized singular valuethresholding,” in

Proceedings of the 29th AAAI Conference on ArtiﬁcialIntelligence , Austin Texas, USA, 2015, pp. 1805–1811.[21] E. J. Cand`es, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity byreweighted l minimization,” Journal of Fourier Analysis and Applications ,vol. 14, no. 5–6, pp. 877–905, Dec. 2008.[22] T. Zhang, “Analysis of multi-stage convex relaxation for sparse regulariza-tion,”

Journal of Machine Learning Research , vol 11, no. 3, pp.1081–1107,Mar. 2010.[23] Y. Hu, D. Zhang, J. Ye, X. Li, and X. He, “Fast and accurate matrixcompletion via truncated nuclear norm regularization,”

IEEE Transactionson Pattern Analysis and Machine Intelligence , vol. 35, no. 9, pp. 2117–2130,Sep. 2013.[24] O. Klopp, J. Lafond, ´E. Moulines, and J. Salmon, “Adaptive multinomialmatrix completion,”

Electronic Journal of Statistics , vol 9, no. 2, pp.2950–2975, 2015.[25] T. T. Cai, and W.-X. Zhou. “A max-norm constrained minimization ap-proach to 1-bit matrix completion,”

Journal of Machine Learning Research ,vol 14, no. 10, pp.3619–3647, Dec. 2013.[26] J. Wang, J. Shen, and .H Xu. “Social trust prediction via max-norm con-strained 1-bit matrix completion,” arXiv , Apr. 2015. [Online]. Available:https://arxiv.org/pdf/1504.06394.pdf[27] S. A. Bhaskar and A. Javanmard, “1-bit matrix completion under exact low-rank constraint,” in , Baltimore, MD, USA, 2015, pp. 1–6.[28] R. Ni and Q. Gu. “Optimal statistical and computational rates for onebit matrix completion,”

Journal of Machine Learning Research , vol 51,pp.426–434, May 2016. 1629] V. Cottet and P. Alquier, “1-bit matrix completion: PAC-Bayesian analysisof a variational approximation,”

Machine Learning , vol. 107, no. 3, pp. 579–603, Mar. 2018.[30] R. Kiryo, G. Niu, and M. Plessis, and Masashi Sugiyama, “Positive-unlabeled learning with non-negative risk estimator,” , Long Beach, CA, USA, 2017, pp.1675–1685.[31] T. Sakai, M. C. Plessis, G,Niu and M. Sugiyama, “Semi-supervised classiﬁ-cation based on classiﬁcation from positive and unlabeled data,” in

Proceed-ings of the 34th International Conference on Machine Learning , Sydney,Australia, 2017, pp. 2998–3006.[32] Q. Yao , and J. T. Kwok . “Accelerated inexact soft-impute for fast large-scale matrix completion.” in

Proceedings of the 24th International Confer-ence on Artiﬁcial Intelligence , Buenos Aires, Argentina, 2015, pp. 4002–4008.[33] J. Cai, E. J. Cand`es, and Z. Shen, “A singular value thresholding algorithmfor matrix completion,”

SIAM Journal on Optimization , vol. 20, no. 4, pp.1956–1982, Jan. 2010.[34] Z. Lin, M. Chen, and Y. Ma, “The augmented lagrange multiplier methodfor exact recovery of corrupted low-rank matrices,”

Mathematical Program-ming , vol. 9, Sep. 2010.[35] X. Su, Y. Wang, X. Kang, and R. Tao, “Nonconvex truncated nuclear normminimization based on adaptive bisection method,”

IEEE Transactions onCircuits and Systems for Video Technology , pp. 1–14, Oct. 2018.[36] J. Yao, D. Meng, Q. Zhao, W. Cao, and Z. Xu, “Nonconvex-sparsity andnonlocal-smoothness based blind hyperspectral unmixing,”

IEEE Transac-tions on Image Processing , pp. 1–16, Jan. 2019.[37] V. Sindhwani, S. S. Bucak, J. Hu, and A. Mojsilovic, “One-class matrixcompletion with low-density factorizations,” in , Sydney, Australia, 2010, pp. 1055–1060.[38] C. Scott, “Calibrated asymmetric surrogate losses,”

Electronic Journal ofStatistics , vol. 6, no. 0, pp. 958–992, Jan. 2012.[39] N. Natarajan, I. S. Dhillon, P. Ravikumar, and A. Tewari, “Learning withNoisy Labels,” in

Proceedings of the 26th International Conference on Neu-ral Information Processing Systems , Lake tahoe, Nevada, 2013, pp. 1196–1204.[40] P. L. Combettes and V. R. Wajs, “Signal recovery by proximal forward-backward splitting,”

Multiscale Modeling and Simulation , vol. 4, no. 4, pp.1168–1200, 2005. 1741] Q. Yao, J. T. Kwok, T. Wang, and T. Liu, “Large-scale low-rank ma-trix learning with nonconvex regularizers,”

IEEE Transactions on PatternAnalysis and Machine Intelligence , pp. 1–16, Jul, 2018.[42] R. Mazumder, T. Hastie, and R. Tibshirani, “Spectral regularization algo-rithms for learning large incomplete matrices,”

Journal of Machine Learn-ing Research , vol. 11, no. 3, pp. 2287–2322, Aug. 2019.[43] N. Halko, P. G. Martinsson, and J. A. Tropp, “Finding structure withrandomness: probabilistic algorithms for constructing approximate matrixdecompositions,”

SIAM Review , vol. 53, no. 2, pp. 217–288, Jan. 2011.[44] H. Li, and Z. Lin, “Accelerated proximal gradient methods for noncon-vex programming,” in

Proceedings of the 28th International Conference onNeural Information Processing Systems , Montreal, Canada, 2015, pp. 379–387.[45] H. Attouch, J. Bolte, and B. F. Svaiter, “Convergence of descent meth-ods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods,”

MathematicalProgramming , vol. 137, no. 1–2, pp. 91–129, Feb. 2013.[46] S. Segarra, A. G. Marques, G. Mateos, and A. Ribeiro, “Network topol-ogy inference from spectral templates,”

IEEE Transactions on Signal andInformation Processing over Networks , vol. 3, no. 3, pp. 467–483, Sep. 2017.[47] P.-H Gong, C.-S Zhang, Z.-S Lu, J.-H Huang, and J.-P Ye, “A generaliterative shrinkage and thresholding algorithm for non-convex regularizedoptimization problems,”

Journal of Machine Learning Research , vol. 28,no. 2, pp. 37–45, Mar. 2013.[48] P. M. Gleiser and L. Danon, “Community structure in Jazz,”

Advances inComplex Systems , vol. 6, no. 4, pp. 565–573, Dec. 2003.[49] M. Gao, L. Chen, B. Li, and W. Liu, “A link prediction algorithm basedon low-rank matrix completion,”

Applied Intelligence , vol. 48, no. 12, pp.4531–4550, Dec. 2018.[50] D. Liben-Nowell and J. Kleinberg, “The link-prediction problem for socialnetworks,”

Journal of the American Society for Information Science andTechnology , vol. 58, no. 7, pp. 1019–1031, May 2007.[51] A. Chaintreau, P. Hui, J. Crowcroft, C. Diot, R. Gass, and J. Scott, “Im-pact of human mobility on opportunistic forwarding algorithms,”

IEEETransactions on Mobile Computing , vol. 6, no. 6, pp. 606–620, Jun. 2007.[52] J. Leskovec, D. Huttenlocher, and J. Kleinberg, “Signed networks in so-cial media,” in

Proceedings of the 28th international conference on Humanfactors in computing systems , Atlanta, Georgia, USA, 2010, pp. 1361–1370.1853] G. Guo, J. Zhang, and N. Yorke-Smith, “A novel evidence-based bayesiansimilarity measure for recommender systems,”

ACM Transactions on theWeb , vol. 10, no. 2, pp. 1–30, May 2016.[54] F. Monti, M. Bronstein, and X. Bresson, “Geometric matrix completionwith recurrent multi-graph neural networks,” in