Bayesian Uncertainty Quantification for Low-rank Matrix Completion
BBayesian Uncertainty Quantification for Low-rankMatrix Completion
Henry Shaowu Yuchi ∗ , Simon Mak † , Yao Xie ∗ February 2021
Abstract
We consider the problem of uncertainty quantification for an unknownlow-rank matrix X , given a partial and noisy observation of its entries.This quantification of uncertainty is essential for many real-world problems,including image processing, satellite imaging, and seismology, providinga principled framework for validating scientific conclusions and guidingdecision-making. However, existing literature has largely focused on thecompletion (i.e., point estimation) of the matrix X , with little work oninvestigating its uncertainty. To this end, we propose in this work anew Bayesian modeling framework, called BayeSMG, which parametrizesthe unknown X via its underlying row and column subspaces. ThisBayesian subspace parametrization allows for efficient posterior inferenceon matrix subspaces, which represents interpretable phenomena in manyapplications. This can then be leveraged for improved matrix recovery. Wedemonstrate the effectiveness of BayeSMG over existing Bayesian matrixrecovery methods in numerical experiments and a seismic sensor networkapplication. Keywords: hierarchical modeling, manifold sampling, matrix factorization,matrix completion, seismic imaging, uncertainty quantification ∗ H. Milton Stewart School of Industrial & Systems Engineering, Georgia Institute ofTechnology † Department of Statistical Science, Duke University a r X i v : . [ s t a t . M E ] F e b Introduction
Low-rank matrices play a vital role in modeling many scientific and engineer-ing problems, including (but not limited to) image processing, satellite imaging,and network analysis. In such applications, however, only a small portion of thedesired matrix (which we denote as X ∈ R m × m in this article) can be observed.The reasons for this are two-fold: (i) the cost of observing all matrix entries canbe high, requiring expensive computational, experimental, or communicationexpenditure; (ii) there can be missing observations at individual entries due tosensor malfunction, experimental failure, or unreliable data transmission. The matrix completion problem aims to complete the missing entries of X from apartial (and oftentimes noisy) observation. Matrix completion has attractedmuch attention since the seminal works of Cand`es and Tao (2010), Cand`es andRecht (2009), and Recht (2011). The theory and methodology behind pointestimation are now well-understood for matrix completion, under the assumptionthat X is low-rank, with various convex and non-convex optimization algorithmsdeveloped for performing this recovery.However, much of the literature (a detailed review is provided in Section 1.1)has focused on the completion, i.e., point estimation , of X , with little work onexploring the uncertainty of such estimates. In many scientific and engineeringapplications, such estimates are much more useful when coupled with a measure ofuncertainty. The principled characterization (and reduction) of this uncertaintyis known as uncertainty quantification (UQ), see, e.g., Smith (2013). UQ isbecoming increasingly important in various applications, providing a principledframework for validating scientific conclusions and guiding decision-making.In this paper, we address the problem of UQ for the matrix completionproblem from a Bayesian perspective. We propose a novel Bayesian modelingframework, called BayeSMG, which quantifies uncertainty in the desired matrix X via posterior sampling on its underlying subspaces. BayeSMG can be viewed asa hierarchical Bayesian extension of the singular matrix-variate Gaussian (SMG)distribution (see Gupta and Nagar, 1999; Mak and Xie, 2018), with hierarchicalpriors on matrix subspaces. In addition to providing point estimates on X , theproposed model also yields UQ on X via an efficient Gibbs sampler on matrixsubspaces. By integrating this subspace structure for posterior inference, we showthat BayeSMG enjoys improved recovery performance and better interpretabilitycompared with existing Bayesian models, in extensive numerical experimentsand a real-world seismic sensor network application.2 .1 Existing literature Much of the existing literature on inferring X from partial observations fallsunder the topic of matrix completion – the completion (or point estimation) of X from observed entries. Early works in this area include the seminal worksof Cand`es and Tao (2010), Cand`es and Recht (2009), and Recht (2011), whichestablished conditions for exact completion via nuclear-norm minimization,under the assumption that observations are uniformly sampled without noise.This is then extended to the noisy matrix completion setting, where entriesare observed with noise; important results include Cand`es and Plan (2010),Keshavan et al. (2010), Koltchinskii et al. (2011), and Negahban and Wainwright(2012), among others. There is now a rich body of work on matrix completion;recent overviews include Davenport and Romberg (2016) and Chi et al. (2019).However, completion focuses solely on the point estimation of the matrix X and provides no quantification of uncertainty on each unobserved entry. Inproblems where only a few entries are observed (see motivating applications),this uncertainty can be as valuable as the point estimate itself in assessing thequality of the recovered matrix.The aforementioned literature largely focuses on point estimation of theunknown matrix X . The problem of quantifying uncertainties in X has beenrelatively unexplored, but clearly an important one given the motivating appli-cations. One recent pioneering work on this is Chen et al. (2019), who proposedentrywise confidence intervals for both convex and non-convex estimators on X , via debiasing using low-rank factors of the matrix. The resulting debiasedestimators admit nearly precise nonasymptotic distributional characterizations,which in turn enable optimal construction of confidence intervals for missingmatrix entries and low-rank factors. Our approach has several distinctions fromthis work. First, the latter is a frequentist approach with appealing theoreticalguarantees, whereas our approach is Bayesian and may yield a richer quan-tification of uncertainty on X via a hierarchical Bayesian model. Second, toderive elegant theoretical results, the latter requires a sample size complexitycondition on X , similar to the minimum sample size condition in standard matrixcompletion analysis (see, e.g., Cand`es and Recht, 2009). Our UQ approach, incontrast, is applicable for any sample size n on X , particularly for the “small- n ”setting where observations are limited and uncertainty quantification is mostneeded.Another approach for quantifying uncertainty is via Bayesian modeling. There3s a growing literature on Bayesian matrix completion, of which the most popularapproach is the Bayesian Probabilistic Matrix Factorization (BPMF) methodin Salakhutdinov and Mnih (2008). BPMF adopts the following probabilisticmodel on X : X = MN T , M ∈ R m × R , N ∈ R m × R , where R < m ∧ m :=min( m , m ) is an upper bound on matrix rank. Each row of the factorizedmatrices M and N are then assigned i.i.d. Gaussian priors N ( µ M , Σ M ) and N ( µ N , Σ N ), respectively. Conjugate normal hyperpriors are then assignedon the row and column means µ M ∼ N ( , Σ M β ), µ N ∼ N ( , Σ N β ), withInverse-Wishart hyperpriors on row and column covariance matrices Σ M ∼ IW( R, W ) , Σ N ∼ IW( R, W ). The hyperparameters β and W are typicallyspecified to provide weakly- or non-informative priors. This Bayesian specificationallows for an efficient Gibbs sampler, which performs conjugate sampling oneach row of M and each row of N , along with conjugate updates on the meanvectors ( µ M , µ N ) and covariance matrices ( Σ M , Σ N ). This allows the BPMF totackle problems as large as the Netflix dataset, which has millions of user-movieratings. Many existing Bayesian matrix completion methods (e.g., Lawrence andUrtasun, 2009; Zhou et al., 2010; Babacan et al., 2011; Alquier et al., 2014) canbe viewed as variations or extensions of this BPMF framework.From a modeling perspective, the key novelty in BayeSMG is that our modelrequires orthonormality in the factorized matrices, whereas the BPMF does not.Such a factorization can be viewed as parametrizing X via its singular valuedecomposition (SVD). This yields several advantages for our method, which wedemonstrate later. First, by explicitly parametrizing row and column subspacesas model parameters, BayeSMG can incorporate prior knowledge on subspaceswithin the prior specification of such parameters. This prior information is oftenavailable in many signal processing and image processing problems, e.g., knownsignal structure or image features. Second, BayeSMG allows for direct inferenceon subspaces of X via posterior sampling, which are of direct interest in manyproblems, e.g., in sensor network localization (Zhang et al., 2020; an applicationwe tackle later on) and topology identification problems (Eriksson et al., 2012).For subspace inference, our approach avoids performing an additional SVD stepfor every posterior sample (compared to the BPMF), which significantly speeds upinference for high-dimensional problems. Finally and perhaps most importantly,BayeSMG can leverage this posterior learning on subspaces to provide improvedinference on X . Compared to the BPMF, our approach can yield faster posteriorcontraction for unobserved entries when the underlying matrix has low-rankstructure, in both numerical simulations and applications. This allows for a4ore accurate estimate and a more precise uncertainty quantification of X overthe BPMF.The BayeSMG model also provides several novel theoretical insights. InSection 4, we show that the maximum a posteriori (MAP) estimator takes theform of a regularized matrix estimator, which provides a connection betweenthe proposed method and existing matrix completion techniques. We also showthat the BayeSMG model provides a probabilistic model on matrix coherence(Cand`es and Recht, 2009). Coherence has been widely used in the matrixcompletion literature as a theoretical condition for recovery, which measures the“recoverability” of a low-rank matrix. Through this, we then establish an errormonotonicity result for BayeSMG, which provides a reassuring check on the UQperformance of the proposed model.The paper is organized as follows. Section 2 introduces the BayeSMG model.Section 3 presents an efficient posterior sampling algorithm for X via manifoldsampling on its subspaces. Section 4 reveals connections between the BayeSMGmodel and coherence, and its impact on error convergence. Section 5 investigatesnumerical experiments with synthetic and image data. Section 6 explores a real-world seismic sensor network application. Section 7 concludes with discussions. We first describe the proposed BayeSMG model, which combines the SingularMatrix-variate Gaussian (SMG) distribution with a Bayesian prior specificationon matrix subspaces.
Let X ∈ R m × m be the matrix of interest, and assume X is low-rank, i.e., R := rank( X ) (cid:28) m ∧ m . Let [ m ] := { , · · · , m } . Suppose X is sampled withnoise at an index set Ω ⊆ [ m ] × [ m ] of size | Ω | = n , yielding observations: Y i,j = X i,j + (cid:15) i,j , ( i, j ) ∈ Ω . (1)Here, Y i,j is the observation at index ( i, j ), corrupted by noise (cid:15) i,j . In this work,we assume (cid:15) i,j i.i.d. ∼ N (0 , η ), i.e., the noise on each entry follows an i.i.d. Gaussiandistribution with zero mean and variance η . Further let Y Ω := ( Y i,j ) ( i,j ) ∈ Ω ∈ R n denote the vector of noisy observations, and let X Ω c be the vector of unobserved5atrix entries, where Ω c := ([ m ] × [ m ]) \ Ω is the set of unobserved indices.With this framework, the desired goal of uncertainty quantification (UQ)can be made more concrete. Given noisy observations Y Ω , we wish to notonly estimate the unobserved matrix entries X Ω c , but also quantify a notion of uncertainty on both observed or unobserved entries (since observation noise ispresent). We adopt the following SMG model for the low-rank matrix X , which weassume to be normalized with zero mean. Definition 1 (SMG model, Definition 2.4.1 of Gupta and Nagar, 1999) . Let Z ∈ R m × m be a random matrix with entries Z i,j i.i.d. ∼ N (0 , σ ) for ( i, j ) ∈ [ m ] × [ m ] . The random matrix X has a singular matrix-variate Gaussian(SMG) distribution if X d = P U Z P V for some choice of projection matrices P U = UU T and P V = VV T , where U ∈ R m × R , U T U = I , V ∈ R m × R , V T V = I and R < m ∧ m . We will denote this as X ∼ SMG ( P U , P V , σ , R ) . In other words, a realization from the SMG distribution can be obtainedby first (i) simulating a matrix Z from a Gaussian ensemble with variance σ , i.e., a matrix with i.i.d. N (0 , σ ) entries, then (ii) performing a left andright projection of Z using the projection matrices P U and P V . Recall thatthe projection operator P U = UU T ∈ R m × m maps a vector in R m to itsorthogonal projection on the R -dimensional subspace U spanned by the columnsof U . By performing this projection, the resulting matrix X = P U Z P V canbe shown to be of rank R < m ∧ m , with its row and column spaces U and V corresponding to the subspaces for P U and P V . The matrix X also lies inthe space T := (cid:83) u k ∈U ,v k ∈V span( { u k v Tk } Rk =1 ). With a small choice of R , thisprovides a flexible probabilistic model for the low-rank matrix X .The SMG distribution provides several appealing properties for modelinglow-rank matrices. First, it provides a prior modeling framework on the matrix X involving its row and column subspaces U and V . It is known from Chikuse(2012) that, for each projection operator P ∈ R m × m of rank R , there exists aunique R -dimensional hyperplane (or an R -plane) in R m containing the originwhich corresponds to the image of such a projection. This connects the spaceof rank R projection matrices and the Grassmann manifold G R,m − R , the space6f R -planes in R m . Viewed this way, the projection matrices parametrizing X ∼ SMG ( P U , P V , σ , R ) encode useful information on the row and columnspaces of X . Second, since the projection of a Gaussian random vector is stillGaussian, the left-right projection of the Gaussian ensemble Z results in eachentry of X being Gaussian-distributed as well. This is useful for deriving a UQproperty of the BayeSMG model.We now show several distributional properties of the SMG model: Lemma 2 (Distributional properties of SMG) . Let X ∼ SMG ( P U , P V , σ , R ) ,with P U ∈ R m × m , P V ∈ R m × m , σ > and R < m ∧ m known. Then:(a) The density of X is given by f ( X ) = (2 πσ ) − R / etr {− σ [( X P V ) T ( P U X )] } , X ∈ T , (2) where etr( · ) := exp { tr( · ) } .(b) Consider the block decomposition of P V ⊗ P U : P V ⊗ P U = (cid:32) ( P V ⊗ P U ) Ω ( P V ⊗ P U ) Ω , Ω c ( P V ⊗ P U ) T Ω , Ω c ( P V ⊗ P U ) Ω c (cid:33) . (3) Conditional on the observed noisy entries Y Ω , the unobserved entries X Ω c follow the distribution, [ X Ω c | Y Ω ] ∼ N ( X P Ω c , Σ P Ω c ) . Here, γ = η /σ , and R N (Ω) := ( P V ⊗ P U ) Ω ∈ R N × N , X P Ω c := ( P V ⊗ P U ) T Ω , Ω c [ R N (Ω) + γ I ] − Y Ω , Σ P Ω c := σ { ( P V ⊗ P U ) Ω c − ( P V ⊗ P U ) T Ω , Ω c [ R N (Ω) + γ I ] − ( P V ⊗ P U ) T Ω , Ω c } . (4) (c) Conditional on the observed noisy entries Y Ω , the observed entries X Ω follow the distribution [ X Ω | Y Ω ] ∼ N ( X P Ω , Σ P Ω ) , where R N (Ω) = ( P V ⊗P U ) Ω , and X P Ω := ( P V ⊗ P U ) Ω [ R N (Ω) + γ I ] − Y Ω , Σ P Ω := σ { ( P V ⊗ P U ) Ω − ( P V ⊗ P U ) T Ω [ R N (Ω) + γ I ] − ( P V ⊗ P U ) Ω } . (5) Remark:
Lemma 2 reveals two key properties of the SMG model. First, prior toobserving data, part (a) shows that the low-rank matrix X lies on the space T ,7nd follows a degenerate multivariate Gaussian distribution with mean zero andcovariance matrix σ ( P V ⊗ P U ), where ⊗ is the Kronecker product. Second, after observing the noisy entries Y Ω , part (c) shows that the conditional distributionof X Ω c (the unobserved entries in X ) given Y Ω is still multivariate Gaussian,with closed-form expressions for its mean vector X P Ω c and covariance matrix Σ P Ω c in (4). Lemma 2 provides a closed-form posterior distribution for the low-rank matrix X after observing the noisy observations Y Ω . This provides a potential way forcomputing confidence intervals on each entry in X , assuming the underlyingrow and column subspaces U and V are known. Of course, in practice, suchsubspaces are never known with certainty. One solution might be to plug inpoint estimates of U and V (estimated from data) within the predictive equationsin Lemma 2. We investigate the efficacy of this plug-in approach via a simplenumerical example.The simulation set-up is as follows. Let m = m = m = 8 be the row andcolumn dimensions of the matrix, and let R = 2 be its rank. We first simulatetwo random orthonormal matrices U and V of size m × R , via a truncatedSVD on an m × m matrix with i.i.d. U [0 ,
1] entries. With P U = UU T and P V = VV T , the “true” low-rank matrix is then simulated from the SMG model X ∼ SMG ( P U , P V , σ = 1 R ). Finally, noisy observations are sampled via (1)with noise variance η = 0 . . In total, 36 entries are observed (50% of totalentries), with such entries chosen uniformly at random. From this, we can obtainpoint estimates of the subspaces U and V , by first estimating X via nuclear normminimization (Cand`es and Plan, 2010), a standard method for matrix completion,and then taking the row and column subspaces for this matrix estimate via anSVD. These subspace estimates are then plugged into the expressions in Lemma2 for UQ.Figure 1(a) plots the point estimates and 95% plug-in confidence intervals(CIs) for each matrix entry using Lemma 2, with its corresponding true valuemarked in red. We see that these intervals provide poor coverage performance,since many of the true matrix entries are not covered by these intervals. Indeed,the coverage ratio for these plug-in CIs is only 56%, meaning only around halfof the confidence intervals cover the true entries. This poor coverage suggeststhat this CI approach (with plug-in subspace estimates) can significantly under-8 overage Ratio = 0.50.560.56 Coverage Ratio = 1
Coverage Ratio = 1 (a) (b)
Figure 1:
Plotted are the point estimates (blue points) and 95% CIs (blue intervals)for each matrix entry, ordered by increasing point estimates. Red points mark the truematrix values. Plot (a) makes use of the CIs constructed via Lemma 2 with plug-insubspace estimates; plot (b) uses the posterior predictive intervals from the proposedBayeSMG model. estimate the underlying uncertainty of point estimates, which is unsurprisingsince uncertainty for subspace estimation is not incorporated. Figure 1(b) plotsthe point estimates and 95% posterior predictive intervals using the proposedBayeSMG method, which accounts for subspace uncertainty by assigning hier-archical priors on subspaces U and V from the SMG model. We see that theproposed CI approach yields much better coverage: the 95% intervals, which arenow slightly wider, cover the true matrix entries well. The resulting coverageratio is now at 100%, which is slightly higher than the the nominal coveragerate of 95%, but much closer to this rate than the earlier plug-in approach. Thisshows that the proposed method can provide better uncertainty quantificationof X via a fully-Bayesian model specification on matrix subspaces. We now present the hierarchical specification for the proposed Bayesian SMGmodel, or BayeSMG for short. We begin by first introducing the matrix vonMises-Fisher (vMF) distribution, which will serve as prior models for the rowand column orthonormal frames U and V . We then present a Gibbs samplingalgorithm which makes use of a reparameterization of the SMG model for efficientposterior sampling.The matrix von Mises-Fisher distribution (Khatri and Mardia, 1977; Mardiaand Jupp, 2009) provides a useful class of distributions on the row and column9rames, which lie on a so-called Stiefel manifold. A Stiefel manifold (Chikuse,2012) consists of all orthonormal subspaces of rank R in the space of R m ; this isdenoted as V R,m hereafter. The matrix vMF distribution assumes the followingprobability density function on V R,m : p ( W ; m, R, F ) = (cid:20) F (; m F T F (cid:21) − etr( F T W ) , W ∈ V R,m , (6)where F (; · ; · ) is the hypergeometric function, and F ∈ R m × R is the concen-tration matrix. We denote this distribution as W ∼ MF ( m, R, F ). The matrixvMF distribution provides conditionally conjugate priors for a wide range ofmultivariate models, including for cluster analysis (Gopal and Yang, 2014) andfactor models (Hoff, 2013). One appeal of this class of distribution is that it canbe efficient sampled. Hoff (2009) proposed a rejection sampling algorithm whichsequentially samples each column of the matrix W . Recently, Jauch et al. (2020)presented a general simulation framework on the Stiefel manifolds using polarexpansions; using such an expansion with Hamiltonian Monte Carlo (Girolamiand Calderhead, 2011) allows for an order of magnitude in sampling efficiencyover competing MCMC methods. We will leverage this useful family of priorsvia the following reparametrization of the BayeSMG model.The following proposition gives a nice reformulation of the SMG model underuniform subspace priors on U and V : Proposition 3 (SVD of BayeSMG) . Suppose X ∼ SMG ( P U , P V , σ , R ) , withindependent uniform priors P U ∼ U ( G R,m − R ) , P V ∼ U ( G R,m − R ) , and fixed σ and R . Let X = UDV T be the SVD of X , with singular values diag( D ) =( d k ) Rk =1 not necessarily in decreasing order. Then:1. The singular vectors U and V follow independent priors MF ( m , R, ) and MF ( m , R, ) , respectively.2. The singular values diag( D ) = ( d k ) Rk =1 follow the repulsed normal distri-bution, with density: Z R (2 πσ ) R/ exp (cid:40) − σ R (cid:88) k =1 d k (cid:41) R (cid:89) k,l =1 k
Model Distribution
Observations [ Y Ω | X , η ]: Y i,j i.i.d. ∼ N ( X i,j , η ) , ( i, j ) ∈ Ω Low-rank matrix [ X |P U , P V , σ ] : X ∼ SMG ( P U , P V , σ , R )(equivalently) [ X | U , V , σ ] : X = UDV T , diag { D } ∼ RN ( , σ ) Priors [ U , V , σ , η ] = [ U ] [ V ] [ σ ][ η ]Matrix frames [ U ] ∼ MF ( m , R, F ) , [ V ] ∼ MF ( m , R, F )Matrix variance [ σ ] ∼ IG ( α σ , β σ )Noise variance [ η ] ∼ IG ( α η , β η )It can then be shown (see Appendix A.4 for a full derivation) that the fullconditional distributions of U , D , V and σ take the form:[ U | D , V , Y , σ , η ] ∼ MF ( m , R, YVD /η + F ) , [ V | D , U , Y , σ , η ] ∼ MF ( m , R, Y T UD /η + F ) , [ D | U , V , Y , σ , η ] ∼ RN (cid:0) σ diag( U T YV ) / ( η + σ ) , η σ / ( η + σ ) (cid:1) , [ σ | U , D , V , Y , η ] ∼ IG ( α σ + R/ , β σ + tr( D ) / , [ η | U , D , V , Y , σ ] ∼ IG ( α η + m m / , β η + (cid:107) Y − UDV T (cid:107) F / . (10)One can then perform the above full conditional updates cyclically, for posteriorsampling on [Θ | Y ] via Gibbs sampling. As mentioned previously, there areefficient sampling algorithms for the matrix vMF distribution (Hoff, 2009; Jauchet al., 2020), which enable efficient full conditional sampling on U and V . Thefull conditional distribution of D follows the aforementioned repulsed normaldistribution with a location shift of µ (denoted as RN ( µ , δ )), with density:1 Z R (2 πδ ) R/ exp (cid:40) − δ R (cid:88) k =1 ( d k − µ k ) (cid:41) R (cid:89) k,l =1; k
256 matrix,the sampler takes around 1 minute to generate T = 1000 samples on a standard13 lgorithm 1 BayeSMG ( Y Ω , R, F , F , α σ , β σ , α η , β η ): Gibbs sampler forBayeSMG Initialization : • Complete X [0] from Y Ω via nuclear-norm minimization. • Initialize [ U [0] , D [0] , V [0] ] ← svd( X [0] ) and σ > Gibbs sampling : T - total samples for t = 1 , . . . , T do • Set X [ t ] ← U [ t − D [ t − V T [ t − • Impute missing entries Y Ω c by sampling Y i,j i.i.d. ∼ X [ t ] ,i,j + N (0 , η ) , ( i, j ) ∈ Ω c . • Sample U [ t ] ∼ MF ( m , R, YV [ t − D [ t − /η t − + F ). • Sample V [ t ] ∼ MF ( m , R, Y T U [ t ] D [ t − /η t − + F ). • Sample D [ t ] ∼ RN (cid:18) σ t − diag( U T [ t ] YV [ t ] )( η t − + σ t − ) , η t − σ t − ( η t − + σ t − ) (cid:19) . • Sample σ t ] ∼ IG ( α σ + R/ , β σ + tr( D t ] ) / • Sample η t ] ∼ IG ( α η + m m / , β η + (cid:107) Y − U [ t ] D [ t ] V T [ t ] (cid:107) F / Output:
Return posterior samples { ( X [ t ] , U [ t ] , D [ t ] , V [ t ] , σ t ] , η t ] ) } Tt =1 .laptop computer (Intel i7 CPU and 16GB RAM), which is quite efficient giventhe size of the matrix. We will report computation times for larger matrices inthe numerical studies later. We now provide some theoretical insights on the BayeSMG model. We firstdiscuss an interesting link between the maximum-a-posterior (MAP) estimatorand regularized estimators in the literature, then present a connection betweenmodel uncertainty from the BayeSMG model and coherence, which is then usedto prove an error monotonicity result on uncertainty quantification.
The following lemma reveals a connection between the BayeSMG model andexisting completion methods:
Lemma 4 (MAP estimator) . Assume the BayeSMG model in Table 1, with F = F = , η and σ fixed, and a uniform prior on rank R . Conditional on Ω , the MAP estimator for X becomes argmin X ∈ R m × m (cid:18) (cid:107) Y Ω − X Ω (cid:107) η + log(2 πσ )rank ( X ) + (cid:107) X (cid:107) F σ (cid:19) , (12) where (cid:107) X (cid:107) F = (cid:113)(cid:80) i,j X i,j is the Frobenius norm of X . The MAP estimator ˜ X in (12) connects the proposed model with existingdeterministic matrix completion methods (see Davenport and Romberg (2016)and references therein). Consider the following approximation to the MAPformulation (12). Treating log(2 πσ )rank ( X ) as a Lagrange multiplier, andreplace the rank function rank( X ) by its nuclear norm (cid:107) X (cid:107) ∗ (its tightest convexrelaxation (Keshavan et al., 2010)), the optimization in (12) becomes:argmin X ∈ R m × m (cid:107) Y Ω − X Ω (cid:107) + λ (cid:8) α (cid:107) X (cid:107) ∗ + (1 − α ) (cid:107) X (cid:107) F (cid:9) , (13)for some choice of λ > α ∈ (0 , elastic net estimator (Zouand Hastie, 2005) from linear regression for noisy matrix completion.To see the connection between the MAP estimator ˜ X and existing matrixcompletion methods, set α = 1 in (13). The problem then reduces to:ˆ X = argmin X ∈ R m × m (cid:88) ( i,j ) ∈ Ω ( Y i,j − X i,j ) + λ (cid:107) X (cid:107) ∗ , (14)which is precisely the nuclear-norm formulation widely used for matrix completion(Cand`es and Recht, 2009; Cand`es and Tao, 2010; Recht, 2011). Consider next the following definition of subspace coherence from Cand`esand Recht (2009), ignoring scaling factors:
Definition 5 (Coherence, Definition 1.2 of Cand`es and Recht, 2009) . Let
U ∈ G
R,m − R be an R -plane in R m , and let P U be the orthogonal projection onto U . The coherence of subspace U with respect to the i -th basis vector, e i , is definedas µ i ( U ) := (cid:107)P U e i (cid:107) , and the coherence of U is defined as µ ( U ) = max i =1 ,...,m µ i ( U ) . igure 2: A visualization of near-maximal coherence (red basis vector) and minimalcoherence (black basis vector) for subspace U . In words, coherence measures how correlated a subspace U is with the basisvectors { e i } mi =1 . A large µ i ( U ) suggests that U is highly correlated with the i -thbasis vector e i , in that the projection of e i onto U preserves much of its originallength; a small value of µ i ( U ) suggests that U is nearly orthogonal with e i , so aprojection of e i onto U loses most of its length. Figure 2 visualizes these twocases using the projection of three basis vectors on a two-dimensional subspace U . Note that the projection of the red vector onto U retains nearly unit length,so U has near-maximal coherence for this basis. The projection of the blackvector onto U results in a considerable length reduction, so U has near-minimalcoherence for this basis. The overall coherence of U , µ ( U ), is largely due to thehigh coherence of the red basis vector.In matrix completion literature, coherence is widely used to quantify the recoverability of a low-rank matrix X . Here, the same notion of coherence arisesin a different context within the proposed model’s uncertainty quantification.Lemma 2 provides the basis for this connection. Consider first the case whereno matrix entries have been observed. From Lemma 2(a), vec( X ) follows thedegenerate Gaussian distribution N { , σ ( P V ⊗ P U ) } . The variance of the( i, j )-th entry in X can then be shown to be:Var( X i,j ) = σ ( e Ti P U e i )( e Tj P V e j ) = σ µ i ( U ) µ j ( V ) . (15)Hence, before observing data, the model uncertainty for entry X i,j is proportionalto the product of coherences for the row and column spaces U and V , with respectto the i -th and the j -th basis vectors. Put another way, BayeSMG assigns greatervariation to matrix entries with higher subspace coherence in either its row orcolumn index. This is quite appealing in view of the original role of coherence16n matrix completion, where larger row (or column) coherences imply greater“spikiness” for entries; our framework accounts for this by assigning greater modeluncertainty to such entries.Consider next the case where noisy entries Y Ω have been observed. Let usadopt a slightly generalized notion of coherence: Definition 6 (Cross-coherence) . The cross-coherence of subspace U with respectto the basis vectors e i and e i (cid:48) is defined as ν i,i (cid:48) ( U ) = e Ti (cid:48) P U e i . The cross-coherence ν i,i (cid:48) ( U ) quantifies how correlated the basis vectors e i and e i (cid:48) are, after a projection onto U . For example, in Figure 2, the pair of red /blue projected basis vectors have negative cross-coherence for U , whereas thepair of blue / black projected vectors have positive cross-coherence. When i = i (cid:48) ,this cross-coherence reduces to the original coherence in Definition 5.Define now the cross-coherence vector ν i ( U ) = [ ν i,i n ( U )] Nn =1 ∈ R N , whereagain Ω = { ( i n , j n ) } Nn =1 . From equation (4) in Lemma 2, the conditional varianceof entry X i,j for an unobserved index ( i, j ) ∈ Ω c becomes:Var( X i,j | Y Ω ) = σ µ i ( U ) µ j ( V ) − σ ν Ti,j (cid:2) R N (Ω) + γ I (cid:3) − ν i,j , (16)where ν i,j := ν i ( U ) ◦ ν j ( V ), and ◦ denotes the entry-wise (Hadamard) product.The expression in (16) yields a nice interpretation. From a UQ perspective,the first term in (16), µ i ( U ) µ j ( V ), is simply the unconditional uncertainty forentry X i,j , prior to observing data. The second term, ν Ti,j [ R N (Ω) + γ I ] − ν i,j ,can be viewed as the reduction in uncertainty, after observing the noisy entries Y Ω . This uncertainty reduction is made possible by the correlation structureimposed on X , via the SMG model; (16) also yields valuable insight in termsof subspace correlation. The first term µ i ( U ) µ j ( V ) can be seen as the jointcorrelation between (i) row space U to row index i , and (ii) column space V tocolumn index j , prior to any observations. The second term can be viewed asthe portion of this correlation explained by observed indices Ω. This link between coherence and uncertainty then sheds insight on expectederror decay. This is based on the following proposition:
Proposition 7 (Variance reduction) . Suppose X follows the BayeSMG model inTable 1, with F = F = and fixed σ and η . Let Y Ω contain the noisy entries t Ω ⊆ [ m ] × [ m ] , and let Y Ω ∪ ( i,j ) contain an additional noisy observation at ( i, j ) ∈ Ω c . For any index ( k, l ) ∈ [ m ] × [ m ] , the expected variance of X k,l canbe decomposed as E U , V [Var( X k,l | Y Ω ∪ ( i,j ) )] = E U , V [Var( X k,l | Y Ω )] − E U , V (cid:104) Cov ( X k,l , X i,j | Y Ω )Var( X i,j | Y Ω ) + η (cid:105) , (17) where Var( X k,l | Y Ω ∪ ( i,j ) ) is provided in (16) , and Cov( X i,j , X k,l | Y Ω ) = σ { ν i,k ( U ) ν j,l ( V ) − ν Ti,j (cid:2) R N (Ω) + γ I (cid:3) − ν k,l } . Remark:
Proposition 7 shows, given observed indices Ω, the reduction in uncer-tainty (as measured by variance) for an unobserved entry X k,l , after observingan additional entry at index ( i, j ). The last term in (17) quantifies this reduction,and can be interpreted as follows. For an unobserved index ( k, l ) / ∈ Ω ∪ ( i, j ),the amount of uncertainty reduction is related to the “signal-to-noise” ratio,where the signal is the conditional squared-covariance between the “unobserved”entry X k,l and the “to-be-observed” entry X i,j , and the noise is the conditionalvariance of the “to-be-observed” entry.The insight of error monotonicity then follows: Corollary 1 (Error monotonicity) . Suppose X follows the BayeSMG model inTable 1, with F = F = and fixed σ and η . Let [( i n , j n )] m m n =1 ⊆ [ m ] × [ m ] be an arbitrary sampling sequence, where ( i n , j n ) (cid:54) = ( i n (cid:48) , j n (cid:48) ) for n (cid:54) = n (cid:48) . Let X Pk,l be the ( k, l ) -th entry of the conditional mean in (4) . Define the error term (cid:15) N ( k, l ) := E X (cid:104)(cid:0) X k,l − X Pk,l (cid:1) (cid:12)(cid:12)(cid:12) Y Ω N (cid:105) , ( k, l ) ∈ [ m ] × [ m ] . Then (cid:15) N +1 ( k, l ) ≤ (cid:15) N ( k, l ) for any ( k, l ) ∈ [ m ] × [ m ] and N = 1 , , · · · . Remark:
This corollary shows that, for any sampling sequence and any index( k, l ), the expected squared-error in estimating X k,l with the conditional mean X Pk,l is always monotonically decreasing as more samples are collected. Thisis intuitive since one expects to gain greater accuracy and precision on theunknown matrix X as more entries are observed. The fact that the proposedmodel quantifies this monotonicity property provides a reassuring check on ourUQ approach. 18 Numerical experiments
We now investigate the performance of the proposed BayeSMG method innumerical simulations and compare it to the aforementioned BPMF method(Salakhutdinov and Mnih, 2008), a popular Bayesian matrix completion methodin the literature.
For the first numerical study, we assume the true matrix X is generated fromthe SMG distribution, i.e., as X ∼ SMG ( P U , P V , σ = 1 , R = 2), with uniformlysampled subspaces U and V . The matrix X entries are assumed to be missing-at-random and the observed entries are contaminated by noise with a variance η = 0 . , which we presume to be known. The prior specifications are asfollows. For BayeSMG, we assign a weakly-informative prior σ ∼ IG (0 . , . σ , with non-informative manifold hyperparameters F = F = . For BPMF, we assign the recommended weak Inverse-Wishartpriors on covariance matrices Σ M ∼ IW( R = 2 , I ), Σ N ∼ IW( R = 2 , I ). Wethen ran 10,000 MCMC iterations for both methods, with the first 2,000 samplestaken as burn-in. Standard MCMC convergence checks were performed via traceplot inspection (see Figure 3 (b)) and the Gelman-Rubin statistic (Gelman andRubin, 1992).We employ two metrics to compare the posterior contraction and UQ per-formance of these two methods. The first is the Mean Frobenius Error (MFE),defined as MFE = 1 T T (cid:88) t =1 (cid:107) X − X [ t ] (cid:107) F . The MFE calculates the Frobenius norm of the difference between the posteriorpredictive samples { X [ t ] } Tt =1 and the true matrix X . A smaller MFE suggestsbetter recovery and faster posterior contraction for the desired matrix X . Thesecond metric is the Mean Spectral Distance (MSD), defined asMSD = 1 T T (cid:88) t =1 d S ( U , U [ t ] ) , d S ( U , U (cid:48) ) := (cid:113) − (cid:107) U T U (cid:48) (cid:107) , where U (or U (cid:48) ) is any frame in subspace U (or U (cid:48) ). The MSD calculatesthe spectral distance (Calderbank et al., 2015) between the posterior samples {U [ t ] } Tt =1 for the row subspaces (equivalently, {V [ t ] } Tt =1 for the column subspaces)19nd the true row subspace U (equivalently, the true column subspace V ). Asmaller MSD suggests better recovery and posterior contraction for matrixsubspaces.The first two plots in Figure 3(a) visualize the true matrix X and theobserved Y Ω , with 20% of the entries observed uniformly-at-random. The nexttwo plots visualize the posterior mean estimates for X using BayeSMG andBPMF. We see that the BayeSMG method provides a visually better recoveryof the matrix X , with a lower posterior MFE than the BPMF method. Thefirst two plots in Figure 3(b) visualize the true and estimated row spaces usingBayeSMG and BPMF. We again see that BayeSMG gives a visually betterrecovery of the row space of X (the same holds for its column space), with alower posterior MSD than BPMF. The next two plots show the trace plots forthe first row coherence µ and the first matrix entry X , , which is unobserved.We see that the posterior samples for BayeSMG concentrate tightly around thetrue coherence and matrix values, whereas the posterior samples for BPMFfluctuate much more around the truth. The above observations suggest that,when the true matrix is generated from the assumed prior model, BayeSMGyields much faster posterior contraction than BPMF, leading to more accurateand precise estimates of X and its subspaces. We show next, in the followingimage recovery experiments and seismic sensor application, that the proposedmethod provides similar improvements over BPMF via learning and integratingsubspace information for recovery. Image inpainting is a fundamental problem in image processing (Bertalmioet al., 2000; Cai et al., 2010), which aims to recover and reconstruct images withmissing pixels and noise corruption. This is useful in a variety of applicationswhere image data are susceptible to unreliable data transmission and scratches.Take, for example, the problem of solar imaging (Xie et al., 2012). When asatellite transmits an image of the sun back to the earth, many pixels willinevitably be lost or corrupted due to the instabilities in the transmission process.Furthermore, missing pixels would need to be filled in when the image is beingscaled up. Here, the quantification of image uncertainty can be as important as itsrecovery, since this UQ provides insight on the quality of recovered image featuresin different regions. There has been some work on applying (deterministic) matrixcompletion methods for image in-painting (e.g., Xue et al., 2017), but little has20a)(b)
Figure 3:
Recovery and UQ performance for a simulated 25 ×
25 matrix. In (a),the four plots show (from left to right) the true matrix X , observations Y Ω , and theposterior mean estimates from BayeSMG and BPMF. In (b), the left plots visualizethe true row space (green) and estimated row space (blue) from BayeSMG and BPMFfor the first two dimensions, with posterior MSD calculated. On the right are the traceplots for row coherence µ and an unobserved entry X , , for BayeSMG and BPMF,with true values dotted in red. been done on uncertainty quantification. Our method addresses the latter goal.We consider the aforementioned solar imaging problem, where the matrix X is a 256 ×
256 image solar flare. The pixel intensity value is encoded from 0 to255 and represents the use of pseudo-color in the images. We then normalizepixel intensities to have zero mean and unit variance. Half of the pixels in thisimage are observed uniformly at random, then corrupted by Gaussian noise η = 0 . . We note that, for this problem, the recovery and UQ of the rowand column subspaces are of interest as well. This is because image featuresare often represented in the row and column spaces. Here, these subspaces mayrepresent domain-specific, interpretable phenomena, such as different classesof solar flares, certain shapes, and sunspots. Furthermore, human eyes aretypically not as sensitive to high-frequency image features; therefore, a few SVDcomponents can often capture the vital features of an image, making its ranklow. For BayeSMG and BPMF, we set the upper limit for rank as R = 40,21 igure 4: Performance comparison between BayeSMG and BPMF on a × solar flare image. The plots (from left to right) show the original image, the partiallyobserved image with noise, the recovered images using BayeSMG and BPMF, and thewidths of the entry-wise 95% HPD intervals from BayeSMG and BPMF. and perform 1,000 iterations of MCMC, with a burn-in period of 200. Asbefore, MCMC convergence checks were performed via trace plot inspection andstandard diagnostics.Figure 4 shows the true solar image, its partial observations, and the recoveredimage using BayeSMG and BPMF via its posterior predictive mean, as well as itscorresponding uncertainties via its 95% highest posterior density (HPD) intervalwidth. We see that the BayeSMG method provides a much better recovery ofthe image, with a noticeably lower MFE of 31.0 compared to the BPMF method(350.8). Visually, we see that the BayeSMG recovery captures the key features ofthe image, e.g., different types of solar flares. The BPMF recovery, on the otherhand, loses much of the smaller-scale features and contains significant blockingdefects. One plausible explanation is that, when low-rank subspace structure ispresent in X (as is the case here), the proposed method can better learn andintegrate this structure for improved recovery. Furthermore, an inspection ofthe HPD plots shows that the BayeSMG is quite certain of the recovered image,with narrow posterior HPD intervals across the whole matrix. In contrast, theBPMF is much more uncertain of its recovery: its entry-wise posterior intervalsare considerably wider, particularly for pixels with low intensities. Computation-wise, the posterior sampling for BayeSMG can be performed within one minuteon a standard laptop computer (Intel i7 processor with 16Gb of RAM), which isquite fast, considering the relatively large size of the image.To demonstrate the scalability of BayeSMG, we consider next a much higher-dimensional image of the Georgia Tech campus. This image is converted to agray-scale matrix of size 1911 × η = 0 . . For BayeSMG and BPMF, we set the upper limit for rank as R = 30,and run the MCMC sampler for 100 iterations after a burn-in period of 25.22 igure 5: Performance of BayeSMG on recovering a large × image ofthe Georgia Tech campus. The four plots show (from left to right) the original image,the partial observations, the recovered image using BayeSMG, and the widths of theentry-wise 95% HPD intervals from BayeSMG. Figure 5 shows the true image, its partial observations, and the recoveredimage from BayeSMG as well as its corresponding uncertainty. The MFE of thisrecovery is 1005.1, which is again noticeably smaller than that for the BPMFrecovery (3004.8). We see that the recovered BayeSMG image captures theoriginal image’s key features, which again shows that the proposed method canlearn and integrate the subspace structure for recovery. As before, the BayeSMGis quite certain of this recovery, with narrow posterior HPD intervals over allpixels. Despite this being a much larger image, BayeSMG can again be performedon the same standard laptop, albeit with a longer time of 24 minutes. This showsthat, in addition to improved recovery performance and reduced uncertainty, theproposed method appears to be quite scalable for high-dimensional matrices aswell.
Seismic imaging is widely used for finding oil and natural gas beneath thesurface of the earth. Ambient Noise Seismic Imaging (Bensen et al., 2007) isa relatively new technique for seismic imaging with great potential. It uses“ambient noises” instead of actively collected signals and is non-invasive to theenvironment (compared to the traditional active imaging techniques). ANSIhas proved useful for imaging shallow earth structures; it is based on pairwisecross-correlation function between signals recorded by seismic sensors followed bytime-frequency analysis. In a recent study (Xu et al., 2019) on the Old Faithfulgeyser at Yellowstone National Park, 133 sensors are deployed in its vicinity tocollect ambient noise signals for investigating geological structures. Figure 6(a)shows the locations of these sensors.One limitation of ANSI, however, is that many pairwise cross-correlations do23a) (b)
Figure 6:
The location of all 133 sensors near the geyser in Yellowstone NationalPark. The yellow circles indicate the sensors and the red pentagram indicates thelocation of the geyser. (a) shows the distribution of all 133 sensors over the regionclose to the geyser (see Wu et al., 2017 for details); (b) shows the locations of the 12most significant sensors and their relative direction from each other. not have any useful signals, i.e., the peak in the cross-correlation is not observed,since it is based on weak ambient noises. This missing data then results in missingentries in the cross-correlation matrix. To determine whether a cross-correlationis “missing”, we first identify which correlations have an unsatisfactory signal-to-noise ratio, by inspecting the standard deviation ξ outside of the main wavelobe relative to the magnitude of the wave peak g . The correlation is deemedmissing if g/ξ <
20. We note that entries on this cross-correlation matrix X areobserved with noise due to background vibrations caused by bubble collapse andboiling water. Here, the noise standard deviation is estimated to be η = 0 . Y Ω .To proceed with ANSI analysis, one would then need to estimate missingentries in the cross-correlation matrix X . Bensen et al. (2007) shows that sucha matrix is indeed low-rank, thus the proposed BayeSMG method can be usedfor recovering this matrix. We note that, here, the quantification of uncertaintyis crucial for estimating geologic structure and source of activities. With thisuncertainty, engineers can better interpret the wave tomography generated fromtime delay estimates, and identify parts where estimates are accurate and wherethey are not. This, in turn, guides the accuracy of downstream analysis, whichsubsequently provides greater insight on reconstruction quality.Figure 7 visualizes the recovery and UQ performance from BayeSMG andBPMF, using an upper limit of R = 15 for rank. We see that the BayeSMGyields more precise estimates (i.e., narrower HPD interval widths) comparedto the BPMF. In particular, when an entire row or column of X is missing,24 igure 7: Performance comparison between BayeSMG and BPMF on the ambientnoise cross-correlation data matrix. The first plot (from the left) shows the observedentries in the cross-correlation matrix, with missing entries in white. The second plotshows the completed matrix using BayeSMG. The third and fourth plots visualize thewidths of the entry-wise 95% HPD intervals from BayeSMG and BPMF. the uncertainties returned by BPMF can be very large, which reduces theusefulness of its recovered entries. On the contrary, the proposed BayeSMGmethod, by leveraging subspace information, can yield more precise inference onthese missing rows and columns. One underlying reason is that the BayeSMGapproach explicitly integrates subspace modeling for recovery and UQ. From thevisualization of Y Ω in Figure 7, we see that there are clear bright stripes in theleft and top edges of the plot, which strongly suggests the presence of low-ranksubspaces in X . This is not too surprising, since it is known that several sensorsare highly correlated due to their relative locations. The BayeSMG appearsto exploit this subspace structure to provide more certain predictions, whereasthe BPMF yields much higher uncertainty in inference, particularly in rowsand columns with little to no observations. While the ground truth for the fullmatrix X is not known for this sensor network, we would expect from previousexperiments that the BayeSMG yields improved recovery performance over theBPMF, particularly in rows and columns with few observations.With posterior samples on X in hand, we can then use its subspace infor-mation to locate (or match) a few sensors that are most significantly correlatedwith each other. This sensor matching is useful in seismology studies, since itcan be used to estimate the dimension and the capacity of the hydrothermalreservoir of the geyser (Wu et al., 2017). We first perform an SVD step on theposterior mean ˆ X , and find the singuular vector with the largest singular value.We then inspect all the rows of the matrix ˆ X , and select the rows most alignedwith this vector. These rows are then traced back to locate the most significantlycorrelated sensors. Figure 6(b) shows the locations of the 12 most correlatedsensors and their relative directions from each other. The identified sensors are25mong the closest to the Old Faithful geyser, and their related observations aredominated by the highly fractured and porous geological structure undergroundadjacent to the geyser. Using the readings from these sensors, researchers arethen able to identify a different pattern of waveform in tremor signals, suggestingdifferent geological structures underneath the geyser. We proposed a new BayeSMG model for uncertainty quantification in low-rank matrix completion. A key novelty of the BayeSMG model is that itparametrizes the unknown matrix X via manifold prior distributions on its rowand column subspaces. This Bayesian subspace parametrization allows for directposterior inference on matrix subspaces, which can then be used for improvedmatrix recovery. We introduced a Gibbs sampler for posterior inference, whichprovides efficient posterior sampling even for matrices with dimensions on theorder of thousands. We also showed that the BayeSMG yields a probabilisticinterpretation for subspace coherence, which can be used to show an errormonotonicity result for UQ. We then showed the effective recovery and UQperformance of BayeSMG on simulated data, image data, and an applicationfor seismic sensor network recovery. For future work, it would be interestingto design locations for observations to control the uncertainties, exploring theconnection with experimental design literature, e.g., integrated mean-squarederror designs (Sacks et al., 1989) or distance-based designs (Mak and Joseph,2018). Acknowledgment
Henry Shaowu Yuchi and Yao Xie are supported by NSF CCF-1650913, NSFDMS-1938106, and NSF DMS-1830210. Simon Mak is supported by NSF CSSIFrameworks grant 2004571. The data and picture used in the seismic sensornetwork recovery are provided by Sin-Mei Wu and Fan-Chi Lin.26 eferences
Alquier, P., Cottet, V., Chopin, N., and Rousseau, J. (2014). Bayesian matrixcompletion: prior specification. arXiv preprint arXiv:1406.1440 .Babacan, S. D., Luessi, M., Molina, R., and Katsaggelos, A. K. (2011). Low-rankmatrix completion by variational sparse Bayesian learning. In
IEEE Inter-national Conference on Acoustics, Speech and Signal Processing (ICASSP) ,pages 2188–2191.Bensen, G., Ritzwoller, M., Barmin, M., Levshin, A. L., Lin, F., Moschetti, M.,Shapiro, N., and Yang, Y. (2007). Processing seismic ambient noise data toobtain reliable broad-band surface wave dispersion measurements.
GeophysicalJournal International , 169(3):1239–1260.Bertalmio, M., Sapiro, G., Caselles, V., and Ballester, C. (2000). Image inpainting.In
Proceedings of the 27th Annual Conference on Computer Graphics andInteractive Techniques , pages 417–424.Cai, J.-F., Cand`es, E. J., and Shen, Z. (2010). A singular value thresholdingalgorithm for matrix completion.
SIAM Journal on Optimization , 20(4):1956–1982.Calderbank, R., Thompson, A., and Xie, Y. (2015). On block coherence offrames.
Applied and Computational Harmonic Analysis , 38(1):50 – 71.Cand`es, E. and Recht, B. (2009). Exact matrix completion via convex optimiza-tion.
Foundations of Computational Mathematics , 9(6):717–772.Cand`es, E. J. and Plan, Y. (2010). Matrix completion with noise.
Proceedingsof the IEEE , 98(6):925–936.Cand`es, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimalmatrix completion.
IEEE Transactions on Information Theory , 56(5):2053–2080.Carson, W. R., Chen, M., Rodrigues, M. R., Calderbank, R., and Carin, L. (2012).Communications-inspired projection design with application to compressivesensing.
SIAM Journal on Imaging Sciences , 5(4):1185–1212.Chen, Y., Fan, J., Ma, C., and Yan, Y. (2019). Inference and uncertainty quan-tification for noisy matrix completion.
Proceedings of the National Academyof Sciences , 116(46):22931–22937.Chi, Y., Lu, Y. M., and Chen, Y. (2019). Nonconvex optimization meets low-rankmatrix factorization: An overview.
IEEE Transactions on Signal Processing ,67(20):5239–5269.Chikuse, Y. (2012).
Statistics on Special Manifolds . Springer Science & Business27edia.Davenport, M. A. and Romberg, J. (2016). An overview of low-rank matrixrecovery from incomplete observations.
IEEE Journal of Selected Topics inSignal Processing , 10(4):608–622.Eriksson, B., Balzano, L., and Nowak, R. (2012). High-rank matrix completion.In
Artificial Intelligence and Statistics , pages 373–381. PMLR.Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation usingmultiple sequences.
Statistical Science , 7(4):457–472.Girolami, M. and Calderhead, B. (2011). Riemann manifold Langevin andHamiltonian Monte Carlo methods.
Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 73(2):123–214.Gopal, S. and Yang, Y. (2014). Von Mises-Fisher clustering models. In
Interna-tional Conference on Machine Learning , pages 154–162.Gupta, A. K. and Nagar, D. K. (1999).
Matrix Variate Distributions . CRCPress.Hern´andez-Lobato, J. M., Houlsby, N., and Ghahramani, Z. (2014). Proba-bilistic matrix factorization with non-random missing data. In
InternationalConference on Machine Learning , pages 1512–1520. PMLR.Hoff, P. D. (2009). Simulation of the matrix Bingham–von Mises–Fisher dis-tribution, with applications to multivariate and relational data.
Journal ofComputational and Graphical Statistics , 18(2):438–456.Hoff, P. D. (2013). Bayesian analysis of matrix data with rstiefel. arXiv preprintarXiv:1304.3673 .Hoffman, K. and Kunze, R. (1971).
Linear Algebra . Englewood Cliffs, NewJersey.Jauch, M., Hoff, P. D., and Dunson, D. B. (2020). Monte Carlo simulation on thestiefel manifold via polar expansion.
Journal of Computational and GraphicalStatistics , pages 1–23.Keshavan, R. H., Montanari, A., and Oh, S. (2010). Matrix completion from afew entries.
IEEE Transactions on Information Theory , 56(6):2980–2998.Khatri, C. G. and Mardia, K. V. (1977). The von Mises-Fisher matrix distributionin orientation statistics.
Journal of the Royal Statistical Society: Series B(Methodological) , 39(1):95–106.Koltchinskii, V., Lounici, K., Tsybakov, A. B., et al. (2011). Nuclear-normpenalization and optimal rates for noisy low-rank matrix completion.
TheAnnals of Statistics , 39(5):2302–2329.Lawrence, N. D. and Urtasun, R. (2009). Non-linear matrix factorization with28aussian processes. In
Proceedings of the 26th International Conference onMachine Learning (ICML) , pages 601–608.Little, R. J. and Rubin, D. B. (2019).
Statistical Analysis with Missing Data ,volume 793. John Wiley & Sons.Mak, S. and Joseph, V. R. (2018). Support points.
Annals of Statistics ,46(6A):2562–2592.Mak, S. and Xie, Y. (2018). Maximum entropy low-rank matrix recovery.
IEEEJournal of Selected Topics in Signal Processing , 12(5):886–901.Mardia, K. V. and Jupp, P. E. (2009).
Directional Statistics , volume 494. JohnWiley & Sons.Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller,E. (1953). Equation of state calculations by fast computing machines.
TheJournal of Chemical Physics , 21(6):1087–1092.Negahban, S. and Wainwright, M. J. (2012). Restricted strong convexity andweighted matrix completion: Optimal bounds with noise.
Journal of MachineLearning Research , 13(1):1665–1697.Recht, B. (2011). A simpler approach to matrix completion.
Journal of MachineLearning Research , 12:3413–3430.Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. P. (1989). Design andanalysis of computer experiments.
Statistical Science , 4(4):409–423.Salakhutdinov, R. and Mnih, A. (2008). Bayesian probabilistic matrix factoriza-tion using Markov chain Monte Carlo. In
Proceedings of the 25th InternationalConference on Machine Learning (ICML) , pages 880–887.Shen, J. (2001). On the singular values of Gaussian random matrices.
LinearAlgebra and its Applications , 326(1-3):1–14.Smith, R. C. (2013).
Uncertainty Quantification: Theory, Implementation, andApplications . SIAM.Wang, Z. and Zhou, H. (2009). A general method of prior elicitation in bayesianreliability analysis. In , pages 415–419. IEEE.Wu, S.-M., Ward, K. M., Farrell, J., Lin, F.-C., Karplus, M., and Smith,R. B. (2017). Anatomy of old faithful from subsurface seismic imaging of theyellowstone upper geyser basin.
Geophysical Research Letters , 44(20):10–240.Xie, Y., Huang, J., and Willett, R. (2012). Change-point detection for high-dimensional time series with missing data.
IEEE Journal of Selected Topicsin Signal Processing , 7(1):12–27.Xu, D., Song, B., Xie, Y., Wu, S.-M., Lin, F.-C., and Song, W. (2019). Low-rank29atrix completion for distributed ambient noise imaging systems. In , pages 1059–1065.IEEE.Xue, H., Zhang, S., and Cai, D. (2017). Depth image inpainting: Improving lowrank matrix completion with low gradient regularization.
IEEE Transactionson Image Processing , 26(9):4311–4320.Zhang, X., Cui, W., and Liu, Y. (2020). Matrix completion with prior subspaceinformation via maximizing correlation. arXiv preprint arXiv:2001.01152 .Zhou, M., Wang, C., Chen, M., Paisley, J., Dunson, D., and Carin, L. (2010).Nonparametric bayesian matrix completion. In , pages 213–216. IEEE.Zou, H. and Hastie, T. (2005). Regularization and variable selection via theelastic net.
Journal of the Royal Statistical Society, Series B , 67(2):301–320.30
Proofs
A.1 Proof of Lemma 2
Proof.
We first prove part (a) of the lemma. To show that X ∈ T almostsurely, let Z be an arbitrary matrix in R m × m , with SVD Z = (cid:101) UD (cid:101) V T , D =diag( { d k } Rk =1 ). Letting u k = P U (cid:101) u k and v k = P V (cid:101) v k , where (cid:101) u k and (cid:101) v k are columnvectors for (cid:101) U and (cid:101) V respectively, we have u k ∈ U and v k ∈ V for k = 1 , · · · , R .From Definition 1, X can then be written as X = P U Z P V = ( P U (cid:101) U ) D ( P V (cid:101) V ) T = (cid:80) Rk =1 d k u k v Tk , as desired. Next, note that the pseudo-inverse of P u , ( P u ) + , issimply P u , since P u ( P u ) + P u = ( P u ) + P u ( P u ) + = P u by the idempotency of P u ,and P u ( P u ) + = ( P u ) + P u are both symmetric. Moreover, letting det ∗ be thepseudo-determinant operator, we have det ∗ ( P U ) = det ∗ ( UU T ) = det( U T U ) = 1,and det ∗ ( P V ) = 1 by the same argument. Using this along with Theorem 2.2.1in Gupta and Nagar (1999), the density function f ( X ) and the distribution ofvec( X ) follow immediately.We now prove part (b) of the lemma. From part (a), we have vec( X ) ∼N { , σ ( P V ⊗ P U ) } , so:[ Y Ω , X Ω c ] ∼ N (cid:40) , (cid:34) σ R N (Ω) + η I σ ( P V ⊗ P U ) Ω , Ω c σ ( P V ⊗ P U ) T Ω , Ω c σ ( P V ⊗ P U ) Ω c (cid:35)(cid:41) . The expressions for X P Ω c and Σ P Ω c in (4) then follow from the conditional densityof the multivariate Gaussian distribution. Part (c) of the lemma can be shownin a similar way as for part (b). A.2 Proof of Proposition 3
Proof.
For fixed P U and P V , X can be written as: X = P U Z P V = U ( U T ZV ) V T , (18)where Z i,j i.i.d. ∼ N (0 , σ ), P U = UU T and P V = VV T . By Theorem 2.3.10 inGupta and Nagar (1999), each entry of ˜ Z = U T ZV ∈ R R × R follows ˜ Z i,j i.i.d. ∼N (0 , σ ). Note that the distribution of ˜ Z is independent of the initial choice of P U and P V (and thereby U and V ). By Theorem 1 of Shen (2001), ˜ Z can befurther factorized via its SVD: ˜ Z = ˜ UD ˜ V T , (19)31ith independent ˜ U ∼ U ( V R,R ), ˜ V ∼ U ( V R,R ) and diag( D ) following therepulsed normal distribution (7).Next, assign independent uniform priors U ( G R,m − R ) and U ( G R,m − R ) onprojection matrices P U and P V , which induces independent uniform priors U ( V R,m − R ) and U ( V R,m − R ) on frames U and V . From (18), we have: X = U ( ˜ UD ˜ V T ) V T = ( U ˜ U ) D ( V ˜ V ) T =: ˜˜ UD ˜˜ V T . (20)Note that ˜˜ U = U ˜ U is an orthonormal frame, since ( U ˜ U ) T ( U ˜ U ) = ˜ U T ( U T U ) ˜ U =˜ U T ˜ U = I . Moreover, ˜˜ U ∼ U ( V R,m − R ), since U and ˜ U are independent anduniformly distributed. Similarly, one can show ˜˜ V = V ˜ V ∼ U ( V R,m − R ) as well,which proves the proposition. A.3 Proof of Lemma 4
Proof.
Since U ( G R,m − R ) is a special case of the matrix Langevin distribution(Section 2.3.2 in Chikuse (2012)), it follows from (2.3.22) of Chikuse (2012) that[ P U | R ] ∝ P V | R ] ∝
1. For fixed η and σ , the MAP estimator for X thenbecomes: ˜ X ∈ Argmax X ∈ R m × m [ Y Ω | X , η ][ X |P U , P V , σ , R ] · [ P U | R ] [ P V | R ] [ R ]s.t. P U ∈ G R,m − R , P V ∈ G R,m − R , R ≤ m ∧ m ∈ Argmax X ∈ R m × m exp (cid:26) − η (cid:107) Y Ω − X Ω (cid:107) (cid:27) · (cid:20) πσ ) R / exp (cid:26) − σ tr (cid:2) ( X P V ) T ( P U X ) (cid:3)(cid:27)(cid:21) · s.t. P U ∈ G R,m − R , P V ∈ G R,m − R , R ≤ m ∧ m ∈ Argmin X ∈ R m × m (cid:20) η (cid:107) Y Ω − X Ω (cid:107) + log(2 πσ ) R +1 σ tr (cid:2) ( X P V ) T ( P U X ) (cid:3)(cid:21) s.t. P U ∈ G R,m − R , P V ∈ G R,m − R , R ≤ m ∧ m . Since X = P U Z P V , we have X = UDV T for some D = diag( { d k } Rk =1 ), U ∈ R m × R and V ∈ R m × R , where U and V are R -frames satisfying P U = UU T P V = VV T . Hence:tr (cid:2) ( X P V ) T ( P U X ) (cid:3) = tr (cid:2) ( VV T )( VDU T )( UU T )( UDV T ) (cid:3) = tr (cid:2) ( V T V ) D ( U T U ) D (cid:3) (cyclic invariance of trace)= tr (cid:2) D (cid:3) ( V T V = I and U T U = I )= (cid:107) X (cid:107) F , (Frob. norm is equal to Schatten 2-norm)which proves the expression in (12). A.4 Proof of Theorem 7
Proof.
Consider the following block decomposition: R N +1 (Ω ∪ ( i, j )) + γ I = (cid:32) R N (Ω) + γ I ν i ( U ) ◦ ν j ( V )[ ν i ( U ) ◦ ν j ( V )] T µ i ( U ) µ j ( V ) + γ (cid:33) . Using the Schur complement identity for matrix inverses Hoffman and Kunze(1971), we have: (cid:2) R N +1 (Ω ∪ ( i, j )) + γ I (cid:3) − = (cid:32) Γ + τ − Γ ξξ T Γ − τ − ξ T Γ − τ − Γ ξ τ − (cid:33) , (21)where ξ = ν i ( U ) ◦ ν j ( V ), Γ = (cid:2) R N (Ω) + γ I (cid:3) − and τ = µ i ( U ) µ j ( V ) − ξ T Γ ξ + γ .Using the conditional variance expression in (16), τ = Var( X i,j | Y Ω ) /σ + γ .Letting (cid:101) ξ = ν k ( U ) ◦ ν l ( V ) and applying (16) again, it follows that:Var( X k,l | Y Ω ∪ ( i,j ) )= σ (cid:110) µ k ( U ) µ l ( V ) − (cid:101) ξ T Γ (cid:101) ξ (cid:111) − τ − σ (cid:110) ν Ti,j (cid:2) R N (Ω) + γ I (cid:3) − ν k,l − ν i,k ( U ) ν j,l ( V ) (cid:111) (using (21) and algebraic manipulations)= Var( X k,l | Y Ω ) − Cov ( X i,j , X k,l | Y Ω )Var( X i,j | Y Ω ) + η , (from (4))which proves the theorem. 33 .5 Proof of Corollary 1 Proof.
This follows directly from Theorem 7 and the fact that:Cov ( X i,j , X k,l | Y Ω N ) / { Var( X i,j | Y Ω N ) + η } > . A.6 Proof of full conditional distributions
Proof.
For fixed rank R , the posterior distribution [Θ | Y ] can be written as:[ U , D , V , σ | Y ] ∝ [ Y | U , D , V , σ ] · [ U ] · [ V ] · [ D | σ ] · [ σ ] ∝ η ) ( m m ) / exp (cid:26) − η (cid:107) Y − UDV T (cid:107) F (cid:27) · σ ) R/ · exp (cid:40) − σ R (cid:88) k =1 d k (cid:41) · R (cid:89) k,l =1 k