Covariance Estimation from Compressive Data Partitions using a Projected Gradient-based Algorithm
Jonathan Monsalve, Juan Ramirez, Iñaki Esnaola, Henry Arguello
IIEEE TRANSACTIONS ON IMAGE PROCESSING, XXX 202X 1
Covariance Estimation from Compressive DataPartitions using a Projected Gradient-basedAlgorithm
Jonathan Monsalve, Juan Ramirez, I˜naki Esnaola,
Member, IEEE
Henry Arguello,
Senior Member, IEEE
Abstract —Covariance matrix estimation techniques requirehigh acquisition costs that challenge the sampling systems’storing and transmission capabilities. For this reason, variousacquisition approaches have been developed to simultaneouslysense and compress the relevant information of the signal usingrandom projections. However, estimating the covariance matrixfrom the random projections is an ill-posed problem that requiresfurther information about the data, such as sparsity, low rank, orstationary behavior. Furthermore, this approach fails using highcompression ratios. Therefore, this paper proposes an algorithmbased on the projected gradient method to recover a low-rank orToeplitz approximation of the covariance matrix. The proposedalgorithm divides the data into subsets projected onto differentsubspaces, assuming that each subset contains an approxima-tion of the signal statistics, improving the inverse problem’scondition. The error induced by this assumption is analyticallyderived along with the convergence guarantees of the proposedmethod. Extensive simulations show that the proposed algorithmcan effectively recover the covariance matrix of hyperspectralimages with high compression ratios ( − approx) in noisyscenarios. Additionally, simulations and theoretical results showthat filtering the gradient reduces the estimator’s error recoveringup to twice the number of eigenvectors. Index Terms —Random projections, compressive covariancesamping, Low-rank matrix recovery, Toeplitz matrix recovery.
I. I
NTRODUCTION
A wide range of signal processing and machine learningapplications demands high computational power and broadstorage capabilities of the acquisition systems. These re-quirements have motivated the development of acquisitionschemes that simultaneously sense and compress signals,preserving in this fashion the information embedded in thehigh-dimensional data. In this sense, compressive sensing hasemerged as an alternative acquisition framework that enables areliable recovery of the signal of interest from its compressiveprojections [1]–[3]. This acquisition approach assumes thatsignal instances admit sparse representations in a predefinedor trained dictionary. However, in diverse applications signalsdo not admit sparse representations or the dictionary learning
The work of J. Monsalve was supported by the Colciencias/Department ofSantander Scholarship (771 of 2016). J. Monsalve is with the Departmentof Electrical Engineering, Universidad Industrial de Santander, Bucaramanga680002, Colombia. J. Ramirez is with the Department of Computer Science,Universidad Rey Juan Carlos, Madrid, 28933, Spain. I. Esnaola is withDepartment of Automatic Control and Systems Engineering, The University ofSheffield, Western Bank, Sheffield, UK. H. Arguello is with the DepartmentSystems Engineering and Informatics, Universidad Industrial de Santander,Bucaramanga 680002, Colombia(e-mail: [email protected]) stages imply expensive computational costs [4]–[6]. Therefore,alternative compressive acquisition approaches which considerother representations of the high-dimensional signals are alsoof importance [6], [7].A class of compressive acquisition schemes that considersthe low dimensional information embedded in high dimen-sional signals has recently emerged [6], [8]. These techniquesaim to recover the second-order statistics from compressiveprojections, and they have been successfully used in appli-cations such as power spectrum estimation [4], [8], systemsidentification [9], and hyperspectral image dimensionality re-duction [7]. More precisely, this acquisition approach, alsoreferred to as compressive covariance sampling , estimates thecovariance matrix from compressive measurements, for whichsignal sparsity is not required [5], [10], [11].The covariance matrix plays a central role in reducing signaldimensionality. In particular, principal component analysis(PCA) is a data-dependent dimensionality reduction (DR)method derived from the covariance matrix eigendecompo-sition. For instance, PCA has been widely used in appli-cations such as compression, denoising, feature extraction,classification, and image fusion [12]–[15]. For instance, incompressive spectral imaging, where the hyperspectral imageis acquired using a reduced set of random projections, thecovariance matrix is estimated from a set of compressivesamples. Many covariance recovery based algorithms usethe low-rank assumption as prior of the problem. On theother hand, communication applications such as wide-bandpower spectrum sensing estimate the covariance matrix fromtailored compressive acquisition schemes to detect the unusedspectrum, and thus, to increase link availability [4]. Forthis application, algorithms use the Toeplitz assumption sinceusually this process is assumed stationary.
A. Contribution
This paper develops an algorithm for estimating the sam-ple covariance matrix of high-dimensional signals from theirrandom compressive projections. To this end, data subsetsare projected onto multiple subspaces in order to improvethe condition of the information matrix of the problem. Theestimation problem is formulated to recover a low-rank orToeplitz representation of a positive semidefinite matrix thatminimizes the Frobenius norm of the projection residuals. The-oretical guarantees for the global convergence of the proposedcovariance matrix estimation algorithm are established. The a r X i v : . [ ee ss . I V ] J a n EEE TRANSACTIONS ON IMAGE PROCESSING, XXX 202X 2 error term induced by the proposed algorithm is analyticallydetermined and the causes of the error are studied, and anapproach to mitigated it is proposed. Although the method wasmainly tested on hyperspectral imaging, it can be extended toother applications such as communications or video.
B. Related work
Covariance matrix estimation methods that use low-dimensional projections have been previously proposed. Forexample, compressive projection PCA algorithm (CPPCA)directly calculates both the PCA coefficients and the mostrelevant eigenvectors of the covariance matrix from an orthog-onal random projection of the data [7]. This approach is basedon the assumption that the covariance matrix’s eigenvalueshave a highly eccentric distribution. Additionally, CPPCAalso assumes that there are small-angle gaps between theorthogonal projection of the most relevant eigenvectors ofthe target covariance matrix and the covariance eigenvectorsmatrix of the random projections.SpeCA (spectral compressive acquisition) is also an algo-rithm used to recover the principal components from a set ofcompressive measurements [16]. SpeCA requires a particularsensing protocol which consists of a projection matrix thatperforms the projection to the whole image, and multipleprojection matrices that projects different pixels of the scene.SpeCA assumes a low-rank structure on the data and uses thelinear mixture model to recover the signal.In the context of power spectrum estimation, various sub-sampling methods have been proposed to recover reliableversions of Toeplitz covariance matrices. For example in radarapplications, D. Romero et al. [6] study the use of sparserulers to recover a toeplitz covariance matrix by sampling thedelays at least once. Finally, Hanchao Qi and Shannon Hughes[17] and Farhad Pourkamali-Anaraki [18] analyzed the biasintroduced by the projection matrix on the covariance matrixestimator when the entries of the projection matrix are i.i.d.from a Gaussian distribution. They prove that the bias dependson the kurtosis of the projection matrix and the dimension ofthe projected subspace under certain assumptions.
C. Paper organization
The paper is organized as follows: Section II introducesthe covariance matrix estimation problem from random pro-jections. Section III presents the optimization problem to besolved and the proposed algorithm for estimating covariancematrices from compressive projections in multiple subspaces.Sections IV and V includes the global convergence guaranteeof the proposed algorithm along with the bias analysis. InSection VI, the performance of the proposed algorithm isevaluated using extensive numerical simulations using hyper-spectral images and some concluding remarks are summarizedin Section VII.II. C
OMPRESSIVE COVARIANCE SAMPLING FORMULATION
Let X = [ x , . . . , x n ] be a matrix whose columns x j ∈ R l with j = 1 , , . . . , n , are independent realizations of a zero mean Gaussian random vector with covariance matrix Σ i.e.the distribution of x conditioned to Σ is f ( x | Σ ) = π − l/ | Σ | − l/ etr (cid:18) Σ − xx T (cid:19) (1)where etr ( . ) denotes the exponential of the trace respectively.The maximum likelihood estimator for the covariance matrixis the sample covariance matrix given by S = 1 n XX T = 1 n n (cid:88) j =1 x j x Tj , (2)with Σ = E [ S ] , Σ ∈ S l ++ , and E [ · ] denotes the expectationoperator. However, in many scenarios, a lower-dimensionalprojection of the signal is known instead of the signal itself.The latter increases the difficulty of the covariance estimation.For this reason, the covariance matrix structure such as low-rank or Toeplitz is exploited to ease the problem. The lower-dimensional signal acquisition process is modeled as: Y = P T X + N = [ P T x , · · · , P T x ] + N , (3)where Y = [ y , y , . . . , y n ] ∈ R m × n is the matrix containingthe observations, P ∈ R l × m with m < l is the projec-tion/sensing matrix; and N ∈ R m × n is additive Gaussian noisewith i.i.d. entries N i,j ∼ N (0 , σ N ) . Note that, since x ∼N ( , Σ ) , and P is deterministic, y ∼ N (0 , P T ΣP + σ N I ) .The sample covariance matrix obtained from the observationmatrix is ˜S = 1 n YY T = 1 n ( P T X + N )( P T X + N ) T (4)with ˜S ∈ R m × m . Note that n ˜S follows a Wishart distribution,i.e. n ˜S ∼ W ( P T ΣP + σ N I , n ) [19]. An optimization problemto recover the sample covariance matrix Σ from ˜S can beformulated as [11] Σ ∗ = argmin Σ ∈ D (cid:107) ˜S − P T ΣP (cid:107) F + τ ψ ( Σ ) (5)where ψ ( · ) is a convex function that regularizes the problem, τ is the regularization parameter, (cid:107) · (cid:107) F denotes the Frobeniusnorm, and D is a proper convex and closed set, e.g. the set ofpositive semi-definitive or Toeplitz matrices.Moreover, note that the zero-mean assumption given in (2)does not holds in applications such as imaging. Hence, therandom projections can be alternatively written as: Y = P T ( X + ˜X ) + N , (6)where ˜X = ˜x1 T is a matrix whose columns are the meanvector, i.e. ˜x = E [ x ] , and ∈ R n is a n -dimesional vectorwith one-valued entries. Notice also that an estimate of themean vector can be obtained from the compressive projections[17], [20] as ˜x = α n (cid:88) j =1 P ( P T P ) − y j , (7)where α = m/n , and y j is the j-th vector in P T X . It isproved in [17] that converges to the mean when n → ∞ . Oncethe mean is estimated, the measurements can be correctedby subtracting the projection of the estimated mean to the EEE TRANSACTIONS ON IMAGE PROCESSING, XXX 202X 3 biased measurements, i.e. ˜Y = Y − P T ( (cid:98) x1 T ) . Without lossof generality, we assume that signals are zero mean, and thus, Y is used instead of ˜Y to represent the projection of thecentered data.III. R ECOVERY OF THE COVARIANCE MATRIX FROMCOMPRESSED MEASUREMENTS
Solving (5) yileds poor results when high compression ratios m/l are used because all the data is projected onto the samesubspace and possibly project it onto the null space. Hence,we split the data in disjoint subsets that are projected onto dif-ferent subspaces to improve the performance of the estimator.This idea is borrowed from CPPCA sensing approach [7].
A. Projection set up and optimization problem
Let’s split the dataset X into p disjoint subsets, X i ,with columns defined as X i = [ x Ω i , · · · , x Ω ib ] with i =1 , , · · · , p and Ω ij = Ω i (cid:48) j (cid:48) only if i = i (cid:48) and j = j (cid:48) . Then,since each column x Ω i ∼ N (0 , Σ ) it holds that the samplecovariance matrix b S i = X i X Ti ∼ W ( Σ , b ) , where b = n/p is the number of columns in each subset. Let’s project eachmatrix X i ∈ R l × b with a independent matrix P i ∈ R l × m as Y i = P Ti X i + N i . (8)Using this splitting procedure, each S i matrix can be estimatedas in (5) by S ∗ i = argmin S i ∈ D || ˜S i − P Ti S i P i || F + τ ψ ( S i ) (9)where ˜S i = b Y i Y Ti for i = 1 , · · · , p . Note that, the formu-lation in (9) involves p different optimization problems, onefor each matrix S i , which increases the number of unknownsand thus the ill-posedness of the problem. However, note thatfor a sub-Gaussian process, it holds that || S i − Σ || ≤ (cid:15), (10)with probability at least − − t l ) , for b ≥ J ( t/(cid:15) ) l , (cid:15) ∈ (0 , , t ≥ , and J depends on the sug-Gaussian norm of X [21]. i.e. the statistics of a sample (subset) can approximatethe statistics of the whole random process.Fig. 1 illustrates the similarity when the covariance matricesof two subsets of the signal X are presented. To that end, twosubsets of columns of the matrix X highlighted in yellow andred are used to compute the covariance matrices S and S .This example X is a matrix representation of a hyperspectralimage with spatial resolution of × (i.e. n = 262144 )and l = 102 spectral bands where each column of X representsthe spectrum at a given spatial location. The computation ofthe matrices S and S uses b = 2048 spectral signatures. Asit can be seen, these two matrices are similar as || S1 − S2 || F =0 . .Instead of recovering all covariance matrices S i , we assumethat S = S = · · · = S p = Σ . By doing this we merge theseparate problems in (9) into a single optimization problemby replacing S i with Σ , which results in Σ ∗ = argmin Σ ∈ D p (cid:88) i =1 || ˜S i − P Ti ΣP i || F + τ ψ ( Σ ) . (11) Fig. 1. Depiction of the partition approach. For a matrix X different subsetsof columns are selected and their covariance matrices are computed. Matrix S represents the covariance matrix of the yellow columns and S representsthe covariance matrix of the red columns. Theoretical results and simulations show that this splittingprocedure improves the accuracy and variance of the estimator(see Lemma 5.3).
B. Proposed projected gradient algorithm for covariance ma-trix recovery
Problem (11) is solved following the projected gradientmethod. This method requires a differentiable function f ( Σ ) and a proper closed and convex set D to formulate theoptimization problem as Σ ∗ = argmin Σ ∈ R l × l f ( Σ ) subject to Σ ∈ D. (12)Note that, (11) has the form of (12), so that it can besolved using the projected gradient algorithm as illustratedin Algorithm 1. This algorithm is summarized in three mainsteps. At the first step, a starting point Σ ∈ D and theregularization parameter τ are selected. Parameter τ inducesthe low-rank structure in the solution. Step two (in line three),the learning step is selected by using the Armijo search [22].For that, λ k = λ k − /η r , where η > , λ − > and r is thesmallest positive integer (including 0) that satisfiesmin r f ( Σ k r , τ ) ≤ f ( Σ k , τ ) + Tr ( ∇ f ( Σ k , τ ) T ( Σ k r ))+ || Σ k r − Σ k || F , (13)where Σ k is the k-th iteration, Σ k r = P D ( Σ k − ( λ k − /η r ) ∇ f ( Σ k , τ )) is and intermediate step between iter-ations k and k + 1 and P D is the projection onto the set D .In step 3 (line 4), the variable Σ k is updated by using thegradient of the cost function f ( Σ , τ ) ≡ p (cid:88) i = i || ˜S i − P Ti ΣP i || F + τ ψ ( Σ ) , (14)which for a fixed τ is given by ∇ f ( Σ , τ ) = p (cid:88) i =1 P i ( ˜S i − P Ti ΣP i ) P Ti + τ ∇ ψ ( Σ ) . (15) EEE TRANSACTIONS ON IMAGE PROCESSING, XXX 202X 4
When ψ ( Σ ) = Tr ( Σ ) which is used for low-rank structure,the gradient is given by ∇ Tr ( Σ ) = I ∈ R l × l . Once thevariable Σ k is updated using the gradient, it is projected ontothe set D whose computation depends on the set D itself. Inthis work we study two sets:1) Positive semi-definitive:
The orthogonal projection ontothe set of positive semi-definitive matrices is given by[23] P D ( Σ ) = WΛ + W T , (16)where Λ + is the matrix containing only the positiveeigenvalues of Σ .2) Toeplitz:
The orthogonal projection onto the set ofToeplitz matrices is given by [23] P D ( Σ ) = t t t . . . t l − t t t − . . . t l − t t t . . . t l − . . . . . . . . . . . . . . .t l − t l − t l − . . . t , (17)where t k = 1 / ( n − k ) (cid:80) n − ki =1 Σ i, ( i + k ) , with Σ = { Σ i,j } .Thus, the proposed gradient algorithm can be summarizedby Alg. 1 Algorithm 1
Projected gradient algorithm Σ ∈ D , τ, λ while stopping criteria is not satisfied do pick λ k > (cid:46) Armijo search Σ k +1 ← P D ( Σ k − λ k ∇ f ( Σ k , τ )) (cid:46) Using 1) or 2) end while Note that, Algorithm 1 works for both low-rank and Toeplitzcases. Nevertheless, for the Toeplitz case τ is set to zero sincethe low-rank constraint is not necessary.IV. C ONVERGENCE ANALYSIS
This section studies the convergence properties of theprojected gradient Algorithm 1 to solve (11). Consider thefunction ψ ( Σ ) = Tr ( Σ ) in (11), and let g ( Σ ) = p (cid:88) i =1 || vec ( ˜S i ) − Q i vec ( Σ ) || + τ d T vec ( Σ ) , (18)be the vector formulation of (14), with Q i = P Ti ⊗ P Ti , where ⊗ is the Kronecker product, vec( · ) denotes the operation thatstacks the columns of a given matrix into a column vector, d = vec ( I ) with I ∈ R l × l the l × l identity matrix, and || · || is the (cid:96) norm. Given that g ( Σ ) ≡ f ( Σ ) [11], the function g ( Σ ) is considered instead of f ( Σ ) , Σ ∗ = argmin Σ ∈ R l × l g ( Σ ) + h ( Σ ) , (19)where h ( Σ ) is an indicator function of the positive semi-definitive and Toeplitz set. For simplicity, and taking into ac-count that the function g ( Σ ) vectorizes the input matrix, both g ( Σ ) and g ( σ ) will be used indistinctly, where σ = vec ( Σ ) and ˜s = vec ( ˜S ) . Further, Fej´er proved that sequence of points generated by the projected gradient converges to a solution[24, Theorem 10.23]. Theorem 10.23 (Fej´er monoticity theorem):
Suppose that g ( Σ ) and h ( Σ ) are proper closed and convex functions,additionally, dom( h ( Σ ) ) ⊆ int(dom( g ( Σ ) )) and g ( Σ ) is L- smooth . Let { Σ k } k ≥ be the sequence of points generated bythe projected gradient algorithm. Then for any optimal point Σ ∗ and k ≥ it holds that || Σ k +1 − Σ ∗ || ≤ || Σ k − Σ ∗ || . (20)It can be seen that (19) is convex and differentiable, thus thefirst assumption of theorem 10.23 is accomplished. Addition-ally, a function g is said to be L-smooth if it is differentiableand that there exists L > such that [24] ||∇ g ( x ) − ∇ g ( y ) || ≤ L || x − y || , (21)for all x , y ∈ E , with E the domain of the function g . Thus,for the function g ( Σ ) defined in (18), the gradient is given by ∇ g ( x ) = p (cid:88) i =1 Q Ti ( Q i x − ˜s ) + τ d . (22)Replacing (22) in (21) yields (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:32) p (cid:88) i =1 Q Ti ( Q i x − ˜s i ) + τ d (cid:33) − (cid:32) p (cid:88) i =1 Q Ti ( Q i y − ˜s i ) + τ d (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (23)and after some algebraic manipulations, (23) can be rewrittenas (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p (cid:88) i =1 ( Q Ti Q i )( x − y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p (cid:88) i =1 ( Q Ti Q i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) || x − y || . (24)This implies that L = || (cid:80) pi =1 ( Q Ti Q i ) || . Thus, it can beconcluded that the sequence of points generated by Algorithm1 is Fej´er monotone which guarantees convergence.V. E RROR TERM OF THE PROPOSED ESTIMATOR
The assumption in (10) introduces an error term in thegradient. To show that, let’s characterize the difference of theground-truth covariance matrix S and the covariance matrices S i as S i = S + R i , (25)where R i ∈ R l × l is a matrix that accounts for the errorbetween the covariance matrices. In the ideal case, where S = S i ∀ i , the estimator is optimal and (15) holds. However,in the more realistic scenario where S (cid:54) = S i ∀ i , assuming (25),the error is described in lemma 5.1 Lemma 5.1:
The gradient step for the proposed Algo-rithm 1 has an error term given by Error [ ∇ ˜ f ( Σ )] = − (cid:80) pi =1 P i P Ti R i P i P Ti .where Error [ · ] = ∇ f − ∇ ˜ f , and ∇ f, ∇ ˜ f are the actual andoptimal gradients respectively. Proof:
The gradient of f ( Σ ) is given in (15). However, asmentioned above, the covariance matrices are not equal, thus(15) is rewritten as ∇ ˜ f ( Σ ) = p (cid:88) i =1 P i ( ˜Σ i − P Ti S i P i ) P Ti + τ ∇ ψ ( Σ ) . (26) EEE TRANSACTIONS ON IMAGE PROCESSING, XXX 202X 5
Plugging (25) into (26) and assuming that S = Σ we havethat ∇ ˜ f ( Σ ) = p (cid:88) i =1 P i ( ˜Σ i − P Ti ( S + R i ) P i ) P Ti + τ ∇ ψ ( Σ )= p (cid:88) i =1 P i ( ˜Σ i − P Ti ( Σ + R i ) P i ) P Ti + τ ∇ ψ ( Σ ) . (27)After some algebraic operations, (27) can be rewritten as ∇ ˜ f ( Σ ) = p (cid:88) i =1 P i ( ˜Σ i − P Ti ΣP i ) P Ti − p (cid:88) i =1 P i P Ti R i P i P Ti + τ ∇ ψ ( Σ ) . (28)Comparing (32) and (28), the error term of the gradientinduced by the splitting procedure is Error [ ∇ ˜ f ( Σ )] = − p (cid:88) i =1 P i P Ti R i P i P Ti . (29)An important property of the error term is that it is proportionalto the number of subsets and thus the error associated to theerror increases with the number of partitions p . However, morepartitions improve the condition of the information matrix ofthe problem. Consequently, choosing the number of subsets isa trade off between improving the condition of the problemand increasing the error. The latter is bounded by the followingtheorem. Theorem 5.2:
The variance for any Covariance matrix Σ estimator for (8) with deterministic projection matrices P i ,assuming that is non-singular, satisfiesvar ( ˜Σ ) ≥ pn Tr (cid:32) p (cid:88) i =1 P i A Ti P Ti ⊗ P i A i P Ti (cid:33) − , (30)with A i = ( Σ N + P Ti ΣP i ) − , and (cid:0)(cid:80) pi =1 P i A Ti P Ti ⊗ P i A i P Ti (cid:1) is the information matrix. Proof:
See appendix C.From Theorem 5.2 it is important notice that for small valuesof p the fisher information matrix is singular. Hence, a largeenough number of partitions ( p > l /m ) must be performedbased on lemma 5.3 Lemma 5.3:
Let P i ∈ R l × m and A i ∈ R m × m , then thematrix (cid:80) pi =1 P i A Ti P Ti ⊗ P i A i P Ti is singular if p < l /m . Proof:
See appendix DFrom lemma 5.3 it can be seen that the information matrixis non-singular for some d ≥ such that p ≥ dl /m .Nevertheless, choosing large p increase the norm of the errorterm given in (29). Hence, p should be chosen such that is bigenough to avoid the singularity of (30) but small enough todecrease the error term in (29). Additionally, this error termfollows an important property given by lemma 5.4 Lemma 5.4:
Let { R i } be the set of error matrices forthe subsets covariance matrices S i , hence since the sensingmatrices P i are deterministic, for any entry B ij of the matrix P i P Ti R i P i P Ti = B it holds that E [ B ij ] = 0 , (31) Proof:
See Appendix B.This result motivates the use of a filtered gradient to removethe effect of the error term. Simulations show that this erroris usually associated with high frequencies. Moreover, theproposed algorithm filters the gradient in each iteration tomitigate this error, specially when the compression is high,since more partitions are required (as can be seen in Lemma5.3). The filtered gradient is given by ∇ ˆ f ( Σ ) = K ∗ ∇ f ( Σ ) , (32)where ∗ represents the convolution operation and K ∈ R k × k is the filter kernel. This new gradient is used in step 4 of thealgorithm 1. VI. S IMULATIONS AND R ESULTS
The performance of the proposed algorithm is tested us-ing synthetic and real data. The gradient is filtered using aGaussian filter with σ = 1 , however it is only used alongwith the low-rank restriction (i.e., τ > ). Three differentprojection matrices P are used: i) Gaussian matrices whoseentries follow a standard normal distribution P i,j ∼ N (0 , ; ii) Binary matrices with entries P i,j ∼ Bernoulli ( p = ) ;and iii) matrices whose elements obey to a standard uniformdistribution, { P } i,j ∼ U (0 , . In simulations two noisyscenarios of 20 and 30 dB SNR were tested with SNR definedas SNR=
10 log || P T X || F / || N || F A. Synthetic data performance evaluation
Synthetic data from a low-rank and a Toeplitz covariancematrices were generated. For the low-rank covariance matrixthe datapoints were generated using the Matlab with µ = 0 ,the rank of Σ was set to 7 and the dimension of the signal l = 100 . The data from the toeplitz matrix was generated asan autoregressive model of order q = 8 and dimension of thesignal l = 100 .Fig. 2 shows the average normalized mean squared errordefined as NMSE= || Σ − ˜ Σ || F / || Σ || F as a function of thenumber of partitions between the original and reconstructedcovariance matrices. It can be seen Gaussian matrices have thebest performance. Based on those results, we set the numberof partitions to 4 and 128 for Toeplitz and low-rank data,respectively in the experiments with synthetic data. The resultsof the proposed algorithm are compared against sparse rulersand a least squares autoregressive estimator for the Toeplitzmatrix. For the low-rank matrix the proposed algorithm iscompared against the compressive-projection principal com-ponent analysis (CPPCA) [7] and the spectral compressiveacquisition (SpeCA) method [16]. EEE TRANSACTIONS ON IMAGE PROCESSING, XXX 202X 6
N. of partitions
N. of partitions
N. of partitions
N. of partitions
Fig. 2. Average normalized mean square error of the reconstructed covari-cance matrix for the (top) Toeplitz and (bottom) low-rank matrices varying thenumber of partitions using of compression ratio with two noise scenarios(left) SNR=30dB and (right) SNR=20dB. Each line represents a differentsensing matrix (Gaussian, Uniform, Binary). The shaded areas represent theconfidence interval (in some cases is it can not be seen in the plot).Fig. 3. Average normalized mean square error of the reconstructed covar-icance matrix for the low-rank covariance matrix varying the number ofacquisitions using partitions with two noise scenarios (left) SNR=30dBand (right) SNR=20dB. Figure 3 shows that both the proposed and SpeCA algo-rithms outperform the CPPCA algorithm, due to the fact thatthe generated random signal does not exhibit an eccentricbehavior in the eigenvalues of the covariance matrix whichis an essential assumption for the CPPCA algorithm. Onthe other hand, the proposed algorithm achieves comparableresults to the SpeCA when Gaussian matrices are used butoutperforms the SpeCA with binary matrices and in low SNRregimes.
Compression ratio
Compression ratio
Fig. 4. Average normalized mean square error of the reconstructed co-varicance matrix for the Toeplitz covariance matrix varying the number ofacquisitions using partitions with two noise scenarios (left) SNR=30dB and(right) SNR=20dB. The shaded areas represent the confidence interval. Figure 4 compares the performance of different algorithmsin the recovery of the Toeplitz covariance matrix. The pro-posed algorithm outperforms two state-of-the-art algorithmsAR coefficient [9] and Sparse rulers [6], specially with highcompression ratios. The proposed method is compared usingtwo types of sensing matrices Gaussian and Binary. Note thatboth, AR coefficients and sparse rulers propose a specificsensing protocol and hence the sensing matrix is fixed.
B. Real data performance evaluation - Hyperspectral imaging
Fig. 5. (Top) Urban dataset: (Left) RGB composite of the hyperspectralimage, (Right) spectral signatures of three different pixels at locations P1 , P2 , and P3 .(Bottom) Pavia Centre dataset: (Left) RGB composite of thespectral image, (Right) spectral signatures of three different pixels. Additionally, the proposed method is evaluated by esti-mating the covariance matrix of hyperspectral images usingsubsets of compressive random projections. Two hyperspectralimages are considered: the Urban dataset [25] with a spatialresolution of × pixels and l = 128 spectral bands; anda section of the Pavia Centre dataset [26] with dimensions × × . The RGB composite and the spectralsignatures of three pixels (at the spatial locations P1 , P2 , and P3 ) for the Urban dataset are displayed in Fig. 5 (Top-Left)and (Top-Right), respectively. Moreover, Figs. 5(Bottom-Left)-(Bottom-Right) show the RGB composite and the spectralsignatures for the Pavia Centre dataset. The results obtained EEE TRANSACTIONS ON IMAGE PROCESSING, XXX 202X 7 with the proposed method are compared with those obtainedusing the CPPCA and the SpeCA algorithms. Three metricsare used to compare the results, the Mean Square Error (MSE)between the covariance matrices, the error angle between theeigenvectors, and the Peak Signal to Noise Ratio (PSNR).
C. Cramer-rao lower bound and optimal number of partitions
As described in Section III, the signal splits into p subsetsthat are projected using a different matrices and p ≥ l /m asdescribed by lemma 5.3. This sections evaluates the varianceof the estimator using the theoretical expression given inTheorem 5.2 and the empirical variance in the simulations.Table I shows the value l /m for both images as m increases. TABLE IM
INIMUM OPTIMAL NUMBER OF PARTITIONS .image/m 8 12 16 20 24 28 32
Urban p
256 114 64 41 29 21 16
Pavia p
163 72 41 26 18 13 10
Fig. 6 shows the theoretical variance given by the Cramer-rao lower bound and the empirical variance defined as /r tr [ (cid:80) ri =1 ( ˜ σ − σ )( ˜ σ − σ ) T ] where r is the number ofrealizations. N. of partitions N. of partitions
Fig. 6. Comparison of the Cramer-rao lower bound and the empirical variance.Blue lines represent the empirical value and red lines represent theoreticalvalue. Note that the red-lines are shown only when fisher information matrixis non-singular.
Fig. 6 presents three different compression ratio scenariosgoing from 6% to 30%. It can be seen that the values of p presented in table I match those obtained in Fig. 6. Notethat, the red lines are shown only when the fisher informationmatrix is non-singular which as expected is close to the pointof least empirical variance. D. Accuracy of the recovered covariance matrix
The quality of the reconstructed covariance matrices wasevaluated using the NMSE and the angle between the eigen-vectors of the ground-truth covariance matrix and the recov-ered eigenvectors using Algorithm 1. In Fig. 7 the NMSE ofthe reconstructed covariance matrix using different types ofmatrices is shown. It can be seen that the proposed methodoutperforms both traditional methods (CPPCA, SpeCA), spe-cially when binary matrices are used. For the case of Gaussianmatrices P the proposed algorithm obtain comparable resultsto SpeCA. Pavia Urban U n i f o r m G a u ss i a n B i n a r y Fig. 7. Average MSE of recovered covariance matrix when the compressionratio varies for Urban image using the number of partitions given in table I
Figs. 8 shows the angle gap obtained with the two differentimages for the three different types of random projections. Re-sults in Fig. 8 are generated by running 20 times the proposedalgorithm, along with CPPCA and SpeCA. The angles of therecovered eigenvectors are averaged. For the SpeCA algorithmwhere the sensing protocol is defined as Y a = AX ∈ R m a × n ,with A ∈ R m a × l , Y b = [ B x , B x , · · · , B n x n ] and B i ∈ R m b × l , m a we set to m − and m b = 1 . For thesesimulations, the signal was corrupted with additive Gaussiannoise as in (3) to yields 20 dB of SNR. The results showthat the angle gap of the recovered eigenvectors is less whenthe proposed algorithm is used with any type of projectionmatrix. Note, that SpeCA produces similar results to theproposed algorithm when Gaussian projection matrices areused. However, the proposed algorithm outperforms SpeCAwhen Binary and Uniform matrices are used. E. Error term and filtered gradient analysis
In this section the error term is numerically analyzed. Fortest purposes we assume that that the truth covariance is knownso that the error matrices R i are computed as R i = S − S i and the error is calculated as in (29). The covariance matrixis estimated using the proposed algorithm without filtering thegradient, and its eigenvectors are compared with the error term B . This is because when no filtering is applied, we observe inthe simulations that some eigenvectors are corrupted with highfrequency noise. Fig. 9 (left) shows the visual comparison ofan eigenvector when no filter is applied on the gradient and aneigenvector of the bias term (29). It can be seen that the fourth EEE TRANSACTIONS ON IMAGE PROCESSING, XXX 202X 8
Pavia Urban
CPPCA Proposed SpeCA CPPCA Proposed SpeCA e v e v e v Fig. 8. Average angle error of recovered eigenvectors with different compression ratios for Pavia (left) and Urban(right) images with different type of sensingmatrices. Rows show the angle gap of the first, second, and third eigenvector. eigenvector of the recovered covariance matrix converges tothe fourth eigenvector of B , which computationally validatesthe statement in lemma 5.1. Fig. 9. Comparison of the fourth eigenvector of the estimated covariancematrix with: (left) non-filtered gradient and first eigenvector of the bias term.(right) Filtered gradient and the fourth eigenvector of the truth covariancematrix.
However, when the filtering procedure is applied usinga Gaussian filter with σ = 1 , this corrupted eigenvectorconverges to the actual one, this is shown in Fig. 9 (right).Additionally, to test the impact of the error term given in(29), this term is computed and subtracted from the gradientand the results shown in Fig. 10. Results show an importantimprovement when the filtered gradient is used specially usinguniform sensing matrices. Additionally, when the error termis computed (only for comparison purposes) the improvementis up to one order of magnitude, note that this is only possibleif the covariance matrix is known beforehand; in the case ofthe Uniform matrices the simulations shows no improvementwhen the error term is subtracted. However, the error alwaysdecrease using the filtered gradient and the improvement islarger when using Uniform matrices. Additionally, the stabilityof the reconstruction appears to improve as well. F. Image reconstruction
The underlying signal is recovered with the estimated eigen-vectors using the method described in [7]. In particular, given -3 -2 Pavia Urban G a u ss i a n U n i f o r m B i n a r y Fig. 10. NMSE of the recovered covariance matrix when the filtered gradientis used. Dotted green line represents the unfiltered gradient results. Blue solidlines with dot markers show the results with filtered gradient. Red solid linepresents the results of filtered gradient when the error term (29) is subtractedin each gradient step.
EEE TRANSACTIONS ON IMAGE PROCESSING, XXX 202X 9 the matrix W m ∈ R l × m containing m recovered eigenvectors,the signal is estimated as X = W m ( P T W m ) † Y , (33)where † is the Moore-Penrose inverse. Using this approach,the image is reconstructed and the performance is comparedagainst SpeCA and CPPCA algorithms. Figure 11 shows theresults for pavia centre image using the PSNR as qualitymeasurement. Fig. 11. Comparison of the different reconstruction methods in terms ofPSNR as a function of the compression ratio. The shaded areas represent theconfidence interval.
It can be seen that using the estimator given in (33), theproposed method outperforms both state-of-art counterparts.Specially CPPCA which exhibits a large dispersion on theperformance. Note that, SpeCA results are similar to thoseobtained when Gaussian matrices are used, but using binarymatrices the proposed method outperforms by up to 5 dB.VII. C
ONCLUSION
We proposed an algorithm to recover the covariance matrixfrom a set of compressive measurements using a strategy basedprojection onto convex subsets. The algorithm is based on theprojected gradient method. The theoretical results show thatalthough the splitting procedure induces an error term, it canbe mitigated by using a filtered gradient. Additionally, thiserror is proportional to the number of partitions, nevertheless,more partitions improve the condition of the informationmatrix, thus choosing the right number of partitions is critical.For that reason, a lower bound for the optimal number ofpartitions is proposed. Experimental results show that theproposed method outperforms state-of-art algorithms CPPCAand SpeCA. The experiments were performed using 2 differenthyperspectral images, for which the proposed method attainedbetter results in terms of MSE and angle GAP which translatesin a gain of up to 10 dB of PSNR in comparison with CPPCAand up to 4 dB PSNR with respect to SpeCA.A
PPENDIX AP ROOF : T
HE EXPECTED VALUE OF THE ERROR TERM ISZERO .The proposed method assumes the covariance matrix of eachsubset X i is S i = S + R i , (34) where R i is the error of the subset covariance matrix estimate.Note that S = 1 n n (cid:88) j =1 x j x Tj = 1 n (cid:16) (cid:88) j ∈ S x j x Tj + (cid:88) j ∈ S x j x Tj + · · · + (cid:88) j p ∈ S p x j p x Tj p (cid:17) , (35) = 1 n ( X X T + X X T + · · · + X p X Tp ) which yields to S = 1 p (( S + R ) + ( S + R ) + · · · + ( S + R p )) (36)Computing the expectation in both sides yields to E [ S ] = E [ S ] + 1 p E [ R + R + · · · + R p ] , (37)which implies that E [ R + R + · · · + R p ] = 0 .A PPENDIX BP ROOF OF THE LEMMA H i ≡ P i P Ti = [ h , h , · · · , h l ] , and definethe matrix B i ≡ H i RH i , (38)where H i is symmetric, and E [ R ] = 0 . For simplicity, the i index of matrices B and R is dropped. The expected value ofthe entries of the matrix B k,j = h Tk Rh j is given by E [ h Tk Rh j ] = E [ tr ( h Tk Rh j )] = E [ tr ( Rh j h Tk )]= tr ( E [ Rh j h Tk )] . (39)Additionally, the matrix h j h k is deterministic, and E [ R ] = 0 ,hence E [ h Tk Rh j ] = 0 . (40)A PPENDIX CP ROOF OF THE C RAM ´ ER -R AO L OWER B OUND FOR THEESTIMATOR
The variance of the estimator is bounded byvar ( ˜Σ ) ≥ tr ( I ( Σ ) − ) (41)where I ( Σ ) is the fisher information matrix. To compute I ( Σ ) we observe that ˜S i = np Y i Y Ti , (42)with np ˜S i ∼ W ( P Ti ΣP i + Σ N , n/p ) , and set r = n/p so thatthat each subset contains n/p samples. The likelihood functionis given by f ( ˜S , · · · , ˜S p | Σ ) ∝ p (cid:89) i =1 | Σ N + P Ti ΣP i | np × (cid:12)(cid:12) Y i Y Ti (cid:12)(cid:12) np − m × etr (cid:8) − ( Σ N + P Ti ΣP i ) − ( Y i Y Ti ) (cid:9) , (43) EEE TRANSACTIONS ON IMAGE PROCESSING, XXX 202X 10 where | . | stands for the determinant and etr { . } the exponentialof the trace. Applying the logarithm in both sides yields F ( ˜S , · · · , ˜S p | Σ ) ∝ − np p (cid:88) i =1 log | Σ N + P Ti ΣP i | +( np − m ) p (cid:88) i =1 log | Y i Y Ti | − p (cid:88) i =1 tr (cid:0) ( Σ N + P Ti ΣP i ) − ( Y i Y Ti ) (cid:1) , (44)where F ( ˜S , · · · , ˜S p | Σ ) = log f ( ˜S , · · · , ˜S p | Σ ) . Differenti-ating twice with respect to σ = vec ( Σ ) yields ∂ F ( ˜S i | Σ ) ∂ σ ∂ σ T = np p (cid:88) i =1 P i A Ti P Ti ⊗ P i A i P Ti + P i A Ti ˜SA Ti P Ti ⊗ P i A i P Ti − P i A Ti P Ti ⊗ P i A i ˜SA i P Ti , (45)with A i = ( Σ N + P Ti ΣP i ) − . The fisher information matrixis computed by calculating the expectation of (45) whichyields to I ( Σ ) = np p (cid:88) i =1 P i A Ti P Ti ⊗ P i A i P Ti . (46)Note that, from (45) to (46) we used E (cid:104) P i A Ti ˜SA Ti P Ti ⊗ P i A i P Ti (cid:105) = np P i A Ti P Ti ⊗ P i A i P Ti ,since P i , A i are deterministic matrices and E (cid:104) ˜S (cid:105) = np ( Σ N + P Ti ΣP i ) . Hence, the estimator variance of Σ is bounded byvar ( ˜Σ ) ≥ pn Tr (cid:32) p (cid:88) i =1 P i A Ti P Ti ⊗ P i A i P Ti (cid:33) − . (47)A PPENDIX DP ROOF LEMMA P i ∈ R l × m and A i ∈ R m × m with m < = l it holds thatrank ( P i A i P Ti ) = rank ( P i A Ti P Ti ) ≤ m, (48)using the fact that rank ( C ⊗ D ) = rank ( C ) rank ( D ) andrank ( C + D ) ≤ rank ( C ) + rank ( D ) , it can be concluded thatrank (cid:32) p (cid:88) i =1 P i A Ti P Ti ⊗ P i A i P Ti (cid:33) ≤ m p. (49)Hence, for fixed values of m and l the matrix I ( Σ ) = p (cid:88) i =1 P i A Ti P Ti ⊗ P i A i P Ti ∈ R l × l , (50)is singular if p < l /m .R EFERENCES[1] D. Donoho, “Compressed sensing,”
IEEE Trans. on Inf. Theory , vol. 52,pp. 1289–1306, apr 2006.[2] R. G. Baraniuk, “Compressive sensing [lecture notes],”
IEEE SignalProcessing Magazine , vol. 24, pp. 118–121, Jul. 2007.[3] E. J. Candes and M. B. Wakin, “An introduction to compressivesampling,”
IEEE Signal Processing Magazine , vol. 25, pp. 21–30, Mar.2008. [4] D. D. Ariananda and G. Leus, “Compressive wideband power spectrumestimation,”
IEEE Transactions on Signal Processing , vol. 60, pp. 4775–4789, Sep. 2012.[5] D. Romero and G. Leus, “Compressive covariance sampling,” in , pp. 1–8, Feb.2013.[6] D. Romero, D. D. Ariananda, Z. Tian, and G. Leus, “Compressive co-variance sensing: Structure-based compressive sensing beyond sparsity,”
IEEE Signal Processing Magazine , vol. 33, pp. 78–93, Jan. 2016.[7] J. E. Fowler, “Compressive-projection principal component analysis,”
IEEE Trans. on Image Processing , vol. 18, pp. 2230–2242, Oct. 2009.[8] S. Qin, Y. D. Zhang, M. G. Amin, and A. M. Zoubir, “Generalizedcoprime sampling of toeplitz matrices for spectrum estimation,”
IEEETransactions on Signal Processing , vol. 65, pp. 81–94, Jan. 2017.[9] M. Testa and E. Magli, “Compressive estimation and imaging basedon autoregressive models,”
IEEE Transactions on Image Processing ,vol. 25, pp. 5077–5087, Nov. 2016.[10] Y. Chen, Y. Chi, and A. J. Goldsmith, “Exact and stable covarianceestimation from quadratic sampling via convex programming,”
IEEETrans. on Inf. Theory , vol. 61, pp. 4034–4059, Jul. 2015.[11] J. M. Bioucas-Dias, D. Cohen, and Y. C. Eldar, “Covalsa: Covarianceestimation from compressive measurements using alternating minimiza-tion,” in , pp. 999–1003, Sep. 2014.[12] M. D. Farrell and R. M. Mersereau, “On the impact of pca dimension re-duction for hyperspectral detection of difficult targets,”
IEEE Geoscienceand Remote Sensing Letters , vol. 2, pp. 192–195, Apr. 2005.[13] F. Palsson, J. R. Sveinsson, M. O. Ulfarsson, and J. A. Benediktsson,“Model-based fusion of multi- and hyperspectral images using pca andwavelets,”
IEEE Trans. on Geoscience and Remote Sensing , vol. 53,pp. 2652–2663, May 2015.[14] G. Chen and S. Qian, “Denoising of hyperspectral imagery usingprincipal component analysis and wavelet shrinkage,”
IEEE Transactionson Geoscience and Remote Sensing , vol. 49, pp. 973–980, Mar. 2011.[15] J. Monsalve, H. Rueda-Chacon, and H. Arguello, “Sensing matrixdesign for compressive spectral imaging via binary principal componentanalysis,”
IEEE Transactions on Image Processing , vol. 29, pp. 4003–4012, 2020.[16] G. Martin and J. M. Bioucas-Dias, “Hyperspectral Blind ReconstructionFrom Random Spectral Projections,”
IEEE Journal of Selected Topics inApplied Earth Observations and Remote Sensing , vol. 9, pp. 2390–2399,jun 2016.[17] H. Qi and S. M. Hughes, “Invariance of principal components under low-dimensional random projection of the data,”
Proceedings - InternationalConference on Image Processing, ICIP , pp. 937–940, 2012.[18] F. Pourkamali-Anaraki, “Estimation of the sample covariance matrixfrom compressive measurements,”
IET Signal Processing , vol. 10,pp. 1089–1095, Dec. 2016.[19] O. Besson, S. Bidon, and J. Y. Tourneret, “Bounds for estimation ofcovariance matrices from heterogeneous samples,”
IEEE Transactionson Signal Processing , vol. 56, no. 7 II, pp. 3357–3362, 2008.[20] F. P. Anaraki and S. M. Hughes, “Efficient recovery of principal com-ponents from compressive measurements with application to Gaussianmixture model estimation,” in , pp. 2332–2336, IEEE, May2014.[21] R. Vershynin, “Introduction to the non-asymptotic analysis of randommatrices,” in
Compressed Sensing (Y. C. Eldar and G. Kutyniok, eds.),pp. 210–268, Cambridge: Cambridge University Press, Nov. 2010.[22] A. N. Iusem, “On the convergence properties of the projected gradientmethod for convex optimization,”
Computational & Applied Mathemat-ics , vol. 22, no. 1, pp. 37–52, 2003.[23] K. Grigoriadis, A. Frazho, and R. Skelton, “Application of alternatingconvex projection methods for computation of positive Toeplitz matri-ces,”
IEEE Trans. on Signal Processing , vol. 42, pp. 1873–1875, Jul.1994.[24] A. Beck, “Chapter 10: The Proximal Gradient Method,”
First-OrderMethods in Optimization , pp. 269–329, 2017.[25] F. Zhu, Y. Wang, S. Xiang, B. Fan, and C. Pan, “Structured sparsemethod for hyperspectral unmixing,”
ISPRS Journal of Photogrammetryand Remote Sensing , vol. 88, pp. 101–118, 2014.[26] A. A. Mueller, A. Hausold, and P. Strobl, “Hysens-dais/rosis imagingspectrometers at dlr,” in