Linear models based on noisy data and the Frisch scheme
Lipeng Ning, Tryphon T. Georgiou, Allen Tannenbaum, Stephen P. Boyd
aa r X i v : . [ c s . S Y ] A p r LINEAR MODELS BASED ON NOISY DATAAND THE FRISCH SCHEME
LIPENG NING ∗ , TRYPHON T. GEORGIOU † , ALLEN TANNENBAUM ‡ , AND
STEPHEN P. BOYD § Abstract.
We address the problem of identifying linear relations among variables based on noisymeasurements. This is, of course, a central question in problems involving “Big Data.” Often a keyassumption is that measurement errors in each variable are independent. This precise formulationhas its roots in the work of Charles Spearman in 1904 and of Ragnar Frisch in the 1930’s. Varioustopics such as errors-in-variables, factor analysis, and instrumental variables, all refer to alternativeformulations of the problem of how to account for the anticipated way that noise enters in thedata. In the present paper we begin by describing the basic theory and provide alternative modernproofs to some key results. We then go on to consider certain generalizations of the theory as wellapplying certain novel numerical techniques to the problem. A central role is played by the Frisch-Kalman dictum which aims at a noise contribution that allows a maximal set of simultaneous linearrelations among the noise-free variables –a rank minimization problem. In the years since Frisch’soriginal formulation, there have been several insights including trace minimization as a convenientheuristic to replace rank minimization. We discuss convex relaxations and certificates guaranteeingglobal optimality. A complementary point of view to the Frisch-Kalman dictum is introduced inwhich models lead to a min-max quadratic estimation error for the error-free variables. Points ofcontact between the two formalisms are discussed and various alternative regularization schemes areindicated.
1. Introduction.
The standard paradigm in modeling is to postulate that mea-sured quantities contain a contribution of “accidental deviation” [41] from the other-wise “uniformities” that characterize an underlying law. Therefore, a key issue whenidentifying dependencies between variables is how to account for the contribution ofnoise in the data. Various assumptions on the structure of noise and of the possibledependencies lead to a number of corresponding methodologies.The purpose of the present paper is to consider from a modern computationalpoint of view, the important situation where the noise components are assumed in-dependent, and the consequences of this assumption –the data is typically abstractedinto a corresponding (estimated) covariance statistic. This independence assumptionunderlies the errors-in-variables model [11, 26] and factor analysis [3, 29, 19, 21, 37],and has a century-old history [16, 35, 27]; see also [22, 23, 31, 44, 17, 40, 2, 15].Accordingly, given the large classical literature on this problem, this paper will alsohave a tutorial flavor.The precise formulation has its roots in the work of Ragnar Frisch in the 1930’s.The central assumption is that the noise components are independent of the under-lying variables and are also mutually independent [22, 23]. In addition, since severalalternative linear relations are typically consistent with the data, a maximal set ofsimultaneous dependencies is sought as a means to limit uncertainty and to providecanonical models [22, 23]. This particular dictum gives rise to a (non-convex) rank-minimization problem. Thus, it is somewhat surprising that the special case where ∗ L. Ning is with the Dept. of Electrical & Comp. Eng., University of Minnesota, Minneapolis,Minnesota 55455, [email protected] † T. T. Georgiou is with the Dept. of Electrical & Comp. Eng., University of Minnesota, Min-neapolis, Minnesota 55455, [email protected] ‡ A. Tannenbaum is with the Comprehensive Cancer Center and Dept. of Electrical & Comp.Eng., University of Alabama, Birmingham, AL 35294, [email protected] § S. P. Boyd is with the Department of Electrical Engineering, Stanford University, Stanford, CA94305, [email protected] he maximal number of possible simultaneous linear relations is equal to 1 can be ex-plicitly characterized –this was accomplished over half a century ago by Reiersøl [35];see also [22, 26]. To date no other case is known that admits a precise closed-formsolution.In recent years, emphasis has been shifting from hard, non-convex optimizationto convex regularizations, which in addition scale nicely with the size of the problem.Following this trend we revisit the Frisch problem from several alternative angles. Wefirst present an overview of the literature, and present several new insights and proofs.In the process, we also give an extension of Reiersøl’s result to complex matrices.Our main interest is in exploring recently studied convex optimization problems thatapproximate rank minimization by use of suitable surrogates. In particular, we studyiterative schemes for treating the general Frisch problem and focus on certificates thatguarantee optimality. In parallel, we consider a viewpoint that serves as an alternativeto the Frisch problem where now, instead of a maximal number of simultaneous linearrelations, we seek a uniformly optimal estimator for the unobserved data under theindependence assumption of the Frisch scheme. The optimal estimator is obtainedas a solution to a min-max optimization problem. Rank-regularized and min-maxalternatives are discussed and an example is given to highlight the potential andlimitations of the techniques.The remainder of this paper is organized as follows. We first introduce the errors-in-variables problem in Section 3. In Section 4, we revisit the Frisch problem, and arelated problem due to Shapiro, and provide a geometric interpretation of Reiersøl’sresult along with a generalization to complex-valued covariances. In Section 5, wepresent an iterative trace-minimization scheme for solving the Frisch problem andprovide computable lower-bounds for the minimum-rank. In Section 7, we bring upthe question of estimation in the context of the Frisch scheme and motivate a suitablea rank-regularized min-max optimization problem in Section 8.2. Some concludingremarks are provided in Section 10.
2. Notation. R ( · ), N ( · ) range space, null spaceΠ X orthogonal projection onto X > ≥
0) positive definite (resp., positive semi-definite) S n = { M | M ∈ R n × n , M = M ′ } S n, + = { M | M ∈ S n , M ≥ } H n = { M | M ∈ C n × n , M = M ∗ } H n, + = { M | M ∈ H n , M ≥ } [ · ] kℓ , ([ · ] k ) ( k, ℓ )-th entry (resp., k -th entry) | M | determinant of M ∈ R n × n n + ( · ) number of positive eigenvaluesdiag : R n × n → R n : M d where [ d ] i = [ M ] ii for i = 1 , . . . n diag ∗ : R n → R n × n : d D where D is diagonal and [ D ] ii = [ d ] i for i = 1 , . . . nM ≻ e (cid:23) e , ≺ e , (cid:22) e
0) the off-diagonal entries are > ≥ < ≤
3. Data and basic assumptions.
Consider a Gaussian vector x taking valuesin R n × having zero mean and covariance Σ. We assume that it represents an additivemixture of a Gaussian “noise-free” vector ˆ x and a “noise component” ˜ x , thus x = ˆ x + ˜ x . (3.1) he entries of ˜ x are assumed independent of one another and independent of theentries of ˆ x with both vectors having zero mean and covariances ˆΣ and ˜Σ, respectively.Thus, E (˜ x ˜ x ′ ) =: ˜Σ is diagonal (3.2a) E (ˆ x ˜ x ′ ) = 0 . (3.2b)Throughout E ( · ) denotes the expectation operation and 0 denotes the zero vector/matrixof appropriate size. The noise-free entries of ˆ x are assumed to satisfy a set of q simul-taneous linear relations. Hence, M ′ ˆ x = 0, with M ∈ R n × q and n > rank( M ) = q > E (ˆ x ˆ x ′ ) =: ˆΣ hasrank( ˆΣ) = n − q (3.2c)and ˆΣ M = 0. Statistics are typically estimated from observation records. To thisend, consider a sequence x t ∈ R n × , t = 1 , . . . , T of independent measurements (realizations) of x and, likewise, let ˆ x t and ˜ x t representthe corresponding values of the noise-free variable and noise components. Denote by X = (cid:2) x x . . . x T (cid:3) ∈ R n × T the matrix of observations of x and similarly denote by ˆ X and ˜ X the correspondingmatrices of the noise-free and noise entries, respectively. Data for identifying relationsamong the noise-free variables are typically limited to the observation matrix X and,neglecting a scaling factor of 1 /T , the data is typically abstracted in the form of asample covariance XX ′ . For the most part we will assume that sample covariancesare accurate approximations of true covariances, and hence the modeling assumptionsamount to ˜ X ˜ X ′ ≃ diagonal (3.3a)ˆ X ˜ X ′ ≃ X ) = n − q (3.3c)since M ′ ˆ X = 0.The number of possible linear relations among the noise free variables and thecorresponding coefficient matrix need to be determined from either X or Σ. Thismotivates the Frisch and Shapiro problems discussed in Section 4. An alternativeset of problems can be motivated by the need to determine ˆ X from X via suitabledecomposition X = ˆ X + ˜ X (3.4)in a way that is consistent with the existence of a set of q linear relations. We willreturn to this in Section 8.
4. The problems of Frisch and Shapiro.
We begin with the Frisch problemconcerning the decomposition of a covariance matrix Σ that is consistent with the ssumptions in Section 3. The fact that, in practice, Σ is an empirical sample covari-ance motivates relaxing (3.2a-3.2c) in various ways. In particular, relaxation of theconstraint ˜Σ ≥ Problem 1 (
The Frisch problem ). Given Σ ∈ S n, + , determine mr + (Σ) := min { rank( ˆΣ) | Σ = ˜Σ + ˆΣ , ˜Σ , ˆΣ ≥ , ˜Σ is diagonal } . (4.1) Problem 2 (
The Shapiro problem ). Given Σ ∈ S n, + , determine mr(Σ) := min { rank( ˆΣ) | Σ = ˜Σ + ˆΣ , ˆΣ ≥ , ˜Σ is diagonal } . (4.2)The Frisch problem was studied by several researchers, see e.g., [23, 31, 44, 45]and the references therein. On the other hand, Shapiro [37] introduced the aboverelaxed version, removing the requirement that ˜Σ ≥
0, in an attempt to gain under-standing of the algebraic constraints imposed by the off-diagonal elements of Σ onthe decomposition. We refer to mr + ( · ) as the Frisch minimum rank and mr( · ) asthe Shapiro minimum rank . The former is lower semicontinuous whereas the latteris not, as stated next. This difference is crucial if one wants to apply this type ofmethodology to real data, namely some sort of continuity is necessary.
Proposition 1. mr + ( · ) is lower semicontinuous whereas mr( · ) is not.Proof: Assume that for a given Σ > , Σ , . . . ofpositive definite matrices such that Σ i → Σ whilemr + (Σ i ) < mr + (Σ) = r, for all i = 1 , , . . . . Decompose Σ i = ˆΣ i + D i with rank( ˆΣ i ) < r , Σ i ≥ D i ≥ D i diagonal. Thenthere exist convergent subsequences ˆΣ i k → ˆΣ and D i k → D , as k → ∞ . SinceΣ i k → ˆΣ + D = Σ, by the lower semicontinuity of the rank,rank( ˆΣ) ≤ lim k →∞ inf rank( ˆΣ i k ) < r = mr + (Σ) . This is a contradiction. On the other hand, to see that mr( · ) is not lower semicontin-uous considerΣ = − − − − and Σ ǫ = − − − ǫ − ǫ , ˆΣ ǫ = ǫ − − − ǫ ǫ − ǫ ǫ for ǫ >
0. Clearly mr(Σ) = 2. Also lim ǫ → Σ ǫ = Σ. Yet Σ ǫ = ˆΣ ǫ + D ǫ while Σ ǫ hasrank 1 and D ǫ is diagonal ( ǫ ) = 1.Assuming that the off-diagonal entries of Σ > n × n are known withabsolute certainty, any “minimum rank” (mr + ( · ) and mr( · )) is bounded below by theso-called Lederman bound, i.e.,2 n + 1 − √ n + 12 ≤ mr(Σ) ≤ mr + (Σ) , (4.3) hich holds on a generic set of positive definite matrices Σ, that is, on a (Zariskiopen) subset of positive definite matrices. Equivalently, the set of matrices Σ forwhich mr(Σ) is lower than the Lederman bound is non-generic –their entries satisfyalgebraic equations which fail under small perturbation. To see this, consider anyfactorization Σ = F F ′ , with F ∈ R n × r . There are ( n − r ) r + r ( r +1)2 independent entries in F (when accountingfor the action of a unitary transformation of F on the right), whereas the value of theoff-diagonal entries of Σ impose n ( n − constraints. Thus, the number of independententries in F exceeds the number of constraints when ( n − r ) ≥ n + r which thenleads to the inequality n +1 −√ n +12 ≤ r . The bound was first noted in [29] while theindependence of the constraints has been detailed in [4]. In general, the computationof the exact value for mr + (Σ) and mr(Σ) is a non-trivial matter. Thus, it is rathersurprising that an exact analytic result is available for both, in the special case when r = n −
1. We review this next in the form of two theorems.
Theorem 2 (
Reiersøl’s theorem [35] ). Let Σ ∈ S n, + and Σ > , then mr + (Σ) = n − ⇔ Σ − ≻ e . Theorem 3 (
Shapiro’s theorem [38] ). Let Σ ∈ S n, + and irreducible, mr(Σ) = n − ⇔ Σ (cid:22) e . The characterization of covariance matrices Σ for which mr + (Σ) = n − We begin with two basic lemmas for irreduciblematrices in M ∈ S n, + . Recall that a matrix is reducible if by permutation of rowsand columns can be brought into a block diagonal form, otherwise it is irreducible. Lemma 4.1.
Let
M > and irreducible. Then, M (cid:22) e ⇒ M − ≻ e . (4.4) Lemma 4.2.
Let M ≥ and irreducible. Then, M (cid:22) e ⇒ nullity( M ) ≤ . (4.5) Proof:
It is easy to verify that for matrices of size 2 ×
2, (4.4) holds true. Assumethat the statement also holds true for matrices of size up to k × k , for a certain value f k , and consider a matrix M of size ( k + 1) × ( k + 1) with M > M (cid:22) e M = (cid:20) A bb ′ c (cid:21) so that c is a scalar and, hence, A is of size k × k . Partitioning conformably, M − = (cid:20) F gg ′ h (cid:21) where F = ( A − bc − b ′ ) − , g = − A − bh, and h = ( c − b ′ A − b ) − > . For the case where A is irreducible, because A has size k × k and A (cid:22) e
0, invokingour hypothesis we conclude that A − ≻ e
0. Now, since b has only non-positive entriesand b = 0, g = − A − bh has positive entries. Since − bc − b ′ (cid:22) e A (cid:22) e
0, then A − bc − b ′ (cid:22) e F = ( A − bc − b ′ ) − has positive entries byhypothesis.For the case where A is reducible, permutation of columns and rows brings A into a block-diagonal form with irreducible blocks. Thus, A − is also block diagonalmatrix with each block entry-wise positive. Because M is irreducible, b must have atleast one non-zero entry corresponding to the rows of each diagonal blocks of A . Then A − bc − b ′ is irreducible and (cid:22) e
0. Also A − b has all of its entries negative. Therefore F = ( A − bc − b ′ ) − and g = − A − bh have positive entries. Therefore M − ≻ e Proof:
Rearrange rows and columns and partition M = (cid:20) A BB ′ C (cid:21) so that A is nonsingular and of maximal size, equal to the rank of M . Then C = B ′ A − B. (4.6)We first show that B ′ A − B (cid:23) e
0. Assume that A is irreducible. Then A − ≻ e B has negative entries and not all zero (since M is irreducible). Inthis case, B ′ A − B ≻ e
0. If on the other hand A is reducible, Lemma 4.1 applied to the(irreducible) blocks of A implies that A − (cid:23) e
0. Therefore, in this case, B ′ A − B (cid:23) e C (cid:22) e B ′ A − B (cid:23) e C is a scalar (and hence there are no off-diagonal negativeentries), or both C and B ′ A − B are diagonal. The latter contradicts the assumptionthat M is irreducible. Hence, the nullity of M can be at most 1.Lemma 4.2 provides the following geometric insight, stated as a corollary. Corollary 4.
In any Euclidean space of dimension n , there can be at most n +1 vectors forming an obtuse angle with one another.Proof: The Grammian M = [ v ′ k v ℓ ] n + qk,ℓ =1 of a selection { v k | k = 1 , . . . , n + q } ofsuch vectors has off-diagonal entries which are negative. Hence, by Lemma 4.2, thenullity of M cannot exceed 1.The necessity part of Theorem 3 is also a direct corollary of Lemma 4.2. Corollary 5.
Let Σ ∈ S n, + and irreducible. Then Σ (cid:22) e ⇒ mr(Σ) = n − . Proof:
Let Σ = ˆΣ + ˜Σ, with ˜Σ diagonal and ˆΣ ≥
0. ˆΣ is irreducible since Σ isirreducible. From Lemma 4.2, the nullity of ˆΣ is at most 1. Thus mr(Σ) = n − .2. A dual decomposition. The matrix inversion lemma provides a corre-spondence between an additive decomposition of a positive-definite matrix and a de-composition of its inverse, albeit with a different sign in one of the summands. Thisis stated next.
Lemma 4.3.
Let
Σ = D + F F ′ (4.7) with Σ , D ∈ S n, + , with Σ , D > and F ∈ R n × r . Then S := Σ − = E − GG ′ (4.8) for E = D − and G = D − F ( I + F ′ D − F ) − / . Conversely, if (4.8) holds with G ∈ R n × r , then so does (4.7) for D = E − and F = E − G ( I − G ′ E − G ) − / .Proof: This follows from the identity ( I ± M M ′ ) − = I ∓ M ( I ∓ M ′ M ) − M ′ .Application of the lemma suggests the following variation to Frisch’s problem. Problem 3 (
The dual Frisch problem ). Given a positive-definite n × n symmetricmatrix S determine the dual minimum rank : mr dual ( S ) := min { rank( ˆ S | S = E − ˆ S, ˆ S, E ≥ , E is diagonal } . Clearly, if S = Σ − = E − GG ′ (as in (4.8)), then E >
0. Furthermore, adecomposition of S always gives rise to a decomposition Σ = D + F F ′ (as in (4.7))with the terms F F ′ and GG ′ having the same rank. Thus, it is clear thatmr + (Σ) ≤ mr dual (Σ − ) , (4.9)and that the above holds with equality when an optimal choice of D ≡ ˜Σ in (4.1) isinvertible. However, if D is allowed to be singular, the rank of the summands F F ′ and GG ′ may not agree. This is can be seen using the following example. TakeΣ = . It is clear that Σ admits a decomposition Σ = ˜Σ + ˆΣ, in correspondence with (4.7),where ˜Σ = D = diag { , , } while ˆΣ = F F ′ as well as F ′ = [1 , ,
1] are of rank one.On the other hand, S = Σ − = −
10 1 − − − . Taking E = diag { e , e , e } in (4.8), it is evident that the rank of GG ′ = E − S = e − e − e − cannot be less than 2 without violating the non-negativity assumption for the sum-mand GG ′ . The minimal rank for the factor G is 2 and is attained by taking e = e = 2 and e = 5. n the other hand, in general, if we perturb Σ to Σ + ǫI and, accordingly, D to D + ǫI , then mr dual ((Σ + ǫI ) − ) ≤ mr + (Σ) , ∀ ǫ > . (4.10)Equality in (4.10) holds for sufficiently small value of ǫ . Thus, mr + and mr dual areclosely related. However, it should be noted that mr dual ( · ) fails to be lower semi-continuous since a small perturbation of the off-diagonal entries can reduce mr dual ( · ).Yet, interestingly, an exact characterization of the mr dual ( S ) = n − + and mr being equal to n −
1; the condition formr dual will be used to prove the Reiersøl and Shapiro theorems.
Theorem 6.
For S ∈ S n, + , with S > and irreducible, mr dual ( S ) = n − ⇔ S (cid:23) e . (4.11) Proof: If S (cid:23) e E is diagonal satisfying E ≥ S >
0, then E − S = GG ′ (cid:22) e E − S is singular, rank( G ) = n −
1. Hence,mr dual ( S ) = n − dual ( S ) = n − ⇒ S (cid:23) e
0, we assume that the condition S (cid:23) e dual ( S ) < n −
1. We first argue the case for a 3 × S = [ s ij ] i,j =1 . Provided S e e i = s ii − s ij s ki s jk for i ∈ { , , } and ( i, j, k ) being permutations of (1 , , S = diag ∗ ( e , e , e ). It can be seen that ˜ S − S ≥ S − S ) = 1. To verifythe latter observe that ˜ S − S = vv ′ for v ′ = (cid:2) √ e − s , √ e − s , √ e − s (cid:3) . This establishes the reverse implication for matrices of size 3 × n − × ( n −
1) for some n ≥ S, ˜ S be of size n × n with S e S diagonal. We need to prove that mr dual ( S ) < n −
1. We partition S = (cid:20) A bb ′ c (cid:21) , ˜ S = (cid:20) E e (cid:21) with A, E being ( n − × ( n − S such that ˜ S − S ≥ e cannot be equalto c , otherwise b = 0 and S is reducible. Further, ˜ S − S ≥ e > c and M := E − ( A + b ( e − c ) − b ′ ) ≥ . The nullity of ˜ S − S coincides with that of M . To prove our claim, it suffices to showthat A e := A + b ( e − c ) − b ′ e
0, or that A e is reducible for some e > c . (Since, ineither case, by our hypothesis, the nullity of M for a suitable E exceeds 1.)We now consider two possible cases where S (cid:23) e A e
0. Then obviously A e e e − c sufficiently large. Thesecond possibility is S e A (cid:23) e
0. But if A is (transformed into) element-wise onnegative, then bb ′ must have at least one pair of negative off-diagonal entries.Then, consider A e = A + λbb ′ for λ = ( e − c ) − ∈ (0 , ∞ ). Evidently, for certain valuesof λ entries of A e change sign. If a whole row becomes zero for a particular value of λ , then A e is reducible. In all other cases, there are values of λ for which A e e We first show that Σ − ≻ e + (Σ) = n −
1. From the continuity of the inverse, (Σ + ǫI ) − ≻ e ǫ >
0. Applying Theorem 6, we conclude thatmr dual ((Σ + ǫI ) − ) = n − . Since mr + (Σ) ≥ mr dual ((Σ + ǫI ) − ) as in (4.10), we conclude that mr + (Σ) = n − + (Σ) = n − ⇒ Σ − ≻ e
0, we show that assuming Σ − e + (Σ) = n − + ( · ) (Proposition 1), there exists a symmetricmatrix ∆ and an ǫ > ǫ ∆) − e , and mr + (Σ + ǫ ∆) = n − . Then, from Theorem 6, mr dual ((Σ + ǫ ∆) − ) < n − + (Σ + ǫ ∆) ≤ mr dual ((Σ + ǫ ∆) − ) . Thus, we have a contradiction and therefore Σ − ≻ e Given Σ ≥ λ > λI − Σ ≥
0, a diagonal D , and let E := λI − D . Since Σ − D = E − ( λI − Σ),mr(Σ) = mr dual ( λI − Σ) . (4.12)If Σ is irreducible and Σ (cid:22) e
0, then λI − Σ is irreducible and λI − Σ (cid:23) e
0. It follows(Theorem 6) that mr dual ( λI − Σ) = n −
1, and therefore mr(Σ) = n − n − dual ( λI − Σ) = n −
1, whichimplies that λI − Σ (cid:23) e (cid:22) e The original proof in [38] claims that for any Σ ≥ n × n with n > e
0, there exists a ( n − × ( n −
1) principle minor that is e
0. This statementfails for the following sign pattern + 0 − − − + − − + 0 − + 0 + . This matrix can not transformed to have all nonpositive off-diagonal entries, yet allits 3 × (cid:22) e For either the Frisch or the Shapiro problem, a solution is not unique in general.The parametrization of solutions to the Frisch problem when mr + (Σ) = n − Proposition 7.
Let Σ ∈ S n, + with Σ > and Σ − ≻ e . The following hold: ) For D ≥ diagonal with Σ − D ≥ and singular, there is a probability vector ρ ( ρ has entries ≥ that sum up to ) such that (Σ − D )Σ − ρ = 0 .ii) For any probability vector ρ , D = diag ∗ (cid:18)(cid:20) [ ρ ] i [Σ − ρ ] i , i = 1 , . . . , n (cid:21)(cid:19) satisfies Σ − D ≥ and Σ − D is singular.Proof: See [22, 26].Thus, solutions of Frisch’s problem under Reiersøl’s conditions are in bijectivecorrespondence with probability vectors. A very similar result holds true for Shapiro’sproblem.
Proposition 8.
Let Σ ∈ S n, + be irreducible and have ≤ off-diagonal entries.The following hold:i) For D diagonal with Σ − D ≥ and singular, there is a strictly positive vector v such that (Σ − D ) v = 0 .ii) For any strictly positive vector v ∈ R n × , D = diag ∗ (cid:18)(cid:20) [Σ v ] i [ v ] i , i = 1 , . . . , n (cid:21)(cid:19) (4.13) satisfies that Σ − D ≥ and Σ − D is singular.Proof: To prove ( i ), we note that if (Σ − D ) v = 0, then v ≻ e
0. To see thisconsider (Σ − D + ǫI ) − for ǫ >
0. From Lemma 4.1,(Σ − D + ǫI ) − ≻ e v is an eigenvector corresponding to its largest eigenvalue, a power iterationargument concludes that v ≻ e ii ), it is easy to verify that the diagonal matrix D in (4.13) for v ≻ e − D ) v = 0. We only need to prove that Σ − D ≥
0. Without loss ofgenerality we assume that all the entries of v are equal. (This can always be done byscaling the entries of v and scaling accordingly rows and columns of Σ.) Since v is anull vector of Σ − D and since M := Σ − D has ≤ M ] ii = X j = i | [ M ] ij | . Gersgorin Circle Theorem (e.g., see [43]) now states that every eigenvalue of M lieswithin at least one of the closed discs n Disk (cid:16) [ M ] ii , P j = i | [ M ] ij | (cid:17) , i = 1 , . . . , n o . Nodisc intersects the negative real line. Therefore Σ − D ≥ Complex-valued covari-ance matrices are commonly used in radar and antenna arrays [42]. The rank ofΣ − D , for noise covariance D as in the Frisch problem, is an indication of the numberof (dominant) scatterers in the scattering field. If this is of the same order as thenumber of array elements (e.g., n − + (Σ) = n − onsider complex-valued observation vectors x t = y t + i z t , t = 1 , . . . T, wherei = √− y t , z t ∈ R n × , and set X = [ x , . . . x T ] = Y + i Z with Y = [ y , . . . y T ], Z = [ z , . . . z T ]. The (scaled) sample covariance isΣ = XX ∗ = Σ r + iΣ i ∈ H n, + , where the real part Σ r := Y Y ′ + ZZ ′ is symmetric, the imaginary part Σ i := ZY ′ − Y Z ′ is anti-symmetric, and “ ∗ ” denotes complex-conjugate transpose. As before, weconsider a decomposition Σ = ˆΣ + D with ˆΣ ≥ D ≥ + (Σ) = 1. In this section we present a sufficient condition for a Reiersøl-case wheremr + (Σ) = n − R := (cid:20) Σ r Σ i Σ ′ i Σ r (cid:21) ∈ S n, + does not allow taking advantage of earlier results. The structure of R with antisym-metric off-diagonal blocks implies that if [ a ′ , b ′ ] ′ is a null vector then so is [ − b ′ , a ′ ] ′ (since, accordingly, a + i b and i a − b are both null vectors of Σ). Thus, in general,the nullity of R is not 1 and the theorem of Reiersøl is not applicable. Further, thecorresponding noise covariance is diagonal with repeated blocks.The following lemmas for the complex case echo Lemma 4.1 and Lemma 4.2. Lemma 4.4.
Let M ∈ H n, + be irreducible. If the argument of each non-zerooff-diagonal entry of − M is in (cid:0) − π n , π n (cid:1) , then each entry of M − has argument in (cid:0) − π + π n , π − π n (cid:1) .Proof: It is easy to verify the lemma for 2 × n × n and consider an ( n + 1) × ( n + 1) matrix M thatsatisfies the conditions of the lemma. Partition M = (cid:20) A bb ∗ c (cid:21) with A is of size n × n , and conformably, M − = (cid:20) F gg ∗ h (cid:21) . By assumption non-zero entries of − A and − b have their argument in (cid:0) − π n +1 , π n +1 (cid:1) .Then, by bounding the possible contribution of the respective terms, it follows thatfor the argument of each of the entries of − A + bc − b ∗ is in (cid:0) − π n , π n (cid:1) . Then, the ar-gument of each entry of F = ( A − bc − b ∗ ) − is in (cid:0) − π + π n , π − π n (cid:1) ; this follows byassumption since F is n × n . Clearly, (cid:0) − π + π n , π − π n (cid:1) ⊂ (cid:0) − π + π n +1 , π − π n +1 (cid:1) .Regarding g , by bounding the possible contribution of respective terms, we similarlyconclude that the argument of each of its non-zero entries is in (cid:0) − π + π n +1 , π − π n +1 (cid:1) . Lemma 4.5.
Let M ∈ H n, + be irreducible. If the argument of each non-zerooff-diagonal entry of − M is in (cid:0) − π n , π n (cid:1) , then rank( M ) ≥ n − . roof: First rearrange rows and columns of M , and partition as M = (cid:20) A BB ∗ C (cid:21) so that A is nonsingular and of size equal to the rank of M , which we denote by r .Then C = B ∗ A − B (4.14)and has size equal to the nullity of M . We now compare the argument of the off-diagonal entries of C and B ∗ A − B , and show they cannot be equal unless C is ascalar. Since the off-diagonal entries of − A have their argument in (cid:0) − π n , π n (cid:1) ⊂ (cid:0) − π r , π r (cid:1) , the off-diagonal entries of A − have their argument in (cid:0) − π + π r , π − π r (cid:1) from Lemma 4.4. Now, the ( k, ℓ ) entry of B ∗ A − B is[ B ∗ A − B ] kℓ = X i,j [ B ∗ ] ki [ A − ] ij [ B ] jℓ and the phase of each summand isarg([ B ∗ ] ki [ A − ] ij [ B ] jℓ ) ∈ (cid:16) − π π r − π n − , π − π r + π n − (cid:17) . Thus, the non-zero off-diagonal entries of B ∗ A − B have positive real part whilearg( − [ C ] kℓ ) ∈ (cid:16) − π n , π n (cid:17) . Hence, either the off-diagonal entries of B ∗ A − B and C are zero, in which case theseare diagonal matrices and M must be reducible, or B ∗ A − B and C are both scalars.This concludes the proof. Theorem 9.
Let Σ ∈ H n, + be irreducible. If the argument of each non-zerooff-diagonal entry of − Σ is in (cid:0) − π n , π n (cid:1) , then mr(Σ) = n − .Proof: The matrix Σ − D is irreducible since D is diagonal. If Σ − D ≥ − (Σ − D ) isin (cid:0) − π n , π n (cid:1) , Lemma 4.5 applies and gives that rank(Σ − D ) = n − + (Σ) ≥ mr(Σ), under the condition of Theorem 9, mr + (Σ) = n −
1. It is also clear that for S ∈ H n, + irreducible with all non-zero off-diagonalentries having argument in (cid:0) − π n , π n (cid:1) , we also conclude that mr dual ( S ) = n −
5. Trace minimization heuristics.
The rank of a matrix is a non-convex func-tion of its elements and the problem to find the matrix of minimal rank within a givenset is a difficult one, in general. Therefore, certain heuristics have been developed overthe years to obtain approximate solutions. In particular, in the context of factor anal-ysis, trace minimization has been pursued as a suitable heuristic [30, 37, 38] therebyrelaxing the Frisch problem into min D :Σ ≥ D ≥ trace(Σ − D ) , for a diagonal matrix D ; with a relaxation of D ≥ tal. [13] who proved that these constitute convex envelops of the rank function onbounded sets of matrices.The relation between minimum trace factor analysis and minimum rank factoranalysis goes back to Ledermann in [28] (see [9] and [36]). Herein we only refer totwo propositions which characterize minimizers for the two problems, Frisch’s andShapiro’s, respectively. Proposition 10 ( [9] ). Let
Σ = ˆΣ + D > for a diagonal D ≥ . Then, ( ˆΣ , D ) = arg min { trace( ˆΣ) | Σ = ˆΣ +
D > , ˆΣ ≥ , diagonal D ≥ } (5.1a) ⇔ ∃ Λ ≥ Λ = 0 and (cid:26) [Λ ] ii = 1 , if [ D ] ii > , [Λ ] ii ≥ , if [ D ] ii = 0 . Proposition 11 ( [36] ). Let
Σ = ˆΣ + D > for a diagonal D . Then, ( ˆΣ , D ) = arg min { trace( ˆΣ) | Σ = ˆΣ +
D > , ˆΣ ≥ , diagonal D } (5.1b) ⇔ ∃ Λ ≥ Λ = 0 and [Λ ] ii = 1 ∀ i. Evidently, when the solutions to these two problems differ and D = D , then thereexists k ∈ { , . . . , n } such that[ D ] kk < D ] kk = 0 . Further, the essence of Proposition 11 is that a singular ˆΣ originates from such aminimization problem if and only if there is a correlation matrix in its null space.The matrices Λ and Λ appear as Lagrange multipliers in the respective problems.Factor analysis is closely related to low-rank matrix completion as well as to sparseand low-rank decomposition problems. Typically, low-rank matrix completion asks fora matrix X which satisfies a linear constraint A ( X ) = b and has low/minimal rank( A ( · ) denotes a linear map A : R n × n → R p ). Thus, factor analysis corresponds tothe special case where A ( · ) maps X onto its off-diagonal entries. In a recent workby Recht etal. [34], the nuclear norm of X was considered as a convex relaxation ofrank( X ) for such problems and a sufficient condition for exact recovery was provided.However, this sufficient condition amounts to the requirement that the null spaceof A ( · ) contains no matrix of low-rank. Therefore, since in factor analysis diagonalmatrices are in fact contained in the null space of A ( · ) and include matrices of low-rank, the condition in [34] does not apply directly. Other works on low-rank matrixcompletion (see, e.g., [34, 6]) mainly focus on assessing the probability of exact re-covery and on constructing efficient computational algorithms for large-scale low-rankcompletion problems [24, 25]. On the other hand, since diagonal matrices are sparse(most of their entries are zero), the work on matrix decomposition into sparse andlow-rank components by Chandrasekaran etal. [7] is very pertinent. In this, the ℓ and nuclear norms were used as surrogates for sparsity and rank, respectively, and asufficient condition for exact recovery was provided which captures a certain “rank-sparsity incoherence”; an analogous but stronger sufficient “incoherence” conditionwhich applies to problem (5.1b) is given in [36]. Both mr(Σ) and mr + (Σ) in(4.1) and (4.2), respectively, remain invariant under scaling of rows and the corre-sponding columns of Σ by the same coefficients. On the other hand, the minimizers n (5.1a) and (5.1b) and their respective ranks are not invariant under scaling. Thisfact motivates weighted-trace minimization,min n trace( W ˆΣ) | Σ = ˆΣ + D, ˆΣ ≥ , diagonal D ≥ o , (5.2)given Σ > W >
0. As before the characterization of mini-mizers relates to a suitable condition for the corresponding Lagrange multipliers:
Proposition 12 ( [38] ). Let
Σ = ˆΣ + D > for a diagonal matrix D ≥ andconsider a diagonal W > . Then, ( ˆΣ , D ) = arg min { trace( W ˆΣ) | Σ = ˆΣ +
D > , ˆΣ ≥ , diagonal D ≥ } (5.3) ⇔ ∃ Λ ≥ = 0 and (cid:26) [Λ ] ii = [ W ] ii , if [ D ] ii > , [Λ ] ii ≥ [ W ] ii , if [ D ] ii = 0 . A corresponding sufficient and necessary condition for ( ˆΣ , D ) to be a minimizerin Shapiro’s problem is that there exists a Grammian in the null space of ˆΣ whosediagonal entries are equal to the diagonal entries of W .Minimum-rank solutions may be recovered as solutions to (5.3) using suitablechoices of weight. However, these choices depend on Σ and are not known in advance –this motivates a selection of certain canonical Σ-dependent weight as well as iterativelyimproving the choice of weight. One should note that since D is diagonal, letting W be a not-necessarily diagonal matrix does not change the problem –only the diagonalentries of W determine the minimizer.We first consider taking W = Σ − . A rationale for this choice is that the minimalvalue in (5.2) bounds mr + (Σ) from below, since for any decomposition Σ = ˆΣ + D ,rank( ˆΣ) = trace( ˆΣ ♯ ˆΣ) ≥ trace(( ˆΣ + D ) − ˆΣ)= trace(Σ − ˆΣ) (5.4)where ♯ denotes the Moore-Penrose pseudo inverse. Continuing with this line ofanalysis rank( ˆΣ) = trace( ˆΣ ♯ ˆΣ) ≥ trace(( ˆΣ + ǫI ) − ˆΣ) (5.5)for any ǫ >
0, suggests the iterative re-weighting process D ( k +1) := arg min D trace (cid:0) (Σ − D ( k ) + ǫI ) − (Σ − D ) (cid:1) (5.6)for k = 1 , , . . . and D (0) := 0. In fact, as pointed out in [14], (5.6) corresponds tominimizing log det(Σ − D + ǫI ) by local linearization.Next we provide a sufficient condition for ˆΣ to be such a stationary point (5.6),i.e., for ˆΣ to satisfy arg min D trace (cid:16) ( ˆΣ + ǫI ) − ( ˆΣ − D ) (cid:17) = 0 . (5.7) he notation ◦ used below denotes the element-wise product between vectors or ma-trices which is also known as Schur product [20] and, likewise, for vectors a, b ∈ R n × , a ◦ b ∈ R n × with [ a ◦ b ] i = [ a ] i [ b ] i . Proposition 13.
Let ˆΣ ∈ S n, + and let the columns of U form a basis of R ( ˆΣ) .If R ( U ◦ U ) ⊂ R (Π N (ˆΣ) ◦ Π N (ˆΣ) ) , (5.8) then ˆΣ satisfies (5.7) for all ǫ ∈ (0 , ǫ ) and some ǫ > . We first need the following result which generalizes [39, Theorem 3.1].
Lemma 5.1.
For A ∈ R n × p and B ∈ R n × q having columns a , . . . , a p and b , . . . , b q , respectively, we let C = [ a ◦ b , a ◦ b , . . . , a ◦ b . . . a p ◦ b q ] ∈ R n × pq ,φ : R n → R n d diag( AA ′ diag ∗ ( d ) BB ′ ) , and ψ : R p × q → R n ∆ diag( A ∆ B ′ ) . Then R ( φ ) = R ( ψ ) = R (( AA ′ ) ◦ ( BB ′ )) = R ( C ) .Proof: Since diag( AA ′ diag ∗ ( d ) BB ′ ) = (( AA ′ ) ◦ ( BB ′ )) d , it follows that R ( φ ) = R (( AA ′ ) ◦ ( BB ′ ) . Moreover, diag( A ∆ B ′ ) = P pi =1 P qj =1 a i ◦ b j [∆] ij , and then R ( ψ ) = R ( C ). We onlyneed to show that R ( C ) = R (( AA ′ ) ◦ ( BB ′ )). This follows from( AA ′ ) ◦ ( BB ′ ) = p X i =1 q X j =1 ( a i a ′ i ) ◦ ( b j b ′ j )= p X i =1 q X j =1 ( a i ◦ b j )( a i ◦ b j ) ′ = CC ′ . Thus R ( C ) = R (( AA ′ ) ◦ ( BB ′ )). Proof: [Proof of Proposition 13:]
Assume that ˆΣ satisfies (5.7). If rank( ˆΣ) = r ,let ˆΣ = U SU ′ be the eigendecomposition of ˆΣ with S = diag ∗ ( s ) with s ∈ R r . Letthe columns of V be an orthogonal basis of the null space of ˆΣ, i.e., Π N (ˆΣ) = V V ′ .Then ( ˆΣ + ǫI ) − = ( ˆΣ + ǫ Π R (ˆΣ) + ǫ Π N (ˆΣ) ) − = ( ˆΣ + ǫ Π R (ˆΣ) ) ♯ + 1 ǫ Π N (ˆΣ) , andarg min D :ˆΣ ≥ D trace (cid:16) ( ˆΣ + ǫI ) − ( ˆΣ − D ) (cid:17) =arg min D :ˆΣ ≥ D trace (cid:16)(cid:16) ǫ ( ˆΣ + ǫ Π R (ˆΣ) ) ♯ + Π N (ˆΣ) (cid:17) ( ˆΣ − D ) (cid:17) . From Proposition 12, (5.7) holds if there is M ∈ S r, + such thatdiag( V M V ′ ) = diag (cid:16) ǫ ( ˆΣ + ǫ Π R (ˆΣ) ) ♯ + Π N (ˆΣ) (cid:17) . (5.9) bviously, if ǫ = 0 M = I satisfies the above equation. We consider the matrix M ofthe form M = I + ∆. For (5.9) holds, we need diag(( ˆΣ + ǫ Π R ) ♯ ) to be in the rangeof ψ for ψ : S n → R n ∆ diag( V ∆ V ′ ) . From Lemma 5.1 that R ( ψ ) = R (Π N (ˆΣ) ◦ Π N (ˆΣ) ). On the other hand, since ǫ ( ˆΣ + ǫ Π R (ˆΣ) ) ♯ = U diag (cid:18)(cid:20) ǫ [ s ] + ǫ , . . . , ǫ [ s ] r + ǫ (cid:21)(cid:19) U ′ , then diag( ǫ ( ˆΣ + ǫ Π R (ˆΣ) ) ♯ ) ∈ R ( U ◦ U ). So if (5.8) holds, there is always a ∆ suchthat M = I + ∆ satisfies (5.9). Morover, it is also required that I + ∆ ≥
0. Sincethe map from ǫ to ∆ is continuous, for small enough ǫ , i.e. in a interval (0 , ǫ ) thecondition I + ∆ can always be satisfied.We note that (5.8) is a sufficient condition for ˆΣ to be a stationary point of (5.7)in both Frisch’s and Shapiro’s settings.
6. Certificates of minimum rank.
We are interested in obtaining bounds onthe minimal rank for the Frisch problem so as to ensure optimality when candidatesolutions are obtained by the earlier optimization approach in (5.6).The following two bounds were proposed in [44], and follow from Theorem 2.However, both of these bounds require exhaustive search which may be prohibitivelyexpensive when n is large. Corollary 14.
Let Σ ∈ S n, + and Σ > . If there is an s × s principle minorof Σ whose inverse is positive, then mr + (Σ) ≥ s − . (6.1a) If there is an s × s principle minor of Σ − which is element-wise positive, then mr + (Σ) ≥ s − . (6.1b)Next we discuss three other bounds that are computationally more tractable –the first two were proposed by Guttman [18]. Guttman’s bounds are based on aconservative assessment for the admissible range of each of the diagonal entries of D = Σ − ˆΣ. Proposition 15.
Let Σ ∈ S n, + and let D := diag ∗ (diag(Σ)) D := (cid:0) diag ∗ (diag(Σ − )) (cid:1) − . Then the following hold, mr + (Σ) ≥ n + (Σ − D ) (6.1c)mr + (Σ) ≥ n + (Σ − D ) . (6.1d) Further, n + (Σ − D ) ≤ n + (Σ − D ) .Proof: The proof follows from the fact that Σ ≥ D implies D ≤ D ≤ D . See[18] for details. t is also easy to see that mr(Σ) ≥ n + (Σ − D ) which provides a lower bound forthe minimum rank in Shapiro’s problem. Next we return to a bound, which we notedearlier in (5.4). Proposition 16.
Let Σ ∈ S n, + . Then the following holds: mr + (Σ) ≥ min Σ ≥ D ≥ trace(Σ − (Σ − D )) . (6.1e) Proof:
The statement follows readily from (5.4).Evidently an analogous statement holds for mr(Σ). We note that (6.1c) and(6.1d) remain invariant under scaling of rows and corresponding columns, whereas(6.1e) does not, hence these two cannot be compared directly.
7. Correspondence between decompositions.
We now return to the decom-position of the data matrix X = ˆ X + ˜ X as in (3.4) and its relation to the correspondingsample covariances. The decomposition of X into “noise-free” and “noisy” compo-nents implies a corresponding decomposition for the sample covariance, but in theconverse direction, a decomposition Σ = ˆΣ + ˜Σ leads to a family of compatible de-compositions for X , which corresponds to the boundary of a matrix-ball. This isdiscussed next. Proposition 17.
Let X ∈ R n × T , and Σ := XX ′ . If Σ = ˆΣ + ˜Σ (7.1) with ˆΣ , ˜Σ symmetric and non-negative definite, there exists a decomposition X = ˆ X + ˜ X (7.2a) for which ˆ X ˜ X ′ = 0 , (7.2b)ˆΣ = ˆ X ˆ X ′ , (7.2c)˜Σ = ˜ X ˜ X ′ . (7.2d) Further, all pairs ( ˆ X, ˜ X ) that satisfy (7.2a-7.2d) are of the form ˆ X = ˆΣΣ − X + R / V, ˜ X = ˜ΣΣ − X − R / V, (7.3) with R := ˆΣ − ˆΣΣ − ˆΣ (7.4a)= ˜Σ − ˜ΣΣ − ˜Σ (7.4b)= ˆΣΣ − ˜Σ= ˜ΣΣ − ˆΣ , and V ∈ R n × T such that V V ′ = I , XV ′ = 0 .Proof: The proof relies on a standard lemma ([10, Theorem 2]) which statesthat if A ∈ R n × T , B ∈ R n × m with m ≤ T such that AA ′ = BB ′ , then A = BU forsome U ∈ R m × T with U U ′ = I . Thus, we let A := X , S := (cid:20) ˆΣ 00 ˜Σ (cid:21) , nd B := (cid:2) I I (cid:3) S / , where S / is the matrix-square root of S . It follows that thereexists a matrix U as above for which A = BU , and therefore we can take (cid:20) ˆ X ˜ X (cid:21) := S / U. This establishes the existence of the decomposition (7.2a).In order to parameterize all such pairs ( ˆ X, ˜ X ), let U o be an orthogonal (square)matrix such that XU o = [Σ / . Then ˆ XU o and ˜ XU o must be of the formˆ XU o =: (cid:2) ˆ X ∆ (cid:3) , ˜ XU o =: (cid:2) ˜ X − ∆ (cid:3) , (7.5)with ˆ X , ˜ X square matrices. Since (cid:20) ˆ X ˜ X (cid:21) (cid:2) ˆ X ′ ˜ X ′ (cid:3) = (cid:20) ˆΣ 00 ˜Σ (cid:21) , then ˆ X ˆ X ′ + ∆∆ ′ = ˆΣ (7.6a)ˆ X ˜ X ′ − ∆∆ ′ = 0 (7.6b)˜ X ˜ X ′ + ∆∆ ′ = ˜Σ . (7.6c)Substituting ˆ X ˜ X ′ for ∆∆ ′ into (7.6a) and using the fact that ˜ X = X − ˆ X with X = Σ / we obtain that ˆ X = ˆΣΣ − / . Similarly, using (7.6c) instead, we obtain that˜ X = ˜ΣΣ − / . Substituting into (7.6b), (7.6a) and (7.6c) we obtain the following three relations∆∆ ′ = ˆΣΣ − ˜Σ= ˆΣ − ˆΣΣ − ˆΣ= ˜Σ − ˜ΣΣ − ˜Σ . Since ∆∆ ′ and the Σ’s are all symmetric,∆∆ ′ = ˜ΣΣ − ˆΣas well. Thus, ∆ = R / V with V V ′ = I . The proof is completed by substitutingthe expressions for ˆ X and ∆ into (7.5).Interestingly,rank( R ) + rank(Σ) = rank (cid:18)(cid:20) ˆΣ ˆΣˆΣ Σ (cid:21)(cid:19) = rank (cid:18)(cid:20) ˆΣ 00 ˜Σ (cid:21)(cid:19) = rank( ˆΣ) + rank( ˜Σ) , nd hence, the rank of the “uncertainty radius” R of the corresponding ˆ X and ˜ X -matrix spheres is rank( R ) = rank( ˆΣ) + rank( ˜Σ) − rank(Σ) . In cases where identifying ˆ X from the data matrix X , different criteria may be usedto quantify uncertainty. One such is the rank of R while another is its trace, which isthe variance of estimation error in determining ˆ X . This topic is considered next andits relation to the Frisch decomposition highlighted.
8. Uncertainty and worst-case estimation.
The basic premise of the decom-position (7.1) is that, in principle, no probabilistic description of the data is needed.Thus, under the assumptions of Proposition 17, R represents a deterministic radiusof uncertainty in interpreting the data. On the other hand, when data and noiseare probabilistic in nature and represent samples of jointly Gaussian random vectors x , ˆ x , ˜ x as in (3.1 - 3.2a), the conditional expectation of ˆ x given x is E { ˆ x | x } = ˆΣΣ − x ,while the variance of the error E { (ˆ x − ˆΣΣ − x )(ˆ x − ˆΣΣ − x ) ′ } = ˆΣ − ˆΣΣ − ˆΣ= R is the radius of the deterministic uncertainty set. Either way, it is of interest to assesshow this radius depends on the decomposition of Σ. Since the decomposition of Σ in theFrisch problem is not unique, it is natural to seek a uniformly optimal choice of theestimate K x for ˆ x over all admissible decompositions. To this end, we denote themean-squared-error loss function L ( K, ˆΣ , ˜Σ) := trace ( E ((ˆ x − K x )(ˆ x − K x ) ′ ))= trace (cid:16) ˆΣ − K ˆΣ − ˆΣ K ′ + K ( ˆΣ + ˜Σ) K ′ (cid:17) , (8.1)and define S (Σ) := { ( ˆΣ , ˜Σ) : Σ = ˆΣ + ˜Σ , ˆΣ , ˜Σ ≥ } as the set of all admissible pairs. Thus, a uniformly-optimal decomposition of X intosignal plus noise relates to the following min-max problem:min K max (ˆΣ , ˜Σ) ∈S (Σ) L ( K, ˆΣ , ˜Σ) . (8.2)The minimizer of (8.2) is the uniformly optimal estimator gain K . Analogous min-max problems, over different uncertainty sets, have been studied in the literature [12].In our settingmin K max (ˆΣ , ˜Σ) ∈S (Σ) L ( K, ˆΣ , ˜Σ) ≥ max (ˆΣ , ˜Σ) ∈S (Σ) min K L ( K, ˆΣ , ˜Σ) (8.3a)= max (ˆΣ , ˜Σ) ∈S (Σ) trace (cid:16) ˆΣ − ˆΣΣ − ˆΣ (cid:17) (8.3b)= max (ˆΣ , ˜Σ) ∈S (Σ) trace (cid:16) ˜Σ − ˜ΣΣ − ˜Σ (cid:17) . (8.3c) he functions to maximize in (8.3b) and (8.3c) are both strictly concave in ˆΣ and ˜Σ.Therefore the maximizer is unique. Thus, we denote( K opt , ˆΣ opt , ˜Σ opt ) := arg max (ˆΣ , ˜Σ) ∈S (Σ) min K L ( K, ˆΣ , ˜Σ) , (8.4)where, clearly, K opt = ˆΣ opt Σ − .In general, the decomposition suggested by the uniformly optimal estimationproblem does not lead to a singular signal covariance ˆΣ. The condition for when thathappens is given next. Interestingly, this is expressed in terms of half the candidatenoise covariance utilized in obtaining one of the Guttman bounds (Proposition 15). Proposition 18.
Let Σ > , and let D := 12 diag ∗ (cid:0) diag(Σ − ) (cid:1) − (8.5) (which is equal to D defined in Proposition 15). If Σ − D ≥ , then ˜Σ opt = D and ˆΣ opt = Σ − D . (8.6a) Otherwise, ˜Σ opt ≤ D and ˆΣ opt is singular . (8.6b) Proof:
From (8.3c), L ( K opt , ˆΣ opt , ˜Σ opt ) = max n ˜Σ − ˜ΣΣ − ˜Σ | Σ ≥ ˜Σ ≥ , ˜Σ is diagonal o ≤ max n ˜Σ − ˜ΣΣ − ˜Σ | ˜Σ is diagonal o (8.7)= 12 trace( D )with the maximum attained for ˜Σ = D . Then (8.6a) follows. In order to prove(8.6b), consider the Lagrangian corresponding to (8.3c) L ( ˜Σ , Λ , Λ ) = trace( ˜Σ − ˜ΣΣ − ˜Σ + Λ (Σ − ˜Σ) + Λ ˜Σ)where Λ , Λ are Lagrange multipliers. The optimal values satisfy[ I − − ˜Σ opt − Λ + Λ ] kk = 0 , ∀ k = 1 , . . . , n, (8.8a)Λ ˆΣ opt = 0 , Λ ≥ , (8.8b)Λ ˜Σ opt = 0 , Λ ≥ . (8.8c)If Σ − D opt is singular. Assume the contrary, i.e., that ˆΣ opt > = 0, while from (8.8a), [ I − − ˜Σ opt ] kk ≤ . This givesthat [ ˜Σ opt ] kk ≥ − ] kk = [ D ] kk , for all k = 1 , . . . , n , which contradicts the fact that Σ − D
0. Therefore ˆΣ opt issingular. We now assume that ˜Σ D . Then there exists k such that [ ˜Σ opt ] kk > [ D ] kk . From (8.8c) and (8.8a), we have that[Λ ] kk = 0 and [ I − − ˜Σ opt ] kk ≥ hich contradicts the assumption that [ ˜Σ opt ] kk > [ D ] kk . Therefore ˜Σ opt ≤ D and(8.6b) has been established.We remark that while E ((ˆ x − K x )(ˆ x − K x ) ′ ) = ˆΣ − K ˆΣ − ˆΣ K ′ + K Σ K ′ = ( ˆΣΣ − − K Σ )( ˆΣΣ − − K Σ ) ′ + ˆΣ − ˆΣΣ − ˆΣis matrix-convex in K and a unique minimum for K = ˆΣΣ − , the error covarianceˆΣ − ˆΣΣ − ˆΣ may not have a unique maximum in the positive semi-definite sense. Tosee this, consider Σ = (cid:20) (cid:21) . In this case D = I , ˆΣ opt = (cid:20) / / (cid:21) , andˆΣ opt − ˆΣ opt Σ − ˆΣ opt = (cid:20) / / /
16 3 / (cid:21) . (8.9)On the other hand, for ˆΣ = (cid:20) / / (cid:21) , thenˆΣ − ˆΣΣ − ˆΣ = (cid:20) / / /
12 1 / (cid:21) which is neither larger nor smaller than (8.9) in the sense of semi-definiteness. Thisis a key reason for considering scalar loss functions of the error covariance as in (8.1).Next we note that there is no gap between the min-max and max-min values inthe two sides of (8.3a). Proposition 19.
For Σ ∈ S n, + , then min K max (ˆΣ , ˜Σ) ∈S (Σ) L ( K, ˆΣ , ˜Σ) = max (ˆΣ , ˜Σ) ∈S (Σ) min K L ( K, ˆΣ , ˜Σ) . (8.10) Proof:
We observe that for a fixed K , the function L ( K, ˆΣ , ˜Σ) is a linearfunction of ( ˆΣ , ˜Σ). For fixed ( ˆΣ , ˜Σ), the function is a convex function of K . Underthis conditions it is standard that (8.10) holds, see e.g. [5, page 281].We remark that when D = diag ∗ (cid:0) diag(Σ − ) (cid:1) − is admissible as noise co-variance, i.e., Σ − D ≥
0, the optimal signal covariance is ˆΣ opt = Σ − D , andthe gain matrix K opt = ˆΣ opt Σ − = I − D Σ − has all diagonal entries equal to .Thus, with K opt in (8.1) the mean-square-error loss is independent of ˆΣ and equal totrace (cid:0) K opt Σ K ′ opt (cid:1) for any admissible decomposition of Σ.We also remark that the key condition (Proposition 18)Σ ≥
12 diag ∗ (cid:0) diag(Σ − ) (cid:1) − ⇔ ∗ (cid:0) diag(Σ − ) (cid:1) ≥ Σ − can be equivalently written as Σ − ◦ (2 I − ′ ) ≥
0, and interestingly, amountsto the positive semi-definitess of a matrix formed by changing the signs of all off-diagonal entries of Σ − . The set of all such matrices, { S | S ≥ , S ◦ (2 I − ′ ) ≥ } ,is convex, invariant under scaling rows and corresponding columns, and contains theset of diagonally dominant matrices { S | S ≥ , [ S ] ii ≥ P j = i | [ S ] ij | for all i } .We conclude this section by noting that trace( R opt ), with R opt := ˆΣ opt − ˆΣ opt Σ − ˆΣ opt , uantifies the distance between admissible decompositions of Σ. This is stated next. Proposition 20.
For Σ > and any pair ( ˆΣ , ˜Σ) ∈ S (Σ) , trace (cid:16) ( ˆΣ − ˆΣ opt )Σ − ( ˆΣ − ˆΣ opt ) ′ (cid:17) ≤ trace( R opt ) . Proof:
Clearly 0 ≤ trace( ˆΣ − ˆΣΣ − ˆΣ), while from Proposition 19, L ( K opt , ˆΣ , ˜Σ) = trace( ˆΣ − opt Σ − ˆΣ + ˆΣ opt Σ − ˆΣ ′ opt ) (8.11) ≤ trace( R opt ) . Thus, trace( ˆΣΣ − ˆΣ − opt Σ − ˆΣ + ˆΣ opt Σ − ˆΣ ′ opt ) ≤ trace( R opt ). A decom-position of Σ in accordance with the min-max estimation problem of the previoussection often produces an invertible signal covariance ˆΣ. On the other hand, it isoften the case and it is the premise of factor analysis, that ˆΣ is singular of low rankand, thereby, allows identifying linear relations in the data. In this section we considercombining the mean-square-error loss function with regularization term promoting alow rank for the signal covariance ˆΣ [13]. More specifically, we consider J = min K max (ˆΣ , ˜Σ) ∈S (Σ) (cid:16) L ( K, ˆΣ , ˜Σ) − λ · trace( ˆΣ) (cid:17) , (8.12)for λ ≥
0, and properties of its solutions.As noted in Proposition 19 (see [5, page 281]), here too there is no gap betweenthe min-max and the max-min, which becomesmax (ˆΣ , ˜Σ) ∈S (Σ) min K L ( K, ˆΣ , ˜Σ) − λ · trace( ˆΣ)= max (ˆΣ , ˜Σ) ∈S (Σ) min K trace (cid:16) (1 − λ ) ˆΣ − K ˆΣ − ˆΣ K ′ + K ( ˆΣ + ˜Σ) K ′ (cid:17) = max (ˆΣ , ˜Σ) ∈S (Σ) trace (cid:16) (1 − λ ) ˆΣ − ˆΣ( ˆΣ + ˜Σ) − ˆΣ (cid:17) (8.13a)= max (ˆΣ , ˜Σ) ∈S (Σ) trace (cid:16) − λ Σ + (1 + λ ) ˜Σ − ˜Σ( ˆΣ + ˜Σ) − ˜Σ (cid:17) . (8.13b)Since (8.13a) and (8.13b) are strictly concave functions of ˆΣ and ˜Σ, respectively, thereis a unique set of optimal values ( K λ, opt , ˆΣ λ, opt , ˜Σ λ, opt ). Proposition 21.
Let Σ > , D = (cid:0) diag ∗ diag(Σ − ) (cid:1) − , λ min be the smallesteigenvalue of D − Σ D − , and ( K λ, opt , ˆΣ λ, opt , ˜Σ λ, opt ) as above, for λ ≥ . For any λ ≥ λ min − , ˆΣ λ, opt is singular.Proof: The trace of ( − λ Σ+(1+ λ ) ˜Σ − ˜ΣΣ − ˜Σ) is maximal for the diagonal choice˜Σ = (1 + λ ) D . For any λ ≥ λ min −
1, Σ − (1 + λ ) D fails to be positive semidefinite.Thus, the constraint Σ − ˜Σ ≥ λ, opt is singular.Note that Σ − D λ min <
2. Hence, for λ ≥
1, ˆΣ λ, opt is singular. When λ → λ → ∞ we recover the solution in Proposition 10. . Accounting for statistical errors. From an applications standpoint Σ rep-resents an empirical covariance, estimated on the basis of a finite observation recordin X . Hence (3.3a) and (3.3b) are only approximately valid, as already suggested inSection 3. Thus, in order to account for sampling errors we can introduce a penaltyfor the size of C := ˆ X ˜ X ′ , conditioned so thatΣ = ˆΣ + ˜Σ + C + C ′ , and a penalty for the distance of ˜Σ from the set { D | D diagonal } .Alternatively, we can use the Wasserstein 2-distance [33, 32] between the respec-tive Gaussian probability density functions, which can be written in the form of asemidefinite program d ( ˆΣ + D, Σ) = min C (cid:18) trace(Σ + ˆΣ + D + C + C ′ ) | (cid:20) ˆΣ + D C C ′ Σ (cid:21) ≥ (cid:19) . Returning to the uncertainty radius of Section 7 and the problem discussed inSection 8, we note that the problemmax min K L ( K, ˆΣ , D ) = max trace (cid:16) ˆΣ − ˆΣ( ˆΣ + D ) − ˆΣ (cid:17) can be expressed as the semidefinite programmax Q (cid:26) trace (cid:16) ˆΣ − Q (cid:17) | (cid:20) Q ˆΣˆΣ ˆΣ + D (cid:21) ≥ (cid:27) . Thus, putting the above together, a formulation that incorporates the various tradeoffsbetween the dimension of the signal subspace, mean-square-error loss, and statisticalerrors is to maximizetrace( ˆΣ − Q ) − λ trace( ˆΣ) − λ trace( ˆΣ + D − C − C ′ ) (9.1)subject to (cid:20) Q ˆΣˆΣ ˆΣ + D (cid:21) ≥ , (cid:20) ˆΣ + D C C ′ Σ (cid:21) ≥ , with D ≥ λ , λ dictate the relative importance that we place onthe various terms and determine the tradeoffs in the problem.We conclude with an example to highlight the potential and limitations of thetechniques. We generate data X in the form X = F V + ˜ X where F ∈ R n × r , V ∈ R r × T , and ˜ X ∈ R n × T with n = 50, r = 10, T = 100.The elements of F and V are generated from normal distributions with mean zeroand unit covariance. The columns of ˜ X are generated from a normal distributionwith mean zero and diagonal covariance, itself having (diagonal) entries which areuniformly drawn from interval [1 , XX ′ is subsequently scaled sothat trace(Σ) = 1. We determine( ˆΣ , Q, D ) = arg max n trace( ˆΣ − Q ) − λ · trace( ˆΣ) o ubject to (cid:20) Q ˆΣˆΣ ˆΣ + D (cid:21) ≥ , d ( ˆΣ + D, Σ) ≤ ǫ, with ˆΣ , D ≥ D diagonal , and tabulate below a typical set of values for the rank of ˆΣ (Table 1) as a function of λ and ǫ . We observe a “plateau” where the rank stabilizes at 10 over a small range ofvalues for ǫ and λ . Naturally, such a plateau may be taken as an indication of a suit-able range of parameters. Although the current setting where a small perturbation inthe empirical covariance Σ is allowed, the bounds for the rank in (6.1d) and (6.1e) arestill pertinent. In fact, for this example, in 7 /
10 instances where the rank( ˆΣ) = 10the bound in (6.1d) (computed based on the perturbed covariance ˆΣ + D ) has beentight and it thus a valid certificate. For the same range of parameters, the bound in(6.1e) has been lower than the actual rank of ˆΣ. In general, the bounds in (6.1d) and(6.1e) are not comparable as either one may be tighter than the other. ❍❍❍❍❍ λ ǫ .
08 0 .
10 0 .
12 0 .
14 0 .
161 46 26 24 23 22 225 46 17 14 10 10 910 45 16 12 10 10 820 45 15 12 10 10 850 45 15 12 10 10 8100 45 15 11 10 10 8Table 1: rank( ˆΣ) as a function of λ and ǫ
10. Conclusions.
In this paper we considered the general problem of identifyinglinear relations among variables based on noisy measurements –a classical problem ofmajor importance in the current era of “Big Data.” Novel numerical techniques andincreasingly powerful computers have made it possible to successfully treat a numberof key issues in this topic in a unified manner. Thus, the goal of the paper has been topresent and develop in a unified manner key ideas of the theory of noise-in-variableslinear modeling.More specifically, we considered two different viewpoints for the linear modelproblem under the assumption of independent noise. From an estimation viewpoint,we quantify the uncertainty in estimating “noise-free” data based on noise-in-variableslinear models. We proposed a min-max estimation problem which aims at a uniformlyoptimal estimator –the solution can be obtained using convex optimization. From themodeling viewpoint, we also derived several classical results for the Frisch problemthat asks for the maximum number of simultaneous linear relations. Our results pro-vide a geometric insight to the Reiersøl theorem, a generalization to complex-valuedmatrices, an iterative re-weighting trace minimization scheme for obtaining solutionsof low rank along with a characterization of fixed points, and certain computationaltractable lower bounds to serve as certificates for identifying the minimum rank. Fi-nally, we consider regularized min-max estimation problems which integrate variousobjectives (low-rank, minimal worst-case estimation error) and explain their effective-ness in a numerical example.In recent years, techniques such as the ones presented in this work are becomingincreasingly important in subjects where one has very large noisy datasets including edical imaging, genomics/proteomics, and finance. It is our hope that the mate-rial we presented in this paper will be used in these topics. It must be noted thatthroughout the present work we emphasized independence of noise in individual vari-ables. Evidently, more general and versatile structures for the noise statistics canbe treated in a similar manner, and these may become important when dealing withlarge databases.A very important topic for future research is that of dealing with statistical errorsin estimating empirical statistics. It is common to quantify distances using standardmatrix norms –as is done in the present paper as well. Alternative distance measuressuch as the Wasserstein distance mentioned in Section 9 and others (see e.g., [32])may become increasingly important in quantifying statistical uncertainty.Finally, we raise the question of the asymptotic performance of certificates suchas those presented in Section 6. It is important to know how the tightness of thecertificate to the minimal rank of linear models relates to the size of the problem. Acknowledgments.
This work was supported in part by grants from NSF, NIH,AFOSR, ONR, and MDA. This work is part of the National Alliance for Medical ImageComputing (NA-MIC), funded by the National Institutes of Health through the NIHRoadmap for Medical Research, Grant U54 EB005149. Information on the NationalCenters for Biomedical Computing can be obtained from http://nihroadmap.nih.gov/bioinformatics. Finally, this project was supported by grants from the National Cen-ter for Research Resources (P41-RR-013218) and the National Institute of BiomedicalImaging and Bioengineering (P41-EB-015902) of the National Institutes of Health.
REFERENCES[1]
B. D. O. Anderson and M. Deistler , Identification of dynamic systems from noisy data ,Institute for Econometrics and Operations Research, Technical University, Vienna, (1988).[2] ,
Generalized linear dynamic factor models-a structure theory , in 47th IEEE Conferenceon Decision and Control, 2008, pp. 1980–1985.[3]
T. Anderson and H. Rubin , Statistical inference in factor analysis , in Proceedings of the thirdBerkeley symposium on mathematical statistics and probability, vol. 5, 1956, pp. 111–150.[4]
P. A. Bekker and J. M. F. ten Berge , Generic global indentification in factor analysis ,Linear Algebra and its Applications, 264 (1997), pp. 255–263.[5]
S. Boyd and L. Vandenberghe , Convex optimization , Cambridge university press, 2004.[6]
E. J. Cand`es and B. Recht , Exact matrix completion via convex optimization , Foundationsof Computational Mathematics, 9 (2009), pp. 717–772.[7]
V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky , Rank-sparsity inco-herence for matrix decomposition , SIAM Journal on Optimization, 21 (2011), pp. 572–596.[8]
M. Deistler and B. D. O. Anderson , Linear dynamic errors-in-variables models: Somestructure theory , Journal of Econometrics, 41 (1989), pp. 39–63.[9]
G. Della Riccia and A. Shapiro , Minimum rank and minimum trace of covariance matrices ,Psychometrika, 47 (1982), pp. 443–448.[10]
R. G. Douglas , On majorization, factorization, and range inclusion of operators on hilbertspace , Proceedings of the American Mathematical Society, 17 (1966), pp. 413–415.[11]
J. Durbin , Errors in variables , Revue de l’Institut international de statistique, 22 (1954),pp. 23–32.[12]
Y. Eldar and N. Merhav , A competitive minimax approach to robust estimation of randomparameters , IEEE Transactions on Signal Processing, 52 (2004), pp. 1931–1946.[13]
M. Fazel, H. Hindi, and S. P. Boyd , A rank minimization heuristic with application tominimum order system approximation , in Proceedings of the 2001 American Control Con-ference, vol. 6, 2001, pp. 4734–4739.[14] ,
Log-det heuristic for matrix rank minimization with applications to hankel and eu-clidean distance matrices , in Proceedings of the 2003 American Control Conference, vol. 3,2003, pp. 2156–2162. 2515]
M. Forni, M. Hallin, M. Lippi, and L. Reichlin , The generalized dynamic-factor model:Identification and estimation , Review of Economics and Statistics, 82 (2000), pp. 540–554.[16]
R. Frisch , Statistical confluence analysis by means of complete regression systems , vol. 5,Universitetets Økonomiske Instituut, 1934.[17]
R. P. Guidorzi , Identification of the maximal number of linear relations from noisy data ,Systems & control letters, 24 (1995), pp. 159–165.[18]
L. Guttman , Some necessary conditions for common-factor analysis , Psychometrika, 19(1954), pp. 149–161.[19]
H. Harman and W. Jones , Factor analysis by minimizing residuals (minres) , Psychometrika,31 (1966), pp. 351–368.[20]
R. A. Horn and C. R. Johnson , Matrix Analysis , Cambridge University Press, 1990.[21]
K. J¨oreskog , A general approach to confirmatory maximum likelihood factor analysis , Psy-chometrika, 34 (1969), pp. 183–202.[22]
R. E. Kalman , System identification from noisy data , in Dynamical Systems II, A. Bednarekand L. Cesari, eds., Academic Press, New York, 1982, pp. 135–164.[23] ,
Identification of noisy systems , Russian Mathematical Surveys, 40 (1985), p. 25.[24]
R. H. Keshavan, A. Montanari, and S. Oh , Matrix completion from a few entries , IEEETransactions on Information Theory, 56 (2010), pp. 2980–2998.[25] ,
Matrix completion from noisy entries , The Journal of Machine Learning Research, 99(2010), pp. 2057–2078.[26]
S. Klepper and E. E. Leamer , Consistent sets of estimates for regressions with errors in allvariables , Econometrica: Journal of the Econometric Society, 52 (1984), pp. 163–183.[27]
T. C. Koopmans , Linear regression analysis of economic time series , Netherlands EconomicInstitute, Harrlem-de Erwen F. Bohn N.V., 1937.[28]
L. L. Ledermann , On a problem concerning matrices with variable diagonal elements , Pro-ceedings of the Royal Society of Edinburgh, 60 (1940), pp. 1–17.[29]
W. Ledermann , On the rank of the reduced correlational matrix in multiple-factor analysis ,Psychometrika, 2 (1937), pp. 85–93.[30] ,
On a problem concerning matrices with variable diagonal elements , Williams and Nor-gate, 1940.[31]
C. A. Los , Identification of a linear system from inexact data: a three-variable example ,Computers & Mathematics with Applications, 17 (1989), pp. 1285–1304.[32]
L. Ning, X. Jiang, and T. Georgiou , Geometric methods for estimation of structured covari-ances , arXiv:1110.3695, (2011).[33]
I. Olkin and F. Pukelsheim , The distance between two random vectors with given dispersionmatrices , Linear Algebra and Its Applications, 48 (1982), pp. 257–263.[34]
B. Recht, M. Fazel, and P. A. Parrilo , Guaranteed minimum-rank solutions of linear matrixequations via nuclear norm minimization , SIAM review, 52 (2010), pp. 471–501.[35]
O. Reiersøl , Confluence analysis by means of lag moments and other methods of confluenceanalysis , Econometrica: Journal of the Econometric Society, 9 (1941), pp. 1–24.[36]
J. Saunderson, V. Chandrasekaran, P. Parrilo, and A. Willsky , Diagonal and low-rankmatrix decompositions, correlation matrices, and ellipsoid fitting , arXiv:1204.1220, (2012).[37]
A. Shapiro , Rank-reducibility of a symmetric matrix and sampling theory of minimum tracefactor analysis , Psychometrika, 47 (1982), pp. 187–199.[38] ,
Weighted minimum trace factor analysis , Psychometrika, 47 (1982), pp. 243–264.[39] ,
Identifiability of factor analysis: Some results and open problems , Linear algebra andits applications, 70 (1985), pp. 1–7.[40]
T. S¨oderstr¨om , Errors-in-variables methods in system identification , Automatica, 43 (2007),pp. 939–958.[41]
C. Spearman , General intelligence, objectively determined and measured , The American Jour-nal of Psychology, 15 (1904), pp. 201–292.[42]
H. L. Van Trees , Optimum array processing , Wiley-Interscience, 2002.[43]
R. Varga , Gerˇsgorin and his circles , Springer Verlag, 2004.[44]
K. G. Woodgate , An upper bound on the number of linear relations identified from noisy databy the Frisch scheme , Systems & control letters, 24 (1995), pp. 153–158.[45] ,