Fundamental Limits of Weak Recovery with Applications to Phase Retrieval
FFundamental Limits of Weak Recoverywith Applications to Phase Retrieval
Marco Mondelli ∗ and Andrea Montanari † July 27, 2018
Abstract
In phase retrieval we want to recover an unknown signal x ∈ C d from n quadratic measurements ofthe form y i = |(cid:104) a i , x (cid:105)| + w i where a i ∈ C d are known sensing vectors and w i is measurement noise.We ask the following weak recovery question: what is the minimum number of measurements n neededto produce an estimator ˆ x ( y ) that is positively correlated with the signal x ? We consider the case ofGaussian vectors a i . We prove that – in the high-dimensional limit – a sharp phase transition takes place,and we locate the threshold in the regime of vanishingly small noise. For n ≤ d − o ( d ) no estimatorcan do significantly better than random and achieve a strictly positive correlation. For n ≥ d + o ( d ) asimple spectral estimator achieves a positive correlation. Surprisingly, numerical simulations with thesame spectral estimator demonstrate promising performance with realistic sensing matrices. Spectralmethods are used to initialize non-convex optimization algorithms in phase retrieval, and our approachcan boost the performance in this setting as well.Our impossibility result is based on classical information-theory arguments. The spectral algorithmcomputes the leading eigenvector of a weighted empirical covariance matrix. We obtain a sharp charac-terization of the spectral properties of this random matrix using tools from free probability and general-izing a recent result by Lu and Li. Both the upper and lower bound generalize beyond phase retrieval tomeasurements y i produced according to a generalized linear model. As a byproduct of our analysis, wecompare the threshold of the proposed spectral method with that of a message passing algorithm. In this work, we consider the problem of recovering a signal x of dimension d , given n generalized linearmeasurements . More specifically, the measurements are drawn independently according to the conditionaldistribution y i ∼ p ( y | (cid:104) x , a i (cid:105) ) , i ∈ { , . . . , n } , (1)where (cid:104)· , ·(cid:105) denotes the inner product, { a i } ≤ i ≤ n is a set of known sensing vector, and p ( · | (cid:104) x , a i (cid:105) ) is a known probability density function. This model appears in many problems in signal processing andstatistical estimation, e.g., photon-limited imaging [UE88, YLSV12], signal recovery from quantized mea-surements [RG01], and phase retrieval [Fie82, SEC + phase retrieval , the model (1)is specialized to y i = |(cid:104) x , a i (cid:105)| + w i , i ∈ { , . . . , n } , (2) ∗ Department of Electrical Engineering, Stanford University † Department of Electrical Engineering and Department of Statistics, Stanford University a r X i v : . [ s t a t . M L ] J u l here w i is noise. Applications of phase retrieval arise in several areas of science and engineering, includingX-ray crystallography [Mil90, Har93], microscopy [MISE08], astronomy [FD87], optics [Wal63], acoustics[BCE06], interferometry [DJ17], and quantum mechanics [Cor06].Popular methods to solve the phase retrieval problem are based on semi-definite programming relax-ations [CESV15, CLS15a, CSV13, WdM15]. However, these algorithms rapidly become prohibitive from acomputational point of view when the dimension d of the signal increases, which makes them impracticalin most of the real-world applications. For this reason, several algorithms have been developed in orderto solve directly the non-convex least-squares problem, including the error reduction schemes dating backto Gerchberg-Saxton and Fienup [Ger72, Fie82], alternating minimization [NJS13], approximate messagepassing (AMP) [SR15], Wirtinger Flow [CLS15b], iterative projections [LGL15], the Kaczmarz method[Wei15], and a number of other approaches [CC17,ZL16,CLM16,WGE16,WG16,Sol17,DR17,WGSC17].Furthermore, recently a convex relaxation that operates in the natural domain of the signal was indepen-dently proposed by two groups of authors [GS16, BR17]. All these techniques require an initialization step,whose goal is to provide a solution ˆ x that is positively correlated with the unknown signal x . To do so,spectral methods are widely employed: the estimate ˆ x is given by the principal eigenvector of a suitablematrix constructed from the data. A similar stategy (initialization step followed by an iterative algorithm)has proved successful for many other estimation problems, e.g., matrix completion [KMO10, JNS13], blinddeconvolution [LLJB17, LLSW16], sparse coding [AGMM15] and joint alignment from pairwise noisy ob-servations [CC16].We focus on a regime in which both the number of measurement n and the dimension of the signal d tend to infinity, but their ratio n/d tends to a positive constant δ . The weak recovery problem requires toprovide an estimate ˆ x ( y ) that has a positive correlation with the unknown vector x : lim inf n →∞ E (cid:26) |(cid:104) ˆ x ( y ) , x (cid:105)|(cid:107) ˆ x ( y ) (cid:107) (cid:107) x (cid:107) (cid:27) > (cid:15), (3)for some (cid:15) > .In this paper, we consider either x ∈ R d or x ∈ C d and assume that the measurement vectors a i arestandard Gaussian (either real or complex). In the general setting of model (1), we present two types ofresults:1. We develop an information-theoretic lower bound δ (cid:96) : for δ < δ (cid:96) , no estimator can output non-trivialestimates. In other words, the weak recovery problem cannot be solved.2. We establish an upper bound δ u based on a spectral algorithm : for δ > δ u , we can achieve weakrecovery (see (3)) by letting ˆ x be the principal eigenvector of a matrix suitably constructed from thedata. We also show that δ u is the optimal threshold for spectral methods.The values of the thresholds δ (cid:96) and δ u depend on the conditional distribution p ( · | (cid:104) x , a i (cid:105) ) . For the specialcase of phase retrieval (see (2)), we evaluate these bounds and we show that they coincide in the limit ofvanishing noise. Theorem.
Let x be uniformly distributed on the d -dimensional complex sphere with radius √ d and assumethat { a i } ≤ i ≤ n ∼ i.i.d. CN ( d , I d /d ) . Let y ∈ R n be given by (2) , with { w i } ≤ i ≤ n ∼ N (0 , σ ) , and n, d → ∞ with n/d → δ ∈ (0 , + ∞ ) . Then, • For δ < , no algorithm can provide non-trivial estimates on x ; For δ > , there exists σ ( δ ) > and a spectral algorithm that returns an estimate ˆ x satisfying (3) ,for any σ ∈ [0 , σ ( δ )] . The assumption that x is uniform on the sphere can be dropped for the upper bound part. We alsoshow that σ ( δ ) scales as √ δ − when δ is close to . In the ‘real case’ x ∈ R d with (cid:107) x (cid:107) = d and { a i } ≤ i ≤ n ∼ i.i.d. N ( d , I d /d ) , we prove that an analogous result holds and that the threshold moves from to / . This is reminiscent of how the injectivity thresholds are δ = 4 and δ = 2 in the complex and the realcase, respectively [BCE06, BCMN14, CEHV15]. A possible intuition for this halving phenomenon comesfrom the fact that the complex problem has twice as many variables but the same amount of equations of thereal problem. Hence, it is reasonable that the complex case requires twice the amount of data with respectto the real case.Let us emphasize that we are considering the problem of weak recovery. Therefore, we may need lessthan n samples in order to obtain positive correlation on n unknowns. For instance, in the linear case y i = (cid:104) a i , x (cid:105) + w i , weak recovery is possible for any δ > . Consequently, it is not surprising that for phaseretrieval in the real case weak recovery can be achieved for δ below one.Our information-theoretic lower bound is proved by estimating the conditional entropy via the secondmoment method. In general, this might not match the spectral upper bound. We provide an example inwhich there is a strictly positive gap between δ (cid:96) and δ u in Remark 3 at the end of Section 3.As in earlier work (see Section 1.1), we consider spectral algorithms that computesthe eigenvector cor-responding to the largest eigenvalue of a matrix of the form: D n = 1 n n (cid:88) i =1 T ( y i ) a i a ∗ i , (4)where T : R → R is a pre-processing function. For δ large enough (and a suitable choice of T ), weexpect the resulting eigenvector ˆ x ( y ) to be positively correlated with the true signal x . The recent paper[LL17] computed exactly the threshold value δ u , under the assumption that the measurement vectors are realGaussian, and T is non-negative.Here, we generalize the result of [LL17] by removing the assumption that T ( y ) ≥ and by consideringthe complex case. The main technical lemma of this generalization consists in the computation of thelargest eigenvalue of a matrix of the form U M n U ∗ , where the entries of U are ∼ i.i.d. CN (0 , and M n isindependent of U and has known empirical spectral distribution. The case in which M n is PSD is handledin [BY12]. In this paper, by using tools from free probability, we solve the case in which M n is notnecessarily PSD. To do so, it is not sufficient to compute the weak limit of the empirical spectral distributionof U M n U ∗ , but we also need to compute the almost sure limit of its principal eigenvalue. Armed withthis result, we compute the optimal pre-processing function T ∗ δ ( y ) for the general model (1). This pre-processing function is optimal in the sense that it provides the smallest possible weak recovery threshold forthe spectral method. Our upper bound δ u is the phase transition location for this optimal spectral method.In the case of phase retrieval (as σ → ), the optimal pre-processing function is given by T ∗ δ ( y ) = y − y + √ δ − , (5)and achieves weak recovery for any δ > δ u = 1 . In the limit δ ↓ , this converges to the limiting function T ∗ ( y ) = 1 − (1 /y ) .While the expression (5) is remarkably simple, it is somewhat counter-intuitive. Earlier methods [CLS15b,CC15, LL17] use T ( y ) ≥ and try to extract information from the large values of y i . The function (5) has a3 a) Original image.(b) proposed – δ = 4 . (c) truncated – δ = 4 .(d) proposed – δ = 6 . (e) truncated – δ = 6 .(f) proposed – δ = 12 . (g) truncated – δ = 12 . Figure 1: Performance comparison between the proposed spectral method and the truncated spectral initial-ization of [CC17] for the recovery of a digital photograph from coded diffraction patterns.4arge negative part for small y , in particular when δ is close to . Furthermore, it extracts useful informationfrom data points with y i small. One possible interpretation is that the points in which the measurementvector is basically orthogonal to the unknown signal are not informative, hence we penalize them.Our analysis applies to Gaussian measurement matrices. However, the proposed spectral method workswell also on real images and realistic measurement matrices. To illustrate this fact, in Figure 1 we test ouralgorithm on a digital photograph of the painting “The birth of Venus” by Sandro Botticelli. We consider atype of measurements that falls under the category of coded diffraction patterns (CDP) [CLS15a,CC17]: themeasurement matrix is given by the product of δ copies of a Fourier matrix and a diagonal matrix with entriesi.i.d. and uniform in { , − , i, − i } , where i denotes the imaginary unit. We compare our method with thetruncated spectral initialization proposed in [CC17], which consists in discarding the measurements largerthan an assigned threshold and leaving the others untouched. The proposed choice of the pre-processingfunction allows to recover a good estimate of the original image already when δ = 4 , while the truncatedspectral initialization of [CC17] requires δ = 12 to obtain similar results.In general, our proposed spectral method can be thought of as a first step of the following two-roundalgorithm: first, use spectral initialization to perform weak recovery and then improve the solution with aniterative algorithm, e.g., AMP or Wirtinger Flow. By using optimal truncation methods, the weak recoverythreshold is smaller, which means that less measurements are required in order to successfully complete thefirst step of the algorithm. If a different truncation is used, the resulting performances are limited by thecorresponding weak recovery threshold.Note that the pre-processing function (5) is optimal in the sense that it minimizes the weak recoverythreshold associated to the spectral method. Hence, for a given correlation ¯ (cid:15) ∈ (0 , , the exact expressionof the optimal pre-processing function that allows to obtain a correlation ¯ (cid:15) between ˆ x ( y ) and x might bedifferent and it might depend on ¯ (cid:15) . However, we observe that (5) provides excellent empirical performanceand outperforms state-of-the-art methods for a wide range of target correlations (see the simulation resultsof Section 7).The rest of the paper is organized as follows. In Section 2, after introducing the necessary notation, wedefine formally the problem. We then state our general information-theoretic lower bound and our spectralupper bound for the case of complex signal x and complex measurement vectors a i . The main results forthe real case are stated in Section 3. In Sections 4 and 5, we present the proof of the information-theoreticlower bound and of the spectral upper bound, respectively. In Section 6, we compare the spectral approachto a message passing algorithm. In particular we show that the latter cannot have a better threshold than δ u and that δ u is the threshold for a linearized version of message passing. In Section 7, we present somenumerical simulations that illustrate the behavior of the proposed spectral method for the phase retrievalproblem. The proofs of several results are deferred to the various appendices. Precise asymptotic information on high-dimensional regression problems has been obtained by severalgroups in recent years [DMM11,BM12,OTH13,BLM15,DM16,Kar13,SC16,ZK16,PV16,TAH15,NWL16].In particular, information-theoretically optimal estimation was considered for compressed sensing [DJM13],and random linear estimation [RP16, BMDK17]. Minimax optimal estimation is considered, among others,in [DMM11, SC16, VJ17].The performance of the spectral methods for phase retrieval was first considered in [NJS13]. In thepresent notation, [NJS13] uses T ( y ) = y and proves that there exists a constant c such that weak recoverycan be achieved for n > c · d · log d . The same paper also gives an iterative procedure to improve over thespectral method, but the bottleneck is in the spectral step. The sample complexity of weak recovery using5pectral methods was improved to n > c · d · log d in [CLS15b] and then to n > c · d in [CC17], forsome constants c and c . Both of these papers also prove guarantees for exact recovery by suitable descentalgorithms. The guarantees on the spectral initialization are proved by matrix concentration inequalities, atechnique that typically does not return exact threshold values.In [GS16], the authors introduce the PhaseMax relaxation and prove an exact recovery result for phaseretrieval, which depends on the correlation between the true signal and the initial estimate given to the algo-rithm. The same idea was independently proposed in [BR17]. Furthermore, the analysis in [BR17] allowsto use the same set of measurements for both initialization and convex programming, whereas the analysisin [GS16] requires fresh extra measurements for convex programming. By using our spectral method toobtain the initial estimate, it should be possible to improve the existing upper bounds on the number ofsamples needed for exact recovery.As previously mentioned, our analysis of spectral methods builds on the recent work of Lu and Li[LL17] that compute the exact spectral threshold for a matrix of the form (4) with T ( y ) ≥ . Here wegeneralize this result to signed pre-processing functions T ( y ) , and construct a function of this type thatachieves the information-theoretic threshold for phase retrieval. Our proof indeed implies that non-negativepre-processing functions lead to an unavoidable gap with respect to the ideal threshold.Finally, while this paper was under completion, two works appeared that address related problems.In [BKM + We use [ n ] as a shortcut for { , . . . , n } . We use upper-case letters (e.g., X, Y, Z, . . . ) to denote randomvariables when we are taking operators such as expectation, variance or mutual information. We denote by n the vector consisting of n x , we denote by (cid:107) x (cid:107) its (cid:96) norm. Given a matrix A ,we denote by (cid:107) A (cid:107) F its Frobenius norm, by (cid:107) A (cid:107) op its operator norm, by A T its transpose, and by A ∗ itsconjugate transpose. Given two vectors x , y ∈ C d , we denote by (cid:104) x , y (cid:105) = (cid:80) di =1 x i y ∗ i their scalar product.We take logarithms in the natural basis and we measure entropies in nats. Given c ∈ C , we denote by (cid:60) ( c ) and (cid:61) ( c ) its real and imaginary part, respectively. We use P −→ and a.s. −→ to denote the convergence inprobability and the almost sure convergence, respectively.Let x ∈ C d be chosen uniformly at random on the d -dimensional complex sphere with radius √ d , i.e., x ∼ Unif( √ d S d − C ) . (6)Let the sensing vectors { a i } ≤ i ≤ n , with a i ∈ C d , be independent and identically distributed accordingto a circularly-symmetric complex normal distribution with variance /d , i.e., { a i } ≤ i ≤ n ∼ i.i.d. CN ( d , I d /d ) . (7)6iven g i = (cid:104) x , a i (cid:105) , the vector of measurements y ∈ R n is obtained by drawing each componentindependently according to the following distribution: y i ∼ p ( y | | g i | ) , i ∈ [ n ] . (8)For the special case of phase retrieval , the measurements are given by the squared scalar product corruptedby additive Gaussian noise with variance σ , i.e., p PR ( y | | g i | ) = 1 σ √ π exp (cid:18) − ( y − | g i | ) σ (cid:19) . (9)Let δ n = n/d and assume that, as n → ∞ , δ n → δ for some δ ∈ (0 , ∞ ) . The main result of this section establishes the following: there is a critical value δ (cid:96) such that, for any δ < δ (cid:96) , the optimal estimator has the same performance as a trivial estimator that does not have access toany measurement. The value of δ (cid:96) depends on the distribution (8) of the measurements and we provide anexpression to compute it.In order to state formally the result, we need to introduce a few definitions. Consider the function f : [0 , → R , given by f ( m ) = (cid:90) R E G ,G { p ( y | | G | ) p ( y | | G | ) } E G { p ( y | | G | ) } d y, (10)with G ∼ CN (0 , , ( G , G ) ∼ CN (cid:18) , (cid:20) cc ∗ (cid:21)(cid:19) , (11)and m = | c | . Note that the RHS of (10) depends only on m = | c | . Indeed, by applying the transforma-tion ( G , G ) → ( e iθ G , e iθ G ) , f ( m ) does not change, but the correlation coefficient c is mapped into ce i ( θ − θ ) . A more explicit formula for f ( m ) is provided by Lemma 6 in Appendix A. The function f ( m ) is related to the conditional entropy H ( Y , . . . , Y n | A , . . . , A n ) , as clarified in the proof of Lemma 1 inSection 4.1. Furthermore, set F δ ( m ) = δ log f ( m ) + log(1 − m ) . (12)Note that, when m = 0 , G and G are independent. Hence, f (0) = 1 , which implies that F δ (0) = 0 forany δ > . We define the information-theoretic threshold δ (cid:96) as the largest value of δ such that the maximumof F δ ( m ) is attained at m = 0 , i.e., δ (cid:96) = sup { δ | F δ ( m ) < for m ∈ (0 , } . (13)Let us now define the error metric. The setting is the following: we observe the vector of n measurements y and, given a new sensing vector a n +1 , we want to estimate some function φ ( |(cid:104) x , a n +1 (cid:105)| ) given by φ ( |(cid:104) x , a n +1 (cid:105)| ) = (cid:90) R ϕ ( y ) p ( y | |(cid:104) x , a n +1 (cid:105)| ) d y. (14)Then, the minimum mean square error is defined as MMSE ( δ n ) = E (cid:26)(cid:16) φ ( |(cid:104) X , A n +1 (cid:105)| ) − E (cid:8) φ ( |(cid:104) X , A n +1 (cid:105)| ) (cid:12)(cid:12) Y , { A i } ≤ i ≤ n (cid:9)(cid:17) (cid:27) , (15)7here E { φ ( |(cid:104) X , A n +1 (cid:105)| ) | Y , { A i } ≤ i ≤ n } represents the optimal estimator of the quantity φ ( |(cid:104) x , a n +1 (cid:105)| ) and the expectation of the square error is to be intended over all the randomness of the system, i.e., over X , A n +1 , Y , and { A i } ≤ i ≤ n . Note that this error metric depends on the choice of the function φ . Fur-thermore, observe that, if we do not have access to the vector of measurements Y , the trivial estimator E { φ ( |(cid:104) X , A n +1 (cid:105)| ) } has a mean square error given by E (cid:26)(cid:16) φ ( |(cid:104) X , A n +1 (cid:105)| ) − E (cid:8) φ ( |(cid:104) X , A n +1 (cid:105)| ) (cid:9)(cid:17) (cid:27) = Var (cid:8) φ ( |(cid:104) X , A n +1 (cid:105)| ) (cid:9) . (16)At this point we are ready to state our main result, which is proved in Section 4.1. Theorem 1 (Information-Theoretic Lower Bound for General Complex Sensing Model) . Let x , { a i } ≤ i ≤ n +1 ,and y be distributed according to (6) , (7) , and (8) , respectively. Let n/d → δ and define δ (cid:96) as in (13) . Fur-thermore, assume that the function ϕ that appears in (14) is bounded. Then, for any δ < δ (cid:96) , we havethat lim n →∞ MMSE ( δ n ) = Var (cid:8) φ ( |(cid:104) X , A n +1 (cid:105)| ) (cid:9) . (17)Let us point out that the requirement that the function ϕ is bounded can be relaxed when the tails of thedistribution of Y are sufficiently light (e.g., sub-Gaussian). Indeed, this is what happens for the special caseof phase retrieval, which is considered immediately below.For the special case of phase retrieval , a more explicit error metric is given by the matrix minimummean square error, defined as MMSE PR ( δ n ) = 1 d E (cid:26)(cid:13)(cid:13)(cid:13) XX ∗ − E (cid:8) XX ∗ | Y , { A i } ≤ i ≤ n (cid:9)(cid:13)(cid:13)(cid:13) F (cid:27) . (18)Indeed, the vector x can be recovered only up to a sign change, since we observe a function of the scalarproducts |(cid:104) x , a i (cid:105)| . Clearly, MMSE ( δ n ) ∈ [0 , and MMSE ( δ n ) = 1 implies that the optimal estimatorcoincides with the trivial estimator that outputs the all-0 vector.The corollary below provides the exact value of δ (cid:96) for the case of phase retrieval and it is proved inAppendix A. Corollary 1 (Information-Theoretic Lower Bound for Phase Retrieval) . Let x , { a i } ≤ i ≤ n , and y be dis-tributed according to (6) , (7) , and (9) , respectively. Let n/d → δ . Then, for any δ < , we have that lim σ → lim n →∞ MMSE PR ( δ n ) = 1 . (19) The main result of this section establishes the following: there is a critical value δ u such that, for any δ > δ u ,the principal eigenvector of a suitably constructed matrix, call it D n , provides an estimate ˆ x that satisfies(3). 3 The threshold δ u is defined as δ u = 1 (cid:90) R (cid:0) E G (cid:8) p ( y | | G | )( | G | − (cid:9)(cid:1) E G { p ( y | | G | ) } d y , (20)with G ∼ CN (0 , . Given the measurements { y i } ≤ i ≤ n , we construct the matrix D n as D n = 1 n n (cid:88) i =1 T ( y i ) a i a ∗ i , (21)8here T : R → R is a pre-processing function.At this point we are ready to state our main result, which is proved in Section 5. Theorem 2 (Spectral Upper Bound for Complex General Sensing Model) . Let x , { a i } ≤ i ≤ n , and y bedistributed according to (6) , (7) , and (8) , respectively. Let n/d → δ and define δ u as in (20) . Let ˆ x be theprincipal eigenvector of the matrix D n defined in (21) . For any δ > δ u , set the pre-processing function T to the function T ∗ δ given by T ∗ δ ( y ) = √ δ u · T ∗ ( y ) √ δ − ( √ δ − √ δ u ) T ∗ ( y ) , (22) where T ∗ ( y ) = 1 − E G { p ( y | | G | ) } E G { p ( y | | G | ) · | G | } . (23) Then, we have that, almost surely, lim n →∞ |(cid:104) ˆ x , x (cid:105)|(cid:107) ˆ x (cid:107) (cid:107) x (cid:107) > (cid:15), (24) for some (cid:15) > . Furthermore, for any δ ≤ δ u , there is no pre-processing function T such that, almost surely, (24) holds. Let us highlight that the pre-processing function (22) provides the optimal threshold among spectralmethods thatuse matrices of the form (4) in the sense that it achieves weak recovery for δ > δ u and nofunction achieves weak recovery for δ ≤ δ u . Note also that the assumption that x is uniform on the spherecan be dropped (see the beginning of the proof of Lemma 2 in Section 5).As a byproduct of our analysis, we also give guarantees on the value of δ sufficient to achieve an assignedcorrelation with the ground truth, using the spectral method, see (84) in the statement of Lemma 2 in Section5. Hence, we can combine our upper bound with existing nonconvex optimization algorithms, in order toobtain provable performance guarantees.The corollary below provides the exact value of δ u and an explicit expression for T ∗ δ ( y ) for the case ofphase retrieval. Its proof is contained in Appendix B. Note that, for phase retrieval, δ u = δ (cid:96) = 1 , i.e., thespectral upper bound matches the information-theoretic lower bound. Corollary 2 (Spectral Upper Bound for Phase Retrieval) . Let x , { a i } ≤ i ≤ n , and y be distributed accordingto (6) , (7) , and (9) , respectively. Let n/d → δ . Let ˆ x be the principal eigenvector of the matrix D n defined in (21) . For any δ > , set the pre-processing function T to the function T ∗ δ given by (with y + ≡ max(0 , y ) ): T ∗ δ ( y ) = y + − y + + √ δ − . (25) Then, we have that, almost surely, lim σ → lim n →∞ |(cid:104) ˆ x , x (cid:105)|(cid:107) ˆ x (cid:107) (cid:107) x (cid:107) > (cid:15), (26) for some (cid:15) > . Notice that this statement is stronger than the claim that δ u ( σ ) → as σ → , where δ u ( σ ) is thespectral threshold at noise level σ . Indeed it requires proving that the scalar product |(cid:104) ˆ x , x (cid:105)| stays boundedaway from , as σ → . Furthermore, this is achieved with the pre-processing function (25) that does notrequire to estimate σ , which can be challenging with real data.We also characterize the scaling between δ u and σ when σ is close to : δ u ( σ ) = 1 + σ + o ( σ ) (see Lemma 8 in Appendix B). 9 Main Results: Real Case
Let us now briefly discuss what happens in the real case. Let x ∈ R d be chosen uniformly at random on the d -dimensional real sphere with radius √ d , i.e., x ∼ Unif( √ d S d − R ) . (27)Let the sensing vectors { a i } ≤ i ≤ n , with a i ∈ R d being independent and identically distributed according toa normal distribution with zero mean and variance /d , i.e., { a i } ≤ i ≤ n ∼ i.i.d. N ( d , I d /d ) . (28)Given g i = (cid:104) x , a i (cid:105) , the vector of measurements y ∈ R n is obtained by drawing each component indepen-dently according to the following distribution: y i ∼ p ( y | g i ) , i ∈ [ n ] . (29)We can define the “real” phase retrieval model, whereby the measurements are given by the squared scalarproduct corrupted by additive Gaussian noise with variance σ , i.e., p PR ( y | g i ) = 1 σ √ π exp (cid:18) − ( y − g i ) σ (cid:19) . (30)We first present the information-theoretic lower bound. Consider the function f : [ − , → R , givenby f ( m ) = (cid:90) R E G ,G { p ( y | G ) p ( y | G ) } E G { p ( y | G ) } d y, (31)with G ∼ N (0 , , ( G , G ) ∼ N (cid:18) , (cid:20) mm (cid:21)(cid:19) . (32)Furthermore, set F δ ( m ) = δ log f ( m ) + 12 log(1 − m ) . (33)Again, F δ (0) = 0 for any δ > . We define the information-theoretic threshold δ (cid:96) as the largest value of δ such that the maximum of F δ ( m ) is attained at m = 0 , i.e., δ (cid:96) = sup { δ | F δ ( m ) < for m ∈ [ − , \ { }} . (34)As for the error metric, we observe the vector of n measurements y and, given a new sensing vector a n +1 , we want to estimate some function φ ( (cid:104) x , a n +1 (cid:105) ) given by φ ( (cid:104) x , a n +1 (cid:105) ) = (cid:90) R ϕ ( y ) p ( y | (cid:104) x , a n +1 (cid:105) )d y. (35)Then, the minimum mean square error is defined as MMSE ( δ n ) = E (cid:26)(cid:16) φ ( (cid:104) X , A n +1 (cid:105) ) − E (cid:8) φ ( (cid:104) X , A n +1 (cid:105) ) (cid:12)(cid:12) Y , { A i } ≤ i ≤ n (cid:9)(cid:17) (cid:27) . (36)10ecall that, if we do not have access to the vector of measurements y , the trivial estimator E { φ ( (cid:104) X , A n +1 (cid:105) ) } has a mean square error given by E (cid:26)(cid:16) φ ( (cid:104) X , A n +1 (cid:105) ) − E (cid:8) φ ( (cid:104) X , A n +1 (cid:105) ) (cid:9)(cid:17) (cid:27) = Var (cid:8) φ ( (cid:104) X , A n +1 (cid:105) ) (cid:9) . (37)At this point we are ready to state the information-theoretic lower bound, which is proved in Section4.2. Theorem 3 (Information-Theoretic Lower Bound for Real General Sensing Model) . Let x , { a i } ≤ i ≤ n +1 ,and y be distributed according to (27) , (28) , and (29) , respectively. Let n/d → δ and define δ (cid:96) as in (34) .Furthermore, assume that the function ϕ that appears in (35) is bounded. Then, for any δ < δ (cid:96) , we havethat lim n →∞ MMSE ( δ n ) = Var (cid:8) φ ( (cid:104) X , A n +1 (cid:105) ) (cid:9) . (38) Remark 1 (Information-Theoretic Lower Bound for Real Phase Retrieval) . For the special case of phaseretrieval , a more explicit error metric is given by the matrix minimum mean square error, defined as
MMSE PR ( δ n ) = 1 d E (cid:26)(cid:13)(cid:13)(cid:13) XX T − E (cid:8) XX T | Y , { A i } ≤ i ≤ n (cid:9)(cid:13)(cid:13)(cid:13) F (cid:27) . (39) By calculations similar to those in Lemma 7 contained in Appendix A, one can prove that, if the distribution p ( · | G ) appearing in (31) is given by (30) , then lim σ → δ (cid:96) ( σ ) = 1 / . (40) Consequently, by following a proof analogous to that of Corollary 1 in Appendix A, we conclude that, forany δ < / , lim σ → lim n →∞ MMSE PR ( δ n ) = 1 . (41)Let us now move to the spectral upper bound. The threshold δ u is defined as δ u = 1 (cid:90) R (cid:0) E G (cid:8) p ( y | G )( G − (cid:9)(cid:1) E G { p ( y | G ) } d y , (42)with G ∼ N (0 , . Given the measurements { y i } ≤ i ≤ n , we construct the matrix D n as D n = 1 n n (cid:88) i =1 T ( y i ) a i a T i , (43)where T : R → R is a pre-processing function.The proof of the following spectral upper bound is discussed in Remark 7 at the end of Section 5. Theorem 4 (Spectral Upper Bound for Real General Sensing Model) . Let x , { a i } ≤ i ≤ n , and y be dis-tributed according to (27) , (28) , and (29) , respectively. Let n/d → δ and define δ u as in (42) . Let ˆ x be theprincipal eigenvector of the matrix D n defined in (43) . For any δ > δ u , set the pre-processing function T to the function T ∗ δ given by T ∗ δ ( y ) = √ δ u · T ∗ ( y ) √ δ − ( √ δ − √ δ u ) T ∗ ( y ) , (44)11 here T ∗ ( y ) = 1 − E G { p ( y | G ) } E G { p ( y | G ) · G } . (45) Then, we have that, almost surely, lim n →∞ |(cid:104) ˆ x , x (cid:105)|(cid:107) ˆ x (cid:107) (cid:107) x (cid:107) > (cid:15), (46) for some (cid:15) > . Furthermore, for any δ ≤ δ u , there is no pre-processing function T such that, almost surely, (46) holds. Remark 2 (Spectral Upper Bound for Real Phase Retrieval) . By calculations similar to those in Lemma 8contained in Appendix B, one can prove that, if the distribution p ( · | G ) appearing in (31) is given by (30) ,then lim σ → δ u ( σ ) = 1 / . (47) Furthermore, by following a proof analogous to that of Corollary 2 in Appendix B, one can prove thefollowing result. For any δ > / , set the pre-processing function T to the function T ∗ δ given by (with y + = max( y, ) T ∗ δ ( y ) = y + − y + + √ δ − . (48) Then, we have that, almost surely, lim σ → lim n →∞ |(cid:104) ˆ x , x (cid:105)|(cid:107) ˆ x (cid:107) (cid:107) x (cid:107) > (cid:15), (49) for some (cid:15) > . Note that, for real phase retrieval, the spectral upper bound matches the information-theoretic lower bound. In the following remark, we provide an example in which there is a strictly positive gap between δ (cid:96) and δ u . Remark 3 (Gap between δ (cid:96) and δ u ) . Let us define H ( a ) = E G (cid:8) tanh ( a G ) ( G − (cid:9) , (50) where G ∼ N (0 , . Note that H (0) = 0 and lim a →∞ H ( a ) = 0 . Hence, there exists a > a such that H ( a ) = H ( a ) .Consider the following distribution for the components of the vector of measurements y : p ( y | g ) = (cid:26) tanh ( a g ) − tanh ( a g ) , for y ∈ [1 , , − (tanh ( a g ) − tanh ( a g )) , for y ∈ [ − , − . (51) Then, we have that, for any y ∈ R , E G (cid:8) p ( y | G ) ( G − (cid:9) = 0 , (52) which, by definition (42) , immediately implies that δ u = ∞ . Note that this argument works when wesubstitute tanh ( x ) with any function which is even, increasing for x ≥ and bounded between and . et us now show that δ (cid:96) is finite. Consider the function f ( m ) defined in (31) . As previously mentioned, f (0) = 1 . Furthermore, f (1) = (cid:90) R E G (cid:110) ( p ( y | G )) (cid:111) E G { p ( y | G ) } d y = (cid:90) R ( E G { p ( y | G ) } ) + Var { p ( y | G ) } E G { p ( y | G ) } d y = 1 + (cid:90) R Var { p ( y | G ) } E G { p ( y | G ) } d y > . (53) Consequently, there exists m ∗ ∈ (0 , such that f ( m ∗ ) > . Set δ ∗ = − log(1 − m ∗ )2 log f ( m ∗ ) + 1 . (54) Then, we have that, for any δ ≥ δ ∗ , F δ ( m ∗ ) ≥ F δ ∗ ( m ∗ ) = 1 > . (55) Hence, by definition (34) , we conclude that δ (cid:96) < δ ∗ , which implies that δ (cid:96) is finite. Note that this upperbound on δ (cid:96) applies to any p ( y | G ) which is not constant in G on a set of positive measure. As a result,there is a strictly positive gap between δ (cid:96) and δ u . The crucial point of the proof consists in the computation of the conditional entropy H ( Y | A ) , whichis contained in Lemma 1. Then, we use this result to compute the mutual information for the consideredmodel. Finally, we provide the proof of Theorem 1. Lemma 1 (Conditional Entropy) . Let x ∼ Unif( √ d S d − C ) , A = ( a , . . . , a n ) with { a i } ≤ i ≤ n ∼ i.i.d. CN ( d , I d /d ) , and y = ( y , . . . , y n ) with y i ∼ p ( · | | g i | ) and g i = (cid:104) x , a i (cid:105) . Let n/d → δ and define δ (cid:96) asin (13) . Then, for any δ < δ (cid:96) , we have that lim n →∞ n H ( Y | A ) = H ( Y ) . (56) Proof.
We divide the proof into two steps. The first step consists in showing that − n (cid:90) R d E A (cid:110) ( p ( y | A )) (cid:111) E A { p ( y | A ) } d y − ≤ n H ( Y | A ) − H ( Y ) ≤ , (57)which holds for all n ∈ N and for all δ > . The proof of (57) does not require any assumption on thedistribution of x and on the distribution of { a i } ≤ i ≤ n (as long as the vectors { a i } ≤ i ≤ n are independent). This gap is not due to the looseness of our lower bound. Indeed, by using the result of [BKM + second step consists in showing that lim n → + ∞ n (cid:90) R d E A (cid:110) ( p ( y | A )) (cid:111) E A { p ( y | A ) } d y − = 0 . (58)It is clear that (57) and (58) imply the thesis. First step.
By definition of conditional entropy, we have that n H ( Y | A ) = 1 n (cid:90) E A {− p ( y | A ) log p ( y | A ) } d y . (59)By using the definition of y i and the fact that they are independent, we can rewrite E A { p ( y | A ) } as follows: E A { p ( y | A ) } = E A , X { p ( y | A , X ) } = E A , X (cid:40) n (cid:89) i =1 p ( y i | |(cid:104) X , A i (cid:105)| ) (cid:41) = n (cid:89) i =1 E G i { p ( y i | | G i | ) } , (60)where we set G i = (cid:104) X , A i (cid:105) .Let us now give an upper bound on the RHS of (59): n (cid:90) R d E A {− p ( y | A ) log p ( y | A ) } d y (a) ≤ n (cid:90) R d − E A { p ( y | A ) } log E A { p ( y | A ) } d y (b) = 1 n (cid:90) R d − n (cid:89) i =1 E G i { p ( y i | | G i | ) } n (cid:88) j =1 log E G j { p ( y j | G j ) } d y = 1 n n (cid:88) i =1 (cid:90) R − E G i { p ( y i | | G i | ) } log E G i { p ( y i | | G i | ) } d y i = H ( Y ) , where in (a) we apply Jensen’s inequality as the function g ( x ) = − x log x is concave, and in (b) we use(60). This immediately implies that n H ( Y | A ) − H ( Y ) ≤ . (61)Note that the upper bound (61) is based on the inequality E A {− p ( y | A ) log p ( y | A ) } − ( − E A { p ( y | A ) } log E A { p ( y | A ) } ) ≤ . E A {− p ( y | A ) log p ( y | A ) } − ( − E A { p ( y | A ) } log E A { p ( y | A ) } )= E A (cid:26) − p ( y | A ) log p ( y | A ) E A { p ( y | A ) } (cid:27) (a) = E A { p ( y | A ) } E Z {− Z log Z } (b) = E A { p ( y | A ) } E Z {− Z log Z + Z − } (c) ≥ − E A { p ( y | A ) } E Z (cid:8) ( Z − (cid:9) (d) = − E A { p ( y | A ) } (cid:0) E Z (cid:8) Z (cid:9) − (cid:1) = − E A (cid:110) ( p ( y | A )) (cid:111) E A { p ( y | A ) } − E A { p ( y | A ) } , (62)where in (a) we set Z = p ( y | A ) / E A { p ( y | A ) } , in (b) we use that E Z { Z } = 1 , in (c) we use that − z log z + z − ≥ − ( z − for any z ≥ , and in (d) we use again that E Z { Z } = 1 . Therefore, n H ( Y | A ) − H ( Y ) = 1 n (cid:90) ( E A {− p ( y | A ) log p ( y | A ) } − ( − E A { p ( y | A ) } log E A { p ( y | A ) } )) d y (a) ≥ − n (cid:90) R d E A (cid:110) ( p ( y | A )) (cid:111) E A { p ( y | A ) } − E A { p ( y | A ) } d y , (b) = − n (cid:90) R d E A (cid:110) ( p ( y | A )) (cid:111) E A { p ( y | A ) } d y − , where in (a) we use (62) and in (b) we use that the integral of p ( y | A ) is . This concludes the proof of(57). Second step. As X ∼ Unif( √ d S d − C ) and A i ∼ CN ( d , I d /d ) , we have that { G i } ≤ i ≤ n ∼ i.i.d. CN (0 , . Let us rewrite the quantity E A (cid:110) ( p ( y | A )) (cid:111) as follows: E A (cid:110) ( p ( y | A )) (cid:111) = E A (cid:32) E X (cid:40) n (cid:89) i =1 p ( y i | |(cid:104) X , A i (cid:105)| ) (cid:41)(cid:33) (a) = E A (cid:40) E X , X (cid:40) n (cid:89) i =1 p ( y i | |(cid:104) X , A i (cid:105)| ) · p ( y i | |(cid:104) X , A i (cid:105)| ) (cid:41)(cid:41) (b) = E C (cid:40) n (cid:89) i =1 E G i, ,G i, { p ( y i | | G i, | ) · p ( y i | | G i, | ) } (cid:41) , (63)15here in (a) X and X are independent, and in (b) we set G i, = (cid:104) X , A i (cid:105) , G i, = (cid:104) X , A i (cid:105) , and C = (cid:104) X , X (cid:105)(cid:107) X (cid:107) (cid:107) X (cid:107) . Then, given C = c , as X , X ∼ i.i.d. Unif( √ d S d − C ) and A i ∼ CN ( d , I d /d ) , we have that { ( G i, , G i, ) } ≤ i ≤ n ∼ i.i.d. CN (cid:18) , (cid:20) cc ∗ (cid:21)(cid:19) . Hence, n (cid:90) R d E A (cid:110) ( p ( y | A )) (cid:111) E A { p ( y | A ) } d y (a) = 1 n (cid:90) R d E C (cid:40) n (cid:89) i =1 E G i, ,G i, { p ( y | | G i, | ) p ( y | | G i, | ) } E G i { p ( y | | G i | ) } (cid:41) d y = 1 n E C (cid:40) n (cid:89) i =1 (cid:90) R E G i, ,G i, { p ( y | | G i, | ) p ( y | | G i, | ) } E G i { p ( y | | G i | ) } d y i (cid:41) (b) = 1 n E M { ( f ( M )) n } (c) = d − n (cid:90) ( f ( m )) n (1 − m ) d − d m, where in (a) we use (60) and (63), in (b) we use the fact that f depends only on m = | c | , which isclear from the explicit expression provided by Lemma 6 contained in Appendix A, and in (c) we use that M ∼ Beta(1 , d − by Lemma 9 contained in Appendix C.Set d (cid:48) = d − and δ (cid:48) n = n/d (cid:48) . Thus, (cid:90) ( f ( m )) n (1 − m ) d − d m = (cid:90) exp (cid:0) n · F δ (cid:48) n ( m ) (cid:1) d m, (64)where F δ (cid:48) n ( m ) is given by (12). Define ˜ F δ ( m ) = δ max(log f ( m ) ,
0) + log(1 − m ) . (65)As δ < δ (cid:96) and n/d (cid:48) → δ , there exists δ ∗ ∈ ( δ, δ (cid:96) ) such that δ (cid:48) n < δ ∗ for n sufficiently large. As F δ (cid:48) n ( m ) ≤ ˜ F δ (cid:48) n ( m ) and ˜ F δ ( m ) is non-decreasing in δ , we have that (cid:90) exp (cid:0) n · F δ (cid:48) n ( m ) (cid:1) d m ≤ (cid:90) exp (cid:16) n · ˜ F δ ∗ ( m ) (cid:17) d m. (66)Note that ˜ F δ ∗ ( m ) < if and only if F δ ∗ ( m ) < . Thus, by definition of δ (cid:96) , we have that ˜ F δ ∗ ( m ) < for m ∈ (0 , when n is sufficiently large. Furthermore, ˜ F δ ∗ (0) = 0 and ˜ F δ ∗ is a continuous function. As aresult, by Lemma 11, the integral in (66) tends to as n → ∞ and the claim immediately follows. Remark 4 (Mutual Information) . An immediate consequence of Lemma 1 is that one can compute themutual information I ( X ; Y , A ) for any δ < δ (cid:96) : lim n → + ∞ n I ( X ; Y , A ) = H ( E G { p ( · | | G | ) } ) − E G { H ( p ( · | | G | )) } , (67) where G ∼ CN (0 , . roof of Theorem 1. Define y n = ( y , . . . , y n ) and a n = ( a , . . . , a n ) . We divide the proof into twosteps. The first step consists in showing that the mutual information between the next observation y n +1 andthe previous observations y n tends to . More formally, we will prove that I ( Y n +1 ; Y n , A n | A n +1 ) = o n (1) . (68)The second step consists in showing that the estimate obtained on φ ( |(cid:104) x , a n +1 (cid:105)| ) given the observations y n is similar to the estimate on φ ( |(cid:104) x , a n +1 (cid:105)| ) when no observation is available. This means that theobservations y n do not provide any help. More formally, we will prove that E Y n , A n +1 (cid:26)(cid:16) E (cid:8) φ ( |(cid:104) X , A n +1 (cid:105)| ) (cid:9) − E (cid:8) φ ( |(cid:104) X , A n +1 (cid:105)| ) (cid:12)(cid:12) Y n , A n (cid:9)(cid:17) (cid:27) = o n (1) , (69)where φ is defined in (14).Furthermore, we have that E (cid:26)(cid:16) φ ( |(cid:104) X , A n +1 (cid:105)| ) − E (cid:8) φ ( |(cid:104) X , A n +1 (cid:105)| ) (cid:12)(cid:12) Y n , A n (cid:9)(cid:17) (cid:27) − (cid:16) E (cid:8) φ ( |(cid:104) X , A n +1 (cid:105)| ) (cid:9) − E (cid:8) φ ( |(cid:104) X , A n +1 (cid:105)| ) (cid:12)(cid:12) Y n , A n (cid:9)(cid:17) = E (cid:26)(cid:16) φ ( |(cid:104) X , A n +1 (cid:105)| ) − E (cid:8) φ ( |(cid:104) X , A n +1 (cid:105)| ) (cid:9)(cid:17) (cid:27) = Var (cid:8) φ ( |(cid:104) X , A n +1 (cid:105)| ) (cid:9) . (70)By applying (69) and (70), the proof of Theorem 1 follows. First step.
By using the chain rule of entropy and that y i is independent from a i +1: n +1 , we obtain that n + 1 H ( Y n +1 | A n +1 ) = 1 n + 1 n +1 (cid:88) i =1 H ( Y i | Y i − , A n +1 )= 1 n + 1 n +1 (cid:88) i =1 H ( Y i | Y i − , A i ) . The sequence s n = H ( Y n | Y n − , A n ) is decreasing, as conditioning reduces entropy. Hence s n has alimit, and this limit must be equal to H ( Y ) by Lemma 1. Since the Y i are i.i.d., we obtain that H ( Y n +1 | Y n , A n +1 ) = H ( Y n +1 ) + o n (1) . By using again that conditioning reduces entropy, we also obtain that H ( Y n +1 | A n +1 ) = H ( Y n +1 ) + o n (1) . By putting these last two equations together, we deduce that (68) holds.17 econd step.
Given two probability distributions p and q , let D KL ( p || q ) and (cid:107) p − q (cid:107) TV denote theirKullback-Leibler divergence and their total variation distance, respectively. Then, I ( Y n +1 ; Y n , A n | A n +1 )= E Y n , A n +1 { D KL ( p ( y n +1 | Y n , A n +1 ) || p ( y n +1 | A n +1 )) } (a) ≥ · E Y n , A n +1 (cid:110) ( (cid:107) p ( y n +1 | Y n , A n +1 ) − p ( y n +1 | A n +1 )) (cid:107) TV ) (cid:111) (b) ≥ K · E Y n , A n +1 (cid:40)(cid:18)(cid:90) R p ( y n +1 | Y n , A n +1 ) ϕ ( y n +1 ) d y n +1 − (cid:90) R p ( y n +1 | A n +1 ) ϕ ( y n +1 ) d y n +1 (cid:19) (cid:41) (c) = 12 K · E Y n , A n +1 (cid:40)(cid:18)(cid:90) C d p ( x | Y n , A n ) (cid:90) R p ( y n +1 | x , Y n , A n +1 ) ϕ ( y n +1 ) d y n +1 d x − (cid:90) C d p ( x ) (cid:90) R p ( y n +1 | x , A n +1 ) ϕ ( y n +1 ) d y n +1 d x (cid:19) (cid:41) (d) = 12 K · E Y n , A n +1 (cid:110) ( E { φ ( |(cid:104) X , A n +1 (cid:105)| ) } − E { φ ( |(cid:104) X , A n +1 (cid:105)| ) | Y n , A n } ) (cid:111) . (71)where in (a) we use Pinsker’s inequality, in (b) we use that ϕ is bounded and we set (cid:107) ϕ (cid:107) ∞ = K , in (c) weuse that X and A n +1 are independent, and in (d) we use the definition (14). By combining (68) and (71),(69) immediately follows. The proof is very similar to the one provided in Section 4.1 for the complex case. In particular, the crucialpoint consists in showing that lim n →∞ n H ( Y | A ) = H ( Y ) , (72)where x ∼ Unif( √ d S d − R ) , a = ( a , . . . , a n ) with { a i } ≤ i ≤ n ∼ i.i.d. N ( d , I d /d ) , and y = ( y , . . . , y n ) with y i ∼ p ( · | g i ) and g i = (cid:104) x , a i (cid:105) . Then, the proof of Theorem 3 follows similar passages as the proof ofTheorem 1.In order to prove (72), we show that (57) and (58) hold. The proof of (57) follows the same passagesas the first step of the proof of Lemma 1, hence it is omitted. The proof of (58) is slightly different and wedetail what changes in the remaining part of this section.Similarly to (60), we have that E A { p ( y | A ) } = n (cid:89) i =1 E G i { p ( y i | G i ) } , where G i = (cid:104) X , A i (cid:105) ∼ N (0 , . Furthermore, similarly to (63), we also have that E A (cid:110) ( p ( y | A )) (cid:111) = E M (cid:40) n (cid:89) i =1 E G i, ,G i, { p ( y i | G i, ) · p ( y i | G i, ) } (cid:41) , G i, = (cid:104) X , A i (cid:105) , G i, = (cid:104) X , A i (cid:105) , and we define M = (cid:104) X , X (cid:105)(cid:107) X (cid:107) (cid:107) X (cid:107) . Then, given M = m , as X , X ∼ i.i.d. Unif( √ d S d − R ) and A i ∼ N ( d , I d /d ) , we have that { ( G i, , G i, ) } ≤ i ≤ n ∼ i.i.d. N (cid:18) , (cid:20) mm (cid:21)(cid:19) . Hence, n (cid:90) R d E A (cid:110) ( p ( y | A )) (cid:111) E A { p ( y | A ) } d y (a) = 1 n E M { ( f ( M )) n } (b) = 1 n Γ( d ) √ π Γ( d − ) (cid:90) − ( f ( m )) n (1 − m ) d − d m, (73)where in (a) we use the definition (33) of f and in (b) we plug in the distribution of M obtained from Lemma10 contained in Appendix C. Note that lim d →∞ Γ( d ) d · Γ( d − ) = 1 . Therefore, by showing that the integral in the RHS of (73) tends to , the claim immediately follows.Set d (cid:48) = d − and δ (cid:48) n = n/d (cid:48) . Thus, (cid:90) − ( f ( m )) n (1 − m ) d − d m = (cid:90) − exp (cid:0) n · F δ (cid:48) n ( m ) (cid:1) d m, (74)where F δ (cid:48) n ( m ) is defined in (33). Define ˜ F δ ( m ) = δ max(log f ( m ) ,
0) + 12 log(1 − m ) . (75)As δ < δ (cid:96) and n/d (cid:48) → δ , there exists δ ∗ ∈ ( δ, δ (cid:96) ) such that δ (cid:48) n < δ ∗ for n sufficiently large. As F δ (cid:48) n ( m ) ≤ ˜ F δ (cid:48) n ( m ) and ˜ F δ ( m ) is non-decreasing in δ , we have that (cid:90) exp (cid:0) n · F δ (cid:48) n ( m ) (cid:1) d m ≤ (cid:90) exp (cid:16) n · ˜ F δ ∗ ( m ) (cid:17) d m. (76)Note that ˜ F δ ∗ ( m ) < if and only if F δ ∗ ( m ) < . Thus, by definition of δ (cid:96) , we have that ˜ F δ ∗ ( m ) < for m (cid:54) = 0 when n is sufficiently large. Furthermore, ˜ F δ ∗ (0) = 0 and ˜ F δ ∗ is a continuous function. As a result,by Lemma 11, the integral in (76) tends to as n → ∞ and the claim immediately follows.19 Proof of Theorems 2 and 4: Spectral Upper Bound
We will consider the complex case. The proof for the real case is essentially the same and it is brieflydiscussed in Remark 7 at the end of this section.A crucial ingredient of the proof consists in Lemma 2, which is a generalization of Theorem 1 of [LL17].Before stating this result, we need some definitions. Let G ∼ CN (0 , , Y ∼ p ( · | | G | ) , and Z = T ( Y ) .Assume that Z has bounded support and let τ be the supremum of this support, i.e., τ = inf { z : P ( Z ≤ z ) = 1 } . (77)For λ ∈ ( τ, ∞ ) and δ ∈ (0 , ∞ ) , define φ ( λ ) = λ · E (cid:26) Z · | G | λ − Z (cid:27) , (78)and ψ δ ( λ ) = λ (cid:18) δ + E (cid:26) Zλ − Z (cid:27)(cid:19) . (79)Note that φ ( λ ) is a monotone non-increasing function and that ψ δ ( λ ) is a convex function. Let ¯ λ δ be thepoint at which ψ δ attains its minimum, i.e., ¯ λ δ = arg min λ ≥ τ ψ δ ( λ ) . (80)For λ ∈ ( τ, ∞ ) , define also ζ δ ( λ ) = ψ δ (max( λ, ¯ λ δ )) . (81) Lemma 2 (Generalization of Theorem 1 of [LL17]) . Let x ∼ Unif( S d − C ) , { a i } ≤ i ≤ n ∼ i.i.d. CN ( d , I d ) ,and y be distributed according to (8) . Let n/d → δ , G ∼ CN (0 , and define Z = T ( Y ) for Y ∼ p ( · | | G | ) .Assume that Z satisfies P ( Z = 0) < and that it has bounded support. Let τ be defined in (77) . Assumefurther that, as λ approaches τ from the right, we have lim λ → τ + E (cid:26) Z ( λ − Z ) (cid:27) = lim λ → τ + E (cid:26) Z · | G | λ − Z (cid:27) = ∞ . (82) Let ˆ x be the principal eigenvector of the matrix D n , defined as in (21) . Then, the following results hold:(1) The equation ζ δ ( λ ) = φ ( λ ) (83) admits a unique solution, call it λ ∗ δ , for λ > τ .(2) As n → ∞ , |(cid:104) ˆ x , x (cid:105)| (cid:107) ˆ x (cid:107) (cid:107) x (cid:107) a.s. −→ , if ψ (cid:48) δ ( λ ∗ δ ) ≤ ,ψ (cid:48) δ ( λ ∗ δ ) ψ (cid:48) δ ( λ ∗ δ ) − φ (cid:48) ( λ ∗ δ ) , if ψ (cid:48) δ ( λ ∗ δ ) > , (84) where ψ (cid:48) δ and φ (cid:48) denote the derivatives of these two functions.
3) Let λ D n ≥ λ D n denote the two largest eigenvalues of D n . Then, as n → ∞ , λ D n a.s. −→ ζ δ ( λ ∗ δ ) ,λ D n a.s. −→ ζ δ (¯ λ δ ) . (85)Before proceeding with the proof, we discuss these results in more detail and we describe in what senseLemma 2 provides a generalization of Theorem 1 of [LL17]. Remark 5 (Two different regimes) . The results of Lemma 2 imply that, according to the value of δ , we candistinguish between two possible regimes.On the one hand, suppose that φ (¯ λ δ ) > ψ δ (¯ λ δ ) . Recall that φ ( λ ) is non-increasing and that ¯ λ δ isthe point in which ψ δ ( λ ) attains its minimum. Thus, ¯ λ δ < λ ∗ δ , which implies that ψ (cid:48) δ ( λ ∗ δ ) > and that ζ δ ( λ ∗ δ ) > ζ δ (¯ λ δ ) . This means that the scalar product |(cid:104) ˆ x , x (cid:105)| is bounded away from zero and that there isa strictly positive gap between the two largest eigenvalues of D n . In this regime, the spectral method thatoutputs ˆ x solves the weak recovery problem and (24) holds for some (cid:15) > .On the other hand, suppose that φ (¯ λ δ ) ≤ ψ δ (¯ λ δ ) . Thus, ¯ λ δ ≥ λ ∗ δ , which implies that ψ (cid:48) δ ( λ ∗ δ ) ≤ andthat ζ δ ( λ ∗ δ ) = ζ δ (¯ λ δ ) . In words, this means that the scalar product |(cid:104) ˆ x , x (cid:105)| converges to zero and that thereis no strictly positive gap between the two largest eigenvalues of D n . In this regime, the spectral methodthat outputs ˆ x does not solve the weak recovery problem. Remark 6 (Lemma 2 and Theorem 1 of [LL17]) . Lemma 2 generalizes Theorem 1 of [LL17] in the followingtwo regards: • x and { a i } ≤ i ≤ n are complex vectors, while Theorem 1 of [LL17] considers the real case; • Z can also be negative, while Theorem 1 of [LL17] assumes that Z ≥ .The first generalization does not require additional work as the whole argument of [LL17] generalizesin the natural way to the complex case: Gaussian random variables become circularly-symmetric com-plex Gaussian random variables, transposes of vectors and matrices become conjugate transposes, squaresbecome modulus squares, and so on.On the contrary, the second generalization is more challenging, as it requires the result of Lemma 3,which is stated below and proved in Appendix D.As a final observation, let us point out that Theorem 1 of [LL17] assumes also that E (cid:8) Z · | G | (cid:9) > E { Z } . A careful check shows that this hypothesis is never used in the proof of that theorem, but it isrequired only in the proof of some additional results of [LL17]. Lemma 3 (Generalization of [BY12] to non-PSD matrices) . Consider the random matrix S n = 1 n U M n U ∗ , (86) where the entries of U ∈ C ( d − × n are ∼ i.i.d. CN (0 , , and M n ∈ C n × n is independent of U . Let λ M n denote the largest eigenvalue of M n . Assume that the empirical spectral measure of the eigenvalues of M n almost surely converges weakly to the probability distribution H , where H is the law of the random variable Z . Let Γ H be the support of H and let τ be the supremum of Γ H . Assume also that, as n → ∞ , λ M n a.s. −→ α ∗ (cid:54)∈ Γ H . (87)21 et n/d → δ , denote by λ S n the largest eigenvalue of the matrix (86) , and define ψ δ as in (79) . Then, as n → ∞ , λ S n a.s. −→ ψ δ ( α ∗ ) , if ψ (cid:48) δ ( α ∗ ) > ,λ S n a.s. −→ min λ>τ ψ δ ( λ ) , if ψ (cid:48) δ ( α ∗ ) ≤ . (88) Proof of Lemma 2.
In this proof, we follow closely the approach detailed in Section III of [LL17]. First ofall, let us write the matrix D n defined in (21) as D n = 1 n AZA ∗ , (89)where A = [ a , . . . , a n ] , Z is a diagonal matrix with entries z i = T ( y i ) for i ∈ [ n ] , the random variables y i are independent and distributed according to p ( · | | g i | ) , and { g i } ≤ i ≤ n ∼ i.i.d. CN (0 , . As the sensingvectors { a i } ≤ i ≤ n are drawn from the circularly-symmetric complex normal distribution, we can assumewithout loss of generality that x = e , where e is the first element of the canonical basis of C d .Consider a matrix U ∈ C ( d − × n independent of { g i } ≤ i ≤ n and Z . Let the elements of U be ∼ i.i.d. CN (0 , . Define P n = 1 n U ZU ∗ , (90)and q n = 1 n U v , (91)where v = [ z g , . . . , z n g n ] ∗ . Then, (89) can be rewritten as D n = (cid:20) a n q ∗ n q n P n (cid:21) , (92)where a n = (cid:80) ni =1 z i | g i | /n is a scalar that converges almost surely to E ( Z · | G | ) as n → ∞ , with G ∼ CN (0 , .Next, consider a parametric family of matrices { P n + µ q n q ∗ n } and let L n ( µ ) denote their largest eigen-values, i.e., L n ( µ ) = λ ( P n + µ q n q ∗ n ) . The idea is to compute the largest eigenvalue of D n , call it λ D n , and the scalar product between ˆ X and e via a fixed-point equation involving L n ( µ ) .To do so, we first need an intermediate result holding for any matrix D that can be written in the form D = (cid:20) a q ∗ q P (cid:21) , where a ∈ R , P ∈ C ( d − × ( d − is a Hermitian matrix and q ∈ C d − is such that (cid:107) q (cid:107) (cid:54) = 0 . Note that thematrix D n defined in (21) fulfills such requirements, since the matrix P n defined in (90) is Hermitian and q n defined in (91) is such that (cid:107) q n (cid:107) (cid:54) = 0 with high probability, as P ( Z = 0) < .Let λ P ≥ λ P ≥ · · · ≥ λ P d − be the set of eigenvalues of P , and let w , w , . . . , w d − be a correspond-ing set of eigenvectors. For λ ∈ (max { λ P i : (cid:104) q , w i (cid:105) (cid:54) = 0 } , ∞ ) , define R ( λ ) = q ∗ ( P − λ I ) − q = d − (cid:88) i =1 |(cid:104) q , w i (cid:105)| λ P i − λ . (93)22ote that R ( λ ) increases monotonically from −∞ to . Hence, it admits an inverse, call it R − ( x ) , for x < . Then, the maximum eigenvalue L ( µ ) = λ ( P + µ qq ∗ ) is given by L ( µ ) = max( R − ( − /µ ) , λ P ) . (94)The proof of (94) is standard, cf. e.g. Lemma 1 in [LL17]. Note that L ( µ ) is a non-decreasing function suchthat lim µ →∞ L ( µ ) = ∞ . Indeed, by construction, R − ( − /µ ) is strictly increasing and lim µ →∞ R − ( − /µ ) = ∞ . Furthermore, L ( µ ) is convex since it is the maximum of a set of linear functions, as L ( µ ) = λ ( P + µ qq ∗ ) = max x : (cid:107) x (cid:107) =1 x ∗ ( P + µ qq ∗ ) x . Let µ ∗ > be the solution to the fixed-point equation µ = ( L ( µ ) − a ) − . (95)This solution is unique, since L ( µ ) is a non-decreasing function with lim µ →∞ L ( µ ) = ∞ . Then, λ D = L ( µ ∗ ) , (96)and |(cid:104) ˆ x , e (cid:105)| ∈ (cid:20) ∂ − L ( µ ∗ ) ∂ − L ( µ ∗ ) + (1 /µ ∗ ) , ∂ + L ( µ ∗ ) ∂ + L ( µ ∗ ) + (1 /µ ∗ ) (cid:21) , (97)where ∂ − L ( µ ∗ ) and ∂ + L ( µ ∗ ) denote the left and right derivative of L ( µ ) , respectively. In particular, if L ( µ ) is differentiable at µ ∗ , then |(cid:104) ˆ x , e (cid:105)| = L (cid:48) ( µ ∗ ) L (cid:48) ( µ ∗ ) + (1 /µ ∗ ) . (98)The proof of (96), (97), and (98) uses the characterization (94) and it is analogous to the proof of Proposition2 in [LL17].At this point, we need to compute L n ( µ ) for the matrix D n defined in (92). The eigenvalues of a lowrank perturbation of a random matrix are studied in [BGN11]. However, we cannot apply those results, as P n and q n are dependent. Hence, we write P n + µ q n q ∗ n = 1 n U M n U ∗ , where M n is independent of U with M n = Z + µn vv ∗ . We start by studying the spectrum of M n . Let λ M n ≥ λ M n ≥ · · · ≥ λ M n n be the set of eigenvalues of M n and let f M n = 1 n − n (cid:88) i =2 δ λ M ni be the empirical spectral measure of the last n − eigenvalues.23hen, standard interlacing theorems (see [HJ12, Section 4.3]) yield that f M n almost surely convergesweakly to the probability law of Z . Furthermore, by using the characterization (94), we can show that λ M n a.s. −→ λ µ = Q − (1 /µ ) , (99)where Q − is the inverse of the function Q ( λ ) = E (cid:26) Z · | G | λ − Z (cid:27) . The proof of these results is the same as the proof of Proposition 3 in [LL17].Note that Q ( λ ) is defined for λ ∈ ( τ, ∞ ) , it is continuous and strictly decreasing with Q ( ∞ ) = 0 .Furthermore, by hypothesis (82), we have that lim λ → τ + Q ( λ ) = ∞ . Thus, Q ( λ ) admits an inverse and Q − (1 /µ ) is well-defined for all µ > .Let us now consider the matrix n U M n U ∗ . First, if Z ≥ , then M n is positive semi-definite (PSD)and we can apply results from [BY12] to compute the limit of L n ( µ ) . If M n is not necessarily PSD, we useLemma 3 with α ∗ = λ µ to conclude that L n ( µ ) a.s. −→ ψ δ ( λ µ ) , if ψ (cid:48) δ ( λ µ ) > ,L n ( µ ) a.s. −→ min λ>τ ψ δ ( λ ) , if ψ (cid:48) δ ( λ µ ) ≤ . (100)The remaining part of the proof follows the argument of Section III-D in [LL17]. For the sake ofreadability, we reproduce it below.We start by proving the first claim of the lemma. For n ≥ , let µ n be the unique solution to thefixed-point equation (95). Then, L n ( µ n ) − /µ n = a n . Now, fix any µ > . Then, by using the definition (81) and the fact that λ µ = Q − (1 /µ ) , (100) immediatelyimplies that, as n → ∞ , L n ( µ ) − /µ a.s. −→ ζ δ ( Q − (1 /µ )) − /µ. (101)Note that, as n → ∞ , a n a.s. −→ E ( Z · | G | ) . Furthermore, as L n ( µ ) and ζ δ ( µ ) are non-decreasing, thetwo functions on both sides of (101) are strictly increasing. Consequently, by Lemma 3 in Appendix Eof [LL17], we conclude that µ n a.s. −→ µ ∗ , (102)where µ ∗ is the unique fixed point such that ζ δ ( Q − (1 /µ ∗ )) = E ( Z · | G | ) + 1 /µ ∗ . (103)Define λ ∗ = Q − (1 /µ ∗ ) . (104)Then, (103) can be rewritten as ζ δ ( λ ∗ ) = E ( Z · | G | ) + Q ( λ ∗ ) = φ ( λ ∗ ) , (105)where φ is defined in (78). By construction, ζ δ ( λ ) is a non-decreasing continuous function on ( τ, ∞ ) and φ ( λ ) is a strictly decreasing continuous function. Furthermore, by hypothesis (82), we have that24 im λ → τ + φ ( λ ) = ∞ . Hence, the existence and the uniqueness of λ ∗ satisfying (105) is guaranteed. Thissuffices to prove the first claim of the lemma.Let us now move on to the proof of the second claim of the lemma. Suppose that ζ δ ( Q − (1 /µ )) isdifferentiable at µ = µ ∗ . Then, as L n ( µ ) is convex for any n ≥ , by Lemma 4 in Appendix E of [LL17],we have that ∂ − L n ( µ n ) a.s. −→ d ζ δ ( Q − (1 /µ ))d µ (cid:12)(cid:12)(cid:12)(cid:12) µ = µ ∗ = − ζ (cid:48) δ ( Q − (1 /µ ∗ )) Q (cid:48) ( Q − (1 /µ ∗ )) · ( µ ∗ ) . Similarly, ∂ + L n ( µ n ) a.s. −→ − ζ (cid:48) δ ( Q − (1 /µ ∗ )) Q (cid:48) ( Q − (1 /µ ∗ )) · ( µ ∗ ) . By using (97), we obtain that |(cid:104) ˆ x , e (cid:105)| a.s. −→ ζ (cid:48) δ ( Q − (1 /µ ∗ )) ζ (cid:48) δ ( Q − (1 /µ ∗ )) − Q (cid:48) ( Q − (1 /µ ∗ )) = ζ (cid:48) δ ( λ ∗ ) ζ (cid:48) δ ( λ ∗ ) − φ (cid:48) ( λ ∗ ) , where the equality follows from the definition (104) of λ ∗ and from the fact that Q (cid:48) ( λ ) = φ (cid:48) ( λ ) . In orderto prove the second claim of the lemma, it suffices to note that, by its definition in (81), ζ (cid:48) δ ( λ ) = ψ (cid:48) δ ( λ ) if ψ (cid:48) δ ( λ ) > , and ζ (cid:48) δ ( λ ) = 0 if ψ (cid:48) δ ( λ ) < .Finally, let us prove the third claim of the lemma. By using (96), we immediately obtain that λ D n = L n ( µ n ) . By applying (102) and Lemma 3 in Appendix E of [LL17], we conclude that λ D n a.s. −→ ζ δ ( λ ∗ ) . As P n is obtained by deleting the first row and column of D n , by applying Cauchy interlacing theorem (see,e.g., [HJ12, Theorem 4.3.17]), we also have that λ P n ≤ λ D n ≤ λ P n . Furthermore, the upper edge of the support of the limiting spectral distribution of P n is given by [SC95,Section 4] and [BY12, Lemma 3.1] min λ>τ ψ δ ( λ ) = ζ δ (¯ λ δ ) , where ¯ λ δ is defined in (80). Therefore, λ D n a.s. −→ ζ δ (¯ λ δ ) , which concludes the proof.At this point, we are ready to prove our spectral upper bound. Proof of Theorem 2.
Note that the normalization of x and { a i } ≤ i ≤ n required in Lemma 2 is different fromthe normalization required in Theorem 2. However, the scalar product (cid:104) x , a i (cid:105) is the same and the datamatrix D n changes by a factor d . Hence, the principal eigenvector ˆ x is not affected by this change in thenormalization.Let G ∼ CN (0 , , Y ∼ p ( · | | G | ) and Z = T ( Y ) , where p is defined in (8) and T is some pre-processing function that we will choose later on. We will assume that the supremum τ of the support of Z is strictly positive and that conditions (82) are satified, and will verify later that our choice of the function T satisfies these requirements. Recall that the function ψ δ ( λ ) defined in (79) is convex and that it attains25ts minimum at the point ¯ λ δ . Since by condition (82) ψ δ ( λ ) ↑ ∞ as λ ↓ , we have ¯ λ δ ∈ ( τ, ∞ ) . Hence, ψ (cid:48) δ (¯ λ δ ) = 0 . By calculating the derivative of ψ δ ( λ ) and setting it to , we have E (cid:26) Z (¯ λ δ − Z ) (cid:27) = 1 δ . (106)Furthermore, as pointed out in Remark 5, (24) holds for some (cid:15) > if and only if φ (¯ λ δ ) > ψ δ (¯ λ δ ) . (107)As τ > , we also have that ¯ λ δ > . Consider now the matrix D (cid:48) n = D n /α for some α > . Then, theprincipal eigenvector of D (cid:48) n is equal to the principal eigenvector of D n . Hence, we can assume without lossof generality that ¯ λ δ = 1 . Consequently, the conditions (106) and (107) can be respectively rewritten as E (cid:26) Z (1 − Z ) (cid:27) = 1 δ , (108) E (cid:26) Z ( | G | − − Z (cid:27) > δ . (109)Furthermore, as Z = T ( Y ) , we also obtain that E (cid:26) Z (1 − Z ) (cid:27) = (cid:90) R (cid:18) T ( y )1 − T ( y ) (cid:19) E G { p ( y | | G | ) } d y, E (cid:26) Z ( | G | − − Z (cid:27) = (cid:90) R T ( y )1 − T ( y ) E G (cid:8) p ( y | | G | ) · ( | G | − (cid:9) d y. (110)Let T ∗ ( y ) be defined in (23). Note that, if we substitute T ( y ) = T ∗ ( y ) into the RHS of (110), then E (cid:26) Z (1 − Z ) (cid:27) = E (cid:26) Z ( | G | − − Z (cid:27) = 1 δ u , where δ u is defined in (20). Let T ∗ δ ( y ) be defined in (22). Then, T ∗ δ ( y )1 − T ∗ δ ( y ) = (cid:114) δ u δ T ∗ ( y )1 − T ∗ ( y ) , which immediately implies that E (cid:26) ( T ∗ δ ( Y )) (1 − T ∗ δ ( Y )) (cid:27) = 1 δ , (111) E (cid:26) T ∗ δ ( Y )( | G | − − T ∗ δ ( Y ) (cid:27) = 1 √ δ · δ u > δ . (112)As a result, we need to show that the function T ∗ δ ( y ) fulfills the following requirements:(1) T ∗ δ ( y ) is bounded;(2) P ( T ∗ δ ( Y ) = 0) < ;(3) the supremum τ of the support of T ∗ δ ( Y ) is strictly positive;264) the condition (82) holds.Note that T ∗ δ ( y ) is bounded, as T ∗ ( y ) ≤ . Furthermore, if E G { p ( y | | G | ) } = E G (cid:8) p ( y | | G | ) | G | (cid:9) , (113)identically, then δ u = ∞ and the claim of Theorem 2 trivially holds. Hence, we can assume that (113)does not hold, which implies that the function T ∗ is not equal to the constant value . Consequently, P ( T ∗ δ ( Y ) = 0) < .By definition (23) of T ∗ , we have that E Y (cid:26) − T ∗ ( Y ) (cid:27) = (cid:90) R E G (cid:8) p ( y | | G | ) · | G | (cid:9) d y = E G (cid:8) | G | (cid:9) = 1 . (114)Hence, P ( T ∗ ( Y ) > > , which implies that P ( T ∗ δ ( Y ) > > . Consequently, the supremum τ of thesupport of T ∗ δ ( Y ) is strictly positive.If P ( T ∗ δ ( Y ) = τ ) > , then the condition (82) is satisfied. Suppose now that P ( T ∗ δ ( Y ) = τ ) = 0 . Then,for any (cid:15) > , there exists ∆ ( (cid:15) ) such that < P (cid:0) T ∗ δ ( Y ) ∈ ( τ − ∆ ( (cid:15) ) , τ ) (cid:1) ≤ (cid:15) . (115)Define T ∗ δ ( y, (cid:15) ) = T ∗ δ ( y ) , if T ∗ δ ( y ) ≤ τ − ∆ ( (cid:15) ) ,τ − ∆ ( (cid:15) ) , otherwise . (116)Clearly, the random variable T ∗ δ ( Y, (cid:15) ) has a point mass, hence the condition (82) is satisfied.As a final step, we show that we can take (cid:15) = 0 . Define D n ( (cid:15) ) = 1 n n (cid:88) i =1 T ∗ δ ( y i , (cid:15) ) a i a ∗ i . Define also D n = 1 n n (cid:88) i =1 T ∗ δ ( y i ) a i a ∗ i . Let ˆ x ( (cid:15) ) and ˆ x be the principal eigenvectors of D n ( (cid:15) ) and of D n , respectively. Then, (cid:107) D n ( (cid:15) ) − D n (cid:107) op ≤ C · ∆ ( (cid:15) ) , (117)where the constant C depends only on n/d . By Lemma 2, there is a strictly positive gap, call it θ , betweenthe first and the second eigenvalue of D n ( (cid:15) ) . Consequently, by the Davis-Kahan theorem [DK70], weconclude that (cid:107) ˆ x ( (cid:15) ) − ˆ x (cid:107) ≤ C · ∆ ( (cid:15) ) , (118)where the constant C depends only on n/d and on θ . In words, for any n , as (cid:15) tends to , the principaleigenvector of D n ( (cid:15) ) tends to the principal eigenvector of D n . This means that we can set T = T ∗ δ andhave that, almost surely, (24) holds.In order to conclude the proof, it remains to show that δ u is the optimal threshold for the spectral method,namely, for any δ < δ u , there is no pre-processing function T such that, (24) holds almost surely. To do27o, note that, (24) holds almost surely if and only if (108) and (109) are satisfied. By setting u ( y ) = T ( y ) / (1 − T ( y )) and using (110), we have that these conditions can be rewritten as (cid:90) R ( u ( y )) E G { p ( y | | G | ) } d y = 1 δ , (119) E (cid:26) Z ( | G | − − Z (cid:27) = (cid:90) R u ( y ) (cid:112) E G { p ( y | | G | ) } E G (cid:8) p ( y | | G | ) · ( | G | − (cid:9)(cid:112) E G { p ( y | | G | ) } d y > δ . (120)By Cauchy-Schwarz inequality, we also have that (cid:90) R u ( y ) (cid:112) E G { p ( y | | G | ) } E G (cid:8) p ( y | | G | ) · ( | G | − (cid:9)(cid:112) E G { p ( y | | G | ) } d y ≤ (cid:115)(cid:90) R ( u ( y )) E G { p ( y | | G | ) } d y (cid:115)(cid:90) R ( E G { p ( y | | G | ) · ( | G | − } ) E G { p ( y | | G | ) } d y. (121)By combining (119), (120) and (121) with the definition (20) of δ u , we conclude that √ δ u √ δ > δ , (122)which implies that δ > δ u . Consequently, for δ ≤ δ u , no pre-processing function achieves weak recoveryand the proof is complete. Remark 7 (Proof of Spectral Upper Bound for the Real Case) . First, we need to prove a result analogousto that of Lemma 2, where x ∼ Unif( S d − R ) , { a i } ≤ i ≤ n ∼ i.i.d. N ( d , I d ) , y is distributed according to (29) , and G ∼ N (0 , . To do so, one can follow the proof of Theorem 1 of [LL17]. The technical difficultyconsists in the fact that the matrix M n is not necessarily PSD. In order to solve this issue, we apply theversion of Lemma 3 for the real case discussed in Remark 8 at the end of Appendix D. At this point, theproof of Theorem 4 follows from the same argument as the proof of Theorem 2. Message passing algorithms have proved successful in a broad range of statistical estimation problems, in-cluding high-dimensional regression [BM12], robust regression [DM16], low-rank matrix estimation [DM13,KMZ13, MR16, KKM + δ u the spectral threshold defined in (42)):1. We prove that, for δ < δ u (i.e. in the regime in which the spectral approach fails), message passingconverges to an un-informative fixed point, even if initialized in a state that is correlated with the truesignal x . 28. Vice versa, for δ > δ u (when the spectral algorithm achieves weak recovery), we consider a linearizedmessage passing algorithm, and prove that the un-informative fixed point is unstable. The proof ofthis fact builds on the analysis contained in the previous pages.Let us point out that the techniques described in Section 5 to compute the spectral threshold δ u are differentfrom those described in this section to analyze message passing algorithms. Hence, we find very interestingthe fact that the spectral threshold is closely related to the performance of message passing. In particular, ourfindings suggest the conjecture that δ u represents the fundamental limit for all polynomial-time algorithms.Note also that message passing often allows to further refine the spectral estimate, in order to provide anexact recovery of the signal. Hence, combining the analyses of message passing and of the spectral methodto provide a threshold for exact recovery constitutes an interesting direction for future research (see [MV17]for an example in which this program is carried out).For the sake of simplicity, we will assume that the signal x and the measurement matrix A are real.Of particular interest for the present setting is approximate message passing (AMP) [DMM09, BM11]:this is a broad class of iterative methods that operates with dense random matrices (as the sensing matrix A in the present case). In particular, in [Ran11] it was proposed a “generalized approximate messagepassing” (GAMP) scheme, which is an AMP algorithm for Bayesian estimation in non-linear regressionmodels. This approach was further developed in the context of phase retrieval in [SR15]. We will follow thesame Bayesian formulation here, by considering an AMP algorithm that is equivalent to GAMP althoughsomewhat simpler.In order to minimize technical overhead, we assume throughout this section that the conditional density p ( y | g ) is bounded and two times differentiable with respect to g . Denote by ∂ g p ( y | g ) and ∂ g p ( y | g ) thefirst and the second derivative of p ( y | g ) , respectively. Let G ∼ N (0 , and define the function F ( x, y ; ¯ q ) = E G { ∂ g p ( y | ¯ q x + √ ¯ qG ) } E G { p ( y | ¯ q x + √ ¯ qG ) } . (123)We further define the following “state evolution” recursion: µ t +1 = δ · h ( q t ) ,q t = µ t µ t , (124)where h ( q ) = (cid:90) R E G (cid:40) (cid:0) E G { ∂ g p ( y | √ qG + √ − qG ) } (cid:1) E G { p ( y | √ qG + √ − qG ) } (cid:41) d y , (125)with G , G ∼ i.i.d. N (0 , .Given the sensing matrix A = ( a , . . . , a n ) T ∈ R n × d , and the vector of measurements y = ( y , . . . , y n ) ∈ R n , the message passing algorithm updates iteratively the estimate z t ∈ R d of the signal x ∈ R d , with (cid:107) x (cid:107) = √ d , according to the iteration z t +1 = A T f t ( ˆ z t ; y ) − b t z t , ˆ z t = Az t − f t − ( ˆ z t − ; y ) . (126)Here, the function f t ( ˆ z ; y ) = ( f t (ˆ z ; y ) , . . . , f t (ˆ z n ; y n )) is understood to be applied component-wise to itsarguments and b t it is defined as f t (ˆ z ; y ) = F (ˆ z, y ; 1 − q t ) , (127)29nd the “Onsager coefficient” b t is defined as b t = δ · E { f (cid:48) t ( µ t G + √ µ t G ; Y ) } , (128)where f (cid:48) t (ˆ z ; y ) denotes the derivative of f t (ˆ z ; y ) with respect to ˆ z , and the expectation is with respect to G , G ∼ i.i.d. N (0 , and Y ∼ p ( · | G ) . The recursion (126) is initialized with z ∈ R d and it is under-stood that f − ( · ; · ) = n .State evolution precisely tracks the asymptotics of AMP. The next statement is a consequence of [BM11,JM13]. We refer to Appendix E for its proof. Lemma 4 (State Evolution for AMP Iteration (126)) . Let x ∈ R d denote the unknown signal such that (cid:107) x (cid:107) = √ d , A = ( a , . . . , a n ) T ∈ R n × d with { a i } ≤ i ≤ n ∼ i.i.d. N ( d , I d /d ) , and y = ( y , . . . , y n ) with y i ∼ p ( · | (cid:104) x , a i (cid:105) ) . Consider the AMP iterates z t , ˆ z t defined in (126) , where f t (ˆ z ; y ) and b t are given by (127) and (128) , respectively. Assume that the initialization z is independent of A and that, almost surely, lim n →∞ d (cid:104) x , z (cid:105) = µ , lim n →∞ d (cid:107) z (cid:107) = µ + µ . (129) Let the state evolution recursion q t , µ t be defined as in (124) with initialization µ . Then, for any t , and forany function ψ : R → R such that | ψ ( u ) − ψ ( v ) | ≤ L (1 + (cid:107) u (cid:107) + (cid:107) v (cid:107) ) (cid:107) u − v (cid:107) for some L ∈ R , wehave that, almost surely, lim n →∞ n n (cid:88) i =1 ψ ( x i , z ti ) = E { ψ ( X , µ t X + √ µ t G ) } , (130) where the expectation is taken with respect to X , G ∼ i.i.d. N (0 , . Informally, this lemma states that z t is a noisy version of the signal x , namely z t ≈ µ t x + √ µ t g , with g ∼ N ( d , I d ) , and that this approximation holds for empirical averages. In order to obtain a non-vanishing weak recovery threshold, we assume that the observation model satisfiesthe condition E G { ∂ g p ( y | G ) } = 0 , (131)where the expectation is with respect to G ∼ N (0 , . Notice that this implies h (0) = 0 , therefore µ t = q t = 0 is a fixed point of state evolution. Furthermore, F (0 , y ; 1) = 0 , therefore q t = 0 , z t = d is a fixedpoint of the message passing algorithm. We will refer to this as to the “un-informative fixed point”. Notethat the condition (131) holds – among others – for the phase retrieval problem.Vice versa, if E G { ∂ g p ( y | G ) } (cid:54) = 0 , then µ > even if µ = 0 , for any δ > . Thanks to Lemma 4,this implies that weak recovery is possible for all δ > . Hence, we will assume that the condition (131)holds.The first result of this section establishes the following: for δ < δ u , the message passing algorithm failseven if the initial condition has a positive correlation with the unknown signal. We refer to Appendix E forits proof. Theorem 5 (Message Passing Fails for δ < δ u ) . Let x ∈ R d denote the unknown signal such that (cid:107) x (cid:107) = √ d . Let A = ( a , . . . , a n ) T ∈ R n × d with { a i } ≤ i ≤ n ∼ i.i.d. N ( d , I d /d ) , and y = ( y , . . . , y n ) with i ∼ p ( · | (cid:104) x , a i (cid:105) ) . Let n/d → δ and define δ u as in (42) . Let G ∼ N (0 , and assume that thecondition (131) holds for any y ∈ R .Consider the AMP algorithm defined in (126), and assume that the initial condition z is such that lim n →∞ (cid:104) z , x (cid:105)(cid:107) z (cid:107) (cid:107) x (cid:107) = (cid:15). (132) Then, for any δ < δ u , there exists (cid:15) ( δ ) such that for any (cid:15) ∈ (0 , (cid:15) ( δ )) , almost surely, lim t →∞ lim n →∞ d (cid:13)(cid:13) z t (cid:13)(cid:13) = d . (133)Next, we consider the case δ > δ u and we linearize the iteration (126) around the non-informative fixedpoint. Lemma 5 (Linearized AMP Equations) . Consider the one-step map defined in (126) and assume that con-dition (131) holds. Define R t = (cid:107) z t (cid:107) + (cid:107) ˆ z t − (cid:107) . Then, as R t → and q t → , we obtain (cid:18) z t +1 ˆ z t (cid:19) = L n (cid:18) z t ˆ z t − (cid:19) + o ( R t ) + R t o q t (1) , (134) where L n ∈ R ( n + d ) × ( n + d ) is defined as L n = (cid:18) A T J A − A T J A − J (cid:19) , (135) and J ∈ R n × n is a diagonal matrix with entries j i = F (cid:48) (0 , y i ; 1) for i ∈ [ n ] , with F (cid:48) denoting the derivativeof F with respect to the first argument. The second result of this section establishes the following: for δ > δ u , the un-informative fixed pointis unstable for the iteration (126), i.e., the matrix L n has an eigenvalue that is larger than 1 in modulus. Todo so, we will relate the matrix J appearing in (135) to the optimal pre-processing function defined in (45)(see equation (237) in Appendix F). We refer to Appendix F for the complete proof. Theorem 6 (Message Passing Escapes from Un-informative Fixed Point for δ > δ u ) . Let x ∈ R d denotethe unknown signal such that (cid:107) x (cid:107) = √ d . Let A = ( a , . . . , a n ) T ∈ R n × d with { a i } ≤ i ≤ n ∼ i.i.d. N ( d , I d /d ) , and y = ( y , . . . , y n ) with y i ∼ p ( · | (cid:104) x , a i (cid:105) ) . Let n/d → δ and define δ u as in (42) .Furthermore, assume that (131) holds for any y ∈ R . Define L n ∈ R ( n + d ) × ( n + d ) as in (135) , where J ∈ R n × n is a diagonal matrix with entries j i = G (0 , y i ; 1) for i ∈ [ n ] . Then, the eigenvalues of L n arereal and the largest of them, call it λ L n , is such that, for any δ > δ u , lim n →∞ λ L n > . (136) We focus on the phase retrieval problem and present some numerical results to illustrate the performanceachieved by the proposed spectral method. First, we consider the case in which the unknown vector ischosen uniformly at random and the sensing vectors are Gaussian. Then, we consider the more practicalscenario in which the unknown vector is an image and the sensing vectors come from a coded diffractionmodel. 31 .1 Gaussian Sensing Vectors for Synthetic Data
Let us consider the complex case. In our experiments, the vector x is chosen uniformly at random on the d -dimensional complex sphere with radius √ d , the sensing vectors { a i } ≤ i ≤ n are i.i.d. circularly-symmetricnormal with variance /d and, for i ∈ [ n ] , the measurement y i is equal to |(cid:104) x , a i (cid:105)| . We take d = 4096 and the numerical simulations are averaged over n sample = 40 independent trials. The results are plotted inFigure 2a.The red curve corresponds to the proposed pre-processing function given by T ( y ) = y − y + (cid:112) ˜ δ − . (137)We pick ˜ δ = 1 . and, as shown by the figure, weak recovery is possible for values of δ very close to .The green curve corresponds to the pre-processing function given by T ( y ) = max (cid:32) y − y + (cid:112) ˜ δ − , (cid:33) . (138)We add this plot in order to show that, by enforcing non-negativity of the pre-processing function, we incurin a degradation of the performance of the spectral method.The black curve corresponds to the pre-processing function given by T ( y ) = (cid:40) , for y > t, , otherwise . (139)This choice was proposed in [WGE16] and it is also considered in [LL17], where the authors refer to it asthe “subset algorithm”. For each value of t , we can compute the smallest value of δ , call it δ ∗ ( t ) , that yieldsa strictly positive scalar product according to the result of Lemma 2. Hence, we pick t = 2 that correspondsto the smallest value of δ ∗ ( t ) over t ∈ { . , . , . , . . . , } .The blue curve corresponds to the pre-processing function given by T ( y ) = (cid:40) y, for y ≤ t, , otherwise . (140)This choice corresponds to the truncated spectral initialization proposed in [CC17] and it is also consideredin [LL17], where the authors refer to it as the “trimming algorithm”. For each value of t , we can compute thesmallest value of δ , call it δ ∗ ( t ) , that yields a strictly positive scalar product according to the result of Lemma2. Hence, we pick t = 5 . that corresponds to the smallest value of δ ∗ ( t ) over t ∈ { . , . , . , . . . , } .Note that the numerical simulations follow closely the theoretical prediction given by (84). Furthermore,the choice of the pre-processing function (137) yields a remarkable performance gain with respect to boththe subset algorithm and the trimming algorithm.Similar considerations apply to the real case. Here, the vector x is chosen uniformly at random on the d -dimensional real sphere with radius √ d and the sensing vectors { a i } ≤ i ≤ n are i.i.d. normal with zeromean and variance /d . We pick d = 4096 and n sample = 40 . The results are plotted in Figure 2b. Again,the numerical simulations follow closely the theoretical prediction. The red curve corresponds to the pre-processing function given by (137), where we pick ˜ δ = 1 . . Note that weak recovery is possible forvalues of δ very close to / . The green curve corresponds to the pre-processing function given by (138).32he blue curve corresponds to the pre-processing function given by (139), where we pick t = 2 whichyields the smallest value of δ ∗ ( t ) over t ∈ { . , . , . , . . . , } . The black curve corresponds to thepre-processing function given by (140), where we pick t = 7 which yields the smallest value of δ ∗ ( t ) over t ∈ { . , . , . , . . . , } . We consider a model of coded diffraction patterns in which the sensing vectors { a r } ≤ r ≤ n are obtained asfollows. For t ∈ [ d ] and t ∈ [ d ] , denote by a r ( t , t ) the ( t , t ) -th component of the vector a r ∈ C d ,with d = d · d . Then, a r ( t , t ) = d (cid:96) ( t , t ) · e i πk t /d · e i πk t /d , (141)where i denotes the imaginary unit. The index r ∈ [ n ] is associated to a pair ( (cid:96), k ) , with (cid:96) ∈ [ L ] ; the index k ∈ [ d ] is associated to a pair ( k , k ) with k ∈ [ d ] and k ∈ [ d ] . As usual, the measurement y r ofan unknown d -dimensional vector x is equal to |(cid:104) x , a r (cid:105)| . As an immediate consequence, the number ofmeasurements n is equal to L · d , therefore δ = L ∈ N . In words, for a fixed (cid:96) , we collect the magnitudeof the diffraction pattern of x modulated by d (cid:96) . By varying (cid:96) and changing the modulation pattern d (cid:96) , wegenerate L distinct views. The vectors { d (cid:96) } ≤ (cid:96) ≤ L are i.i.d. and their entries are also i.i.d. drawn uniformlyfrom the set { , − , i, − i } .We test the spectral method on the digital photograph represented in Figure 1a. Each color image canbe viewed as a d × d × array. We run the spectral algorithm separately on the vectors x j ∈ R d , where j ∈ { , , } . In our example, d = 820 and d = 1280 . Let ˆ x j be the estimate of x j provided by thespectral method. Then, we employ as a performance metric the average squared normalized scalar product (cid:88) j =1 |(cid:104) ˆ x j , x j (cid:105)| (cid:107) ˆ x j (cid:107) (cid:107) x j (cid:107) . (142)Note that the scalar product between the input and the measurement vectors can be interpreted as a 2-dimensional Fourier transform, hence it can be computed with an FFT algorithm. In order to evaluate theprincipal eigenvector of the data matrix, we use the power method with a random initialization, as describedin Appendix B of [CLS15b]. As a stopping criterion, we require that one of the following two conditions isfulfilled: either the number of iterations reaches the maximum value of , or the modulus of the scalarproduct between the estimate at the current iteration T and at the iteration T − is larger than − − .The results are summarized in Figure 3. The red curve corresponds to the proposed pre-processing func-tion. In this case, the eigenvalues of the data matrix can be negative. Recall that the power method outputsthe eigenvector associated to the largest eigenvalue in modulus , while we are interested in the eigenvectorassociated to the largest eigenvalue. To address this issue, we add to the data matrix a multiple α of theidentity matrix. However, as α grows, the convergence of the power method becomes slower and slower. Inorder to reduce the negative tail of the distribution of eigenvalues and, consequently, the value of α , we pickthe pre-processing function given by T ( y ) = max( T ( y ) , − M ) , (143)where T ( y ) is defined in (137), ˜ δ = 1 . , and M = 40 . In this way, by taking α = 100 , the largesteigenvalue in modulus has positive sign.The blue curve corresponds to the truncated spectral initialization in [CC17], i.e., the pre-processingfunction is given by (140) with t = 9 . 33 (a) Complex case. (b) Real case. Figure 2: Performance of the spectral method for the phase retrieval problem where the unknown vector isuniformly random on the sphere and the sensing vectors are Gaussian. On the x -axis, we have the ratio δ between the number of samples and the dimension of the signal; on the y -axis, we have the square of thenormalized scalar product between the unknown signal x and the estimate ˆ x . Note that the proposed choiceof the pre-processing function (red curve) provides a significant performance improvement with respect tothe subset algorithm considered in [WGE16, LL17] (black curve) and the truncated spectral initializationconsidered in [CC17, LL17] (blue curve). 34 Figure 3: Performance of the spectral method for the phase retrieval problem where the unknown vector isa digital photograph and the sensing vectors are obtained from a coded diffraction model. On the x -axis,we have the ratio δ between the number of samples and the dimension of the signal; on the y -axis, we havethe square of the normalized scalar product between the unknown signal x and the estimate ˆ x (averaged onthe three RGB components of the image). Note that the proposed choice of the pre-processing function (redcurve) provides a significant performance improvement with respect to the truncated spectral initializationconsidered in [CC17] (blue curve).The numerical simulations for the optimal pre-processing function follow closely the theoretical pre-dictions (84) obtained for a Gaussian measurement matrix, with the exception of the point δ = 2 . On thecontrary, the numerical simulations for the truncated spectral initialization show a different behavior withrespect to the Gaussian model. Our algorithm provides weak recovery of the original image for δ ≥ ,while the truncated spectral initialization requires δ ≥ . Furthermore, for any value of δ , the proposedchoice of the pre-processing function yields a better performance than the choice in [CC17]. For a visualrepresentation of these results, see Figure 1. Acknowledgement
M. M. was supported by an Early Postdoc.Mobility fellowship from the Swiss National Science Foundation.A. M. was partially supported by grants NSF DMS-1613091 and NSF CCF-1714305.
A Proof of Corollary 1
We start by providing in Lemma 6 a less compact, but more explicit form of the expression (10). Thismore explicit expression is employed to prove Lemma 7, which yields the value of δ (cid:96) for the case of phaseretrieval. Finally, we provide the proof of Corollary 1.35 emma 6 (Explicit Formula for f ( m ) - Complex Case) . Consider the function f : [0 , → R defined in (10) . Then, f ( m ) is given by the following expression: f ( m ) = (cid:90) R − m (cid:90) + ∞ (cid:90) + ∞ r r · p ( y | r ) p ( y | r ) · exp (cid:18) − r + r − m (cid:19) · I (cid:18) r r √ m − m (cid:19) d r d r (cid:90) + ∞ r · p ( y | r ) · exp (cid:0) − r (cid:1) d r d y, (144) where I denotes the modified Bessel function of the first kind, given by I ( x ) = 1 π (cid:90) π exp ( x cos θ ) d θ. (145) Proof.
Let us rewrite G as G = G (R) + jG (I) , with ( G (R) , G (I) ) ∼ N (cid:18) d , I (cid:19) , i.e., G (R) and G (I) are i.i.d. Gaussian random variables with mean 0 and variance / . Set R = (cid:113)(cid:0) G (R) (cid:1) + (cid:0) G (I) (cid:1) . Then, R follows a Rayleigh distribution with scale parameter / √ , hence E G { p ( y | | G | ) } = E R { p ( y | R ) } = (cid:90) + ∞ r · p ( y | r ) · exp (cid:0) − r (cid:1) d r. (146)Let us rewrite ( G , G ) as ( G , G ) = ( G (R)1 + jG (I)1 , G (R)2 + jG (I)2 ) , with ( G (R)1 , G (R)2 , G (I)1 , G (I)2 ) ∼ N d , (cid:60) ( c ) 0 −(cid:61) ( c ) (cid:60) ( c ) 1 (cid:61) ( c ) 00 (cid:61) ( c ) 1 (cid:60) ( c ) −(cid:61) ( c ) 0 (cid:60) ( c ) 1 , and consider the following change of variables: G (R)1 = R cos θ G (R)2 = R cos θ G (I)1 = R sin θ G (I)2 = R sin θ . Then, after some algebra, we have that E G ,G { p ( y | | G | ) · p ( y | | G | ) } = 1 π (1 − | c | ) (cid:90) + ∞ (cid:90) + ∞ (cid:90) π (cid:90) π r r · p ( y | r ) p ( y | r ) · exp (cid:18) − r + r − r r ( (cid:60) ( c ) cos( θ − θ ) − (cid:61) ( c ) sin( θ − θ ))1 − | c | (cid:19) d r d r d θ d θ . (147)36y writing ( (cid:60) ( c ) , (cid:61) ( c )) = ( | c | cos θ c , | c | sin θ c ) and by using the definition (145), we can further simplifythe RHS of (147) as − | c | (cid:90) + ∞ (cid:90) + ∞ r r · p ( y | r ) p ( y | r ) · exp (cid:18) − r + r − | c | (cid:19) · I (cid:18) r r | c | − | c | (cid:19) d r d r . (148)From (146) and (148), the claim easily follows. Lemma 7 (Computation of δ (cid:96) for Phase Retrieval) . Let δ (cid:96) ( σ ) be defined as in (13) and assume that thedistribution p ( · | | G | ) appearing in (10) is given by (9) . Then, lim σ → δ (cid:96) ( σ ) = 1 . (149) Proof.
For the special case of phase retrieval, it is possible to compute explicitly the function f ( m ) definedin (10) and simplified in Lemma 6. Indeed, (cid:90) + ∞ r · p PR ( y | r ) · exp (cid:0) − r (cid:1) d r (a) = (cid:90) + ∞ p PR ( y | √ z ) · exp ( − z ) d z (b) = (cid:90) + ∞−∞ σ √ π exp (cid:18) − ( y − z ) σ (cid:19) · exp ( − z ) · H ( z ) d z (c) = E Z { exp( − Z ) H ( Z ) } , (150)where in (a) we do the change of variables z = r ; in (b) we use the definition (9) and we define H ( x ) = 1 if x ≥ and H ( x ) = 0 otherwise; and in (c) we define Z ∼ N ( y, σ ) . In the limit σ → , we have that E Z { exp( − Z ) H ( Z ) } → exp( − y ) · H ( y ) , (151)by Lebesgue’s dominated convergence theorem. Similarly, (cid:90) + ∞ (cid:90) + ∞ r r · p PR ( y | r ) p PR ( y | r ) · exp (cid:18) − r + r − m (cid:19) · I (cid:18) r r √ m − m (cid:19) d r d r (a) = (cid:90) + ∞ (cid:90) + ∞ p PR ( y | √ z ) p PR ( y | √ z ) · exp (cid:18) − z + z − m (cid:19) · I (cid:18) √ z z √ m − m (cid:19) d z d z (b) = (cid:90) + ∞−∞ (cid:90) + ∞−∞ (cid:18) σ √ π (cid:19) exp (cid:18) − ( y − z ) + ( y − z ) σ (cid:19) · exp (cid:18) − z + z − m (cid:19) · I (cid:18) √ z z √ m − m (cid:19) · H ( z ) H ( z )d z d z (c) = E Z ,Z (cid:26) exp (cid:18) − Z + Z − m (cid:19) · I (cid:18) √ Z Z √ m − m (cid:19) · H ( Z ) H ( Z ) (cid:27) , where in (a) we do the change of variables z = r and z = r ; in (b) we use the definition (9) and wedefine H ( x ) = 1 if x ≥ and H ( x ) = 0 otherwise; and in (c) we define ( Z , Z ) ∼ i.i.d. N ( y, σ ) . In thelimit σ → , we have that E Z ,Z (cid:26) exp (cid:18) − Z + Z − m (cid:19) · I (cid:18) √ Z Z √ m − m (cid:19) · H ( Z ) H ( Z ) (cid:27) → exp (cid:18) − y − m (cid:19) · I (cid:18) y √ m − m (cid:19) · H ( y ) ,
37y Lebesgue’s dominated convergence theorem. As a result, by using (145), we obtain that f ( m ) σ → −→ − m (cid:90) + ∞ exp (cid:18) − y − m (cid:19) · I (cid:18) y √ m − m (cid:19) · exp( y ) d y = 1 π (1 − m ) (cid:90) π (cid:90) + ∞ exp (cid:18) − y (cid:18) m − √ m cos θ − m (cid:19)(cid:19) d y d θ = 1 π (cid:90) π
11 + m − √ m cos θ d θ = 11 − m . Consequently, F δ ( m ) σ → −→ (1 − δ ) log(1 − m ) , which implies the desired result. Proof of Corollary 1.
We follow the proof of Theorem 1 presented in Section 4. The first step is exactly thesame, i.e., by applying Lemma 1, we show that (68) holds. On the contrary, the second step requires somemodifications, since the definition of the error metric is different. In particular, we will prove that d E Y n , A n (cid:110) (cid:107) E { XX ∗ } − E { XX ∗ | Y n , A n }(cid:107) F (cid:111) = o n (1) . (152)Furthermore, we have that (cid:107) E { XX ∗ } − E { XX ∗ | Y n , A n }(cid:107) F + E (cid:110) (cid:107) XX ∗ − E { XX ∗ | Y n , A n }(cid:107) F (cid:111) (a) ≥ E (cid:110) (cid:107) E { XX ∗ } − XX ∗ (cid:107) F (cid:111) (b) = E (cid:110) (cid:107) I d − XX ∗ (cid:107) F (cid:111) (c) = E { trace ( I d − XX ∗ + XX ∗ XX ∗ ) } (d) = d − d + d = d − d, (153)where in (a) we use the triangle inequality, in (b) we use that E { XX ∗ } = I d by Lemma 12, in (c) weuse that, for any matrix A , (cid:107) A (cid:107) F = (cid:112) trace( AA ∗ ) , and in (d) we use that E { trace ( XX ∗ XX ∗ ) } = E (cid:110) (cid:107) X (cid:107) (cid:111) = d . By applying (152) and (153), the proof of Corollary 1 is complete.Let us now give the proof of (152). Similarly to (71), we have that I ( Y n +1 ; Y n , A n | A n +1 ) ≥ K · E Y n , A n +1 (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) C d p ( x | Y n , A n ) (cid:90) R p PR ( y n +1 | x , Y n , A n +1 ) ϕ PR ( y n +1 )d y n +1 d x − (cid:90) C d p ( x ) (cid:90) R p PR ( y n +1 | x , A n +1 ) ϕ PR ( y n +1 ) d y n +1 d x (cid:12)(cid:12)(cid:12)(cid:12) (cid:41) , (154)38here we define ϕ PR ( x ) = x for | x | ≤ M , and ϕ PR ( x ) = M · sign( x ) otherwise. Then, (cid:90) C d p ( x | Y n , A n ) (cid:90) R p PR ( y n +1 | x , Y n , A n +1 ) · ϕ PR ( y n +1 ) d y n +1 d x (a) = (cid:90) C d p ( x | Y n , A n ) (cid:90) R p PR ( y n +1 | x , Y n , A n +1 ) · y n +1 d y n +1 d x + E (b) = (cid:90) C d p ( x | Y n , A n ) · |(cid:104) A n +1 , x (cid:105)| d x + E (c) = (cid:104) A n +1 , (cid:18)(cid:90) C d p ( x | Y n , A n ) · xx ∗ d x (cid:19) A n +1 (cid:105) + E = (cid:104) A n +1 , E { XX ∗ | Y n , A n } A n +1 (cid:105) + E , (155)where in (a) we set E = (cid:90) C d p ( x | Y n , A n ) (cid:90) R p PR ( y n +1 | x , Y n , A n +1 ) · ( ϕ PR ( y n +1 ) − y n +1 ) d y n +1 d x , in (b) we use the definition (9), and in (c) we use that |(cid:104) A n +1 , x (cid:105)| = (cid:104) A n +1 , xx ∗ A n +1 (cid:105) . Similarly, wehave that (cid:90) C d p ( x ) (cid:90) R p PR ( y n +1 | x , A n +1 ) ϕ PR ( y n +1 ) d y n +1 d x = (cid:104) A n +1 , E { XX ∗ } A n +1 (cid:105) + E , (156)with E = (cid:90) C d p ( x ) (cid:90) R p PR ( y n +1 | x , A n +1 ) · ( ϕ PR ( y n +1 ) − y n +1 ) d y n +1 d x . By applying (155) and (156), we can rewrite the RHS of (154) as K · E Y n , A n E A n +1 (cid:110) |(cid:104) A n +1 , ( E { XX ∗ | Y n , A n } − E { XX ∗ } ) A n +1 (cid:105) + E − E | (cid:111) ≥ K · E Y n , A n (cid:16) E A n +1 (cid:110) |(cid:104) A n +1 , M A n +1 (cid:105)| (cid:111) − E A n +1 (cid:8) | E | (cid:9) − E A n +1 (cid:8) | E | (cid:9)(cid:17) , (157)where we define M = E { XX ∗ | Y n , A n } − E { XX ∗ } . As K goes large, E A n +1 (cid:8) | E i | (cid:9) tends to ,for i ∈ { , } . Furthermore, we have that E A n +1 (cid:110) |(cid:104) A n +1 , M A n +1 (cid:105)| (cid:111) (a) = d (cid:88) i,j,k,l =1 M ij M ∗ kl · d ( δ ij · δ kl + δ il · δ jk )= 1 d (cid:16) | trace( M ) | + (cid:107) M (cid:107) F (cid:17) (b) = 1 d (cid:107) M (cid:107) F , (158)where in (a) we use the following definition of the Kronecker delta: δ ab = (cid:26) , if a = b, , otherwise , (159)39nd in (b) we use that trace( M ) = d (cid:88) i =1 (cid:0) E (cid:8) | X i | | Y n , A n (cid:9) − E (cid:8) | X i | (cid:9)(cid:1) = E (cid:40) d (cid:88) i =1 | X i | | Y n , A n (cid:41) − E (cid:40) d (cid:88) i =1 | X i | (cid:41) = 0 . (160)As a result, we conclude that (152) holds. B Proof of Corollary 2
First, we evaluate the RHS of (20), as well as the scaling between δ u and σ when σ → . Then, we givethe proof of Corollary 2. Lemma 8 (Computation of δ u for Phase Retrieval) . Let δ u ( σ ) be defined as in (20) and assume that thedistribution p ( · | | g | ) is given by (9) . Then, δ u ( σ ) = 1 + σ + o ( σ ) . (161) Proof.
The proof boils down to computing expected values and integrals. By using (146) and (150), weimmediately obtain that E G { p PR ( y | | G | ) } = (cid:90) + ∞ r · p PR ( y | r ) · exp (cid:0) − r (cid:1) d r = exp( − y ) E X { exp( − σX ) H ( y + σX ) } , where X ∼ N (0 , . By computing explicitly the expectation, we deduce that E G { p PR ( y | | G | ) } = 12 exp (cid:18) − y + σ (cid:19) erfc (cid:18) √ (cid:16) − yσ + σ (cid:17)(cid:19) , (162)where erfc( · ) is the complimentary error function. Similarly, we have that E G (cid:8) p PR ( y | | G | )( | G | − (cid:9) = (cid:90) + ∞ r − r ) p PR ( y | r ) · exp (cid:0) − r (cid:1) d r = exp( − y ) E X { exp( − σX ) H ( y + σX )( y − σX ) } = σ √ π exp (cid:18) − y σ (cid:19) + 12 ( y − − σ ) exp (cid:18) − y + σ (cid:19) erfc (cid:18) √ (cid:16) − yσ + σ (cid:17)(cid:19) . (163)Thus, by using (162) and (163), after some manipulations, we obtain that δ u = (cid:90) R (cid:0) E G (cid:8) p PR ( y | | G | )( | G | − (cid:9)(cid:1) E G { p ( y PR | | G | ) } d y = (cid:90) R σ π exp (cid:18) y − σ − y σ (cid:19) (cid:16) √ (cid:0) − yσ + σ (cid:1)(cid:17) d y + (cid:90) R σ √ π exp (cid:18) − y σ (cid:19) ( y − − σ )d y + (cid:90) R
12 exp (cid:18) − y + σ (cid:19) ( y − − σ ) erfc (cid:18) √ (cid:16) − yσ + σ (cid:17)(cid:19) d y. (164)40y performing the change of variables t = − y/σ + σ , we simplify the first integral in the RHS of (164) as (cid:90) R σ π exp (cid:18) y − σ − y σ (cid:19) (cid:16) √ (cid:0) − yσ + σ (cid:1)(cid:17) d y = 2 σ π (cid:90) R exp (cid:0) − t (cid:1) erfc (cid:16) t √ (cid:17) exp (cid:18) σt − σ (cid:19) d t = o ( σ ) , (165)where in the last equality we use that the integral (cid:90) R exp (cid:0) − t (cid:1) erfc (cid:16) t √ (cid:17) d t is finite. The other two integrals in the RHS of (164) can be expressed in closed form as (cid:90) R σ √ π exp (cid:18) − y σ (cid:19) ( y − − σ )d y = − σ (1 + σ ) , (166) (cid:90) R
12 exp (cid:18) − y + σ (cid:19) ( y − − σ ) erfc (cid:18) √ (cid:16) − yσ + σ (cid:17)(cid:19) d y = σ (cid:18) − σ (cid:19) (cid:90) R exp ( σt ) ( σt + 1) erfc (cid:18) t √ (cid:19) d t = 1 + σ + σ . (167)By combining (164), (165), (166) and (167), the result follows. Proof of Corollary 2.
Pick σ sufficiently small. Let G ∼ CN (0 , , Y ∼ p PR ( · | | G | ) and Z = T ( Y ) ,where p PR is defined in (9) and T is a pre-processing function (possibly dependent on σ ) that we willchoose later on. Assume that(1) T ( y ) is upper and lower bounded by constants independent of σ ;(2) P ( Z = 0) ≤ c < and c is independent of σ ;(3) the condition (82) holds.Then, by Lemma 2, we have that, as n → ∞ , |(cid:104) ˆ x , x (cid:105)| (cid:107) ˆ x (cid:107) (cid:107) x (cid:107) a.s. −→ ρ = , if ψ (cid:48) δ ( λ ∗ δ ) ≤ ,ψ (cid:48) δ ( λ ∗ δ ) ψ (cid:48) δ ( λ ∗ δ ) − φ (cid:48) ( λ ∗ δ ) , if ψ (cid:48) δ ( λ ∗ δ ) > , (168)where λ ∗ δ is the unique solution of the equation ζ δ ( λ ) = φ ( λ ) , and φ , ψ δ and ζ δ are defined in (78), (79) and(81), respectively.Let τ be the supremum of the support of Z . Assume also that, for ¯ λ δ < λ ∗ δ ,(4) τ ≥ c > and c is independent of σ ;(5) φ (cid:48) ( λ ∗ δ ) is lower bounded by a constant independent of σ ;(6) min λ ∈ (min(¯ λ δ ,λ ∗ δ )) ψ (cid:48)(cid:48) δ ( λ ) is lower bounded by a strictly positive constant independent of σ .41et ¯ λ δ be the point at which ψ δ attains its minimum. Then, φ (¯ λ δ ) − ψ δ (¯ λ δ ) (a) = φ (¯ λ δ ) − φ ( λ ∗ δ ) + ζ δ ( λ ∗ δ ) − ψ δ (¯ λ δ ) (b) = φ (¯ λ δ ) − φ ( λ ∗ δ ) + ζ δ ( λ ∗ δ ) − ζ δ (¯ λ δ ) (c) = (cid:0) ζ (cid:48) δ ( x ) − φ (cid:48) ( x ) (cid:1) · ( λ ∗ δ − ¯ λ δ ) (d) = (cid:0) ζ (cid:48) δ ( x ) − φ (cid:48) ( x ) (cid:1) ψ (cid:48)(cid:48) δ ( x ) · (cid:0) ψ (cid:48) δ ( λ ∗ δ ) − ψ (cid:48) δ (¯ λ δ ) (cid:1) (e) = (cid:0) ζ (cid:48) δ ( x ) − φ (cid:48) ( x ) (cid:1) ψ (cid:48)(cid:48) δ ( x ) · ψ (cid:48) δ ( λ ∗ δ ) (f) ≤ c · ψ (cid:48) δ ( λ ∗ δ ) , (169)where in (a) we use that ζ δ ( λ ∗ δ ) = φ ( λ ∗ δ ) , in (b) we use that ζ δ (¯ λ δ ) = ψ δ (¯ λ δ ) , (c) holds for some x ∈ (¯ λ δ , λ ∗ δ ) by the mean value theorem, (d) holds for some x ∈ (¯ λ δ , λ ∗ δ ) by the mean value theorem, and in (e)we use that ψ (cid:48) δ (¯ λ δ ) = 0 . Note that (f) holds for some constant c independent of σ , as ζ (cid:48) δ ( x ) ≥ , ψ (cid:48)(cid:48) δ ( x ) is bounded, and φ (cid:48) ( x ) < since P ( Z = 0) < .As φ (cid:48) ( λ ∗ δ ) is bounded, from (168) and (169) we deduce that ρ ≥ c · (cid:0) φ (¯ λ δ ) − ψ δ (¯ λ δ ) (cid:1) , (170)for some constant c independent of σ . Notice that, if λ ∗ δ ≤ ¯ λ δ , then the right hand side is non-positive andhence the lower bound still holds.As τ > , we also have that ¯ λ δ > . Consider now the matrix D (cid:48) n = D n /α for some α > . Then, theprincipal eigenvector of D (cid:48) n is equal to the principal eigenvector of D n . Hence, we can assume without lossof generality that ¯ λ δ = 1 . This condition can be rewritten as E (cid:26) Z (1 − Z ) (cid:27) = 1 δ , (171)and (170) can be rewritten as ρ ≥ c · (cid:18) E (cid:26) Z ( | G | − − Z (cid:27) − δ (cid:19) . (172)We set T ( y ) = T ∗ δ ( y, σ ) (cid:44) y + − y + + (cid:112) δ c ( σ ) − , (173)where y + = max( y, and c ( σ ) is a function of σ to be set as to satisfy Eq. (171). By substituting (173)into (171), we get E (cid:26) Z (1 − Z ) (cid:27) = 1 δc ( σ ) E (cid:8) ( Y + − (cid:9) . (174)Hence, Eq. (171) is satisfied by c ( σ ) = E (cid:8) ( Y + − (cid:9) = E (cid:110)(cid:0) ( | G | + σW ) + − (cid:1) (cid:111) , (175)42here W ∼ N (0 , is independent of G . Therefore, c ( σ ) is always well defined and, by dominated conver-gence, c ( σ ) → c (0) = 1 , as σ → . Furthermore, E (cid:26) Z ( | G | − − Z (cid:27) = 1 (cid:112) δc ( σ ) E (cid:8) ( Y + − | G | − (cid:9) . (176)By applying again dominated convergence, we get lim σ → E (cid:26) Z ( | G | − − Z (cid:27) = 1 √ δ E (cid:8) ( | G | − (cid:9) = 1 √ δ . (177)Hence, by using (170), we get that, for δ > and σ ≤ σ ( δ ) , lim inf n →∞ |(cid:104) ˆ x σ , x (cid:105)| (cid:107) ˆ x σ (cid:107) (cid:107) x (cid:107) ≥ c (cid:16) √ δ − δ (cid:17) > , (178)where ˆ x σ denotes the spectral estimator corresponding to the pre-processing function (173). Let us nowverify that, by setting T = T ∗ δ , the requirements stated above are fulfilled. As δ > , the function T isbounded by constants independent of σ . It is also clear that the conditions (2) and (4) hold. Furthermore,the conditions (5) and (6) follow by showing that φ ( λ ) , ψ δ ( λ ) have well defined uniform limits as σ → that satisfy those conditions: this can be proved by one more application of dominated convergence.In order to show that the condition (3) holds, we follow the argument presented at the end of the proof ofLemma 2. First, we add a point mass with associated probability at most (cid:15) , which immediately implies that(82) is satisfied. Then, by applying the Davis-Kahan theorem [DK70], we show that we can take (cid:15) = 0 .This proves the claim of the corollary for the pre-processing function T ∗ δ ( y, σ ) , defined in (173). Let usnow prove that the same conclusion holds for T ∗ δ ( y ) defined in (25). Let f a ( x ) = x + 1 x + a . (179)Then, for any x, a ∈ R ≥ , | f (cid:48) a ( x ) | = | a − | ( x + a ) ≤ max (cid:18) , x (cid:19) . (180)Therefore, since T ∗ δ ( y, σ ) = 1 − f y + ( (cid:112) δc ( σ ) − , we have that sup y ∈ R (cid:12)(cid:12) T ∗ δ ( y, σ ) − T ∗ δ ( y ) (cid:12)(cid:12) ≤ (cid:0) min( (cid:112) δc ( σ ) − , √ δ − , (cid:1) √ δ · (cid:12)(cid:12)(cid:112) c ( σ ) − (cid:12)(cid:12) . (181)Denote by D n ( σ ) and D n the matrices constructed with the pre-processing functions T ∗ δ ( y, σ ) and T ∗ δ ( y ) ,respectively. It follows that, for any δ > , there exists a function ∆( σ ) with ∆( σ ) → as σ → such that (cid:107) D n ( σ ) − D n (cid:107) op ≤ ∆( σ ) . (182)Hence, by applying again the Davis-Kahan theorem, we conclude that, for all δ > and σ ≤ σ ( δ ) , lim inf n →∞ |(cid:104) ˆ x , x (cid:105)| (cid:107) ˆ x (cid:107) (cid:107) x (cid:107) ≥ c (cid:16) √ δ − δ (cid:17) > , (183)where ˆ x is the estimator corresponding to the pre-processing function T ∗ δ ( y ) .43 Auxiliary Lemmas
Lemma 9 (Distribution of Scalar Product of Two Unit Complex Vectors) . Let x , x ∼ i.i.d. Unif( S d − C ) and define M = |(cid:104) x , x (cid:105)| . Then, M ∼ Beta(1 , d − . (184) Proof.
Without loss of generality, we can pick x to be the first element of the canonical base of C d . Thus, M is equal to the squared modulus of the first component of x . Furthermore, we can think to x as beingchosen uniformly at random on the d -dimensional real sphere with radius . Note that, by taking a vector ofi.i.d. standard Gaussian random variables and dividing it by its norm, we obtain a vector uniformly randomon the sphere of radius . Hence, M = U + U (cid:80) di =1 U i , with { U i } ≤ i ≤ d ∼ i.i.d. N (0 , . Set A = U + U and B = (cid:80) di =3 U i . Then, A and B are independent, A follows a Gamma distributionwith shape and scale , i.e., A ∼ Γ(1 , , and B follows a Gamma distribution with shape d − and scale , i.e., B ∼ Γ( d − , . Thus, we conclude that M = AA + B ∼ Beta(1 , d − , which proves the claim. Lemma 10 (Distribution of Scalar Product of Two Unit Real Vectors) . Let x , x ∼ i.i.d. Unif( S d − R ) anddefine M = (cid:104) x , x (cid:105) . Then, the distribution of M is given by p ( m ) = Γ( d ) √ π Γ( d − ) (1 − m ) d − , m ∈ [ − , . (185) Proof.
Without loss of generality, we can pick x to be the first element of the canonical base of R d . Thus, M is equal to the first component of x . Note that, by taking a vector of i.i.d. standard Gaussian randomvariables and dividing it by its norm, we obtain a vector uniformly random on the sphere of radius . Hence, M = U (cid:80) di =1 U i , with { U i } ≤ i ≤ d ∼ i.i.d. N (0 , . Set A = U and B = (cid:80) di =2 U i . Then, A and B are independent, A follows a Gamma distribution withshape / and scale , i.e., A ∼ Γ(1 / , , and B follows a Gamma distribution with shape ( d − / andscale , i.e., B ∼ Γ(( d − / , . Thus, we obtain that M = AA + B ∼ Beta(1 / , ( d − / . A change of variable and the observation that the distribution of M is symmetric around immediately letus conclude that, for m ∈ [ − , , p ( m ) = c · (1 − m ) d − , (186)where the normalization constant c is given by c = (cid:18)(cid:90) − (1 − m ) d − d m (cid:19) − = Γ( d ) √ π Γ( d − ) . emma 11 (Laplace’s Method) . Let F : [0 , → R be such that • F is continuous; • F ( x ) < for x ∈ (0 , ; • F (0) = 0 .Then, lim n → + ∞ (cid:90) exp ( n · F ( x )) d x = 0 . (187) Proof.
Pick (cid:15) > and separate the integral into two parts: (cid:90) exp ( n · F ( x )) d x = (cid:90) (cid:15) exp ( n · F ( x )) d x + (cid:90) (cid:15) exp ( n · F ( x )) d x. Now, the first integral is at most (cid:15) since F ( x ) ≤ for any x ∈ [0 , , and the second integral tends to as n → + ∞ since F ( x ) < for x ∈ (0 , . Thus, the claim immediately follows. Lemma 12 (Second Moment of Uniform Vector on Complex Sphere) . Let x ∼ Unif( √ d S d − C ) . Then, E { XX ∗ } = I d . (188) Proof.
Let z ∼ CN ( d , I d ) and note that, by taking a vector of i.i.d. standard complex normal randomvariables and dividing it by its norm, we obtain a vector uniformly random on the complex sphere of radius . Then, x = √ d z / (cid:107) z (cid:107) .For i ∈ [ d ] , denote by x i and by z i the i -th component of x and z , respectively. Then, for i (cid:54) = j , E (cid:8) X i X ∗ j (cid:9) = d · E (cid:26) Z i Z ∗ j (cid:107) Z (cid:107) (cid:27) = 0 , where the last equality holds by symmetry. Furthermore, E (cid:8) | X i | (cid:9) = d · E (cid:26) | Z i | (cid:107) Z (cid:107) (cid:27) = 1 , as | Z i | / (cid:107) Z (cid:107) ∼ Beta(1 , d − by the argument of Lemma 9. As a result, the thesis is readily proved. D Proof of Lemma 3
Before presenting the proof of the lemma, let us introduce some basic definitions and well-known results.Let H be a probability measure on [0 , + ∞ ) . Denote by Γ H the support of H and by τ the supremum of Γ H .Let s H ( g ) denote the Stieltjes transform of H , which is defined as s H ( g ) = (cid:90) t − g d H ( t ) , (189)and let g H ( s ) denote its inverse.Consider a matrix S n = 1 d U M n U ∗ , (190)and assume that 451) M n is PSD for all n ∈ N ;(2) U ∈ C d × n is a random matrix whose entries { u i,j } ≤ i ≤ d, ≤ j ≤ n are i.i.d. such that E { U i,j } = 0 , E (cid:8) | U i,j | (cid:9) = 1 , and E (cid:8) | U i,j | (cid:9) < ∞ (this includes the cases in which the entries are ∼ i.i.d. CN (0 , or are ∼ i.i.d. N (0 , );(3) The sequence of empirical spectral distributions of M n ∈ C n × n converges weakly to a probabilitydistribution H , as n → + ∞ ;(4) n/d → δ ∈ (0 , + ∞ ) , as n → ∞ ;(5) The sequence of spectral norms of M n is bounded.Note that the normalization of (190) differs from the normalization of (86) by a factor of δ . However, sincethe form (190) is more common in the literature, we will stick to it for the rest of this section. In order toobtain the desired result for the matrix (86), it suffices to incorporate a factor /δ in the definition of thefunction ψ δ .Let F δ,H be the probability measure on [0 , + ∞ ) such that the inverse g F δ,H of its Stieltjes transform s F δ,H is given by g F δ,H ( s ) = − s + δ (cid:90) t ts d H ( t ) , s ∈ { z ∈ C : (cid:61) ( z ) > } . (191)Then, the sequence of empirical spectral distributions of S n converges weakly to F δ,H [MP67], [SB10,Chapter 4].For α (cid:54)∈ Γ H and α (cid:54) = 0 , let us also define ψ F δ,H ( α ) = g F δ,H (cid:18) − α (cid:19) . (192)The function ψ F δ,H links the support of F δ,H with the support of the generating measure H (see [SC95,Section 4] and [BY12, Lemma 3.1]). In particular, if λ (cid:54)∈ Γ F δ,H , then s F δ,H ( λ ) (cid:54) = 0 and α = − /s F δ,H ( λ ) satisfies(1) α (cid:54)∈ Γ H and α (cid:54) = 0 (so that ψ F δ,H ( α ) is well-defined);(2) ψ (cid:48) F δ,H ( α ) > .Conversely, if α satisfies (1) and (2), then λ = ψ F δ,H ( α ) (cid:54)∈ Γ F δ,H .Let λ M n denote the largest eigenvalue of M n and assume that, as n → ∞ , λ M n a.s. −→ α ∗ (cid:54)∈ Γ H . (193)Denote by λ S n the largest eigenvalue of S n . Then, the results in [BY12] prove that λ S n a.s. −→ λ ∗ = ψ F δ,H ( α ∗ ) , if ψ (cid:48) F δ,H ( α ∗ ) > ,λ S n a.s. −→ min α>τ ψ F δ,H ( α ) , if ψ (cid:48) F δ,H ( α ∗ ) ≤ . (194)Informally, the eigenvalue λ M n is mapped into the point ψ F δ,H ( α ∗ ) , where α ∗ = − /s F δ,H ( λ ∗ ) . This pointemerges from the support of F δ,H if and only if ψ (cid:48) F δ,H ( α ∗ ) > .In what follows, we relax the first hypothesis, i.e., we consider the case in which the matrix M n is notPSD. We will show that (194) still holds, which implies the claim of Lemma 3.46 roof of Lemma 3. As U is drawn from a rotationally invariant distribution, we can assume without loss ofgenerality that M n is diagonal. Then, we have that S n = ( U + , U − ) (cid:18) M + n k n − k − M − n (cid:19) (cid:18) U ∗ + U ∗− (cid:19) = 1 d U + M + n U ∗ + − d U − M − n U ∗− , (195)where M + n ∈ R k × k is the diagonal matrix containing the positive eigenvalues of M n , M − n ∈ R ( n − k ) × ( n − k ) is the diagonal matrix containing the negative eigenvalues of M n with the sign changed, U + contains thefirst k columns of U , and U − contains the remaining n − k columns of U .Note that U + and U − are independent. Furthermore, if H is a unitary matrix, then U − and HU − havethe same distribution. Hence, we can rewrite the matrix S n as S n = 1 d U M + n U ∗ − d HU M − n U ∗ H ∗ , (196)where U and U are independent with entries ∼ i.i.d. CN (0 , , and H is a random unitary matrix distributedaccording to the Haar measure.Recall that, by hypothesis, the sequence of empirical spectral distributions of M n converges weaklyto the probability distribution H , where H is the law of the random variable Z . Then, the sequence ofempirical spectral distributions of M + n converges weakly to the probability distribution H + , where H + isthe law of Z + = max( Z, . Let F δ,H + be the probability measure on [0 , + ∞ ) such that the inverse g F δ,H + of its Stieltjes transform s F δ,H + is given by g F δ,H + ( s ) = − s + δ (cid:90) t ts d H + ( t ) . (197)Define S + n = d U M + n U ∗ . Then, as M + n is PSD, the sequence of empirical spectral distributions of S + n converges weakly to F δ,H + [MP67], [SB10, Chapter 4].Similarly, the sequence of empirical spectral distributions of M − n converges weakly to the probabilitydistribution H − , where H − is the law of Z − = − min( Z, . Let F δ,H − be the probability measure on [0 , + ∞ ) such that the inverse g F δ,H − of its Stieltjes transform s F δ,H − is given by g F δ,H − ( s ) = − s + δ (cid:90) t ts d H − ( t ) . (198)Define S − n = d U M − n U ∗ . Then, as M − n is PSD, the sequence of empirical spectral distributions of S − n converges weakly to F δ,H − [MP67], [SB10, Chapter 4]. Furthermore, the sequence of empirical spectraldistributions of − S − n converges weakly to the probability measure F δ,H − inv such that g F δ,H − inv ( s ) = − g F δ,H − ( − s ) , (199)where g F δ,H − inv denotes the inverse of the Stieltjes transform s F δ,H − inv of F δ,H − inv .Define F δ,H = F δ,H + (cid:1) F δ,H − inv , (200)47here (cid:1) denotes the free additive convolution. Recall the decomposition (196). Then, the sequence ofempirical spectral distributions of S n converges weakly to F δ,H [Voi91, Spe93]. Consequently, the inverse g F δ,H of the Stieltjes transform s F δ,H of F δ,H can be computed as g F δ,H ( s ) (a) = g F δ,H + (cid:1) F δ,H − inv ( s ) (b) = g F δ,H + ( s ) + g F δ,H − inv ( s ) + 1 s (c) = − s + δ (cid:90) t ts d H + ( t ) − δ (cid:90) t − ts d H − ( t ) (d) = − s + δ (cid:90) t ts d H + ( t ) + δ (cid:90) t ts d H − ( − t ) (e) = − s + δ (cid:90) t ts d H ( t ) , (201)where in (a) we use (200), in (b) we use that the R -transform of the free convolution is the sum of the R -transforms of the addends, in (c) we use (197), (198), and (199), in (d) we perform the change of variable t → − t in the second integral; and in (e) we use the fact that H + ( t ) is the law of max( Z, , H − ( − t ) is thelaw of min( Z, , and that t/ (1 + ts ) = 0 for t = 0 .By hypothesis, λ M n a.s. −→ α ∗ (cid:54)∈ Γ H . First, we establish under what condition the largest eigenvalueof S + n , call it λ S + n , converges to a point outside the support of F δ,H + . To do so, define ψ F δ,H + ( α ) = g F δ,H + ( − /α ) . Then, λ S + n a.s. −→ ψ F δ,H + ( α ∗ ) , if ψ (cid:48) F δ,H + ( α ∗ ) > ; and λ S + n converges almost surely to apoint inside the support of F δ,H + , otherwise [BY12].For the moment, assume that ψ (cid:48) F δ,H + ( α ∗ ) > . We now establish under what condition the largesteigenvalue of S n , call it λ S n , converges to a point outside the support of F δ,H . To do so, let ω and ω denote the subordination functions corresponding to the free convolution F δ,H + (cid:1) F δ,H − inv . These functionssatisfy the following analytic subordination property: s F δ,H + (cid:1) F δ,H − inv ( z ) = s F δ,H + ( ω ( z )) = s F δ,H − inv ( ω ( z )) . (202)Then, by Theorem 2.1 of [BBCF15], we have that the spike ψ F δ,H + ( α ∗ ) is mapped into ω − ( ψ F δ,H + ( α ∗ )) .The Stieltjes transform at this point is given by s F δ,H + (cid:1) F δ,H − inv ( ω − ( ψ F δ,H + ( α ∗ ))) (a) = s F δ,H + ( ψ F δ,H + ( α ∗ )) (b) = s F δ,H + ( g F δ,H + ( − /α ∗ )) (c) = − /α ∗ , where in (a) we use (202), in (b) we use the definition of ψ F δ,H + , and in (c) we use that g F δ,H + is thefunctional inverse of the Stieltjes transform s F δ,H + . As a result, by [SC95, Section 4], we conclude that ω − ( ψ F δ,H + ( α ∗ )) (cid:54)∈ Γ F δ,H if and only if ψ (cid:48) F δ,H ( α ∗ ) > . Furthermore, the condition ψ (cid:48) F δ,H ( α ∗ ) > ismore restrictive than the condition ψ (cid:48) F δ,H + ( α ∗ ) > since ψ (cid:48) F δ,H + ( α ∗ ) = 1 − δ (cid:90) (cid:18) tα ∗ − t (cid:19) d H + ≥ − δ (cid:90) (cid:18) tα ∗ − t (cid:19) d H = ψ (cid:48) F δ,H ( α ∗ ) . λ S n converges to a point outside the support of F δ,H if and only if ψ (cid:48) F δ,H ( α ∗ ) > and the proof iscomplete. Remark 8 (Lemma 3 for the Real Case) . Consider the random matrix n U M n U T , where U ∈ R ( d − × n isa random matrix whose entries are ∼ i.i.d. N (0 , and M n ∈ R n × n . Then, the claim of Lemma 3 still holds.Let us briefly explain why this is the case.If M n is PSD, then the results of [BY12] allow us to conclude. If M n is not PSD, we can write anexpression analogous to (196) : d U M n U T = 1 d U M + n U T − d HU M − n U T H ∗ , (203) where M + n is the diagonal matrix containing the positive eigenvalues of M n , M − n is the diagonal matrixcontaining the negative eigenvalues of M n with the sign changed, U and U are independent with entries ∼ i.i.d. N (0 , , H is a random unitary matrix distributed according to the Haar measure, and we have usedthe fact that the eigenvalues of U M − n U T are the same as the eigenvalues of HU M − n U T H ∗ since H isunitary. Hence, the proof follows from the same argument of Lemma 3. E Proof of Lemma 4 and Theorem 5
We start by proving a result similar to Lemma 4 for a general AMP iteration, where the function f t (ˆ z ; y ) isgeneric. Lemma 13 (State Evolution for General AMP Iteration) . Let x ∈ R d denote the unknown signal such that (cid:107) x (cid:107) = √ d , A = ( a , . . . , a n ) T ∈ R n × d with { a i } ≤ i ≤ n ∼ i.i.d. N ( d , I d /d ) , and y = ( y , . . . , y n ) with y i ∼ p ( · | (cid:104) x , a i (cid:105) ) . Consider the AMP iterates z t , ˆ z t defined in (126) for some function f t (ˆ z ; y ) , with b t given by b t = δ · E { f (cid:48) t ( µ t G + τ t G ; Y ) } , (204) where the expectation is with respect to G , G ∼ i.i.d. N (0 , and Y ∼ p ( · | G ) . Assume that theinitialization z is independent of A and that, almost surely, lim n →∞ d (cid:104) x , z (cid:105) = µ , lim n →∞ d (cid:107) z (cid:107) = µ + τ . (205) Let the state evolution recursion τ t , µ t be defined as µ t +1 = δ (cid:90) R E (cid:8) ∂ g p ( y | X ) f t ( µ t X + τ t G ; y ) (cid:9) d y ,τ t +1 = δ · E (cid:110)(cid:0) f t ( µ t X + τ t G ; Y ) (cid:1) (cid:111) , (206) with initialization µ and τ , where the expectation is taken with respect to X , G ∼ i.i.d. N (0 , . Then, forany t , and for any function ψ : R → R such that | ψ ( u ) − ψ ( v ) | ≤ L (1 + (cid:107) u (cid:107) + (cid:107) v (cid:107) ) (cid:107) u − v (cid:107) for some L ∈ R , we have that, almost surely, lim n →∞ n n (cid:88) i =1 ψ ( x i , z ti ) = E { ψ ( X , µ t X + τ t G ) } . (207)49 roof. For g ∈ R , let H ( · ; g ) : [0 , → R ∪ { + ∞ , −∞} be the generalized inverse of F ( y | g ) ≡ (cid:90) y −∞ p ( y (cid:48) | g ) d y (cid:48) , namely, H ( w ; g ) ≡ inf (cid:8) y ∈ R : F ( y | g ) ≥ w (cid:9) . (208)With this definition, the model y i ∼ p ( · | (cid:104) a i , x (cid:105) ) is equivalent to y i = H ( w i ; (cid:104) a i , x (cid:105) ) for { w i } ≤ i ≤ n ∼ i.i.d. Unif([0 , independent of A and x . Let w = ( w , . . . , w n ) ∈ R n and denote by [ v | · · · | v k ] ∈ R m × k the matrix obtained by stacking column vectors v , . . . , v k ∈ R m .For t ≥ , define r t = d , ˆ r t = Ax , and introduce the extended state variables s t ∈ R d × and ˆ s t ∈ R n × , defined as s t = [ z t | r t ] , ˆ s t = [ ˆ z t | ˆ r t ] . (209)We further define the functions h t = [ h t, | h t, ] : R × R → R and ˆ h t = [ˆ h t, | ˆ h t, ] : R × R → R bysetting h t ( s , s ; x ) ≡ [ s | x ] , ˆ h t (ˆ s , ˆ s ; w ) ≡ [ f t (ˆ s ; H ( w ; ˆ s )) | . (210)With these notations, the iteration (126) is equivalent to s t +1 = A T ˆ h t (ˆ s t ; w ) − h t ( s t ; x )ˆ B t , ˆ s t = A h t ( s t ; x ) − ˆ h t − (ˆ s t − ; w ) B t − , (211)where the functions h t ( s t ; x ) and ˆ h t (ˆ s t ; w ) are understood to be applied component-wise to their argumentsand B t , ˆ B t ∈ R × are defined by (ˆ B t ) j,k = δ · E (cid:40) ∂ ˆ h t,k ∂ ˆ s j ( µ t X + τ t G, X ; W ) (cid:41) , ( B t ) j,k = δ · E (cid:26) ∂h t,k ∂s j ( µ t X + τ t G, X ) (cid:27) . (212)The iteration (211) satisfies the assumptions of [JM13][Proposition 5]. By applying that result, the claimfollows.At this point, first we present the proof of Lemma 4 and then of Theorem 5. Proof of Lemma 4.
Consider the state evolution recursion defined in (124) with initialization µ . Let f t bedefined as in (127) with F given by (123). Suppose that, for any t , (206) holds with µ t = τ t . Then, byLemma 13, the claim immediately follows.The remaining part of the proof is devoted to show that (206) holds with µ t = τ t , for t ≥ . First,we prove by induction that µ t = τ t , for t ≥ . The basis of the induction, i.e., µ = τ , is true by thehypothesis of the Lemma. Now, we assume that µ t = τ t and we show that µ t +1 = τ t +1 . Set Z = µ t X + τ t G, (213)50nd note that Z ∼ N (0 , µ t + τ t ) . Then, we can re-write X as X = aZ + b (cid:101) G, for some a, b ∈ R , where (cid:101) G ∼ N (0 , and independent from Z . In order to compute the coefficients a and b , we evaluate E { X } and E { X · Z } , thus obtaining the equations a ( µ t + τ t ) + b = 1 ,a ( µ t + τ t ) = µ t , which can be simplified as a = µ t µ t + τ t ,b = τ t (cid:112) µ t + τ t . Furthermore, by using the inductive hypothesis µ t = τ t and that q t = µ t / (1 + µ t ) , we obtain that X = (1 − q t ) Z + (cid:112) − q t (cid:101) G. (214)Hence, the following chain of equalities holds: τ t +1 (a) = δ (cid:90) R E (cid:110) p ( y | X ) · (cid:0) f t ( µ t X + τ t G ; y ) (cid:1) (cid:111) d y (b) = δ (cid:90) R E (cid:110) p (cid:0) y | (1 − q t ) Z + (cid:112) − q t (cid:101) G (cid:1) · (cid:0) f t ( Z ; y ) (cid:1) (cid:111) d y (c) = δ (cid:90) R E (cid:110)(cid:0) f t ( Z ; y ) (cid:1) · E (cid:8) p (cid:0) y | (1 − q t ) Z + (cid:112) − q t (cid:101) G (cid:1) (cid:12)(cid:12) Z (cid:9)(cid:111) d y (d) = δ (cid:90) R E (cid:40)(cid:18) E { ∂ g p ( y | (1 − q t ) Z + √ − q t (cid:101) G ) | Z } E { p ( y | (1 − q t ) Z + √ − q t (cid:101) G ) | Z } (cid:19) · E (cid:8) p (cid:0) y | (1 − q t ) Z + (cid:112) − q t (cid:101) G (cid:1) (cid:12)(cid:12) Z (cid:9)(cid:41) d y = δ (cid:90) R E (cid:40) E { ∂ g p ( y | (1 − q t ) Z + √ − q t (cid:101) G ) | Z } E { p ( y | (1 − q t ) Z + √ − q t (cid:101) G ) | Z } · E (cid:8) ∂ g p (cid:0) y | (1 − q t ) Z + (cid:112) − q t (cid:101) G (cid:1) (cid:12)(cid:12) Z (cid:9)(cid:41) d y (e) = δ (cid:90) R E (cid:110) f t ( Z ; y ) · E (cid:8) ∂ g p (cid:0) y | (1 − q t ) Z + (cid:112) − q t (cid:101) G (cid:1) (cid:12)(cid:12) Z (cid:9)(cid:111) d y (f) = δ (cid:90) R E (cid:110) f t ( µ t X + τ t G ; y ) · ∂ g p ( y | X ) (cid:111) d y = µ t +1 , (215)where in (a) we use that Y ∼ p ( · | X ) , in (b) we use (213) and (214), in (c) we condition with respect to Z , in (d) we use the definition (127) of f t , in (e) we use again the definition (127) of f t , and in (f) we useagain (213) and (214). 51inally, we prove that µ t +1 satisfies (206). Indeed, the following chain of equalities holds: µ t +1 (a) = δ (cid:90) R E (cid:40) (cid:0) E { ∂ g p ( y | (1 − q t ) Z + √ − q t (cid:101) G ) | Z } (cid:1) E { p ( y | (1 − q t ) Z + √ − q t (cid:101) G ) | Z } (cid:41) d y (b) = δ (cid:90) R E (cid:40) (cid:0) E { ∂ g p ( y | (1 − q t ) (cid:112) µ t + τ t G + √ − q t G ) | G } (cid:1) E { p ( y | (1 − q t ) (cid:112) µ t + τ t G + √ − q t G ) | G } (cid:41) d y (c) = δ (cid:90) R E (cid:40) (cid:0) E { ∂ g p ( y | √ q t G + √ − q t G ) | G } (cid:1) E { p ( y | √ q t G + √ − q t G ) | G } (cid:41) d y = δ · h ( q t ) , (216)where in (a) we use (215), in (b) we set G = (cid:101) G and G = Z/ (cid:112) µ t + τ t , and in (c) we use that µ t = τ t and that q t = µ t / (1 + µ t ) . Proof of Theorem 5.
In view of Lemma 4, it is sufficient to show that ( q, µ ) = (0 , is an attractive fixedpoint of the recursion (124).First of all, let us check that ( q, µ ) = (0 , is a fixed point. This happens if and only if h (0) = (cid:90) R (cid:0) E G { ∂ g p ( y | G ) } (cid:1) E G { p ( y | G ) } d y = 0 , (217)which holds because of the condition (131).Let us now prove that this fixed point is stable. We start by re-writing the function h ( q ) defined in (125)as h ( q ) = (cid:90) R E G (cid:40) (cid:0) h num ( √ q, y ) (cid:1) h den ( √ q, y ) (cid:41) d y , (218)where h num ( x, y ) = E G (cid:110) ∂ g p ( y | x · G + (cid:112) − x G ) (cid:111) ,h den ( x, y ) = E G (cid:110) p ( y | x · G + (cid:112) − x G ) (cid:111) . (219)Note that h num (0 , y ) = 0 by assumption (131). Then, h num ( √ q, y ) = √ q ∂h num ( x, y ) ∂x (cid:12)(cid:12)(cid:12)(cid:12) x =0 + q ∂ h num ( x, y ) ∂ x (cid:12)(cid:12)(cid:12)(cid:12) x = x ,h den ( x, y ) = h den (0 , y ) + √ q ∂h den ( x, y ) ∂x (cid:12)(cid:12)(cid:12)(cid:12) x = x , (220)for some x , x ∈ [0 , √ q ] . Furthermore, by applying Stein’s lemma, we have that h num ( x, y ) = 1 √ − x E G (cid:110) G · p ( y | x · G + (cid:112) − x G ) (cid:111) . (221)By using (221), we can re-write (220) as h num ( √ q, y ) = √ q G · E G { G · ∂ g p ( y | G ) } + q − x ) / E G { f num ( G , G , x ) } ,h den ( x, y ) = E G { p ( y | G ) } + √ q E G { f den ( G , G , x ) } , f num ( G , G , x ) = G (cid:32) (1 + 2 x ) p ( y | x · G + (cid:113) − x G ) − (cid:0) G · x ( x −
1) + G (cid:113) − x (1 + 2 x ) (cid:1) ∂ g p ( y | x · G + (cid:113) − x G )+ ( x − G ( x − − G x + 2 G G x (cid:113) − x ) ∂ g p ( y | x · G + (cid:113) − x G ) (cid:33) ,f den ( G , G , x ) = (cid:32) G − x (cid:112) − x G (cid:33) · ∂ g p ( y | x · G + (cid:113) − x G ) . (222)By applying again Stein’s lemma and by using that the conditional density p ( y | g ) is bounded, we note that E G { f num ( G , G , x ) } and E G { f den ( G , G , x ) } are bounded. Hence, by dominated convergence, weobtain that h ( q ) = q · (cid:90) R (cid:0) E G { ∂ g p ( y | G ) } (cid:1) E G { p ( y | G ) } d y + o ( q ) . Therefore, in a neighborhood of the fixed point we have q t = µ t + o ( µ t ) ,µ t +1 = δ · q t · (cid:90) R (cid:0) E G { ∂ g p ( y | G ) } (cid:1) E G { p ( y | G ) } d y + o ( q t ) . (223)Furthermore, by applying twice Stein’s lemma, we also have that E G { ∂ g p ( y | G ) } = E G { p ( y | G )( G − } . (224)By using (223), (224) and by recalling the definition (42) of δ u , we conclude that q t = µ t + o ( µ t ) ,µ t +1 = δδ u q t + o ( q t ) . (225)As δ < δ u , the fixed point is stable. F Proof of Lemma 5 and Theorem 6
For the proofs in this section, it is convenient to introduce the function G ( x, y ; ¯ q ) = E G { ∂ g p ( y | ¯ q x + √ ¯ qG ) } E G { p ( y | ¯ q x + √ ¯ qG ) } − (cid:18) E G { ∂ g p ( y | ¯ q x + √ ¯ qG ) } E G { p ( y | ¯ q x + √ ¯ qG ) } (cid:19) . (226)First, we present the proof of Lemma 5 and then of Theorem 6. Proof of Lemma 5.
The condition (131) implies that F (0 , y ; 1) = 0 . (227)53urthermore, we have that E Y { G (0 , Y ; 1) } (a) = E Y (cid:40) E G { ∂ g p ( Y | G ) } E G { p ( Y | G ) } (cid:41) (b) = (cid:90) R E G { ∂ g p ( y | G ) } d y = E G (cid:26) ∂ g (cid:90) R p ( y | G ) d y (cid:27) = 0 , (228)where in (a) we use (227) and the definition (226) of G (0 , y ; 1) , and in (b) we use the fact that y has density E G { p ( y | G ) } .Denote by F (cid:48) ( x, y ; ¯ q ) the derivative of F with respect to its first argument. Then, we have F (cid:48) ( x, y ; ¯ q ) = ¯ q G ( x, y ; ¯ q ) . (229)Hence, b t = δ · (1 − q t ) · E { G ( µ t G + √ µ t G , Y ; 1 − q t ) } = δ · E { G (0; Y ; 1) } + o q t (1) = o q t (1) . (230)By using (229) and (230), we linearize the recursion (126) around the fixed point z t = d and ˆ z t − = n as z t +1 = A T J ˆ z t + o q t (1)( (cid:107) z t (cid:107) + (cid:107) ˆ z t (cid:107) ) + o ( (cid:107) ˆ z t (cid:107) ) , (231) ˆ z t = Az t − J ˆ z t − + o q t (1) (cid:107) ˆ z t − (cid:107) + o ( (cid:107) ˆ z t − (cid:107) ) , (232)where J ∈ R n × n is a diagonal matrix with entries j i = F (cid:48) (0 , y i ; 1) for i ∈ [ n ] . By substituting theexpression (232) for ˆ z t into the RHS of (231), the result follows. Proof of Theorem 6.
By definition, α is an eigenvalue of L n if and only if det( L n − α I n + d ) = 0 . (233)Recall that, when D is invertible, det (cid:18) A BC D (cid:19) = det( D ) · det( A − BD − C ) . (234)Then, after some calculations, we obtain that (233) is equivalent to α d · det( − J − α I n ) · det( I d − A T ( I n + α J − ) − A ) = 0 . (235)From (235), we immediately deduce that the eigenvalues of L n are real if and only if all the solutions to det( I d − A T ( I n + α J − ) − A ) = 0 (236)are real. We will prove that in fact this equation does not have any solution for α ∈ C \ R .Let U Σ V T be the SVD of A . Then, (236) is equivalent to det( U Σ − − ( I n + α J − ) − U ) = 0 . det( Σ ) (cid:54) = 0 , and det( I n + α J − ) (cid:54) = 0 for α ∈ C \ R , Eq. (236) is equivalent to det(( I n + α J − ) U − U Σ ) = 0 , or equivalently det( I n + α J − − AA T ) = det( I n + α J − − U Σ U T ) = 0 . Given that the solutions of this equations are generalized eigenvalues for the pairs of symmetric matrices AA T − I n and J − , they must be real. We conclude that the eigenvalues of L n are real.Note that G (0 , y ; 1) (a) = E G { ∂ g p ( y | G ) } E G { p ( y | G ) } (b) = E G { p ( y | G )( G − } E G { p ( y | G ) } (c) = T ∗ ( y )1 − T ∗ ( y ) , (237)where in (a) we use that F (0 , y ) = 0 as (131) holds, in (b) we apply twice Stein’s lemma, and in (c) weuse the definition (45) of T ∗ . Then, (236) can be re-written as det (cid:32) I d − n (cid:88) i =1 T ∗ ( y i ) T ∗ ( y i ) + α (1 − T ∗ ( y i )) a i a T i (cid:33) = 0 . (238)Let λ D ∗ n ( α ) be the largest eigenvalue of the matrix D ∗ n ( α ) defined as D ∗ n ( α ) = n (cid:88) i =1 T ∗ ( y i ) T ∗ ( y i ) + α (1 − T ∗ ( y i )) a i a T i . (239)Note that, as α → + ∞ , the entries of D ∗ n ( α ) tend to with high probability. Since the eigenvalues of amatrix are continuous functions of the elements of the matrix, we also obtain that lim α → + ∞ λ D ∗ n ( α ) = 0 . Hence, if there exists ¯ α > such that λ D ∗ n ( ¯ α ) > , then there exists also ¯ α > ¯ α > such that λ D ∗ n ( ¯ α ) =1 . Consequently, there exists α > that satisfies (238), which implies the result of the theorem.The rest of the proof consists in showing that ¯ α = (cid:112) δ/δ u satisfies the desired requirements. First of all,note that (cid:112) δ/δ u > , as δ > δ u . Furthermore, we have that D ∗ n ( ¯ α ) = n (cid:88) i =1 T ∗ δ ( y i ) a i a T i , (240)where T ∗ δ is defined in (44). Recall that, by hypothesis, x is such that (cid:107) x (cid:107) = √ d and { a i } ≤ i ≤ n ∼ i.i.d. N ( d , I d /d ) . Let ˜ x = x / √ d and ˜ a i = √ d · a i . Then, (cid:104) x , a i (cid:105) = (cid:104) ˜ x , ˜ a i (cid:105) . Let λ (cid:101) D n be the largest eigenvalueof the matrix (cid:101) D n defined as (cid:101) D n = 1 n n (cid:88) i =1 T ∗ δ ( y i )˜ a i ˜ a T i . (241)55ince (cid:101) D n = D ∗ n ( ¯ α ) /δ , it remains to prove that λ (cid:101) D n a.s. −→ ˜ λ > δ . (242)To do so, we apply a result analogous to that of Lemma 2 for the real case with T = T ∗ δ . For themoment, assume that T ∗ δ fulfills the hypotheses of Lemma 2 (we will prove later that this is the case). Then, λ (cid:101) D n converges almost surely to ζ δ ( λ ∗ δ ) .Recall that ζ δ ( λ ) = ψ δ (max( λ, ¯ λ δ )) , where ¯ λ δ is the point of minimum of the convex function ψ δ ( λ ) defined as ψ δ ( λ ) = λ (cid:18) δ + E (cid:26) T ∗ δ ( Y ) λ − T ∗ δ ( Y ) (cid:27)(cid:19) . Notice also that this minimum is the unique local minimizer since ψ δ is convex and analytic.Furthermore, λ ∗ δ is the unique solution to the equation ζ δ ( λ ∗ δ ) = φ ( λ ∗ δ ) , where φ ( λ ) is defined as φ ( λ ) = λ · E (cid:26) T ∗ δ ( Y ) · G λ − T ∗ δ ( Y ) (cid:27) . By setting the derivative of ψ δ ( λ ) to , we have that E (cid:26) ( T ∗ δ ( Y )) (¯ λ δ − T ∗ δ ( Y )) (cid:27) = 1 δ . By using the definition (44) of T ∗ δ and the definition (45) of T ∗ , we verify that T ∗ δ ( Y )1 − T ∗ δ ( Y ) = (cid:114) δ u δ E G { p ( y | G )( G − } E G { p ( y | G ) } . (243)Hence, by using the definition (42) of δ u , we obtain that E (cid:26) ( T ∗ δ ( Y )) (1 − T ∗ δ ( Y )) (cid:27) = δ u δ (cid:90) R (cid:0) E G { p ( y | G )( G − } (cid:1) E G { p ( y | G ) } d y = 1 δ , which immediately implies that ¯ λ δ = 1 . (244)By using (243), one also obtains that E (cid:26) T ∗ δ ( Y )1 − T ∗ δ ( Y ) (cid:27) = (cid:114) δ u δ (cid:90) R E G { p ( y | G )( G − } d y = (cid:114) δ u δ E G { G − } = 0 , which implies that ψ δ (1) = 1 δ . (245)Furthermore, we have that E (cid:26) T ∗ δ ( Y )( G − − T ∗ δ ( Y ) (cid:27) = (cid:114) δ u δ (cid:90) R (cid:0) E G { p ( y | G )( G − } (cid:1) E G { p ( y | G ) } d y = 1 √ δ · δ u > δ , φ (1) = 1 √ δ · δ u > δ , (246)as δ > δ u . By putting (244), (245), and (246) together, we obtain that φ (¯ λ δ ) > ζ δ (¯ λ δ ) . (247)Recall that ζ δ ( λ ) is monotone non-decreasing and φ ( λ ) is monotone non-increasing. Consequently, (247)implies that λ ∗ δ > ¯ λ δ . Thus, we conclude that lim n →∞ λ (cid:101) D n = ζ δ ( λ ∗ δ ) = ψ δ ( λ ∗ δ ) > ψ δ (¯ λ δ ) = ψ δ (1) = 1 δ . (248)Now, we show that T ∗ δ fulfills the hypotheses of Lemma 2 by using arguments similar to those at theend of the proof of Theorem 2. First of all, since T ∗ ( y ) ≤ , we have that T ∗ δ ( y ) is bounded. Furthermore,if T ∗ δ ( y ) is equal to the constant value , then δ u = ∞ and the claim of Theorem 6 trivially holds. Hence,we can assume that P ( T ∗ δ ( Y ) = 0) < . Let τ be the supremum of the support of T ∗ δ ( Y ) . If P ( T ∗ δ ( Y ) = τ ) > , then the condition (82) is satisfied and the proof is complete. Otherwise, for any (cid:15) > , there exists ∆ ( (cid:15) ) such that Eq. (115) holds. Define T ∗ δ ( y, (cid:15) ) as in (116). Clearly, the random variable T ∗ δ ( Y, (cid:15) ) hasa point mass at δ , hence the condition (82) is satisfied. As a final step, we show that we can take (cid:15) ↓ .Define (cid:101) D n ( (cid:15) ) = 1 n n (cid:88) i =1 T ∗ δ ( y i , (cid:15) )˜ a i ˜ a ∗ i . Then, (cid:13)(cid:13)(cid:13) (cid:101) D n ( (cid:15) ) − (cid:101) D n (cid:13)(cid:13)(cid:13) op ≤ C · ∆ ( (cid:15) ) , (249)where the constant C depends only on n/d . Consequently, by using (249) and Weyl’s inequality, weconclude that | λ (cid:101) D n ( (cid:15) )1 − λ (cid:101) D n | ≤ C · ∆ ( (cid:15) ) . (250)Hence, for any n , as (cid:15) tends to , the largest eigenvalue of (cid:101) D n ( (cid:15) ) tends to the largest eigenvalue of (cid:101) D n ,which concludes the proof. References [AGMM15] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra,
Simple, efficient, and neural algo-rithms for sparse coding , Conference on Learning Theory (COLT) (Paris, France), July 2015,pp. 113–149.[BBCF15] Serban T. Belinschi, Hari Bercovici, Mireille Capitaine, and Maxime Février,
Outliers in thespectrum of large deformed unitarily invariant models , arXiv:1412.4916 , 2015.[BCE06] Radu Balan, Pete Casazza, and Dan Edidin, On signal reconstruction without phase , Appliedand Computational Harmonic Analysis (2006), no. 3, 345–356.57BCMN14] Afonso S. Bandeira, Jameson Cahill, Dustin G. Mixon, and Aaron A. Nelson, Saving phase:Injectivity and stability for phase retrieval , Applied and Computational Harmonic Analysis (2014), no. 1, 106–125.[BGN11] Florent Benaych-Georges and Raj R. Nadakuditi, The eigenvalues and eigenvectors of finite,low rank perturbations of large random matrices , Advances in Mathematics (2011), no. 1,494–521.[BKM +
17] Jean Barbier, Florent Krzakala, Nicolas Macris, Léo Miolane, and Lenka Zdeborová,
Phasetransitions, optimal errors and optimality of message-passing in generalized linear models , arXiv:1708.03395 , 2017.[BLM15] Mohsen Bayati, Marc Lelarge, and Andrea Montanari, Universality in polytope phase transi-tions and message passing algorithms , Annals of Applied Probability (2015), no. 2, 753–822.[BM11] Mohsen Bayati and Andrea Montanari, The dynamics of message passing on dense graphs,with applications to compressed sensing , IEEE Trans. Inform. Theory (2011), 764–785.[BM12] Mohsen Bayati and Andrea Montanari, The LASSO risk for Gaussian matrices , IEEE Trans.Inform. Theory (2012), no. 4, 1997–2017.[BMDK17] Jean Barbier, Nicolas Macris, Mohamad Dia, and Florent Krzakala, Mutual information andoptimality of approximate message-passing in random linear estimation , arXiv:1311.2445 ,2017.[BR17] Sohail Bahmani and Justin Romberg, Phase retrieval meets statistical learning theory: A flexi-ble convex relaxation , Proc. of the 20th International Conference on Artificial Intelligence andStatistics (AISTATS) (Fort Lauderdale, FL), June 2017, pp. 252–260.[BY12] Zhidong Bai and Jianfeng Yao,
On sample eigenvalues in a generalized spiked populationmodel , Journal of Multivariate Analysis (2012), 167–177.[CC15] Yuxin Chen and Emmanuel J. Candès,
Solving random quadratic systems of equations is nearlyas easy as solving linear systems , Advances in Neural Information Processing Systems, 2015,pp. 739–747.[CC16] ,
The projected power method: An efficient algorithm for joint alignment from pairwisedifferences , arXiv:1609.05820 , 2016.[CC17] , Solving random quadratic systems of equations is nearly as easy as solving linearsystems , Communications on Pure and Applied Mathematics (2017), 0822–0883.[CEHV15] Aldo Conca, Dan Edidin, Milena Hering, and Cynthia Vinzant, An algebraic characterizationof injectivity in phase retrieval , Applied and Computational Harmonic Analysis (2015),no. 2, 346–356.[CESV15] Emmanuel J. Candès, Yonina C. Eldar, Thomas Strohmer, and Vladislav Voroninski, Phaseretrieval via matrix completion , SIAM Review (2015), no. 2, 225–251.58CLM16] T. Tony Cai, Xiaodong Li, and Zongming Ma, Optimal rates of convergence for noisy sparsephase retrieval via thresholded Wirtinger flow , The Annals of Statistics (2016), no. 5, 2221–2251.[CLS15a] Emmanuel J. Candès, Xiaodong Li, and Mahdi Soltanolkotabi, Phase retrieval from codeddiffraction patterns , Applied and Computational Harmonic Analysis (2015), no. 2, 277–299.[CLS15b] , Phase retrieval via Wirtinger flow: Theory and algorithms , IEEE Trans. Inform.Theory (2015), no. 4, 1985–2007.[Cor06] John V. Corbett, The Pauli problem, state reconstruction and quantum-real numbers , Reportson Mathematical Physics (2006), no. 1, 53–68.[CSV13] Emmanuel J. Candès, Thomas Strohmer, and Vladislav Voroninski, Phaselift: Exact and stablesignal recovery from magnitude measurements via convex programming , Communications onPure and Applied Mathematics (2013), no. 8, 1241–1274.[DJ17] Laurent Demanet and Vincent Jugnon, Convex recovery from interferometric measurements ,IEEE Trans. Computational Imaging (2017), no. 2, 282–295.[DJM13] David L. Donoho, Adel Javanmard, and Andrea Montanari, Information-theoretically opti-mal compressed sensing via spatial coupling and approximate message passing , IEEE Trans.Inform. Theory (2013), no. 11, 7434–7464.[DK70] Chandler Davis and William M. Kahan, The rotation of eigenvectors by a perturbation. III ,SIAM Journal on Numerical Analysis (1970), no. 1, 1–46.[DKMZ11] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborová, Asymptoticanalysis of the stochastic block model for modular networks and its algorithmic applications ,Physical Review E (2011), no. 6, 066106.[DL17] Oussama Dhifallah and Yue M. Lu, Fundamental limits of PhaseMax for phase retrieval: Areplica analysis , arXiv:1708.03355 , 2017.[DM13] Yash Deshpande and Andrea Montanari, Finding hidden cliques of size (cid:112)
N/e in nearly lineartime , Foundations of Computational Mathematics (2013), 1–60.[DM16] David L. Donoho and Andrea Montanari,
High dimensional robust M-estimation: Asymptoticvariance via approximate message passing , Probability Theory and Related Fields (2016),no. 3–4, 935–969.[DMM09] David L. Donoho, Arian Maleki, and Andrea Montanari,
Message Passing Algorithms forCompressed Sensing , Proceedings of the National Academy of Sciences (2009), 18914–18919.[DMM11] David L. Donoho, Arian Maleki, and Andrea Montanari,
The noise-sensitivity phase transitionin compressed sensing , IEEE Trans. Inform. Theory (2011), no. 10, 6920–6941.[DR17] John C. Duchi and Feng Ruan, Solving (most) of a set of quadratic equalities: Compositeoptimization for robust phase retrieval , arXiv:1705.02356 , 2017.59DTL17] Oussama Dhifallah, Christos Thrampoulidis, and Yue M. Lu, Phase retrieval via linear pro-gramming: Fundamental limits and algorithmic improvements , 55th Annual Allerton Confer-ence on Communication, Control, and Computing, 2017.[FD87] J. R. Fienup and J. C. Dainty,
Phase retrieval and image reconstruction for astronomy , ImageRecovery: Theory and Application (1987), 231–275.[Fie82] J. R. Fienup,
Phase retrieval algorithms: A comparison , Applied Optics (1982), no. 15,2758–2769.[Ger72] Ralph W. Gerchberg, A practical algorithm for the determination of the phase from image anddiffraction plane pictures , Optik (1972), 237–246.[GS16] Tom Goldstein and Christoph Studer, Phasemax: Convex phase retrieval via basis pursuit , arXiv:1610.07531 , 2016.[Har93] Robert W. Harrison, Phase problem in crystallography , J. Optical Soc. America A (1993),no. 5, 1046–1055.[HJ12] Roger A. Horn and Charles R. Johnson, Matrix analysis , Cambridge University Press, 2012.[JM13] Adel Javanmard and Andrea Montanari,
State evolution for general approximate message pass-ing algorithms, with applications to spatial coupling , Information and Inference (2013), 115–144.[JNS13] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi,
Low-rank matrix completion using al-ternating minimization , Proc. of the 45th Ann. ACM Symp. on Theory of Computing (STOC)(Palo Alto, CA), ACM, June 2013, pp. 665–674.[Kar13] Noureddine El Karoui,
Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: Rigorous results , arXiv:1311.2445 , 2013.[KKM +
16] Yoshiyuki Kabashima, Florent Krzakala, Marc Mézard, Ayaka Sakata, and Lenka Zdeborová,
Phase transitions and sample complexity in bayes-optimal matrix factorization , IEEE Trans.Inform. Theory (2016), no. 7, 4228–4265.[KMO10] Raghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh, Matrix completion from afew entries , IEEE Trans. Inform. Theory (2010), no. 6, 2980–2998.[KMZ13] Florent Krzakala, Marc Mézard, and Lenka Zdeborová, Phase diagram and approximate mes-sage passing for blind calibration and dictionary learning , Proc. of the IEEE Int. Symposiumon Inform. Theory (ISIT) (Istanbul, Turkey), July 2013, pp. 659–663.[LGL15] Gen Li, Yuantao Gu, and Yue M. Lu,
Phase retrieval using iterative projections: Dynamicsin the large systems limit , Proc. of the 53rd Annual Allerton Conf. on Commun., Control, andComputing (Allerton) (Monticello, IL), Oct. 2015, pp. 1114–1118.[LL17] Yue M. Lu and Gen Li,
Phase transitions of spectral initialization for high-dimensional non-convex estimation , arXiv:1702.06435 , 2017.60LLJB17] Kiryung Lee, Yanjun Li, Marius Junge, and Yoram Bresler, Blind recovery of sparse signalsfrom subsampled convolution , IEEE Trans. Inform. Theory (2017), no. 2, 802–821.[LLSW16] Xiaodong Li, Shuyang Ling, Thomas Strohmer, and Ke Wei, Rapid, robust, and reliable blinddeconvolution via nonconvex optimization , arXiv:1606.04933 , 2016.[Mil90] Rick P. Millane, Phase retrieval in crystallography and optics , J. Optical Soc. America A (1990), no. 3, 394–411.[MISE08] Jianwei Miao, Tetsuya Ishikawa, Qun Shen, and Thomas Earnest, Extending X-ray crystallog-raphy to allow the imaging of noncrystalline materials, cells, and single protein complexes ,Annu. Rev. Phys. Chem. (2008), 387–410.[MNS14] Elchanan Mossel, Joe Neeman, and Allan Sly, Belief propagation, robust reconstruction andoptimal recovery of block models , Conference on Learning Theory (COLT) (Barcelona, Spain),June 2014, pp. 356–370.[MP67] Vladimir A. Marˇcenko and Leonid A. Pastur,
Distribution of eigenvalues in certain sets ofrandom matrices , Mat. Sb. (N.S.) (1967), 457–483 (in Russian).[MR16] Andrea Montanari and Emile Richard, Non-negative principal component analysis: Messagepassing algorithms and sharp asymptotics , IEEE Trans. Inform. Theory (2016), no. 3,1458–1484.[MV17] Andrea Montanari and Ramji Venkataramanan, Estimation of low-rank matrices via approxi-mate message passing , arXiv:1711.01682 , 2017.[MX16] Elchanan Mossel and Jiaming Xu, Density evolution in the degree-correlated stochastic blockmodel , Conference on Learning Theory (COLT) (New York, NY), June 2016, pp. 1319–1356.[NJS13] Praneeth Netrapalli, Prateek Jain, and Sujay Sanghavi,
Phase retrieval using alternating mini-mization , Advances in Neural Information Processing Systems, 2013, pp. 2796–2804.[NWL16] Matey Neykov, Zhaoran Wang, and Han Liu,
Agnostic estimation for misspecified phase re-trieval models , Advances in Neural Information Processing Systems, 2016, pp. 4089–4097.[OTH13] Samet Oymak, Christos Thrampoulidis, and Babak Hassibi,
The squared-error of generalizedLASSO: A precise analysis , Proc. of the 51st Annual Allerton Conf. on Commun., Control, andComputing (Allerton) (Monticello, IL), Oct. 2013, pp. 1002–1009.[PV16] Yaniv Plan and Roman Vershynin,
The generalized lasso with non-linear observations , IEEETransactions on information theory (2016), no. 3, 1528–1537.[Ran11] Sundeep Rangan, Generalized Approximate Message Passing for Estimation with Random Lin-ear Mixing , Proc. of the IEEE Int. Symposium on Inform. Theory (ISIT) (St. Petersburg), Aug.2011, pp. 2168–2172.[RG01] Sundeep Rangan and Vivek K. Goyal,
Recursive consistent estimation with bounded noise ,IEEE Trans. Inform. Theory (2001), no. 1, 457–464.61RP16] Galen Reeves and Henry D. Pfister, The replica-symmetric prediction for compressed sensingwith Gaussian matrices is exact , Proc. of the IEEE Int. Symposium on Inform. Theory (ISIT)(Barcelona, Spain), July 2016, pp. 665–669.[SB10] Jack W. Silverstein and Zhidong Bai,
Spectral Analysis of Large Dimensional Random Matri-ces ( nd edition) , Springer, 2010.[SC95] Jack W. Silverstein and Sang-Il Choi, Analysis of the limiting spectral distribution of large-dimensional random matrices , Journal of Multivariate Analysis (1995), no. 2, 295–309.[SC16] Weijie Su and Emmanuel J. Candès, Slope is adaptive to unknown sparsity and asymptoticallyminimax , Annals of Statistics (2016), no. 3, 1038–1068.[SEC +
15] Yoav Shechtman, Yonina C. Eldar, Oren Cohen, Henry N. Chapman, Jianwei Miao, andMordechai Segev,
Phase retrieval with application to optical imaging: a contemporaryoverview , IEEE Signal Processing Magazine (2015), no. 3, 87–109.[Sol17] Mahdi Soltanolkotabi, Structured signal recovery from quadratic measurements: Breakingsample complexity barriers via nonconvex optimization , arXiv:1702.06175 , 2017.[Spe93] Roland Speicher, Free convolution and the random sum of matrices , Publ. Res. Inst. Math. Sci. (1993), 731–744.[SR15] Philip Schniter and Sundeep Rangan, Compressive phase retrieval via generalized approximatemessage passing , IEEE Transactions on Signal Processing (2015), no. 4, 1043–1055.[TAH15] Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi, Lasso with non-linear measure-ments is equivalent to one with linear measurements , Advances in Neural Information Pro-cessing Systems, 2015, pp. 3420–3428.[UE88] Michael Unser and Murray Eden,
Maximum likelihood estimation of linear signal parametersfor Poisson processes , IEEE Trans. Acoust., Speech, and Signal Process. (1988), no. 6,942–945.[VJ17] Ramji Venkataramanan and Oliver Johnson, Strong converse bounds for high-dimensional es-timation , arXiv:1706.04410 , 2017.[Voi91] Dan Voiculescu, Limit laws for random matrices and free products , Inventiones Mathematicae (1991), 201–220.[Wal63] Adriaan Walther,
The question of phase retrieval in optics , Journal of Modern Optics (1963),no. 1, 41–49.[WdM15] Irène Waldspurger, Alexandre d’Aspremont, and Stéphane Mallat, Phase recovery, maxcut andcomplex semidefinite programming , Mathematical Programming (2015), no. 1-2, 47–81.[Wei15] Ke Wei,
Solving systems of phaseless equations via Kaczmarz methods: A proof of conceptstudy , Inverse Problems (2015), no. 12.[WG16] Gang Wang and Georgios B. Giannakis, Solving random systems of quadratic equations viatruncated generalized gradient flow , Advances in Neural Information Processing Systems,2016, pp. 568–576. 62WGE16] Gang Wang, Georgios B. Giannakis, and Yonina C. Eldar,