Phase retrieval using alternating minimization in a batch setting
PPhase retrieval using alternating minimization in a batch setting
Teng ZhangDepartment of MathematicsUniversity of Central Florida [email protected]
September 17, 2018
Abstract
This paper considers the problem of phase retrieval, where the goal is to recover a signal z ∈ C n from the observations y i = | a ∗ i z | , i = 1 , , · · · , m . While many algorithms have beenproposed, the alternating minimization algorithm is still one of the most commonly used and thesimplest methods. Existing works have proved that when the observation vectors { a i } mi =1 aresampled from a complex norm distribution CN (0 , I ), the alternating minimization algorithmrecovers the underlying signal with a good initialization when m = O ( n ), or with randominitialization when m = O ( n ), and it is conjectured that random initialization succeeds with m = O ( n ) [26]. This work proposes a modified alternating minimization method in a batchsetting and proves that when m = O ( n log n ), the proposed algorithm with random initializationrecovers the underlying signal with high probability. The proof is based on the observation thatafter each iteration of alternating minimization, with high probability, the correlation betweenthe direction of the estimated signal and the direction of the underlying signal increases. This article concerns the phase retrieval problem as follows: let z ∈ C n be an unknown vector; given m known sensing vectors { a i } mi =1 ∈ C n and the observations y i = | a ∗ i z | , i = 1 , , · · · , m, then can we reconstruct z from the observations { y i } mi =1 ? This problem is motivated from theapplications in imaging science, and we refer interested readers to [21] for more detailed discussionson the background in engineering. In addition, this problem has applications in other areas ofsciences and engineering as well, as discussed in [6].Because of the practical ubiquity of the phase retrieval problem, many algorithms and theoreticalanalysis have been developed for this problem. For example, an interesting recent approach is basedon convex relaxation [8, 7, 27], that replaces the non-convex measurements by convex measurementsthrough relaxation. Since the associated optimization problem is convex, it has interesting propertiessuch as convergence to the global minimizer, and it has been shown that under some assumptionson the sensing vectors, this method recovers the correct z [5, 15]. However, since these algorithmsinvolve semidefinite programming for n × n positive semidefinite matrices, the computational costcould be prohibitive when n is large. Recently, several works [1, 14, 16, 17] proposed and analyzedan alternate convex method that uses linear programming instead of semidefinite programming,which is more computationally efficient, but the program itself requires an “anchor vector”, whichneeds to be a good approximate estimation of z .1 a r X i v : . [ m a t h . O C ] S e p nother line of works are based on Wirtinger flows, i.e., gradient flow in the complex setting [6,9, 29, 30, 4, 28, 22]. Some theoretical justifications are also provided [6, 22]. However, since theobjective functions are nonconvex, these algorithms require careful initializations, which are usuallyonly justified when the measurement vectors follow a very specific model, for example, when theobservation vectors { a i } mi =1 are sampled from a complex normal distribution CN (0 , I ). That is,both its real component and its imaginary component follows from a real Gaussian distributionof N (0 , I / m = O ( n log n ), their cost function has no bad critical point, and as a result,arbitrary initialization is sufficient and a trust-region method (TRM) can be applied to obtain thesolution. However, this method is more complicated than the alternate minimization algorithm asdescribed below, due to its specific objective function and the associated trust-region method.The most widely used method is perhaps the alternate minimization algorithm and its variants [13,11, 12], that is based on alternating projections onto nonconvex sets [3]. This method is very simpleto implement and is parameter-free. However, since it is a nonconvex algorithm, its propertiessuch as convergence are only partially known. Netrapalli et al. [19] studied a resampling version ofthis algorithm and established its convergence as the number of measurements m goes to infinitywhen the measurement vectors are independent standard complex normal vectors. Marchesini et al.[18] studied and demonstrated the necessary and sufficient conditions for the local convergence ofthis algorithm. Recently, Waldspurger [26] showed that when m ≥ Cn for sufficiently large C , thealternating minimization algorithm succeeds with high probability, provided that the algorithm iscarefully initialized. In addition, with random initialization, the algorithm succeeds with m ≥ Cn .This work also conjectured that the alternate minimizations algorithm with random initializationsucceeds with m ≥ Cn .The contribution of this work is to show that a modified version of the alternating minimizationalgorithm and random initialization succeeds with high probability when m = O ( n log n ), whichpartially verifies the conjecture that the alternating minimization algorithm succeeds with highprobability when m = O ( n ). Compared with the previous methods based on Wirtinger flows andlinear programming, the proposed algorithm is more practical since it does not require a goodinitialization, and compared with the existing works that also do not depend on good initializationssuch as semidefinite programming and [23], the proposed alternating minimization algorithm issimpler and easier to implement.The paper is organized as follows. Section 2 presents the algorithm and the main results of thepaper, and the proof of the key component, Theorem 2.4, is given in Section 3. We run simulationsto verify Theorem 2.4 in Section 4. The alternating minimization method is one of the earliest methods that was introduced for phaseretrieval problems [13, 11, 12], and it is based on alternating projections onto nonconvex sets [3].Let A ∈ C m × n be a matrix with rows given by a ∗ , a ∗ , · · · , a ∗ m , the goal of this algorithm is tofind a vector in C m such that it lies in both the set S = range( A ) ∈ C m and the set of correctamplitude A = { w ∈ C m : | w i | = y i } . For this purpose, the algorithm picks an initial guess in C m ,2nd alternatively projects it to both sets. The projections P S , P A : C m → C m can be defined by P S ( w ) = A ( A ∗ A ) − A ∗ w , [ P A ( w )] i = y i w i | w i | , and the alternating minimization algorithm is given by w ( k +1) = P S P A w ( k ) . (1)In fact, the alternating minimization method can be explicitly written down as follows. Writing w ( k ) = Ax ( k ) and let e i ∈ C m be the indicator vector of the i -th coordinate, then the updateformula is Ax ( k +1) = A ( A ∗ A ) − A ∗ m X i =1 | a ∗ i z | a ∗ i x ( k ) | a ∗ i x ( k ) | e i ! = A ( A ∗ A ) − m X i =1 | a ∗ i z || a ∗ i x ( k ) | a ∗ i x ( k ) a i ! , which implies x ( k +1) = ( A ∗ A ) − m X i =1 | a ∗ i z || a ∗ i x ( k ) | a i a ∗ i x ( k ) ! . (2)Define g i ( x ) = | a ∗ i z || a ∗ i x | a i a ∗ i x , g ( x ) = m X i =1 g i ( x ) , T ( x ) = ( A ∗ A ) − g ( x ) (3)then the algorithm (2) can be written as x ( k +1) = T ( x ( k ) ) . (4)In this work, we will consider the algorithm (2) in a batch setting. Similar to AltMinPhase [19],we divide the sampling vectors a i (the rows of the matrix A ) and corresponding observations y i into B disjoint blocks ( y (1) , A (1) ) , · · · , ( y ( B ) , A ( B ) ) of roughly equal size, and perform alternatingminimization (1) to the disjoint blocks cyclically. The procedure is summarized as Algorithm 1,where T ( k ) represents the alternating minimization operator with the k -th block ( y ( k ) , A ( k ) ). Weremark that while it is similar to AltMinPhase, this algorithm uses partitions cyclically, rather thanonly using each partition once. As a result, it only requires finite observations to estimate z exactly,which is different than the method in [19]. Algorithm 1
Alternating minimization in a batch setting
Input:
The sampling vectors A ∈ C m × n and corresponding observations y ∈ C m partitioned into B disjoint blocks ( y (1) , A (1) ) , · · · , ( y ( B ) , A ( B ) ) of roughly equal size. Output:
An estimator of the underlying signal z . Steps:1:
Let x (0) be a random unit vector in C n , k = 0. Repeat x ( k +1) ← T (mod( k,B )+1) x ( k ) , k = k + 1 Until Convergence
Output: lim k →∞ x ( k ) . 3 .1 Main Result Before we state our main result, we present an auxiliary function and its related properties asfollows. We remark that in the following statements and proofs, we use c, c , C, C to denote anyfixed constants as m, n → ∞ . Depending on the context, they might denote different values indifferent equations and expressions. Theorem 2.1.
There exists C , C , C , C , C that does not depend on n and m , such that when m > C C n log n and B = C log n satisfies n > C log m and m/B > C n , then with probability atleast − C/ log n − exp( − Cn ) − B/ log n − BC exp( − C m/B ) , Algorithm 1 recovers the underlying z multiplication by a global phase in the sense that lim k →∞ | z ∗ x ( k ) | = 1 . We remark that when n and m goes to infinity together under the assumption that mn log n → ∞ and n log m → ∞ , then the conditions in Theorem 2.1 are satisfied and the probability in Theorem 2.1goes to 1. The proof of the main result, Theorem 2.1, can be divided into three steps. First, the randominitialization in step 1 Algorithm 1 exhibits a slight correlation with the ground truth. Then onemay run a batched version of alternating projections by partitioning the measurements into O (log n )batches. Since the batches are independent of each other, the second step proves that projectingonto the measurements of each batch will (with high probability) iteratively improve the estimationuntil it has a constant correlation with the ground truth. Finally, Theorem 3.1 of [25] gives that(with high probability) alternating projections converges to the ground truth provided the seed hasa constant correlation with the ground truth. Throughout the paper, we define the θ ( x ) by sin − ( | x ∗ z | / k x kk z k ), which can be understood asthe “angle” between x with the hyperplane that is orthogonal z (though here the angle is not welldefined since x and z are complex-valued). For example, when θ ( x ) = π/
2, then there exists aconstant c ∈ C such that x = c z ; when θ ( x ) = 0, then x is orthogonal to z in the sense that x ∗ z = 0.For Algorithm 1, the random initialization has a slight correlation with z as follows: Lemma 2.2.
For any fixed z ∈ C n and random unit vector x ∈ C n , with probability − C/ log n − exp( − Cn ) , θ ( x (0) ) > sin − (
12 log n √ n ) .Proof of Lemma 2.2. WLOG assume z = (1 , , · · · , | x (0) ∗ z | = | x (0)1 | / k x (0) k . Using Hanson-Wright inequality [20] with k x (0) k = x (0) ∗ I x (0) , we have that with probability 1 − exp( − Cn ), k x (0) k < √ n . In addition, with probability at least 1 − C/ log n , | x (0)1 | > / log n . Combingthese two observations, Lemma 2.2 is proved. We remark that while [20] presents the Hanson-Wright inequality for real-valued vectors and matrices, it is straightforward to generalize it to thecomplex-valued vectors and matrices, by writing any complex number as a pair of real numbers. In the second step, we prove that the correlation between x ( i ) and z over each iteration improves(with high probability). We first introduce a function h ( θ ) : R → R and an auxiliary lemma on theproperty of h ( θ ). 4 emma 2.3. Let a and a be two complex variables independently sampled from a complex normaldistribution CN (0 , . Let h ( θ ) = E a ,a ∼ CN (0 , | a || a sin θ + a cos θ | , then there exists c > such that for all < θ < π/ , h ( θ ) ≥ c min( θ, π/ − θ ) . In addition, there exists c > such that min ≤ θ<π/ h ( θ ) < c . For the main result in this step, we investigate T ( x ) as defined in (4), rather than T ( k ) as definedin Algorithm 1. However, T is a random operator that exhibits the same distribution as each T ( k ) . Theorem 2.4.
Assuming that { a i } mi =1 are i.i.d. sampled from complex normal distribution CN (0 , ,then there exists C , C > such that if m > C n and n > C log m , then for any fixed x ∈ C n ,with probability at least − / log n , θ ( T ( x )) > (cid:18) − C nθ ( x ) √ m (cid:19) (cid:18) θ ( x ) + tan − h ( θ ( x )) h ( θ ( x )) (cid:19) . Theorem 2.4 is the key element of this work since it describes the performance of the alternatingminimization in each iteration. Its proof is rather technical and it is deferred to Section 3.
To complete the proof of Theorem 2.1, we apply the following lemma, which is a result of [26,Theorem 3.1]. Similar to Theorem 2.4, it is a result for the operator T defined in (4), instead of T ( i ) as defined in Algorithm 2.4. Lemma 2.5 (Theorem 3.1 in [26]) . Assuming that { a i } mi =1 are i.i.d. sampled from complex normaldistribution CN (0 , , then there exists (cid:15), C , C , M > and < δ < such that if m ≥ M n , thenwith probability − C exp( − C m ) , for all x such that inf ϕ ∈ R k e iϕ z − x k ≤ (cid:15) k z k , then inf ϕ ∈ R k e iϕ z − T ( x ) k ≤ δ inf ϕ ∈ R k e iϕ z − x k . Combining this result with the previous steps, we proved Theorem 2.1.
Proof of Theorem 2.1.
Applying Lemma 2.5 to the B operators T ( i ) with i = 1 , · · · , B , then wehave the following result: if m/B > M n , then with probability 1 − BC exp( − C m ), for all x suchthat inf ϕ ∈ R k e iϕ z − x k ≤ (cid:15) k z k , and for all 1 ≤ i ≤ B , inf ϕ ∈ R k e iϕ z − T ( i ) ( x ) k ≤ δ inf ϕ ∈ R k e iϕ z − x k . (5)Then as long as inf ϕ ∈ R k e iϕ z − x ( i ) k ≤ (cid:15) k z k , for some 0 ≤ i ≤ B (6)then the sequence inf ϕ ∈ R k e iϕ z − x ( k ) k for k = i, i + 1 , · · · will converge linearly to zero.Consider that the operator T ( i ) ( x ) are invariant to the scale of x and inf ϕ ∈ R ,c ∈ C k e iϕ z − c x ( i ) k ≤ θ ( x ( i ) ) k z k , the sufficient condition in (6) can be further reduced to θ ( x ( i ) ) ≤ (cid:15), for some 0 ≤ i ≤ B (7)5hat is, it is sufficient to prove that (7) holds with high probability. If for all 0 ≤ i < B , θ ( x ( i ) ) < π − (cid:15) , then Lemma 2.3 implies that there exists c (cid:15) > − h ( θ ( x ( i ) )) h ( θ ( x ( i ) )) > c (cid:15) θ ( x ( i ) )for all 0 ≤ i < B . Since each batch has m/B observations and for 1 ≤ i ≤ B , T ( i ) is independentwith x ( i ) , m/B > C n , and n > C log m , Theorem 2.4 implies that for each 0 ≤ i < B , withprobability 1 − / log n , θ ( x ( i +1) ) > " (1 + c (cid:15) ) − C log n √ Bθ ( x ( i ) ) √ m ! θ ( x ( i ) ) . (8)We choose C such that (1 + c (cid:15) / C log n sin − (1 / n √ n ) > π/ − (cid:15) , and C such that(1 + c (cid:15) ) − C θ ( x (0) ) q C > c (cid:15) / , (9)then when B = C log n and m = C C log n , applying (8) and induction we can prove that withprobability 1 − B/ log n , θ ( x ( B ) ) > (cid:20) c (cid:15) (cid:21) B θ ( x (0) ) > π − (cid:15). (10)then this is a contradiction to the assumption that (7) does not hold, i.e., θ ( x ( B ) ) can not be largerthan π/ − (cid:15) . Therefore, there exist 0 ≤ i < B such that θ ( x ( i ) ) > π − (cid:15) , and Theorem 2.1 is proved.The probabilistic estimation in Theorem 2.1 comes from the union bound of Lemma 2.2, (5) and(10). Theorem 2.1 has several interesting connections with the results within the current literature. Firstof all, it complements the analysis of AltMinPhase in [19]. While the analysis of AltMinPhase in [19]is one of the first theoretical guarantees for the alternating minimization algorithm, the work hasno instruction on how we should divide the samples into distinct blocks, or how we should choosethe number of size of blocks. In addition, the analysis requires infinite observations to recover z exactly. In comparison, Theorem 2.1 gives an estimation of the number of blocks to use. In addition,Theorem 2.4 also shows that when the size of each block is on the order of O ( n ) up to a logarithmicfactor, then each iteration of the algorithm improves the estimation of z , in the sense that everyiteration decreases the angle between z and the estimator.Our work also partially answers the conjecture from the work [26] that when the initializationis randomly chosen and m = O ( n ), the alternating minimization algorithm succeeds with highprobability. In comparison, we proved that the alternating minimization algorithm in a batch settingsucceeds with m = O ( n log n ), which is an improvement from the estimation m = O ( n ) in [26](though we remark that the result in [26] is for the non-batch setting).An interesting observation from [26] is the existence of stationary points when m < O ( n ). Incomparison, Theorem 2.1 shows that the algorithm avoids these stationary points from randominitialization. In this sense, Theorem 2.1 is very different from most existing theoretical guaranteesfor phase retrieval, which are based on the observations that there is no stationary point (or there isno stationary point within a neighborhood of z ).We also emphasize the result in this work can be applied to settings other than a i ∼ CN (0 , I ).In fact, most existing works on algorithms that succeed with m = O ( n ) requires a good initialization,6hich is constructed under the setting a i ∼ CN (0 , I ). For example, [19] uses the top eigenvector of P mi =1 | a ∗ i z | a i a ∗ i , and [9] applies a similar estimator with a thresholding-based scheme by using thetop eigenvector of m X i =1 | a ∗ i z | a i a ∗ i | a ∗ i z | ≤ m P mj =1 | a ∗ i z | , and a similar scheme is also used in [4]. The only exception that we are aware of is [28], whichintroduces an orthogonality-promoting initialization that is obtained with a few simple poweriterations and the initialization works when the distribution of a i is heavy-tailed. In comparison,random initialization is a much simpler procedure and can be used in the setting that { a i } mi =1 arei.i.d. sampled from the complex normal distribution CN (0 , Σ) in Corollary 2.6 as follows, whichsuggests that Theorem 2.1 still holds under the setting a i ∼ CN (0 , Σ).
Corollary 2.6.
Assuming that a i ∼ CN (0 , Σ) , k Σ z kk Σ z k √ tr(Σ) > c √ n , tr(Σ) ≥ k Σ k F log n , and Σ is nonsingular, then Algorithm 1 converges to the underlying z under the assumptions stated inTheorem 2.1.Proof. The proof is based on the observation that it is equivalent to the setting where a i ∼ CN (0 , I ).If we let ˜ a i = Σ − a i , ˜ x ( k ) = Σ x ( k ) , and ˜ z = Σ z , then the update formula (2) is equivalent tothe setting of estimating ˜ z with sensing vectors { ˜ a i } mi =1 , with initialization ˜ x (0) = Σ x (0) sampledfrom CN (0 , Σ).Now let us investigate the angle between ˜ x (0) and ˜ z (0) : | ˜ x (0) ∗ ˜ z |k ˜ x (0) kk ˜ z k = | x (0) ∗ Σ z |k Σ x (0) kk Σ z k . (11)WLOG we may assume that all elements of x (0) are i.i.d. sampled from the complex normaldistribution CN (0 , x (0) ∗ Σ z is distributed according to CN (0 , k Σ z k ), and | x (0) ∗ Σ z | > k Σ z k / log n with probability 1 − C/ log n . In addition, Hanson-Wright inequality implies thatPr {|k Σ x (0) k − tr(Σ) | > t } ≤ − c min( t k Σ k F , t k Σ k )) . Since tr(Σ) ≥ k Σ k F log n ≥ k Σ k log n , with probability at least 1 − − c log n ), |k Σ x (0) k − tr(Σ) | ≤ c tr(Σ). As a result, the RHS of (11) is larger than k Σ z k log n k Σ z k p tr(Σ) . If k Σ z kk Σ z k √ tr(Σ) > c √ n , this recovers Lemma 2.2. Following the proof of Theorem 2.1, ˜ x ( n ) convergesto ˜ z . Since Σ is nonsingular, Corollary 2.6 is proved.At last, we emphasize that Theorem 2.1 does not apply to the standard alternating minimizationalgorithm (i.e., not in a batch setting). The reason is that the probabilistic estimation in Theorem 2.4only holds for a fixed x that is independent of A . However, in the standard alternating minimizationalgorithm, x ( k ) for k > A , and Theorem 2.4 cannot be used to estimate θ ( x ( k +1) ). Incomparison, Theorem 3.1 in [26] applies for all x as long as x ( k ) is sufficiently close to z . It is unclearhow we can find a method generalizing Theorem 2.1 to the standard alternating minimizationalgorithm, by “decoupling” the dependence of x ( k ) and A . This is an open question and we considerit as an interesting future direction. 7 Proof of Theorem 2.4
To prove Theorem 2.4, we first present Lemma 3.1, which gives the exact formula for the expectationof g i ( x ) for g i defined in (3). We also present Lemma 3.2, which shows that the expectation of T ( x )is a scalar multiplication of the expectation of g i ( x ), and Lemma 3.4, which shows that T ( x ) hasa small variance. Combining these three results together, we proved Theorem 2.4. These lemmasapply the probabilistic setting of Theorem 2.4 by assuming that { a i } mi =1 ∼ CN (0 ,
1) and x is fixed.In the proof, we assume WLOG that k x k = k z k = 1. Lemma 3.1.
Let η ∈ [0 , π ] and w be chosen such that k w k = 1 , w ⊥ z (i.e., w ∗ z = 0 ), and x = sin( θ ) z exp( iη ) + cos( θ ) w . Then for g i defined in (3) , E g i ( x ) = h ( θ ) x + h ( θ ) d , where d = cos( θ ) z exp( iη ) − sin( θ ) w . Lemma 3.2.
For any ≤ i ≤ m and Σ i = P ≤ j ≤ m,j = i a j a ∗ j , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E T ( x ) − m E
11 + tr(Σ − i ) Σ − i ! E g i ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) < Cn/m (12) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) z ∗ E T ( x ) − m z ∗ E
11 + tr(Σ − i ) Σ − i ! E g i ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) < Cn √ n/m (13) Lemma 3.3.
For g ( x ) defined in (3) , there exists C > such that Pr( k g ( x ) k > Ctm ) < exp( − t ) . Lemma 3.4.
There exists
C > such that for all ≤ i ≤ n , E [ k T ( x ) − E T ( x ) k ] < Cn/m, and Var[ z ∗ T ( x )] < C/m. We first prove Theorem 2.4, with the proofs of lemmas deferred.
Proof of Theorem 2.4.
Applying the Chebyshev’s inequality to Lemma 3.4, we have that withprobability at least 1 − / log n , we have k T ( x ) − E T ( x ) k < C √ n log n/ √ m, k z ∗ T ( x ) − z ∗ E T ( x ) k < C log n/ √ m. (14)In addition, E (cid:18) − i ) Σ − i (cid:19) is a scalar matrix and (17) implies that with probability 1 /
2, thelargest singular value and the smallest singular value of Σ i are both in the order of 1 /m , so thereexists some c = O (1) such that its diagonal entries are larger than c/m .Lemma 3.1 implies that angle between z ∗ and g i ( x ) satisfies | z ∗ E g i ( x ) |k E g i ( x ) k = sin (cid:18) θ ( x ) + tan − h ( θ ( x )) h ( θ ( x )) (cid:19) . Combining it with k E g i ( x ) k ≥ E (cid:18) − i ) Σ − i (cid:19) = cm I with c = o (1), (14), and Lemma 3.2, | z ∗ T ( x ) |k T ( x ) k ≥ c sin (cid:16) θ ( x ) + tan − h ( θ ( x )) h ( θ ( x )) (cid:17) − C log n √ m c + C √ n log n √ m . Then Theorem 2.4 is proved by applying θ ( T ( x )) = sin − (cid:16) | z ∗ T ( x ) |k T ( x ) k (cid:17) .8 .1 Proof of Auxiliary Lemmas for Theorem 2.4 Proof of Lemma 3.1.
The proof is based on the observation that g i ( x ) is the derivative of | a ∗ i x || a ∗ i z | .In particular, this work defines the derivatives of real valued functions over complex variables asfollows: ∇ f ( x ) is chosen such that f ( x + ∆ x ) = f ( x ) + re( ∇ f ( x ) ∗ ∆ x ) + o ( | ∆ x | ) . Then we can define G ( x ) = P ni =1 G i ( x ) with G i ( x ) = | a ∗ i x | . Then we have g i ( x ) = ∇ G i ( x ) and g ( x ) = ∇ G ( x ).In addition, we can calculate E G i ( x ). Since the expectation is invariant to unitary transforma-tions of x and z and θ ( x ) = sin − ( | x ∗ z |k x kk z k ), WLOG we may phase z so that x ∗ z is nonnegative andassume that z = (1 , , · · · ,
0) and x = (sin( θ ) , cos( θ ) , , · · · , E [ G i ( x )] = E [ a ,a ] ∼ CN (0 , I ) h | a || a sin θ + a cos θ | i = h ( θ ) . Since E [ G i ( x )] only depends on the θ ( x ) and k x k , its derivative is only nonzero at two directions: x and the direction where θ ( x ) changes most. Since the function G i has the property G i ( x + t x ) =(1 + t ) G i ( x ), we have x ∗ ∇ E [ G i ( x )] = E [ G i ( x )] . By definition, d is the direction where θ ( x ) changes most, that is, d = arg max k y k =1 , y ∈ C n θ ( x + t y ) − θ ( x ) t ,and θ ( x + t d ) = θ ( x ) + t + O ( t ). Combining it with k x + t d k = k x k + O ( t ), we have d ∗ ∇ E [ G i ( x )] = h ( θ ) θ = θ ( x ) . Combining the above observations together, Lemma 3.1 is proved.
Proof of Lemma 3.2.
The proof of Lemma 3.2 is based on an upper bound of k Σ − k for Σ = A ∗ A .To start, we apply the result from [24, Theorem 1.1] that for any for any n × n complex normalmatrix A , Pr (cid:0) σ min ( A ) ≤ t √ n (cid:1) < t. (15)For any m × n complex normal matrix A , we denote its smallest singular value by σ min ( A ). Since A contains b mn c independent submatrices of size n × n , and σ min ( A ) is larger than the smallestsingular value of any submatrix of A , we havePr (cid:0) σ min ( A ) ≤ t √ n (cid:1) < t b mn c , (16)We may also apply the result from [10, Theorem II.13] that for any m × n matrix Γ that is i.i.d.sampled from real Gaussian distribution CN (0 , (cid:0) √ m + √ n + t ≥ σ (Γ) ≥ σ n (Γ) ≥ √ m − √ n − t (cid:1) > − exp( − t / . Combining it with σ n ( A ) ≥ σ n (im( A )) and σ ( A ) ≤ σ (re( A )) + σ (im( A )),Pr (cid:26) σ n ( A ) ≥ √ √ m − √ n − t ) , σ ( A ) ≤ √ √ m + √ n + t ) (cid:27) > − − t / . (17)As a result, we havePr ( σ min ( A ) ≤ t ) ≤ min t √ n b mn c , − ( √ m − √ n − t √ !! (18) ≤ t √ n b mn c , if t < exp ( − n )2 exp (cid:16) − ( √ m −√ n − t √ (cid:17) , if t ≥ exp ( − n ) . (19)9ow let us estimate the upper bound of E k Σ − k . Since k Σ − k = σ min ( A ) − , so E k Σ − k ≤ m + Z ∞ t =1 /m Pr( k Σ − k > t ) ≤ m + Z ∞ t =8 /m Pr( σ min ( A ) < √ t ) d t ≤ m + Z exp( n/ t =8 /m − ( √ m − √ n − p /t ) ! d t + Z ∞ t =exp( n/ √ tn b mn c d t ≤ m + 2 exp( n/
2) exp − ( √ m/ − √ n ) ! + exp( − n/ ≤ Cm , where the last two steps uses the assumption that m > Cn and n > C log m .In addition, for any fixed n × n matrix Σ, Hanson-Wright inequality [20]Pr ( | a ∗ i Σ a i − tr(Σ) | > t ) ≤ " − c min t k Σ k F , t k Σ k ! implies Pr (cid:0) | a ∗ i Σ a i − tr(Σ) | > t √ n k Σ k (cid:1) ≤ h − c min( t , t √ n ) i . (20)Applying the Sherman–Morrison formula, T ( x ) = Σ − m X i =1 g i ( x ) = m X i =1 [Σ − i − Σ − i a i a ∗ i Σ − i a ∗ i Σ − i a i ] g i ( x ) = m X i =1
11 + a ∗ i Σ − i a i Σ − i g i ( x ) , we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E T ( x ) − m X i =1 E
11 + tr(Σ − i ) Σ − i ! E g i ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E "
11 + a ∗ i Σ − i a i −
11 + tr(Σ − i ) ! Σ − i g i ( x ) ≤ m E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)
11 + a ∗ i Σ − i a i −
11 + tr(Σ − i ) ! Σ − i g i ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ m E (cid:13)(cid:13)(cid:13)(cid:16) a ∗ i Σ − i a i − tr(Σ − i ) (cid:17) Σ − i g i ( x ) (cid:13)(cid:13)(cid:13) , (21)where the last inequality follows from the fact that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
11 + a ∗ i Σ − i a i −
11 + tr(Σ − i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) a ∗ i Σ − i a i − tr(Σ − i ) (cid:12)(cid:12)(cid:12) . Applying [2, Proposition 2.2(1)], for any t >
1, with probability 1 − t exp( − t + 1), | a ∗ i z | < t and with probability 1 − t n exp( − ( t − n ), k a i k < t √ n , which means that with probability1 − t n exp( − ( t − n ) − t exp( − t + 1), k g i ( x ) k < t √ n . Combining it with the (20), the RHS of(21) can be estimated by m E (cid:13)(cid:13)(cid:13)(cid:16) a ∗ i Σ − i a i − tr(Σ − i ) (cid:17) Σ − i g i ( x ) (cid:13)(cid:13)(cid:13) = m E { a j } ≤ j ≤ m,j = i h E a i (cid:13)(cid:13)(cid:13)(cid:16) a ∗ i Σ − i a i − tr(Σ − i ) (cid:17) Σ − i g i ( x ) (cid:13)(cid:13)(cid:13)i ≤ Cnm E { a j } ≤ j ≤ m,j = i k Σ − i k . Combining it with (18), we have (12): (cid:12)(cid:12)(cid:12) E T ( x ) − m X i =1 E
11 + tr(Σ − i ) Σ − i ! E g i ( x ) (cid:12)(cid:12)(cid:12) < C. k g i ( x ) k replaced by | z ∗ g i ( x ) | .For | z ∗ g i ( x ) | we have | z ∗ g i ( x ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) z ∗ a ∗ i za ∗ i x a i a ∗ i z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = | a ∗ i z | , and a ∗ i z ∼ CN (0 , − − t / | a ∗ i g i ( x ) | < t . Proof of Lemma 3.3.
Let Sp( x , z ) be the two-dimensional subspace spanned by x and z , and P Sp( x , z ) ⊥ ∈ C n × n − be a projector matrix to the n − x , z ),then P Sp( x , z ) ⊥ g ( x ) = m X i =1 | a ∗ i z | a ∗ i x | a ∗ i x | P Sp( x , z ) ⊥ a i , where P Sp( x , z ) ⊥ a i ∈ C n − is i.i.d. sampled from CN (0 , I ) and is independent with respect to | a ∗ i z | a ∗ i x | a ∗ i x | . As a result, P Sp( x , z ) ⊥ g ( x ) ∈ C n − is a vector whose elements are i.i.d. sampled from CN (0 , P mi =1 | a ∗ i z | ).Applying Hansen-Wright inequality [20], we havePr( k P Sp( x , z ) ⊥ g ( x ) k > tn m X i =1 | a ∗ i z | ) < exp( − Cnt ) , and k P Sp( x , z ) g ( x ) k ≤ n X i =1 k P Sp( x , z ) a i a ∗ i z k ≤ n X i =1 k P Sp( x , z ) a i k . In addition, Berstein’s inequality implies that there exists
C > n X i =1 | a ∗ i z | > Ct ) < exp( − t ) , Pr( n X i =1 k P Sp( x , z ) a i k > Ct ) < exp( − t ) . (22)Combining these estimations together with k g ( x ) k ≤ k P Sp( x , z ) g ( x ) k + k P Sp( x , z ) ⊥ g ( x ) k , the lemma is proved. Proof of Lemma 3.4.
First, we apply the following Lemma, which is a straightforward generalizationof the Tensorization of variance theorem [25, Theorem 2.3] to the complex setting:
Lemma 3.5.
For complex random variables X , · · · , X n and f : C n → C , we have Var[ f ( X , · · · , X n )] ≤ E " n X i =1 Var i ( f ( X , · · · , X n )) , where Var i is the variance of f with respect to the variable X i only, the remaining variables beingkept fixed.Proof. Applying Var[ f ] = Var[re( f )] + Var[im( f )] and the same argument as in the proof of [25,Theorem 2.3] for both the real and the imaginary part, the lemma is proved.11pplying Lemma 3.5, denote the variance when { a i } i = j are fixed byVar j ( z ∗ T ( x )) , then we have Var( z ∗ T ( x )) ≤ E m X j =1 [Var j ( z ∗ T ( x ))] . (23)ThenVar j ( z ∗ T ( x )) ≤ [ z ∗ Σ − m X i =1 g i ( x ) − z ∗ Σ − i m X i =1 ,i = j g i ( x )] = h z ∗ Σ − g j ( x ) − (1 + a ∗ j Σ − j a j ) − z ∗ Σ − j a j a ∗ j Σ − j m X i =1 ,i = j g i ( x ) i ≤ h z ∗ Σ − a j | a ∗ j z || a ∗ j x | a ∗ j x i + 2 h z ∗ Σ − j a j a ∗ j Σ − j m X i =1 ,i = j g i ( x ) i ≤ h z ∗ Σ − j a j | a ∗ j z || a ∗ j x | a ∗ j x i + 4 h
11 + a ∗ j Σ − j a j z ∗ Σ − j a j a ∗ j Σ − j a j | a ∗ j z || a ∗ j x | a ∗ j x i + 2 h z ∗ Σ − j a j a ∗ j Σ − j m X i =1 ,i = j g i ( x ) i ≤ h z ∗ Σ − j a j | a ∗ j z | i + 2 h z ∗ Σ − j a j a ∗ j Σ − j m X i =1 ,i = j g i ( x ) i . Consider that when { a i } ≤ i ≤ n,i = j are fixed and a j ∼ CN (0 , z ∗ Σ − j a j ∼ CN (0 , k z ∗ Σ − j k ), a ∗ j z ∼ CN (0 , k z k ) = CN (0 , a ∗ j Σ − j P mi =1 ,i = j g i ( x ) ∼ CN (0 , k Σ − j P mi =1 ,i = j g i ( x ) k ), so E Var j ( z ∗ T ( x )) ≤ C E k z ∗ Σ − j k + k z ∗ Σ − j k k Σ − j m X i =1 ,i = j g i ( x ) k ≤ C E k Σ − j k + k Σ − j k k m X i =1 ,i = j g i ( x ) k . Combining it with the estimation of k Σ − j k in (18) and the estimation of k P m ≤ i ≤ m,i = j g i ( x ) k in Lemma 3.3 (note that the estimation of k P m ≤ i ≤ m,i = j g i ( x ) k is identical to the estimation of k g ( x ) k = k P m ≤ i ≤ m g i ( x ) k ), we have E Var j ( z ∗ T ( x )) < C/m . Applying (23), we have Var[ z ∗ T ( x )] < C/m .Similarly, we can prove the other inequality by showing that any vector e i , whose i -th elementis 1 and other elements are zero, Var[ e ∗ i T ( x )] < C/m . This section aims to verify the result in Theorem 2.4. In particular, we would like to investigatewhether empirically, θ ( x ) and θ ( T ( x )) have the relation predicted by Theorem 2.4 and its proof: θ ( T ( x )) ≈ θ ( x ) + tan − h ( θ ( x )) h ( θ ( x )) . (24)12igure 1: Comparison between the predicted and the empirical value of θ ( T ( x )), with varioussettings of ( n, m ). 13or this purpose, we run simulations and compare the empirically observed θ ( T ( x )) and thepredicted values. We run two simulations with different settings of n, m . For each setting and each θ ( x ), we repeat the alternating minimization algorithm randomly by 1000 times and visualize the10% , ,
90% quantile of the observed θ ( T ( x )) in Figure 1, as well as the predicted value in (24).The figure clearly indicates that our predicted value is close to the empirical values, and as a result, T ( θ ( x )) > θ ( x ) with high probability as long as θ ( x ) is not too small, which means that with highprobability, the alternating minimization algorithm monotonically reduces the angle between theestimated and the underlying signal. In addition, the variance of the distribution of θ ( T ( x )) isshown to be on the order of 1 / √ m . This work analyzes the performance of the alternating minimization algorithm for phase retrieval.Theoretical analysis shows that the angle between the current iteration and the underlying signalis reduced at each iteration with high probability. Based on this observation, it is shown thatalternating minimization in a batch setting with random initialization can recover the underlyingsignal as long as m = O ( n log n ).A future direction is the analysis of standard alternating minimization without the batch setting.Current work only analyzes the performance of phase retrieval per iteration, as discussed at theend of Section 2.5, it does not apply to the standard alternating minimization algorithm. Wehope to find a way to uncouple the correlation between x ( k ) and A , to prove the conjecture thatalternating minimization algorithm succeeds with m = O ( n ). It is also interesting to improve theprobabilistic estimation in this work, for example, finding the exact value of C and possibly removethe logarithmic factors from the current estimation. Proof of Lemma 2.3.
Write it in terms of real variables, we have h ( θ ) = E a ,a ,b ,b ∼ N (0 , q a + b q ( a sin θ + a cos θ ) + ( b sin θ + b cos θ ) Using ( p f ( x )) = ( f ( x ) − / f ( x )) = f ( x ) − / f ( x ) − f ( x ) − / f ( x ) and[( a sin θ + a cos θ ) + ( b sin θ + b cos θ ) ] =2( a − a + b − b ) sin θ cos θ + 2( a a + b b )(cos θ − sin θ )[( a sin θ + a cos θ ) + ( b sin θ + b cos θ ) ] =2( a − a + b − b )(cos θ − sin θ ) − a a + b b ) cos θ sin θ, where h ( θ ) = E q a + b [ f ( θ )] h f ( θ )[( a − a + b − b )(cos θ − sin θ ) − a a + b b ) cos θ sin θ ] − [( a − a + b − b ) sin θ cos θ + ( a a + b b )(cos θ − sin θ )] i , and h (0) = E q a + b [ a + b ] h [ a + b ][ a + b − a − b ] − [ a a + b b ] i . a + b and a + b are fixed, then under this conditional distribution, E [ a a + b b ] = [ a + b ][ a + b ], we have h (0) = E q a + b [ a + b ] h
12 [ a + b ][ a + b ] − [ a + b ] i = E
12 [ a + b ] − [ a + b ] − q a + b q a + b . Applying E ( a + b ) k = 1 π Z x,y ( x + y ) k e − x − y d x d y = 2 Z ∞ r =0 r k +1 e − r d r = Z ∞ z =0 z k e − z d z = Γ( k + 1) ,h (0) = Γ( )Γ( ) − Γ( ) = π >
0. Using the fact that h ( ϕ ) = dd θ E q ( − a sin ϕ + a cos ϕ ) + ( − b sin ϕ + b cos ϕ ) q ( a sin θ + a cos θ ) + ( b sin θ + b cos θ ) (cid:12)(cid:12)(cid:12) θ =0 and applying the same procedure as in the calculation of h (0), we have h ( ϕ ) = E p ( − a sin ϕ + a cos ϕ ) + ( − b sin ϕ + b cos ϕ ) [ a + b ] h [ a + b ][ a + b − a − b ] − [ a a + b b ] i , (25)and as a special case, h ( π E h
12 [ a + b ] − [ a + b ] i = −
12 Γ(2) = − . Next, we will show that h ( θ ) is well-defined and Lipschitz continuous. In fact, applying (25)and the fact that ( − a sin ϕ + a cos ϕ ) − ( − a sin ϕ + a cos ϕ ) < | ϕ − ϕ | ( a + a ), | h ( ϕ ) − h ( ϕ ) | ≤ E | ϕ − ϕ | q a + b + q a + b [ a + b ] h [ a + b ][ a + b ] + [ a + b ] + [ a a + b b ] i ≤ E | ϕ − ϕ | q a + b + q a + b [ a + b ] h
32 [ a + b ][ a + b ] + [ a + b ] i . Then we obtain the Lipschitz continuity of h ( θ ) with Lipschitz factor given by L = E q a + b + q a + b [ a + b ] h
32 [ a + b ][ a + b ] + [ a + b ] i = 32 Γ( 12 )Γ( 52 ) + Γ(2) . Then to prove for all 0 < θ < π/ h ( θ ) ≥ c min( θ, π/ − θ ), it is sufficient to verify thatmin π L <θ< π − L h ( θ ) > c for some c >
0. (26)Since h ( π/
2) = − h ( θ ) has a Lipschitz factor L , h ( θ ) is also Lipschitz continuous with aLipschitz factor 1 + πL . Therefore, (26) can be verified by numerically by checking a few valuesof h ( θ ) in the interval π L < θ < π − L . More specifically, it is sufficient to verify that for θ = π L , π L + δ, π L + 2 δ, · · · , π − L , h ( θ ) > c + δ/ (1 + πL ). Using a computer with δ = 1 /
10, itis verified as shown in Figure 2.Based on the Lipschitz continuity of h ( θ ) we can verify the Lipschitz continuity of h in[0 , π/ c > ≤ θ<π/ h ( θ ) < c , by checking a few functional values of h ( θ ) for θ ∈ [0 , π/ h ( θ ) and h ( θ ), calculated from the average of 10 simulations.To visualize Lemma 2.3, we randomly reproduce 10 samples of ( a , a ), calculate the averagevalues of h ( θ ) and h ( θ ) and plot them in Figure 2. The right figure verifies that Lemma 2.3holds. We remark that if a and a are sampled from real Gaussian distribution N (0 , h ( θ ) = π [2 θ sin θ + 2 cos θ ], but in the complex setting, the calculation is more complicated andthere is no known explicit formula. References [1] S. Bahmani and J. Romberg. Phase Retrieval Meets Statistical Learning Theory: A FlexibleConvex Relaxation. In A. Singh and J. Zhu, editors,
Proceedings of the 20th InternationalConference on Artificial Intelligence and Statistics , volume 54 of
Proceedings of MachineLearning Research , pages 252–260, Fort Lauderdale, FL, USA, 20–22 Apr 2017. PMLR.[2] A. Barvinok. Math 710: Measure Concentration, 2005.[3] H. H. Bauschke, P. L. Combettes, and D. R. Luke. Hybrid projection–reflection method forphase retrieval.
J. Opt. Soc. Am. A , 20(6):1025–1034, Jun 2003.[4] T. T. Cai, X. Li, and Z. Ma. Optimal rates of convergence for noisy sparse phase retrieval viathresholded wirtinger flow.
Ann. Statist. , 44(5):2221–2251, 10 2016.[5] E. J. Candès and X. Li. Solving quadratic equations via phaselift when there are about asmany equations as unknowns.
Foundations of Computational Mathematics , 14(5):1017–1026,2014.[6] E. J. Candès, X. Li, and M. Soltanolkotabi. Phase retrieval via wirtinger flow: Theory andalgorithms.
IEEE Transactions on Information Theory , 61(4):1985–2007, April 2015.[7] E. J. Candès, T. Strohmer, and V. Voroninski. Phaselift: Exact and stable signal recovery frommagnitude measurements via convex programming.
Communications on Pure and AppliedMathematics , 66(8):1241–1274, 2013.[8] A. Chai, M. Moscoso, and G. Papanicolaou. Array imaging using intensity-only measurements.
Inverse Problems , 27(1):015005, 2011. 169] Y. Chen and E. Candes. Solving random quadratic systems of equations is nearly as easyas solving linear systems. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, andR. Garnett, editors,
Advances in Neural Information Processing Systems 28 , pages 739–747.Curran Associates, Inc., 2015.[10] K. Davidson and S. Szarek. Local operator theory, random matrices and Banach spaces. InLindenstrauss, editor,
Handbook on the Geometry of Banach spaces , volume 1, pages 317–366.Elsevier Science, 2001.[11] J. R. Fienup. Reconstruction of an object from the modulus of its fourier transform.
Opt. Lett. ,3(1):27–29, Jul 1978.[12] J. R. Fienup. Phase retrieval algorithms: a comparison.
Appl. Opt. , 21(15):2758–2769, Aug1982.[13] R. W. Gerchberg and W. O. Saxton. A practical algorithm for the determination of the phasefrom image and diffraction plane pictures.
Optik (Jena) , 35:237+, 1972.[14] T. Goldstein and C. Studer. PhaseMax: Convex Phase Retrieval via Basis Pursuit. 2016.[15] D. Gross, F. Krahmer, and R. Kueng. A partial derandomization of phaselift using sphericaldesigns.
Journal of Fourier Analysis and Applications , 21(2):229–266, 2015.[16] P. Hand and V. Voroninski. An Elementary Proof of Convex Phase Retrieval in the NaturalParameter Space via the Linear Program PhaseMax. 2016.[17] P. Hand and V. Voroninski. Corruption Robust Phase Retrieval via Linear Programming. dec2016.[18] S. Marchesini, Y.-C. Tu, and H.-T. Wu. Alternating projection, ptychographic imaging andphase synchronization.
Applied and Computational Harmonic Analysis , 41(3):815 – 851, 2016.[19] P. Netrapalli, P. Jain, and S. Sanghavi. Phase retrieval using alternating minimization.
IEEETransactions on Signal Processing , 63(18):4814–4826, Sept 2015.[20] M. Rudelson and R. Vershynin. Hanson-wright inequality and sub-gaussian concentration.
Electron. Commun. Probab. , 18:9 pp., 2013.[21] Y. Shechtman, Y. C. Eldar, O. Cohen, H. N. Chapman, J. Miao, and M. Segev. Phase retrievalwith application to optical imaging: A contemporary overview.
IEEE Signal ProcessingMagazine , 32(3):87–109, May 2015.[22] M. Soltanolkotabi. Structured signal recovery from quadratic measurements: Breaking samplecomplexity barriers via nonconvex optimization. feb 2017.[23] J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. In , pages 2379–2383, July 2016.[24] T. Tao and V. Vu. Random matrices: the distribution of the smallest singular values.
Geometricand Functional Analysis , 20(1):260–297, 2010.[25] R. van Handel. Probability in high dimension. Technical report, Princeton University, 2014.1726] I. Waldspurger. Phase retrieval with random gaussian sensing vectors by alternating projections.
IEEE Transactions on Information Theory , 64(5):3301–3312, May 2018.[27] I. Waldspurger, A. d’Aspremont, and S. Mallat. Phase recovery, maxcut and complex semidefi-nite programming.
Mathematical Programming , 149(1):47–81, 2015.[28] G. Wang and G. Giannakis. Solving random systems of quadratic equations via truncatedgeneralized gradient flow. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Gar-nett, editors,
Advances in Neural Information Processing Systems 29 , pages 568–576. CurranAssociates, Inc., 2016.[29] H. Zhang, Y. Chi, and Y. Liang. Provable non-convex phase retrieval with outliers: Mediantruncated wirtinger flow. In
Proceedings of the 33rd International Conference on InternationalConference on Machine Learning - Volume 48 , ICML’16, pages 1022–1031. JMLR.org, 2016.[30] H. Zhang and Y. Liang. Reshaped wirtinger flow for solving quadratic system of equations.In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,