Constructing Confidence Intervals for the Signals in Sparse Phase Retrieval
CConstructing Confidence Intervals for the Signals in SparsePhase Retrieval
Yisha Yao ∗ Rutgers University
Abstract
In this paper, we provide a general methodology to draw statistical inferences on individualsignal coordinates or linear combinations of them in sparse phase retrieval. Given an initialestimator for the targeting parameter (some simple function of the signal), which is generated bysome existing algorithm, we can modify it in a way that the modified version is asymptoticallynormal and unbiased. Then confidence intervals and hypothesis testings can be constructedbased on this asymptotic normality. For conciseness, we focus on confidence intervals in thiswork, while a similar procedure can be adopted for hypothesis testings. Under some mildassumptions on the signal and sample size, we establish theoretical guarantees for the proposedmethod. These assumptions are generally weak in the sense that the dimension could exceedthe sample size and many non-zero small coordinates are allowed. Furthermore, theoreticalanalysis reveals that the modified estimators for individual coordinates have uniformly boundedvariance, and hence simultaneous interval estimation is possible. Numerical simulations in awide range of settings are supportive of our theoretical results.
Keywords:
Sparse phase retrieval; Statistical inference; Confidence interval; high dimension.
The problem of recovering a signal from its transformed measurements, referred to as phase retrieval,is fundamental in various applications, including optical imaging, X-ray crystallography, speechrecognition, etc [28]. It can be formulated into model (1.1). y j = | x ∗ j β | + ε j , j = 1 , · · · , n (1.1)where ε j is a random noise with mean zero, x j , β ∈ C p or R p , and x ∗ j denotes the conjugatetranspose of x j . Given the noise-contaminated magnitudes y j ’s and the design vectors x j ’s, weneed to recover the signal β . x j ’s could be Fourier basis, Gaussian vectors, or other sensing vectors,depending on the specific scenario. The problem is difficult because phase information is totallylost in the data-acquisition process. Extensive literature is available on the theory and algorithmsfor estimating β . ∗ The research of Yisha Yao is partially supported by NSF grants DMS-1721495 and IIS-1741390. a r X i v : . [ s t a t . M E ] S e p he early-stage algorithms pioneered by Gerchberg and Saxton [15] and extended by Fienup [14]start with an arbitrary guess, then refine it by transforming back and forth between the signaldomain and Fourier domain until all the constraints are satisfied. Since the violation betweenthe iterate and the a priori knowledge is monotonically non-increasing, this type of algorithms getthe name error reduction algorithms [13]. Such scheme is equivalent to alternating projectionsonto nonconvex sets [21, 1], but its convergence nature remains unknown. Besides, error reductionalgorithms rely heavily on the prior information of the signal. Following the spirit of Gerchberg-Saxton algorithm, the alternating minimization is recently proposed [24]. It divides the data intoa number of independent parts and use a new part in each minimization step. Nevertheless, thisstrategy is of little practical value.In most literature, phase retrieval is translated into a nonconvex minimization problem with variousobjective functions, e.g. , (1.2)-(1.4).minimize b ∈ C p / R p f ( b ) = 14 n n (cid:88) j =1 ( | x ∗ j b | − y j ) (1.2)minimize b ∈ C p / R p f ( b ) = 14 n n (cid:88) j =1 ( | x ∗ j b | − φ j ) , where φ j = | x ∗ j β | + ε j (1.3)minimize b ∈ C p / R p f ( b ) = 14 n n (cid:88) j =1 (cid:12)(cid:12)(cid:12) | x ∗ j b | − y j (cid:12)(cid:12)(cid:12) (1.4)Existing methods for solving (1.2)-(1.4) can be categorized into convex-optimization-type andgradient-descent-type. The former is based on Shor’s convex relaxation [3]. It relaxes (1.2) toa convex minimization problem (see below) and solves this convex problem via semidefinite pro-gramming (SDP) [8, 6, 30].minimize B trace( B )subject to B (cid:23) y j = trace ( x j x ∗ j B ) , j = 1 , · · · , n. where B = bb ∗ . Under noiseless Gaussian designs, SDP achieves exact recovery with samplesize O ( p ) [6]. Later, a modified version of SDP is proposed with the trace norm replaced by areweighted trace norm, which is equivalent to minimizing a log-det function [7, 12]. Despite itsreasonable performance and theoretical guarantees, SDP is computationally expensive because itoptimizes over p variables.The second category are various types of gradient descent methods. The ”Wirtinger flow” algo-rithm [5] targets the objective function (1.2). It obtains the starting point by spectral initialization,and refines the iterates via Wirtinger derivatives. A sample size O ( p log p ) is claimed to guaranteereasonable accuracy. The ”truncated Wirtinger flow” [9] eliminates those abnormal data pointsgenerated during the process to obtain a more reliable starting point as well as control the searchdirection. It exhibits more stable performance than the plain Wirtinger flow algorithm while ad-vances the sample complexity to O ( p ). The ”truncated amplitude flow” algorithm [31] is targeting2he objective function (1.3). It is also a two-stage procedure with orthogonality-promoting initial-ization followed by regularized gradient descent [31]. During its gradient descent stage, the signsof the components x ∗ j b are scrutinized to ensure a correct search direction. In a recent paper [11],the objective function (1.4) is transformed into the composition of a convex function and a smoothfunction, and the smooth function is further approximated by a linear function. The resultingobjective function is convex and amenable to gradient descent. This method has slightly broaderapplications than other methods because it works on certain complex design vectors besides Gaus-sian designs. We do not give a complete bibliography here due to the vast amount of literature onthis topic.In many real-world applications, the signal has few nonzero coordinates, and far less measurementsthan the dimension of the signal are available. Phase retrieval in such context is referred to as sparsephase retrieval. The community has showed extensive interest on sparse phase retrieval during thepast two decades [10]. Many of the algorithms for sparse phase retrieval are obtained by modifyingthe existing algorithms for non-sparse case. For example, certain norm regularization is added tothe trace function in SDPs to promote sparsity [25, 22, 26], and a thresholding step is incorporatedinto each iteration of the gradient-descent-type algorithms [4, 32]. These modifications do not workfor Fourier phase retrieval due to the ambiguity of translation and conjugate reflection. A novelmethod is proposed in [18] and demonstrated good performance in sparse Fourier phase retrieval. Itfirst estimates the support via autocorrelation function, and then solves an SDP over the support.Another recent algorithm is based on greedy local search [27]. It updates the signal support byinterchanging the coordinate on support with the smallest gradient value with the coordinate offsupport with the largest gradient value. The objective function is also updated accordingly in eachiteration. There are some other established methods for (sparse) phase retrieval [19][16][17], whichwe will not elaborate here to avoid unnecessary details.Despite such intensive study on the algorithms for solving phase retrieval and sparse phase retrieval,statistical inferences about the signal is rarely touched. All the foregoing methods merely generate apoint estimator for β and establish its convergence rate. No statistical inferences can be drawn on β or a function of β based on these point estimators. While in many real life applications, statisticalinferences on the sparse signal are very much desired. For example, researchers might seek the95% confidence interval of a certain coordinate β k in order to adjust the receiver bandwidth. Onemajor obstacle that thwarts statistical inferences on sparse phase retrieval is that the estimatorsgenerated by these algorithms cannot be written as an explicit function of the data. Thus, itssampling distribution or asymptotic distribution is in general not tractable.In this paper, we propose a general method to construct confidence intervals for some simple func-tion of the signals θ ( β ), e.g. β k . We also show that the resulting confidence interval approximatelyattains the preassigned coverage probability when the sample size satisfies n (cid:29) (log p ) /s . Supposewe have an estimator for θ ( β ) available, which is asymptotically normal with mean θ ( β ). Thenconfidence intervals can be easily built based on this estimator and its asymptotic normality. There-fore, the key is to construct such an estimator, which we shall obtain as follows. First we pick aninitial estimator output from some existing phase-retrieval algorithm; second we modify the initialestimator in a way that the resulting estimator possesses all desired properties (asymptoticallynormal and centered at the true θ ( β )). The choice for the initial estimator will be discussed inSection 2. This method is inspired by the ”debiased LASSO” introduced in high-dimensional linearregressions [34] [29] [20] [2]. The LASSO is a shrinkage/thresholded estimator and hence biased. Byadding a bias-correction term to the initial LASSO estimator, the authors obtain an asymptotically3nbiased and normally distributed estimator, the debiased LASSO. Similarly in the case of sparsephase retrieval, existing algorithms always generate biased estimators because they are designedto promote sparsity. We will try to eliminate the bias of the chosen initial estimator by adding adifferent correction term other than that in debiased LASSO. The correction term is core to ourmethod, and will be derived in Section 2. To the best of our knowledge, this is the first effort onstatistical inference in the realm of sparse phase retrieval.We organize the rest of this paper as follows. Section 2 elaborates our methodology and explainsthe rationale behind it. Section 3 presents the main theoretical guarantees for our method. Section4 displays its empirical performance. Further discussions and perspectives are left to Section 5.And we leave all the proofs and technicality to Section 6. Throughout the rest of the paper, we carry out the discussion based on model (1.1) and objectivefunction (1.2). Wherever f ( · ) appears, it means the function in (1.2). However, the idea extendsnaturally to other phase retrieval models and objective functions. For conciseness, we focus on thereal case, and consider model (1.1) with β ∈ R p , (cid:107) β (cid:107) = s , x j ’s i.i.d. N ( , I p ), and ε , ε , · · · , ε n ∈ R i.i.d. Gaussian noise N (0 , σ ). Given the measurements y and the design matrix X , we aim toconstruct confidence intervals with approximately preassigned coverage probabilities for a one-dimensional parameter θ = θ ( β ) = β k , one coordinate of the signal.This work is motivated by the celebrated debiasing techniques [33][34][29][20][2] proposed for high-dimensional linear regressions and the various phase retrieval algorithms described in Section 1.Provided an initial estimator for β k , which is biased and whose sampling distribution is intractable,we correct its bias so that the resulting estimator has approximately normal distribution centered at β k in asymptotic. And confidence intervals can be constructed based on this asymptotic normality.Although we restrict our discussion to sparse phase retrieval problem, the method is applicable toany M-estimating problem where (i) the Hessian matrix exits in a big enough neighborhood of theglobal maximizer and is invertible, (ii) θ ( β ) is differentiable almost everywhere, (iii) a good enoughinitial estimator is available. Thresholded Wirtinger Flow (TWF) is recently proposed for sparse phase retrieval [4]. It firstgenerates a starting point by spectral initialization and then apply thresholded gradient descentto refine it. We choose the TWF output as our initial estimator because it is claimed to achieveoptimal minimax rate of convergence in the sparse phase retrieval setting [4]. More specifically,The authors showed that with high probability the TWF estimator of iteration t, ˜ β ( t ) , falls withina tiny ball centered at β (Theorem A.1). Written in mathematical formula,inf (cid:107) β (cid:107) = s P ( X , y | β ) (cid:26) min i =0 , (cid:107) ˜ β ( t ) − ( − i β (cid:107) ≤
16 (1 − µ
16 ) t (cid:107) β (cid:107) + C σ (cid:107) β (cid:107) (cid:114) s log pn (cid:27) > − n − e − s − tnp (2.1)4or some absolute constant C >
0, provided the sample size n ≥ K (1+ σ (cid:107) β (cid:107) ) s log( np ) for some ab-solute constant K > µ is the step size of gradient descent, which can be regarded as absolute constant once it is decided.If t (cid:16) log (cid:18) (cid:107) β (cid:107) √ nσ √ s log p (cid:19) , one can obtain from the preceding result that with high probabilitymin i =0 , (cid:107) ˜ β ( t ) − ( − i β (cid:107) (cid:45) σ (cid:107) β (cid:107) (cid:114) s log pn . (2.2)This error rate is crucial for our bias-correction scheme to work, as will be revealed later in theproof part. In this subsection, we will explain the bias-correction procedure in details. Suppose we are interestedin a one-dimensional parameter θ = θ ( β ), which is a continuously differentiable function of β . Inour particular case, θ = e Tj β = β j , j = 1 , , · · · , p . The TWF solution, denoted by ˜ β , is biased dueto the shrinkage nature of TWF. And so is θ ( β ). To correct the bias of θ ( ˜ β ), we adopt the ideaof Low-Dimensional Projection Estimator (LDPE) proposed in [33]. The authors consider a moregeneral semi-low-dimensional (LD) approach where a high-dimensional (HD) model is decomposedas HDmodel = LDcomponent + HDcomponent
In our problem, such decomposition amounts to β − ˜ β = u ( ˜ β ) (cid:18) θ ( β ) − θ ( ˜ β ) (cid:19) + Q ( ˜ β )( β − ˜ β ) (2.3)Here u ( ˜ β ) is the least favorable one-dimensional sub-model at ˜ β , normalized such that ∇ θ ( ˜ β ) T u ( ˜ β ) =1. It is called least favorable because it gives the minimum Fisher information for estimating θ .The formula for u ( ˜ β ) is given by u ( ˜ β ) = I ( ˜ β ) − ∇ θ ( ˜ β ) ∇ θ ( ˜ β ) T I ( ˜ β ) − ∇ θ ( ˜ β )where I ( ˜ β ) is the Fisher Information at ˜ β I ( ˜ β ) := − E ˜ β (cid:34) ∂ l ( X , y | b ) ∂ b ∂ b T (cid:35) b = ˜ β While Q ( ˜ β ) projects β − ˜ β into a space of nuisance parameters. To simplify the notation, let˜ u = u ( ˜ β ). The LDPE searches, in the least favorable direction ˜ u , a parameter value that maximizesthe likelihood of the occurring data sample.ˆ θ = θ ( ˜ β ) + arg max d n (cid:88) j =1 l ( x j , y j | ˜ β + ˜ u d ) = θ ( ˜ β ) + arg min d f ( ˜ β + ˜ u d ) (2.4)5he second equality above is because under model (1.1) and Gaussian noises, n (cid:88) j =1 l ( x j , y j | b ) ∝ n n (cid:88) j =1 − ( y j − | x Tj b | ) = − f ( b )The score equation for the minimization problem in (2.4) is˜ u T ∇ f ( ˜ β + ˜ u ˆ d ) = 0 , Where ˆ d = arg min d f ( ˜ β + ˜ u d ). By Taylor expansion, ∇ f ( ˜ β + ˜ u ˆ d ) − ∇ f ( ˜ β ) ≈ ˆ d (cid:34) ∂ f ( b ) ∂ b ∂ b T (cid:35) b = ˜ β ˜ u ≈ ˆ d I ( ˜ β ) ˜ u = ˆ d ∇ θ ( ˜ β ) ∇ θ ( ˜ β ) T I ( ˜ β ) − ∇ θ ( ˜ β )Thus we have 0 = ˜ u T ∇ f ( ˜ β + ˜ u ˆ d ) ≈ ˜ u T ∇ f ( ˜ β ) + ˆ d ˜ u T ∇ θ ( ˜ β ) ∇ θ ( ˜ β ) T I ( ˜ β ) − ∇ θ ( ˜ β )= ˜ u T ∇ f ( ˜ β ) + ˆ d ∇ θ ( ˜ β ) T I ( ˜ β ) − ∇ θ ( ˜ β )which solves ˆ d ≈ −∇ θ ( ˜ β ) T I ( ˜ β ) − ∇ f ( ˜ β ) (2.5)Plug (2.5) into (2.4), our corrected estimator is given byˆ θ ≈ θ ( ˜ β ) − ∇ θ ( ˜ β ) T I ( ˜ β ) − ∇ f ( ˜ β ) (2.6)Later we will implement the bias-corrected estimator (2.6) to the case where θ ( β ) = e Tk β = β k ,and demonstrate some nice properties of this estimator. For consistency, hereinafter we will denotethe corrected estimator as ˆ β and the initial estimator as ˜ β .Before elaborating the analysis, let us acknowledge one fact. The exact sign of an individual β k cannot be recovered because the measurements only provide magnitude information ( | x Tj β | contaminated with noise). If ˜ β is a solution to the minimization problem (1.2), − ˜ β is also asolution. The signal β can be recovered only up to a global sign. Given an initial estimator ˜ β , wedefine β ∗ to be whichever in { β , − β } is closer to ˜ β . β ∗ := (cid:26) β if (cid:107) β − ˜ β (cid:107) ≤ (cid:107) − β − ˜ β (cid:107) − β otherwiseWe are only interested in gaining information about β ∗ . This is the best we can do under model(1.1) in a sense that the global sign can never be retrieved.6o prepare the proofs in Section 6, we derive some formulae and simply a few notations here. TheFisher information matrices at β ∗ and ˜ β are I ( β ∗ ) = E (cid:2) x Tj β ∗ ) x j x Tj − ε j x j x Tj (cid:3) = (cid:107) β ∗ (cid:107) I + 2 β ∗ β ∗ T I ( ˜ β ) = E (cid:2) x Tj ˜ β ) x j x Tj − ε j x j x Tj (cid:3) = (cid:107) ˜ β (cid:107) I + 2 ˜ β ˜ β T (2.7)In our specific context, θ ( β ∗ ) = e Tk β ∗ and ∇ θ ( β ∗ ) = e k . If we define w k := w ( β ∗ , θ = β k ) = − I ( β ) − ∇ θ ( β ) = − (cid:0) (cid:107) β ∗ (cid:107) I + 2 β ∗ β ∗ T (cid:1) − e k = − (cid:107) β ∗ (cid:107) (cid:18) I − β ∗ β ∗ T (cid:107) β ∗ (cid:107) (cid:19) e k , (2.8)and ˜ w k := w ( ˜ β , θ = β k ) = − I ( ˜ β ) − ∇ θ ( ˜ β ) = − (cid:0) (cid:107) ˜ β (cid:107) I + 2 ˜ β ˜ β T (cid:1) − e k = − (cid:107) ˜ β (cid:107) (cid:18) I − β ˜ β T (cid:107) ˜ β (cid:107) (cid:19) e k , (2.9)the estimator (2.6) can adopt a simpler formulaˆ β k = ˜ β k + ˜ w Tk ∇ f ( ˜ β ) (2.10) Remark . . Equation (2.7) is derived under the assumption that ˜ β is independent from thedesign vectors { x j } j =1 , ··· ,n . This assumption is realistic in certain specially-designed procedure.For example, we can randomly split the i.i.d. data ( x j , y j ) j =1 , ··· ,n into two parts. We use the firstpart to obtain ˜ β and the second part to construct the bias-correction term. The detailed procedurewill be addressed in Section 2.3. In the analysis so far, we implicitly assume that ˜ β is independent from the design vectors x j ’s andthe noises ε j ’s, which especially ease the derivation of (2.7). However, this assumption fails if ˜ β isobtained from the same set of design vectors x j ’s and measurements y j ’s by TWF iterations. Tovalidate the independence between ˜ β and ( { x j } j , { ε j } j ), we design the three-step procedure. First,splitting the whole data set randomly into two parts, using the first part of data ( X , y ) to obtain˜ β via TWF and using the second part of data ( X , y ) to debias ˜ β ; second, swapping the two partsof data, using the second part to generate ˜ β and the first part to debias ˜ β ; finally, combining thetwo debiased estimators. Theoretically, the required sample size to obtain a good initial estimator˜ β is different from the required sample size for efficient bias-correction. But we set the two partsto be of equal size n since the two parts will be swapped for both purposes.7 Theoretical Result
In this section, we will establish the asymptotic normality of the bias-corrected estimator (2.10).And the confidence interval based on this asymptotic distribution is shown to achieve roughlythe preassigned coverage probability. Furthermore, theoretical analysis reveals that the confidenceinterval has sharp width and attains approximately the coverage probability simultaneously forall β k . Note we assume σ is known because in many applications people have a prior knowledgeabout the noise level. Even if the noise level is unknown, there are efficient ways to estimate it, forexample, the method proposed in [4].Theorem 3.1 provides theoretical guarantees for the data-splitting scheme described below. The i.i.d. data generated from model (1.1) are randomly split into two halves ( X , y ), ( X , y ). We use( X , y ) to obtain the initial estimator and ( X , y ) to correct the bias. To fully extract informationcontained in the data, we further apply the data-swapping scheme, whose theoretical property isgiven in Theorem 3.2. More precisely, it goes through two rounds of construction, generating abias-corrected estimator in each round. In the first round, we obtain an initial estimator from( X , y ) and correct bias using ( X , y ). In the second round, the initial estimator is generatedfrom ( X , y ) while ( X , y ) is used to correct the bias. Finally we combine the two estimators ina way that the resulting estimator has the smallest asymptotic variance. Each half of the data hassize n.Before stating the theorems, we introduce some global assumptions. Suppose the design matrixand the true signal satisfy (3.1)-(3.3). n ≥ K (1 + σ (cid:107) β ∗ (cid:107) ) s log( np ) , K is an absolute constant. (3.1)log p √ ns = o (1) . (3.2) (cid:107) β ∗ (cid:107) = O ( √ s ) . (3.3)Assumption (3.1) is to guarantee the quality of the initial estimator, the TWF estimator, as statedin [4]. For the sack of bias-correction, assumptions (3.2) and (3.3) would be sufficient. If s (cid:28) log p ,assumptions (3.2) and (3.3) would imply assumption (3.1). However, we do not impose here anyrestrictions on s . Our method could work for cases where the signal sparsity is not strong, i.e. ,there could be many non-zero small coordinates.Throughout the rest of the paper, let (cid:15) (cid:48) n = C (log p ) sn + C (log p ) ( sn ) + C log p ( sn ) + C n p (3.4)for some absolute constants C , C , C , C , and let (cid:15) (cid:48)(cid:48) n = 50 n + 10 e − s + M log nnp (3.5)for some absolute constant M. It is easy to see that (cid:15) (cid:48) n , (cid:15) (cid:48)(cid:48) n → heorem 3.1 If we randomly split the data into two halves, using the first half to obtain theTWF estimator with enough iterations such that (cid:107) ˜ β − β ∗ (cid:107) ≤ C σ (cid:107) β ∗ (cid:107) (cid:113) s log pn , and the second half toconstruct the bias-correction term, then P (cid:40) (cid:12)(cid:12)(cid:12) √ n ( ˆ β k − β ∗ k ) − Z k (cid:12)(cid:12)(cid:12) ≤ (cid:15) (cid:48) n (cid:41) ≥ − (cid:15) (cid:48)(cid:48) n (3.6) with Z k = − √ n n (cid:88) j =1 ε j ( x Tj ˜ β )( x Tj ˜ w k ) , (3.7) where ˆ β k , ˜ w k , (cid:15) (cid:48) n , (cid:15) (cid:48)(cid:48) n are given by (2.10), (2.9), (3.4), (3.5), respectively. Remark . It is clear that ˜ β is independent from x j ’s in our data-splitting regime. Z k haslimiting distribution N (0 , σ τ k ), where τ k = (cid:107) ˜ β (cid:107) (cid:107) ˜ w k (cid:107) + 2( ˜ β T ˜ w k ) = (cid:107) ˜ β (cid:107) (cid:107) ˜ w k (cid:107) + 2 (cid:18) − ˜ β k (cid:107) ˜ β (cid:107) + ˜ β k (cid:107) ˜ β (cid:107) (cid:19) = (cid:107) ˜ β (cid:107) (cid:107) ˜ w k (cid:107) + ˜ β k (cid:107) ˜ β (cid:107) ≤ (cid:18) (cid:107) β ∗ (cid:107) + (cid:107) ˜ β − β ∗ (cid:107) (cid:19) (cid:20) (cid:107) β ∗ (cid:107) + O ( (cid:107) ˜ β − β ∗ (cid:107) (cid:107) β ∗ (cid:107) ) (cid:21) + 118 (cid:107) ˜ β (cid:107) (cid:16) s + O ( (cid:114) log pns )by (3.3), (2.9), (2.2) and (3.19). Such sharp width is uniformly achievable for all coordinates, whichallows the possibility of constructing simultaneous confidence intervals.To fully utilize the data, we further carry out the data-swap procedure. It is composed of two roundsof data-splitting procedures: dividing the data randomly into two halves ( X , y ) and ( X , y ); inthe first round, we obtain the initial estimator ˜ β from ( X , y ), and correct the bias using ( X , y ),resulting in ˆ β k = ˜ β k + ˜ w T k ∇ f ( ˜ β ); in the second round, we obtain the initial estimator ˜ β from( X , y ) and correct the bias using ( X , y ), resulting in ˆ β k = ˜ β k + ˜ w T k ∇ f ( ˜ β ). Here˜ w k := w ( ˜ β , θ = β k ) = − (cid:18) (cid:107) ˜ β (cid:107) I + 2 ˜ β ˜ β T (cid:19) − e k ˜ w k := w ( ˜ β , θ = β k ) = − (cid:18) (cid:107) ˜ β (cid:107) I + 2 ˜ β ˜ β T (cid:19) − e k ∇ f ( ˜ β ) = 1 n n (cid:88) j =1 (cid:2) ( x T j ˜ β ) − y j (cid:3) ( x T j ˜ β ) x j ∇ f ( ˜ β ) = 1 n n (cid:88) j =1 (cid:2) ( x T j ˜ β ) − y j (cid:3) ( x T j ˜ β ) x j Finally, we linearly combine ˆ β k and ˆ β k , resulting in a better estimator ˆ β swapk that contains moreinformation of the data than ˆ β k or ˆ β k alone. 9 heorem 3.2 ( i ) Under the global assumptions (3.1)-(3.3), the estimator ˆ β swapk = τ k τ k + τ k · (cid:32) ˜ β k + ˜ w T k ∇ f ( ˜ β ) (cid:33) + τ k τ k + τ k · (cid:32) ˜ β k + ˜ w T k ∇ f ( ˜ β ) (cid:33) (3.8) achieves the smallest asymptotic variance among all asymptotically unbiased estimators that areconvex combinations of ˆ β k and ˆ β k . Furthermore, P (cid:40) (cid:12)(cid:12)(cid:12)(cid:12) √ n ( ˆ β swapk − β ∗ k ) − (cid:18) τ k τ k + τ k Z k + τ k τ k + τ k Z k (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (cid:48) n (cid:41) ≥ − (cid:15) (cid:48)(cid:48) n (3.9) where Z k = 1 √ n n (cid:88) j =1 ε j ( x T j ˜ β )( x T j ˜ w k ) Z k = 1 √ n n (cid:88) j =1 ε j ( x T j ˜ β )( x T j ˜ w k ) τ k = (cid:107) ˜ β (cid:107) (cid:107) ˜ w k (cid:107) + 2( ˜ β T ˜ w k ) τ k = (cid:107) ˜ β (cid:107) (cid:107) ˜ w k (cid:107) + 2( ˜ β T ˜ w k ) ( ii ) P (cid:40) (cid:12)(cid:12)(cid:12) ˆ β swapk − β ∗ k (cid:12)(cid:12)(cid:12) ≤ σr √ n (cid:115) τ k τ k τ k + τ k + O ( 1 p ) + (cid:15) (cid:48) n √ n (cid:41) ≥ r ) − − (cid:15) (cid:48)(cid:48) n − n (3.10) ( iii ) lim inf n →∞ P (cid:40) max k ∈ [ p ] (cid:12)(cid:12)(cid:12) √ n ( ˆ β swapk − β ∗ k ) (cid:12)(cid:12)(cid:12) ≤ (cid:114) s σ Φ − (1 − α (cid:41) ≥ − α (3.11)lim inf n →∞ P (cid:40) (cid:12)(cid:12)(cid:12) h T ( ˆ β swap − β ∗ ) (cid:12)(cid:12)(cid:12) ≤ (cid:114) σ n χ p,α h T V h (cid:41) ≥ − α f or ∀ h (cid:54) = 0 , h ∈ R p . (3.12) where V kl = a k a l (cid:20) (cid:107) ˜ β (cid:107) ( ˜ w T k ˜ w l )+2( ˜ β T ˜ w k )( ˜ β T ˜ w l ) (cid:21) +(1 − a k )(1 − a l ) (cid:20) (cid:107) ˜ β (cid:107) ( ˜ w T k ˜ w l )+2( ˜ β T ˜ w k )( ˜ β T ˜ w l ) (cid:21) a k = τ k τ k + τ k , a l = τ l τ l + τ l . Remark . Note, asymptotically, τ k , τ k (cid:16) s for all k = 1 , · · · , p by Remark 3.1. Hence τ k τ k τ k + τ k ≤ max { τ k , τ k } ≤ s . The asymptotic variance of ˆ β swapk is shrunken by a factor of 2 comparedto that of ˆ β k or ˆ β k . The uniformly bounded variance for all √ n ( ˆ β swapk − β ∗ k ) in (3.11) allowsBonferroni adjustment to control familywise error rate in simultaneous interval estimation. While(3.12) provides the theoretical guarantee for Scheffe’s simultaneous confidence interval.10 Numerical Simulation
In this section, we implement our method on a variety of settings to assess its empirical performance.Moreover, by comparing the results in different combinations of sparsity (s), sample size (2n), andnoise-to-signal ratio (NSR), we get a general idea how the performance depends on those factors.Throughout our simulation, the signal dimension p=1000 and all the tuning parameters in theTWF algorithm are fixed. In each choice of (n,s,NSR), we generate the signal β by randomlypicking the support and assigning nonzero coordinates i.i.d. N (0 , β , the followingprocedure is repeated independently for 100 times: first, generate 2n random vectors x j ’s i.i.d. N (0 , I ), 2n noises ε j ’s i.i.d. N (0 , σ ), and the measurements y j ’s by (1.1); second, obtain the TWFestimator using the whole dataset ( X , y ) and record the errors ˜ β k − β ∗ k of four large coordinates( | β ∗ k | ≈ | β ∗ k | ≈ | β ∗ k | ≈ . β swap , and record the errorsˆ β swapk − β ∗ k of four large/median/small coordinates, respectively. Every summary statistic in table1 comes from a pool of 400 errors and each histogram in Figure 1/2/3 presents the distribution ofa pool of 400 errors.The performance of our method is assessed in several aspects, including biasness (Table 1), variance(Table 1), asymptotic normality (Figure 1/2/3), and coverage probability (Table 2).Judging from Table 1, debiased TWF achieves close-to-zero average bias at the cost of slightly largervariance than TWF in all settings except when sparsity s is too large (s=200) or sample size 2n is toosmall (n=2000). The reason behind such phenomenon could be as follows. The bias-correction termwith non-zero mean is supposed to neutralize the bias of TWF. Though it brings extra variance,the amount of this extra variance is negligible as long as n is large enough. Nevertheless, when n issmall or s is large, the bias-correction term does not concentrate tightly around its mean. Insteadof neutralizing the bias of TWF, it adds extra bias and much larger variance. Table 1 also exhibitsa parallel trend between TWF and debiased TWF that their bias and variance get larger as s/NSRincreases or n decreases. This is because the performance of debiased TWF depends on the qualityof TWF while TWF gets worse as s/NSR increases or n decreases, which has been demonstratedin the original paper [4]. Although there are two abnormal cases (s=200 and n=2000) in Table 1where the debiased TWF fails, our main theory (Theorem 3.2) is not violated because the ( s, n, p )in these cases are largely deviated from assumption (3.1) and (3.2).Figure 1, 2, and 3 demonstrate the unbiasedness and approximate normality of the debiased TWFin all settings. Figure 1 explores the relationship between the sparsity s and the quality of debiasedTWF by fixing p=1000, n=3000, NSR=0.3 while varying s=50, 100, 150, 200 from left to right.The normality and unbiasedness hold better in small s. Besides, the distributions spread wider ass increases in every row. Such phenomenon is not surprising because the asymptotic unbiasednessand normality of ˆ β swapk rely on that √ n (cid:107) ∆ (cid:107) (cid:107) β ∗ (cid:107) is asymptotically negligible. When s increases to anextend that the assumption is violated, the quality of ˜ β drops and (cid:107) ∆ (cid:107) = (cid:107) ˜ β − β ∗ (cid:107) cannot becontrolled. Similarly, the errors of debiased TWF distribute more Gaussian and spread narroweras n increases in Figure2 or as NSR decreases in Figure 3.11arge coordinates median coordinates small coordinatesn s NSR TWF de-TWF TWF de-TWF TWF de-TWF3000 50 0.3 bias 0.0108 0.0016 0.0421 0.0017 -0.0607 0.0020sd 0.0174 0.0203 0.0186 0.0216 0.0188 0.0226mae 0.0144 0.0135 0.0415 0.0155 0.0604 0.01513000 100 0.3 bias 0.0367 -0.0061 0.0573 -0.0054 -0.0872 0.0054sd 0.0295 0.0838 0.0291 0.0540 0.0173 0.0513mae 0.0375 0.0316 0.0567 0.0311 0.0987 0.03533000 150 0.3 bias -0.0370 -0.0359 -0.0928 0.0475 0.0975 0.0025sd 0.0483 0.1495 0.0498 0.1287 0.0084 0.1346mae 0.0442 0.0838 0.0921 0.0992 0.1000 0.09393000 200 0.3 bias 0.1007 0.2577 0.1197 -0.0474 -0.0989 0.0153sd 0.0679 0.3521 0.0681 0.2335 0.0068 0.2175mae 0.0976 0.2857 0.1209 0.1630 0.1000 0.15232000 100 0.3 bias 0.0481 -0.0271 0.0844 -0.0128 -0.0972 -0.0143sd 0.0457 0.4064 0.0475 0.2029 0.0091 0.1421mae 0.0503 0.1264 0.0840 0.1179 0.1000 0.09864000 100 0.3 bias -0.0216 -0.0021 -0.0499 0.0027 0.0780 -0.0044sd 0.0247 0.0311 0.0252 0.0331 0.0205 0.0350mae 0.0255 0.0223 0.0498 0.0241 0.0798 0.02485000 100 0.3 bias 0.0288 0.0018 0.0362 -0.0017 -0.0625 0.0037sd 0.0193 0.0227 0.0192 0.0233 0.0190 0.0236mae 0.0282 0.0156 0.0362 0.0154 0.0624 0.01593000 100 0.2 bias -0.0229 0.0078 -0.0410 -0.0014 0.0662 -0.0050sd 0.0196 0.0848 0.0210 0.0521 0.0209 0.0380mae 0.0242 0.0221 0.0408 0.0251 0.0678 0.02433000 100 0.4 bias -0.0346 -0.0083 -0.0827 0.0140 0.0953 -0.0081sd 0.0425 0.0594 0.0418 0.0585 0.0137 0.0634mae 0.0380 0.0420 0.0841 0.0425 0.1000 0.04123000 100 0.5 bias -0.0462 -0.0076 -0.0980 0.0141 0.0972 -0.0082sd 0.0480 0.0630 0.0507 0.0685 0.0097 0.0707mae 0.0494 0.0445 0.0965 0.0438 0.1000 0.0487Table 1: Summary statistics of TWF errors and debiased TWF errors for large β k , median β k , andsmall β k under various simulation settings. ”mae” stands for ”median absolute error”.12 − − − − − − − − − − − − − − − − errors of TWF for large coordinateserrors of TWF for median coordinateserrors of TWF for small coordinates (a) Histograms of TWF errors for large β k (1st row), median β k (2nd row), small β k (3rd row). − − − − − − − − − − − − − − − − errors of debiased − TWF for large coordinateserrors of debiased − TWF for median coordinateserrors of debiased − TWF for small coordinates (b) Histograms of debiased TWF errors for large β k (4th row), median β k (5th row), small β k (6th row). Figure 1: From left to right, plots correspond to simulation settings (n=3000, s=50, NSR=0.3), (n=3000,s=100, NSR=0.3), (n=3000, s=150, NSR=0.3), (n=3000, s=200, NSR=0.3) with p=1000 fixed.13 − − − − − − − − − − − − − − − − − errors of TWF for large coordinateserrors of TWF for median coordinateserrors of TWF for small coordinates (a) Histograms of TWF errors for large β k (1st row), median β k (2nd row), small β k (3rd row). − − − − − − − − − − − − − − − − − − errors of debiased − TWF for large coordinateserrors of debiased − TWF for median coordinateserrors of debiased − TWF for small coordinates (b) Histograms of debiased TWF errors for large β k (4th row), median β k (5th row), small β k (6th row). Figure 2: From left to right, plots correspond to simulation settings (n=2000, s=100, NSR=0.3), (n=3000,s=100, NSR=0.3), (n=4000, s=100, NSR=0.3), (n=5000, s=100, NSR=0.3) with p=1000 fixed.14 − − − − − − − − − − − − − − − − errors of TWF for large coordinateserrors of TWF for median coordinateserrors of TWF for small coordinates (a) Histograms of TWF errors for large β k (1st row), median β k (2nd row), small β k (3rd row). − − − − − − − − − − − − − − − − − − − errors of debiased − TWF for large coordinateserrors of debiased − TWF for median coordinateserrors of debiased − TWF for small coordinates (b) Histograms of debiased TWF errors for large β k (4th row), median β k (5th row), small β k (6th row). Figure 3: From left to right, plots correspond to simulation settings (n=3000, s=100, NSR=0.2), (n=3000,s=100, NSR=0.3), (n=3000, s=100, NSR=0.4), (n=3000, s=100, NSR=0.5) with p=1000 fixed.15hen simulating to check the accuracy of Theorem 3.2 ( ii ), we use the confidence interval (cid:18) ˆ β swapk − σ Φ − (0 . √ n (cid:114) τ k τ k τ k + τ k , ˆ β swapk − σ Φ − (0 . √ n (cid:114) τ k τ k τ k + τ k (cid:19) . For each setting, a new β is generated, and werepeat constructing the confidence interval for 200 times with independently generated ( X , ε , y ).Note, the theoretic coverage probability is 2 × . − − n , in which we count the term (cid:15) (cid:48)(cid:48) n .In asymptotic the term (cid:15) (cid:48)(cid:48) n is negligible, yet in our simulation n ∈ (1 . , (cid:15) (cid:48) n √ n in Theorem 3.2 ( ii ) . Again, in asymptotic (cid:15) (cid:48) n √ n is negligible compared to σr √ n (cid:114) τ k τ k τ k + τ k , but in our simulation these two terms are of the sameorder. Thus, leaving the term (cid:15) (cid:48) n √ n out has diminished the coverage probability to certain extend.Despite this flaw, the results in Table 2 imply that Theorem 3.2 ( ii ) is informative of the actualperformance of the debiased TWF. We can see the general trends: the average coverage probabilitiesget better as n increases or s decreases, and not sensitive to σ .n s σ all coor large coor median coor small coor5000 40 5 92.863 92.5 94.75 93.256000 40 5 93.2885 93.625 92.625 94.3756000 50 5 92.7765 93 92.625 94.257500 50 5 93.4155 93.875 92.875 93.255000 40 10 92.912 92.375 94.125 92.3756000 40 10 93.664 92.75 94.75 93.256000 50 10 92.5935 92.25 92.25 91.6257500 50 10 93.421 93.875 93.875 95.125Table 2: Mean coverage probabilities of the confidence interval in Theorem 3.2 ( ii ) under varioussettings. ”coor” is short for ”coordinate”. We propose in this work a general approach for drawing statistical inferences on the sparse signal inphase retrieval. A new estimator ˆ β swapk for the individual signal coordinate β k has been constructedby adding a bias-correction term to the TWF estimator. With mild assumptions on X and β andsample size requirement (3.2), ˆ β swapk has asymptotic Gaussian distribution centered at β k . Thisproperty allows construction of confidence intervals with approximately preassigned probabilitiesas well as hypothesis testing on the signal of interest. Our new estimator can also be used as apoint estimator. Compared with the plain TWF estimator, ˆ β swapk achieves asymptotic unbiasnessat the cost of a slightly larger variance.There remain some open problems. For instance, can we draw statistical inferences on more compli-cated functions of β , such as a group of coordinates (similar to that in [23]) or a multidimensional-valued function? Can this method be extended to Fourier designs, which are more applicable? The16ormer is difficult since it involves non-convex optimization over matrices. The later is even morechallenging since the Fourier phase retrieval problem is generally considered not solved. And hencewe do not have an initial estimator available yet. In summary, there is still long way to go beforethe Fourier phase retrieval problem is solved and related statistical inferences can be drawn. Proof of Theorem 3.1.
For the objective function (1.2), we have ∇ f ( ˜ β ) = 1 n n (cid:88) j =1 (cid:2) ( x Tj ˜ β ) − y j (cid:3) ( x Tj ˜ β ) x j Let ∆ = ˜ β − β ∗ .ˆ β k − β ∗ k = ˜ β k − β ∗ k + ˜ w Tk ∇ f ( ˜ β )= e Tk ∆ + 1 n n (cid:88) j =1 (cid:2) ( x Tj ˜ β ) − y j (cid:3) ( x Tj ˜ β )( x Tj ˜ w k )= e Tk ∆ + 1 n n (cid:88) j =1 (cid:2) ( x Tj ˜ β ) − ( x Tj β ∗ ) − ε j (cid:3) ( x Tj ˜ β )( x Tj ˜ w k )= e Tk ∆ + 1 n n (cid:88) j =1 ( x Tj ∆) ( x Tj ˜ w k ) + 3 n n (cid:88) j =1 ( x Tj ∆) ( x Tj ˜ w k )( x Tj β ∗ )+ 2 n n (cid:88) j =1 ( x Tj ∆)( x Tj ˜ w k )( x Tj β ∗ ) − n n (cid:88) j =1 ε j ( x Tj ˜ β )( x Tj ˜ w k )= e Tk ∆ + 2 n n (cid:88) j =1 ( x Tj ∆)( x Tj w k )( x Tj β ∗ ) + 2 n n (cid:88) j =1 ( x Tj ∆) (cid:2) x Tj ( ˜ w k − w k ) (cid:3) ( x Tj β ∗ ) + 1 n n (cid:88) j =1 ( x Tj ∆) ( x Tj ˜ w k ) + 3 n n (cid:88) j =1 ( x Tj ∆) ( x Tj ˜ w k )( x Tj β ∗ ) − n n (cid:88) j =1 ε j ( x Tj ˜ β )( x Tj ˜ w k ) (6.1)We will bound these terms separately.Note supp ( ˜ β ) ⊆ S := supp ( β ) with probability at least 1 − n − e − s − log nnp when t (cid:16) log (cid:18) (cid:107) β ∗ (cid:107) √ nσ √ s log p (cid:19) .17hus, we have supp (∆) ⊆ S with high probability. w k = − (cid:0) (cid:107) β ∗ (cid:107) I + 2 β ∗ β ∗ T (cid:1) − e k = − (cid:107) β ∗ (cid:107) (cid:0) I − β ∗ (cid:107) β ∗ (cid:107) β ∗ T (cid:107) β ∗ (cid:107) (cid:1) e k = − (cid:107) β ∗ (cid:107) e k + β ∗ k (cid:107) β ∗ (cid:107) β ∗ Then supp ( w k ) ⊆ S ∪ { k } . Similarly, supp ( ˜ w k ) ⊆ S ∪ { k } . Let ¯ S = S ∪ { k } .Checking the supports of ∆, w k and ˜ w k grants the applicability of Lemma A.1 here, which im-plies (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) w Tk (cid:32) n n (cid:88) j =1 ( x Tj β ∗ ) x j x Tj − ( (cid:107) β ∗ (cid:107) I + 2 β ∗ β ∗ T ) (cid:33) ∆ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) w Tk (cid:32) n n (cid:88) j =1 ( x Tj β ∗ ) x j ¯ S x jT ¯ S − ( (cid:107) β ∗ (cid:107) I ¯ S + 2 β ∗ β ∗ T ) (cid:33) ∆ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ δ (cid:107) β ∗ (cid:107) (cid:107) ∆ (cid:107) (cid:107) w k (cid:107) (6.2)with probability at least 1 − n , provided n ≥ C ( δ )( s + 1) log ( s + 1). Here C ( δ ) is constant onlydepending on δ , and I ¯ S is a diagonal matrix with the diagonal elements in ¯ S equal to 1, and othersequal to 0.By (2.8) and simple algebra, we have e Tk ∆ + 2 w Tk ( (cid:107) β ∗ (cid:107) I ¯ S + 2 β ∗ β ∗ T )∆ = 0Combining the above two formulae, the first two terms in (6.1) can be bounded by (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e Tk ∆ + 2 n n (cid:88) j =1 ( x Tj ∆)( x Tj w k )( x Tj β ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ δ (cid:107) β ∗ (cid:107) (cid:107) ∆ (cid:107) (cid:107) w k (cid:107) (6.3)with high probability if n ≥ C ( δ )( s + 1) log ( s + 1).By similar argument, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ w Tk (cid:32) n n (cid:88) j =1 ( x Tj ∆) x j x Tj − ( (cid:107) ∆ (cid:107) I + 2∆∆ T ) (cid:33) ∆ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ δ (cid:107) ∆ (cid:107) (cid:107) ˜ w k (cid:107) with probability at least 1 − n if n ≥ C ( δ )( s + 1) log ( s + 1). Thus, the fourth term in (6.1) isbounded by (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) j =1 ( x Tj ∆) ( x Tj ˜ w k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) ∆ (cid:107) | ∆ T ˜ w k | + δ (cid:107) ∆ (cid:107) (cid:107) ˜ w k (cid:107) ≤ (cid:107) ∆ (cid:107) (cid:107) ˜ w k (cid:107) + δ (cid:107) ∆ (cid:107) (cid:107) ˜ w k (cid:107) (6.4)18ith high probability.Again via similar reasoning, the fifth term in (6.1) falls within 3 (cid:107) ∆ (cid:107) ( ˜ w Tk β ∗ ) + 6(∆ T ˜ w k )(∆ T β ∗ ) ± δ (cid:107) ∆ (cid:107) (cid:107) ˜ w k (cid:107) (cid:107) β ∗ (cid:107) with probability at least 1 − n provided n ≥ C ( δ )( s + 1) log ( s + 1). And webound it by (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) j =1 ( x Tj ∆) ( x Tj ˜ w k )( x Tj β ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) ∆ (cid:107) | ˜ w Tk β ∗ | + 6 | ∆ T ˜ w k || ∆ T β ∗ | + 3 δ (cid:107) ∆ (cid:107) (cid:107) ˜ w k (cid:107) (cid:107) β ∗ (cid:107) ≤ (cid:107) ∆ (cid:107) (cid:107) ˜ w k (cid:107) (cid:107) β ∗ (cid:107) + 3 δ (cid:107) ∆ (cid:107) (cid:107) ˜ w k (cid:107) (cid:107) β ∗ (cid:107) (6.5)The derivations of (6.3)-(6.5) require a common condition n ≥ C ( δ )( s + 1) log ( s + 1) with exactlythe same C ( δ ).Next, we deal with the third term in (6.1).1 n n (cid:88) j =1 ( x Tj ∆) (cid:2) x Tj ( ˜ w k − w k ) (cid:3) ( x Tj β ∗ ) = ∆ T (cid:32) n n (cid:88) j =1 ( x Tj β ∗ ) x j x Tj − ( (cid:107) β ∗ (cid:107) I + 2 β ∗ β ∗ T ) (cid:33) ( ˜ w k − w k )+ (cid:107) β ∗ (cid:107) ∆ T ( ˜ w k − w k ) + 2( β ∗ T ∆) (cid:18) β ∗ T ( ˜ w k − w k ) (cid:19) Applying Lemma A.1 one more time, the third term falls within (cid:107) β ∗ (cid:107) ∆ T ( ˜ w k − w k ) + 2( β ∗ T ∆) (cid:18) β ∗ T ( ˜ w k − w k ) (cid:19) ± δ (cid:107) β ∗ (cid:107) (cid:107) ∆ (cid:107) (cid:107) ˜ w k − w k (cid:107) with probability at least 1 − n , given n ≥ C ( δ )( s + 1) log ( s + 1). The magnitude of (cid:107) ˜ w k − w k (cid:107) interms of (cid:107) β ∗ (cid:107) and (cid:107) ∆ (cid:107) can be estimated by˜ w k − w k = 12 (cid:32) (cid:107) β ∗ (cid:107) − (cid:107) ˜ β (cid:107) (cid:33) e k + ˜ β k (cid:107) ˜ β (cid:107) ∆ + 13 (cid:32) ˜ β k (cid:107) ˜ β (cid:107) − β ∗ k (cid:107) β ∗ (cid:107) (cid:33) β ∗ (cid:107) ˜ w k − w k (cid:107) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) β ∗ (cid:107) − (cid:107) ˜ β (cid:107) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + | ˜ β k | (cid:107) ˜ β (cid:107) (cid:107) ∆ (cid:107) + 13 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ β k (cid:107) ˜ β (cid:107) − β ∗ k (cid:107) β ∗ (cid:107) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) β ∗ (cid:107) = O ( (cid:107) ∆ (cid:107) (cid:107) β ∗ (cid:107) ) (6.6)The last equality is because (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) β ∗ (cid:107) − (cid:107) ˜ β (cid:107) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:107) β ∗ + ∆ (cid:107) − (cid:107) β ∗ (cid:107) (cid:107) β ∗ (cid:107) (cid:107) ˜ β (cid:107) ≤ (cid:107) β ∗ (cid:107) (cid:107) ∆ (cid:107) + (cid:107) ∆ (cid:107) (cid:107) β ∗ (cid:107) (cid:107) ˜ β (cid:107) = O ( (cid:107) ∆ (cid:107) (cid:107) β ∗ (cid:107) ) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ β k (cid:107) ˜ β (cid:107) − β ∗ k (cid:107) β ∗ (cid:107) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 (cid:107) β ∗ (cid:107) (cid:107) ˜ β (cid:107) (cid:12)(cid:12)(cid:12) (cid:107) β ∗ (cid:107) ˜ β k − (cid:107) ˜ β (cid:107) β ∗ k (cid:12)(cid:12)(cid:12) ≤ (cid:107) β ∗ (cid:107) | ˜ β k − β ∗ k |(cid:107) β ∗ (cid:107) (cid:107) ˜ β (cid:107) + (cid:12)(cid:12)(cid:12) (cid:107) ˜ β (cid:107) − (cid:107) β ∗ (cid:107) (cid:12)(cid:12)(cid:12) | β ∗ k |(cid:107) β ∗ (cid:107) (cid:107) ˜ β (cid:107) ≤ (cid:107) ∆ (cid:107) (cid:107) ˜ β (cid:107) + (cid:18) (cid:107) β ∗ (cid:107) + (cid:107) ∆ (cid:107) (cid:19) − (cid:107) β ∗ (cid:107) (cid:107) β ∗ (cid:107) (cid:107) ˜ β (cid:107) ≤ O ( (cid:107) ∆ (cid:107) (cid:107) β ∗ (cid:107) ) . Therefore, we bound the third term in (6.1) by (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) j =1 ( x Tj ∆) (cid:2) x Tj ( ˜ w k − w k ) (cid:3) ( x Tj β ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) β ∗ (cid:107) (cid:107) ∆ (cid:107) (cid:107) ˜ w k − w k (cid:107) + δ (cid:107) β ∗ (cid:107) (cid:107) ∆ (cid:107) (cid:107) ˜ w k − w k (cid:107) = 3 (cid:107) β ∗ (cid:107) · O ( (cid:107) ∆ (cid:107) (cid:107) β ∗ (cid:107) ) + δ (cid:107) β ∗ (cid:107) · O ( (cid:107) ∆ (cid:107) (cid:107) β ∗ (cid:107) ) (6.7) (cid:107) ˜ w k (cid:107) appears in the mean terms of (6.4), (6.5) and needs to be bounded. By (2.8) and (6.6), (cid:107) w k (cid:107) ≤ (cid:107) β ∗ (cid:107) + β ∗ k (cid:107) β ∗ (cid:107) ≤ (cid:107) β ∗ (cid:107) + 13 (cid:107) β ∗ (cid:107) ≤ (cid:107) β ∗ (cid:107) (cid:107) ˜ w k (cid:107) ≤ (cid:107) w k (cid:107) + O ( (cid:107) ∆ (cid:107) (cid:107) β ∗ (cid:107) ) ≤ (cid:107) β ∗ (cid:107) + O ( (cid:107) ∆ (cid:107) (cid:107) β ∗ (cid:107) )So far, the terms that differ ˆ β k − β ∗ k from − n (cid:80) nj =1 ε j ( x Tj ˜ β )( x Tj ˜ w k ) (an asymptotically normalrandom variable) have been concentrated around their means, with the concentration errors asso-ciated with δ . We can set δ = p , so that all concentration errors vanish in an order faster than √ np . And we can just ignore these terms. The only thing left to check is whether these mean termsare negligible after multiplying by √ n , i.e. to show that √ n ( ˆ β k − β ∗ k ) is approximately normal as n → ∞ . The goal is to show (cid:12)(cid:12)(cid:12) √ n ( ˆ β k − β ∗ k ) − Z k (cid:12)(cid:12)(cid:12) = o p (1)20y (6.1), (6.3)-(6.5), (6.7), (3.7) we have with probability at least 1 − n , (cid:12)(cid:12)(cid:12) √ n ( ˆ β k − β ∗ k ) − Z k (cid:12)(cid:12)(cid:12) ≤ √ n (cid:32) (cid:107) ∆ (cid:107) (cid:107) ˜ w k (cid:107) + 9 (cid:107) ∆ (cid:107) (cid:107) ˜ w k (cid:107) (cid:107) β ∗ (cid:107) + 3 (cid:107) β ∗ (cid:107) · O ( (cid:107) ∆ (cid:107) (cid:107) β ∗ (cid:107) ) (cid:33) + O ( √ np ) ≤ √ n (cid:32) (cid:107) ∆ (cid:107) (cid:107) β ∗ (cid:107) + O ( (cid:107) ∆ (cid:107) (cid:107) β ∗ (cid:107) ) + 152 (cid:107) ∆ (cid:107) (cid:107) β ∗ (cid:107) + O ( (cid:107) ∆ (cid:107) (cid:107) β ∗ (cid:107) ) + O ( (cid:107) ∆ (cid:107) (cid:107) β ∗ (cid:107) ) (cid:33) + O ( √ np ) (6.8)provided n ≥ C ( p )( s + 1) log ( s + 1), which is satisfied under assumption (3.1) and (3.2). Here C ( p ) is a constant only depending on p . Together with (2.1) and (2.2) that have been establishedin [4], we obtain (cid:12)(cid:12)(cid:12) √ n ( ˆ β k − β ∗ k ) − Z k (cid:12)(cid:12)(cid:12) ≤ C σ (cid:107) β ∗ (cid:107) s log p √ s log pn + C σ (cid:107) β ∗ (cid:107) s (log p ) n √ n + C σ (cid:107) β ∗ (cid:107) s log p √ n + C √ np (6.9)with probability at least 1 − n − e − s − M log nnp for some absolute constants M, C , C , C , C .Plugging in (cid:107) β ∗ (cid:107) = O ( √ s ), the right hand side of (6.9) becomes C (cid:48) p ) sn + C (cid:48) p ) ( sn ) + C (cid:48) p ( sn ) + C (cid:48) n p , all of which vanish as n → ∞ under assumption (3.2). And the asymptotic normality of √ n ( ˆ β k − β ∗ k ) is established. (cid:3) Proof of Theorem 3.2. ( i ) It is intuitive that a reasonable estimator would combine ˆ β k and ˆ β k so as to integrate bothpieces of information. We know that Z k and Z k have asymptotic variances σ τ k and σ τ k ,respectively. By Theorem 3.1 and Remark 3.1,ˆ β k := β ∗ k + Z k √ n + op ( στ k √ n ) = β ∗ k + N (0 , σ τ k n ) + op ( στ k √ n )ˆ β k := β ∗ k + Z k √ n + op ( στ k √ n ) = β ∗ k + N (0 , σ τ k n ) + op ( στ k √ n )Judging from these two formulae, any convex combination of ˆ β k and ˆ β k remains asymptoticallyunbiased (has asymptotic mean equal to β ∗ k ) and possibly attain a smaller asymptotic variance.Suppose we have the final estimator given byˆ β swapk = a ˆ β k + (1 − a ) ˆ β k f or a ∈ (0 , . then, √ n ( ˆ β swapk − β ∗ k ) = a · Z k + (1 − a ) · Z k + op (1)21t is easy to verify that COV ( Z k , Z k ) = 0, and hence the asymptotic variance of ˆ β swapk is σ (cid:18) a τ k +(1 − a ) τ k (cid:19) , which is a quadratic form in a . When a = τ k τ k + τ k , the asymptotic varianceattains minimum, in which case ˆ β swapk has approximately asymptotic distribution N (0 , σ τ k τ k τ k + τ k ). ( ii ) Let a = τ k τ k + τ k . By Lemma A.1, for n satisfying (3.2), P (cid:40) a (cid:12)(cid:12)(cid:12)(cid:12) V ar ( Z k ) σ − τ k (cid:12)(cid:12)(cid:12)(cid:12) ≤ a p (cid:107) ˜ β (cid:107) (cid:107) ˜ w k (cid:107) (cid:41) ≥ − n P (cid:40) (1 − a ) (cid:12)(cid:12)(cid:12)(cid:12) V ar ( Z k ) σ − τ k (cid:12)(cid:12)(cid:12)(cid:12) ≤ (1 − a ) p (cid:107) ˜ β (cid:107) (cid:107) ˜ w k (cid:107) (cid:41) ≥ − n Let Z swapk = a · Z k + (1 − a ) · Z k , then P (cid:40) (cid:12)(cid:12)(cid:12)(cid:12) V ar ( Z swapk ) σ − a τ k − (1 − a ) τ k (cid:12)(cid:12)(cid:12)(cid:12) ≤ p (cid:18) a (cid:107) ˜ β (cid:107) (cid:107) ˜ w k (cid:107) + (1 − a ) (cid:107) ˜ β (cid:107) (cid:107) ˜ w k (cid:107) (cid:19)(cid:41) ≥ − n (6.10)Similar to the argument in Theorem 3.1, (cid:107) ˜ β (cid:107) (cid:107) ˜ w k (cid:107) ≤ (cid:18) (cid:107) β ∗ (cid:107) + 2 (cid:107) ∆ (cid:107) (cid:19) (cid:18) (cid:107) β ∗ (cid:107) + O ( (cid:107) ∆ (cid:107) (cid:107) β ∗ (cid:107) ) (cid:19) (cid:16) s + O ( (cid:114) log pns ) (6.11)Plugging (6.11) into (6.10), together with the fact a + (1 − a ) ≤ , we have P (cid:40) (cid:12)(cid:12)(cid:12)(cid:12) V ar ( Z swapk ) σ − a τ k − (1 − a ) τ k (cid:12)(cid:12)(cid:12)(cid:12) ≤ p (cid:18) s + O ( (cid:114) log pns ) (cid:19)(cid:41) ≥ − n . Further plugging the value of a , P (cid:40) V ar ( Z swapk ) ≤ σ τ k τ k τ k + τ k + O ( 1 p ) (cid:41) ≥ − n . By (3.8), we get P (cid:40) (cid:12)(cid:12)(cid:12) ˆ β swapk − β ∗ k (cid:12)(cid:12)(cid:12) ≤ √ n | Z swapk | + (cid:15) (cid:48) n √ n (cid:41) ≥ − (cid:15) (cid:48)(cid:48) n Let Φ( · ) be the cumulative distribution function of standard normal distribution, for ∀ r > P (cid:40) √ n | Z swapk | ≤ r (cid:112) V ar ( Z swapk ) √ n (cid:41) = 2Φ( r ) − (cid:18) ˆ β swapk − σr √ n (cid:114) τ k τ k τ k + τ k + (cid:15) (cid:48) n √ n , ˆ β swapk − σr √ n (cid:114) τ k τ k τ k + τ k + (cid:15) (cid:48) n √ n (cid:19) has asymptotic coverage probability at least 2Φ( r ) − iii ) By Remark 3.1, τ k τ k τ k + τ k ≤ max { τ k , τ k } (cid:16) s + O ( (cid:113) log pns ) for all k = 1 , , · · · , p . Plus(3.10) holds uniformly over k, we obtainlim inf n →∞ P (cid:40) max k ∈ [ p ] (cid:12)(cid:12)(cid:12) √ n ( ˆ β swapk − β ∗ k ) (cid:12)(cid:12)(cid:12) ≤ (cid:114) s σr (cid:41) ≥ φ ( r ) − − α ).It is easy to verify that the asymptotic distribution of ˆ β swap is multinormal N ( β ∗ , σ n V ), where V kl = 1 σ Cov ( Z swapk , ˜ Z l ) = 1 σ Cov (cid:18) a k Z k + (1 − a k ) Z k , a l Z l + (1 − a l ) Z l (cid:19) = 1 σ (cid:20) a k a l Cov ( Z k , Z l ) + (1 − a k )(1 − a l ) Cov ( Z k , Z l ) (cid:21) = a k a l (cid:20) (cid:107) ˜ β (cid:107) ( ˜ w T k ˜ w l ) + 2( ˜ β T ˜ w k )( ˜ β T ˜ w l ) (cid:21) + (1 − a k )(1 − a l ) (cid:20) (cid:107) ˜ β (cid:107) ( ˜ w T k ˜ w l ) + 2( ˜ β T ˜ w k )( ˜ β T ˜ w l ) (cid:21) By linear algebra,sup h (cid:54) =0 , h ∈ R p (cid:12)(cid:12)(cid:12) h T ( ˆ β swap − β ∗ ) (cid:12)(cid:12)(cid:12) h T V h = ( ˆ β swap − β ∗ ) T V − ( ˆ β swap − β ∗ ) d −→ σ n χ p Therefore, the coverage probability in the worst direction is bounded below,lim inf n →∞ P (cid:40) sup h (cid:54) =0 , h ∈ R p (cid:12)(cid:12)(cid:12) h T ( ˆ β swap − β ∗ ) (cid:12)(cid:12)(cid:12) h T ( σ V n ) h ≤ χ p,α (cid:41) ≥ − α It further implies, for ∀ h (cid:54) = 0, h ∈ R p ,lim inf n →∞ P (cid:40) (cid:12)(cid:12)(cid:12) h T ( ˆ β swap − β ∗ ) (cid:12)(cid:12)(cid:12) ≤ (cid:114) σ n χ p,α h T V h (cid:41) ≥ − α (cid:3) Acknowledgements
The authors would like to thank Cun-Hui Zhang for several constructive advises and enlighten-ing discussions. The authors also would like to thank Pierre Bellec for the enlightening discus-sions. 23 ppendix
The theorem and lemma below are stated and proved in [4].
Theorem A.1 [4] Suppose the tuning parameters in the thresholded Wirtinger algorithm are suit-ably chosen, and the sample size n ≥ K (1 + σ (cid:107) β (cid:107) ) s log( np ) for some absolute constant K > ,Let S = support ( β ) , then inf (cid:107) β (cid:107) = s P ( X , y | β ) (cid:40) supp ( ˜ β ( t ) ) ⊆ S and min i =0 , (cid:107) ˜ β ( t ) − ( − i β (cid:107) ≤
16 (1 − w
16 ) t (cid:107) β ∗ (cid:107) + C σ (cid:107) β ∗ (cid:107) (cid:114) s log pn (cid:41) > − n − e − s − tnp for some absolute constant C > , where w is the gradient descent step size. When σ (cid:107) β ∗ (cid:107) = o ( (cid:113) n log n ) and is unknown, we can estimate (cid:107) β ∗ (cid:107) by φ := (cid:92) (cid:107) β ∗ (cid:107) = 1 n n (cid:88) j =0 y j and define ˆ σ = (cid:118)(cid:117)(cid:117)(cid:116) ( 1 n n (cid:88) j =0 y j ) − φ . Then with probability at least − n , there holds ˆ σφ (cid:16) σ (cid:107) β ∗ (cid:107) . If the sample size n ≥ K (1 + ˆ σφ ) s log( np ) , the the above claim holds with the first term on the right hand side n replaced by n . Lemma A.1 [4] Suppose x j are i.i.d. N (0 , I p × p ) . Then on an event with probability at least − n ,we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) j =1 ( x Tj b ) x j ¯ S x jT ¯ S − ( (cid:107) b (cid:107) I ¯ S + 2 bb T ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ δ (cid:107) b (cid:107) provided n ≥ C ( δ ) s log s , where C ( δ ) is constant only depending on δ . Here, I S is a diagonal matrixwith the diagonal elements in S equal to 1, whereas others equal to 0. And supp ( b ) ⊂ S . eferences [1] Heinz H Bauschke, Patrick L Combettes, and D Russell Luke. “Hybrid projection–reflectionmethod for phase retrieval”. In: JOSA A arXiv preprint arXiv:1902.08885 (2019).[3] Ahron Ben-Tal and Arkadi Nemirovski.
Lectures on modern convex optimization: analysis,algorithms, and engineering applications . Vol. 2. Siam, 2001.[4] T Tony Cai, Xiaodong Li, Zongming Ma, et al. “Optimal rates of convergence for noisy sparsephase retrieval via thresholded Wirtinger flow”. In:
The Annals of Statistics
IEEE Transactions on Information Theory
Commu-nications on Pure and Applied Mathematics
SIAM review
Inverse Problems
Advances in Neural Information Processing Systems .2015, pp. 739–747.[10] David L Donoho et al. “Compressed sensing”. In:
IEEE Transactions on information theory arXiv preprint arXiv:1705.02356 (2017).[12] Maryam Fazel, Haitham Hindi, and Stephen P Boyd. “Log-det heuristic for matrix rankminimization with applications to Hankel and Euclidean distance matrices”. In:
Proceedingsof the 2003 American Control Conference, 2003.
Vol. 3. IEEE. 2003, pp. 2156–2162.[13] James R Fienup. “Phase retrieval algorithms: a comparison”. In:
Applied optics
JOSA A
Optik
35 (1972), pp. 237–246.[16] Mark Iwen, Aditya Viswanathan, and Yang Wang. “Robust sparse phase retrieval made easy”.In:
Applied and Computational Harmonic Analysis arXiv preprint arXiv:1510.07713 (2015).[18] Kishore Jaganathan, Samet Oymak, and Babak Hassibi. “Recovery of sparse 1-D signalsfrom the magnitudes of their Fourier transform”. In: . IEEE. 2012, pp. 1473–1477.2519] Kishore Jaganathan, Samet Oymak, and Babak Hassibi. “Sparse phase retrieval: Convexalgorithms and limitations”. In: .IEEE. 2013, pp. 1022–1026.[20] Adel Javanmard and Andrea Montanari. “Hypothesis testing in high-dimensional regressionunder the gaussian random design model: Asymptotic theory”. In:
IEEE Transactions onInformation Theory
JOSA A
SIAM Journal on Mathematical Analysis
Electronic Journal of Statistics
Advances in Neural Information Processing Systems . 2013, pp. 2796–2804.[25] Henrik Ohlsson et al. “Compressive phase retrieval from squared output measurements viasemidefinite programming”. In: arXiv preprint arXiv:1111.6323 (2011), pp. 1–27.[26] Samet Oymak et al. “Simultaneously structured models with application to sparse and low-rank matrices”. In:
IEEE Transactions on Information Theory
IEEE transactions on signal processing
IEEE signal processing magazine
The Annals of Statistics
Mathematical Programming
IEEE Transactions on Information Theory
IEEE Transac-tions on Signal Processing
Mathematisches Forschungsin-stitut Oberwolfach: Very High Dimensional Semiparametric Models, Report
48 (2011), pp. 28–31.[34] Cun-Hui Zhang and Stephanie S Zhang. “Confidence intervals for low dimensional parametersin high dimensional linear models”. In: