Reconstruction of Sparse Signals under Gaussian Noise and Saturation
RReconstruction of Sparse Signals using LikelihoodMaximization from Compressive Measurementswith Gaussian and Saturation Noise
Shuvayan Banerjee, Radhendushka Srivastava, Ajit Rajwade
Abstract
Most compressed sensing algorithms do not account for the effect ofsaturation in noisy compressed measurements, though saturation is animportant consequence of the limited dynamic range of existing sensors.The few algorithms that handle saturation effects either simply discardsaturated measurements, or impose additional constraints to ensure con-sistency of the estimated signal with the saturated measurements (basedon a known saturation threshold) given uniform-bounded noise. In thispaper, we instead propose a new data fidelity function which is directlybased on ensuring a certain form of consistency between the signal and thesaturated measurements, and can be expressed as the negative logarithmof a certain carefully designed likelihood function. Our estimator workseven in the case of Gaussian noise (which is unbounded) in the measure-ments. We prove that our data fidelity function is convex. We moreover,show that it satisfies the condition of Restricted Strong Convexity andthereby derive an upper bound on the performance of the estimator. Wealso show that our technique experimentally yields results superior to thestate of the art under a wide variety of experimental settings, for com-pressive signal recovery from noisy and saturated measurements.
Compressed sensing (CS) aims to recover a signal x ∈ R n from its ‘compressivemeasurements’ of the form y = Ax + η where A ∈ R m × n , m (cid:28) n , is a sensingmatrix representing the forward model of the compressive device, and y ∈ R m is a vector of (possibly noisy) compressive measurements. The noise vector is η ∈ R m . Although this problem is ill-posed for most vectors in R n , CS theorystates that it is well-posed and that the signal x can be recovered with highaccuracy [4], if x is a sparse (or weakly-sparse) vector, and A obeys the so-called restricted isometry property (RIP). A sensing matrix A is said to obeythe RIP of order s , if for any s -sparse vector x , we have (cid:107) Ax (cid:107) ≈ (cid:107) x (cid:107) . Here,the degree of approximation is given by the so-called s -order restricted isometryconstant (RIC) of A . There exist precise error bounds for the recovery of x [4]. Moreover, most of the algorithms for CS recovery are also efficient in1 a r X i v : . [ c s . L G ] F e b erms of computation speed, a well-known example being the LASSO [7], whichseeks to minimize the objective function J ( x ) (cid:44) (cid:107) y − Ax (cid:107) + λ (cid:107) x (cid:107) , given aregularization parameter λ .However, the vast majority of the literature assumes a zero mean i.i.d. Gaus-sian distribution (with known variance) as the noise model. Many practicalsensing systems, on the other hand, innately enforce noise of other distributions.Almost all sensors have a fixed (and usually known) dynamic range [ a, b ] , a < b .However the underlying signal may be such that not all measurements A i x (where A i is the i th row of A ) can be accommodated within this range. Suchmeasurements then get ‘clipped’ to the value a if A i x < a , or to the value b if A i x > b . This is called the ‘saturation effect’, and is common in all sensingsystems (not only the compressive ones). Problem statement:
In this paper, we consider the following forwardmodel for the measurements y for a compressive device with dynamic range[ − τ, τ ]: ∀ i ∈ { , , ..., m } , y i = C ( A i x + η i ; − τ, τ ) . (1)Here the noise values are i.i.d., with η i ∼ N (0 , σ ) with known σ . Also C ( q ; a, b )is a saturation operator defined as follows: C ( q ; a, b ) = a if q < a,b if q > b,q if q ∈ [ a, b ] . (2)Here q < a is called ‘negative saturation’ and q > b is called ‘positive satu-ration’. Given this forward model with known A and τ , we seek to recover asparse/weakly-sparse vector x from its compressive measurements y . There exists a moderate-sized literature on the problem of CS recovery fromsaturated measurements, which we summarize here. Right through this pa-per, we use S − , S + to denote the sets that respectively consist of indices ofnegatively and positively saturated measurements, S is the set of indices ofall measurements, and the set of indices of non-saturated measurements is S ns (cid:44) S − S + − S − . The work in [8] proposes two types of estimators forCS recovery from measurements with saturation effects and uniform quanti-zation (i.e. bounded) noise: (1) ‘saturation rejection’ ( SR ), which weeds outsaturated measurements and performs recovery only from the non-saturatedmeasurements via the estimator: min (cid:107) x (cid:107) s. t. (cid:80) i ∈ S ns ( y i − A i x ) ≤ (cid:15) ns ; and(2) ‘saturation consistency’ ( SC ), which imposes the added constraint in the SR estimator, that the recovered signal ˆ x should obey the conditions that ∀ i ∈ S − , A i ˆ x ≤ − ( τ − ∆) and ∀ i ∈ S + , A i ˆ x ≥ τ − ∆, where ∆ denotesquantization width. The SR method potentially ignores many useful measure-ments (depending on the relation between τ and (cid:107) x (cid:107) ), and in the worst casethe remaining part of the sensing matrix may not obey the RIP due to an insuf-ficient number of measurements. The SC method is hard to adapt to saturation2ffects with Gaussian noise , which is unbounded in nature. The work in [9, 10]seeks to optimize the following cost function, which is based on the assumptionthat saturated measurements are not too large in number: J ss ( x ) (cid:44) λ ( (cid:107) x (cid:107) + (cid:107) r (cid:107) ) + (cid:107) y − ( Ax + r ) (cid:107) = λ (cid:107) x ; r (cid:107) + (cid:107) y − [ A | I ]( x ; r ) (cid:107) . (3)Here r refers to the error due to saturation effects, ( x ; r ) is the concatenationof column vectors x , r ; I is the identity matrix; and the (cid:107) r (cid:107) term promotessparsity on the vector r . In this paper, we term this approach ‘saturationsparsity’ ( SS ). Although [9, 10] prove RIP of [ A | I ], that property is true onlyin an asymptotic sense as m → ∞ (with n → ∞ and m/n → m is small, we have observed that such a technique has atendency to estimate r to be a vector of all zeroes, due to the penalty on (cid:107) r (cid:107) .Recent work in [13] proposes a greedy approximation algorithm to minimize thefollowing cost function, designed to be resilient to measurement outliers: J α ( x ) (cid:44) (cid:107) y − Ax (cid:107) pp + λ (cid:107) x (cid:107) ; 0 < p < . (4)An approximation algorithm to minimize such a cost function is essential, asthe (cid:107)(cid:107) pseudo-norm otherwise renders this problem to be NP-hard. Note thatthe approaches in [9, 10, 13] were designed for general impulse noise and not for saturation effects, and hence these methods do not use knowledge of thesaturation threshold τ . Very recent work in [5] provides theoretical boundsfor the following interesting estimator, termed ‘noise-cognizant (cid:96) -minimization’( NCLM ): argmin x , r (cid:107) x (cid:107) such that ( i ) C ( Ax + r ; − τ, τ ) = y , (5)( ii ) (cid:107) r (cid:107) ≤ γ(cid:15) ; ( iii ) (cid:107) x (cid:107) ≤ γ (cid:48) µ √ m. The parameters γ, γ (cid:48) , µ need to be selected based on properties of the sensingmatrix, (cid:15) is a bound on (cid:107) y − Ax (cid:107) , and the vector r plays the same role as inEqn. 3. Our method presented in this paper does not require the choice of somany parameters, nor does it require an upper bound on (cid:107) x (cid:107) .The rest of this paper is organized as follows. The main objective function andits properties are presented in Sec. 2. Several numerical results are presentedand discussed in Sec. 3. We conclude in Sec. 4 with a discussion of avenues forfuture work. In this section, we first present the cost function which we seek to optimize, forCS recovery under saturated measurements. Although we consider the signal x to be sparse in the canonical basis, our method is easily extensible to a signalthat in sparse/weakly sparse in any known orthonormal basis (see Sec. 3).In the following, Φ( . ) denotes the cumulative distribution function (CDF) ofa standard normal random variable, and φ ( . ) denotes its probability densityfunction (PDF). 3 .1 Cost function and its properties Our cost function J our ( x ) is given below: J our ( x ) = λ (cid:107) x (cid:107) + L ( y , Ax ; τ ) , (6)where L ( y , Ax ; τ ) (cid:44) (cid:88) i ∈ S ns (cid:16) y i − A i x σ (cid:17) − (cid:88) i ∈ S + log (cid:16) − Φ(( τ − A i x ) /σ ) (cid:17) − (cid:88) i ∈ S − log (cid:16) Φ(( − τ − A i x ) /σ ) (cid:17) . The first term in L ( y , Ax ; τ ) is due to the Gaussian noise in the unsaturatedmeasurements; the second (third) term encourages the values of A i x , i.e. themembers of S + (likewise S − ) to be much greater than τ (likewise much lessthan − τ ). To understand the behaviour of the second term of L ( y , Ax ; τ ),consider a measurement y i such that i ∈ S + . Referring to Eqn. 1, we have P ( y i ≥ τ ) = P ( η i ≥ τ − A i x ) = 1 − Φ(( τ − A i x ) /σ ). The last equality is dueto the Gaussian nature of η i . Given such a measurement, we seek to find x suchthat A i x > τ , which will push τ − A i x toward −∞ , i.e. push Φ(( τ − A i x ) /σ )toward 0, and thus reduce the cost function. A similar argument can be made forthe third term involving S − . Consider that P ( y i < − τ ) = P ( η i < − τ − A i x ) =Φ(( − τ − A i x ) /σ ). We seek to find x , which will tend to push − τ − A i x toward + ∞ , i.e. push Φ(( − τ − A i x ) /σ ) toward 1, and thereby reduce the costfunction. Assuming independence of the measurements, note that L ( y , Ax ; τ )is essentially the negative log of the following likelihood function:˜ L ( y , Ax ; τ ) (cid:44) (cid:89) i ∈ S ns e − ( y i − A i x ) / (2 σ ) σ √ π (7) (cid:89) i ∈ S + [1 − Φ(( τ − A i x ) /σ )] (cid:89) i ∈ S − Φ(( − τ − A i x ) /σ ) . We henceforth term our technique ‘likelihood maximization’ or LM . The ten-dency to push Φ(( τ − A i x ) /σ ) toward 0 or to push Φ(( − τ − A i x ) /σ ) toward 1,is counter-balanced by the sparsity-promoting term (cid:107) x (cid:107) , with λ deciding therelative weightage. (cid:4) We now state an important property of L ( y , Ax ; τ ), proved in the supplementalmaterial [1]. Theorem 1: L ( y , Ax ; τ ) is a convex function of x . (cid:4) For further theoretical analysis, we present an overview of the broad frameworkin [11] and then adapt it meticulously for the analysis of our estimator in Eqn.8. At first we state
L1, D1 and T1 and then we use them to prove
Theorems2,3,4 . 4 emma L1: (Lemma 1 of [11]): Let (cid:99) x λ be the optimum of a general cost func-tion L g ( y ; Ax ) + λ (cid:107) x (cid:107) with a regularization parameter λ ≥ (cid:107)∇ L g ( y ; Ax ) (cid:107) ∞ .Then the error vector ∆ (cid:44) (cid:99) x λ − x belongs to the set C ( S ; x ) (cid:44) { ∆ |(cid:107) ( x − (cid:99) x λ ) S c (cid:107) ≤ (cid:107) ( x − (cid:99) x λ ) S (cid:107) , where S is the set of indices of the s non-zero ele-ments of x , and ∀ i ∈ S, x S ( i ) = x i ; ∀ i / ∈ S, x S ( i ) = 0. (cid:4) Definition D1:
A loss function L is said to obey the restricted strong convex-ity (RSC) property with curvature κ L > τ L ( x ) if theBregman divergence δL g ( ∆ , x ) (cid:44) L g ( y ; A (cid:99) x λ ) − L g ( y ; Ax ) − ∇ L g ( y ; Ax ) t ( ∆ )(the error between the loss function value at (cid:99) x λ and its first order Taylor se-ries expansion about x ) satisfies δL g ( ∆ , x ) ≥ κ L (cid:107) ∆ (cid:107) − τ L ( x ) for every vector ∆ ∈ C ( S ; x ). (cid:4) Intuitively, a loss function that obeys RSC is sharply curved around x , so thatany difference in the loss function | L g ( y ; Ax ) − L g ( y ; A (cid:99) x λ ) | will imply a pro-portional estimation error (cid:107) x − (cid:99) x λ (cid:107) for all error vectors (cid:99) x λ − x ∈ C ( S ; x ). Werefer the reader to [11] for more details. Theorem T1: (Theorem 1 of [11]) If L g is convex, differentiable and obeysRSC property with curvature κ L and tolerance τ L ( x ), if (cid:99) x λ is as defined inLemma L1 with λ ≥ (cid:107)∇ L ( y ; Ax ) (cid:107) ∞ , and if x is an s -sparse vector, then wehave: (cid:107) (cid:99) x λ − x (cid:107) ≤ λ sκ L + 2 λτ L ( x ) κ L . (cid:4) We now state the following theorems pertaining to the cost function in Eqn. 8and prove them in [1]:
Theorem 2: L ( y , Ax ; τ ) from Eqn. 8 follows RSC with curvature κ L = γ σ and tolerance function τ L ( x ) = 0, where γ is the restricted eigenvalue constant(REC) for A .Here, we use the structure of δL g ( ∆ , x ) defined in D1 to find the values ofcurvature and tolerance function for our cost function. Proving RSC for ourcost function implies that we will reach the global minima. (cid:4) Theorem 3:
For our noise model and with additional constraints on the sig-nal that ∀ i, α ≤ x i ≤ β , we have the lower bound (cid:107)∇ L (cid:107) ∞ ≥ √ (cid:37) log( n ) σ √ m {√ m + C (cid:112) ( m + m ) } with probability 1 − (cid:8) − ( (cid:37) −
2) log( n ) (cid:9) for constant C , (cid:37) > (cid:107)∇ L (cid:107) ∞ so that we can apply T1 to find theupper bound on the reconstruction error in Theorem 4 (cid:4)
Theorem 4:
Let (cid:99) x λ be the minimizer of the cost function in Eqn. 8 with reg-ularization parameter λ ≥ (cid:107)∇ L (cid:107) ∞ and with the signal constraints from Thm.3. Let x be the true s -sparse signal which gave rise to the compressive measure-ments in y . Then we have the following upper bound with the same probabilityas in Thm. 3: (cid:107) (cid:99) x λ − x (cid:107) ≤ s log( n ) σ (cid:37)γ m ( √ m + C (cid:112) ( m + m )) . (cid:4) Observations related to the upper bound:
The upper bound is directlyproportional to s log( n ) which is equivalent to the upper bound in Lasso recon-struction. So, the tightness of the upper bound on the reconstruction error ofour cost function is relatively close to that of Lasso reconstruction. The boundis directly proportional to σ as well as s = (cid:107) x (cid:107) and inversely proportional to5 = REC( A ; s ) [12, 7], all of which is very intuitive. The bound also becomeslooser with increase in the number of saturated measurements m , m . If thereare no saturated measurements, i.e. m = m = 0, then the bound reduces tothe normal LASSO bound [7], except that here we consider A with unit columnnorm as against column norm of m in [7]. The bound also increases with m .However, it turns out that the constant factor C for the O ( √ m + m ) term inthe bounds, is very large. This is because it contains other factors of the form φ ( z )Φ( z ) or φ ( z )1 − Φ( z ) where z stands for either α or β (see suppl. mat. [1]), whichare both large in absolute value for large α, β . Hence the O ( √ m + m ) termdominates over the O ( √ m ) term, which is intuitive. Here we report results on CS recovery using our technique LM in compar-ison to the following existing approaches described in Sec. 1.1: (i) Satura-tion rejection ( SR ) from [8]; (ii) Saturation Consistency ( SC ) from [8] withthe following constraint set designed to handle Gaussian measurement noise: ∀ i ∈ S − , A i ˆ x ≤ − τ + 3 σ and ∀ i ∈ S + , A i ˆ x ≥ τ − σ ; (iii) Saturation Spar-sity ( SS ) from [10], (iv) Saturation Ignorance ( SI ), a technique which recovers x pretending there was no saturation in y ; and (v) NCLM from [5]. De-fine ζ (cid:44) (cid:80) mi =1 | A i x | /m , the average absolute value of noiseless unsaturatedmeasurements. For all techniques including LM , we assume knowledge of τ and thereby that of sets S + , S − . For LM , we did not impose the constraints α ≤ x i ≤ β from Thm. 3, due to negligible impact on the results. Experiment description:
All our experiments were performed on signals ofdimension n = 256 that were sparse in the 1D-DCT (discrete cosine transform)basis. The supports of the DCT coefficient vectors were chosen randomly, andeach signal had a different support. The elements of the sensing matrix A weredrawn i.i.d. from N (0 , /m ) so that A would obey RIP with high probability[4]. Gaussian noise was added to the measurements, followed by applicationof the saturation operator C . Keeping all other parameters fixed, we studiedthe variation in the performance of these six techniques with regard to changein (A) number of measurements m ; (B) signal sparsity s expressed as fraction f sp ∈ [0 ,
1] of signal dimension n ; (C) noise standard deviation σ expressed asa fraction f σ ∈ [0 ,
1] of ζ ; and (D) the fraction f sat ∈ [0 ,
1] of the m mea-surements that were saturated. For the measurements experiment (i.e. (A)), m was varied in { , , , ..., } with s = 25 , f sat = 0 . , f σ = 0 .
1. Forthe sparsity experiment (i.e. (B)), f sp was varied in { . , . , . , . } with m = 150 , f sat = 0 . , f σ = 0 .
1. For the noise experiment (i.e. (C)), we varied f σ in { . , . , . , ..., . } with m = 150 , f sp = 25 / , f sat = 0 .
15. Forthe saturation experiment (i.e. (D)), f s was varied in { , , , ..., } /
150 with m = 150 , f sp = 25 / , f σ = 0 .
1. The performance was measured using relativeroot-mean squared error (RRMSE) (defined as (cid:107) x − ˆ x (cid:107) / (cid:107) x (cid:107) where ˆ x is anestimate of the signal x ), computed over reconstructions from 10 noise trials. Parameter settings:
For the proposed LM technique and for SS , the reg-6igure 1: Comparison of NCLM and LM for s = 15 , m = 150 , n = 256 , f sat =0 . , f sig = 0 . λ was chosen using cross-validation on a set of unsat-urated measurements, following the method in [14]. The size of the cross-validation set was 0.3 times the number of measurements used for reconstruc-tion. For SR and SC , we set (cid:15) ns = σ (cid:112) | S ns | . For SI , we used the estimatormin (cid:107) x (cid:107) s. t. (cid:107) y − Ax (cid:107) ≤ σ √ m . For NCLM , the bound on (cid:107) x (cid:107) was set tobe the (cid:96) -norm of the true signal (omnisciently), and that on (cid:107) r (cid:107) was set to bea statistical estimate of the magnitude of the pre-saturated noise vector. Thewell-known FISTA algorithm [2] was used for LM , whereas CVX was used for SS , SC , SR and SI . Discussion:
The results of these experiments are summarized in Fig. 2, andshow that the proposed LM technique consistently outperforms the competingmethods numerically. This behaviour is particularly observable for high f sat or f sig . We observed that SC outperformed SR for high f sat or f sig . We also notethat our technique performed better than NCLM (our closest competitor) inthe regime of high f sig and high f sat , as can be seen from Fig. 1.The upperbound of the reconstruction error is plotted using proper scaling . The empiricaltrends observed here clearly satisfies the intuitive arguments however, tightnessof the bound might vary depending on the constants of the upper bound. We have presented a principled likelihood-based method of compressive signalrecovery under Gaussian noise combined with saturation effects. We have provedthe convexity of our estimator and derived the upper performance bound, andshown that it numerically outperforms competing methods. The recent work in[3] handles compressive inversion under with Poisson-Gaussian-uniform quan-tization noise, a very realistic noise model in imaging systems. Extending the7igure 2: Performance comparison for six methods: SR (saturation rejection), SC (saturation consistency), SI (saturation ignorance), SS (saturation sparsity),the NCLM method and the proposed LM technique w.r.t. variation in numberof measurements m (topmost, experiment (A)), signal sparsity s (2nd from top,experiment (B)), noise σ (3rd from top, experiment (C)) and fraction f s of the m measurements that were saturated (bottom-most, experiment (D)).8umerical simulations as well as the convexity proofs to handle saturation effectsin conjunction with such a Poisson-Gaussian noise model is a potential avenuefor future work. Another useful avenue of research would be to derive lowerperformance bounds for the presented penalized estimator. This section contains the proof of various results from the main paper. Alltheorem numbers refer to the corresponding ones in the main paper.
Our cost function J our ( x ) is given below: L ( y , Ax ; τ ) (cid:44) (cid:88) i ∈ S ns (cid:16) y i − A i x σ (cid:17) − (cid:88) i ∈ S + log (cid:16) − Φ(( τ − A i x ) /σ ) (cid:17) − (cid:88) i ∈ S − log (cid:16) Φ(( − τ − A i x ) /σ ) (cid:17) ,J our ( x ) = λ (cid:107) x (cid:107) + L ( y , Ax ; τ ) . The notation A i denotes the i-th row of the sensing matrix x .The first termin L ( y , Ax ; τ ) is due to the Gaussian noise in the unsaturated measurements;the second (third) term encourages the values of A i x , i.e. the members of S + (likewise S − ) to be much greater than τ (likewise much less than − τ ).To understand the behaviour of the second term of L ( y , Ax ; τ ), consider ameasurement y i such that i ∈ S + . We have P ( y i ≥ τ ) = P ( η i ≥ τ − A i x ) =1 − Φ(( τ − A i x ) /σ ). The last equality is due to the Gaussian nature of η i . Givensuch a measurement, we seek to find x such that A i x > τ , which will push τ − A i x toward −∞ , i.e. push Φ(( τ − A i x ) /σ ) toward 0, and thus reduce thecost function. A similar argument can be made for the third term involving S − .Consider that P ( y i < − τ ) = P ( η i < − τ − A i x ) = Φ(( − τ − A i x ) /σ ). We seek tofind x , which will tend to push − τ − A i x toward + ∞ , i.e. push Φ(( − τ − A i x ) /σ )toward 1, and thereby reduce the cost function. Assuming independence ofthe measurements, note that L ( y , Ax ; τ ) is essentially the negative log of thefollowing likelihood function:˜ L ( y , Ax ; τ ) (cid:44) (cid:89) i ∈ S ns e − ( y i − A i x ) / (2 σ ) σ √ π (8) (cid:89) i ∈ S + [1 − Φ(( τ − A i x ) /σ )] (cid:89) i ∈ S − Φ(( − τ − A i x ) /σ ) . We henceforth term our technique ‘likelihood maximization’ or LM . The ten-dency to push Φ(( τ − A i x ) /σ ) toward 0 or to push Φ(( − τ − A i x ) /σ ) toward 1,9s counter-balanced by the sparsity-promoting term (cid:107) x (cid:107) , with λ deciding therelative weightage.Note that since we assume τ is known, we know the exact constitution of S + , S − in a data-driven manner in all the techniques including ours , i.e. we assign the i th measurement to S + if y i = τ and to S − if y i = − τ . We now state and prove an important property of L ( y , Ax ; τ ). Theorem 1: L ( y , Ax ; τ ) is a convex function of x . (cid:4) Proof:
For proving convexity, we show that the Hessian matrix ∂ L∂ x ∂ x t is posi-tive semi-definite. Define Q ( x ) (cid:44) (cid:80) i ∈ S ns (cid:16) y i − A i x σ (cid:17) ; Q ( x ) (cid:44) − (cid:80) i ∈ S + log (cid:16) − Φ(( τ − A i x ) /σ ) (cid:17) ; Q ( x ) (cid:44) − (cid:80) i ∈ S − log (cid:16) Φ(( − τ − A i x ) /σ ) (cid:17) . It is clear that ∂ Q ∂ x ∂ x t = (cid:80) i ∈ S ns A i ( A i ) t , which is a positive semi-definite matrix. Now, wehave: ∂Q ( x ) ∂ x = − σ (cid:88) i ∈ S + (cid:16) A i φ ( τ − A i x σ ) (cid:17) / (cid:16) − Φ( τ − A i x σ ) (cid:17) ,∂ Q ( x ) ∂ x ∂ x t = 1 σ (cid:88) i ∈ S + A i A i t h i , where h i (cid:44) ( φ ( u i )) − [1 − Φ( u i )] u i φ ( u i ) (cid:104) − Φ( u i ) (cid:105) ; u i (cid:44) τ − A i x σ . (9)In the expression for the Hessian of Q , we note that terms such as A i A i t form a positive semi-definite matrix ∀ i , and the denominator in every h i is non-negative. If we can prove that the numerator of each term h i is non-negative aswell, then we can show Q to be convex since its Hessian would be positive semi-definite. The numerator has the form φ ( u ) H ( u ) where H ( u ) (cid:44) φ ( u ) − [1 − Φ( u )] u .Since φ ( u ) ≥ H ( u ) ≥
0. We see that H (0) = 1 / √ π ; H ( ∞ ) = lim u →∞ φ ( u ) − [1 − Φ( u )] u = 0. The latter is becauseas u → ∞ , φ ( u ) → − Φ( u )] →
0. But the rate of convergence of[1 − Φ( u )] → u → ∞ on the extended real line, so H ( ∞ ) = 0. Also H ( −∞ ) = ∞ . Noting that φ (cid:48) ( u ) = − uφ ( u ), we see that H (cid:48) ( u ) = φ (cid:48) ( u ) − uφ ( u ) + Φ( u ) = Φ( u ) − ≤
0. Hence H ( u ) is a non-increasing function bounded below by 0, which establishes that H ( u ) ≥ u ∈ R , and hence ∀ i, h i ≥
0. Since φ ( u ) ≥
0, we see that φ ( u ) H ( u ) ≥
0. Thisestablishes that Q is convex. We can establish the convexity of Q along very10imilar lines. ∂ Q ( x ) ∂ x ∂ x t = 1 σ (cid:88) i ∈ S − A i A i t g i where g i (cid:44) ( φ ( u i )) + [Φ( u i )]( u i φ ( u i ))[Φ( u i )] ; u i (cid:44) τ − A i x σ . Using the same tech-nique as for Q , we can show that Q is convex. Since Q , Q , Q are all convex,the convexity of L follows. (cid:4) L ( y , Ax ; τ ) The restricted Strong Convexity (RSC) is an important property for a datafidelity function from the point of proving performance bounds. Let ∆ (cid:44) (cid:99) x λ − x ∗ be the difference between an optimal solution and the true parameter x ∗ , andconsider the loss difference L ( (cid:99) x λ ) − L ( x ∗ ) as defined in [11] (Lemma-1). Inthe classical setting, it is expected that the loss difference goes to zero withincrease in sample size of the data in the model. However, convergence in theloss difference is not sufficient to imply ∆ is small. This sufficiency dependson the curvature of the loss function. If the loss function is ‘sharply curved’around the optimal value (cid:99) x λ , then having a small loss difference implies havinga small ∆. However, if the loss function is ‘relatively constant’ around theoptimal value (cid:99) x λ , then the loss difference may be small but ∆ can be relativelylarge. If the loss function is ‘too flat’ around the optimal value, it may hamperthe convergence of the optimal solution to the true value. Thus, to ensure thatthe loss function is not ‘too flat’, the notion of strong convexity is considered.One way to enforce that our cost function L is strongly convex is to requirethe existence of some positive constant κ L such that δL ( x ∗ , ∆ ) ≥ κ L where δL ( x ∗ , ∆ ) = L ( x ∗ + ∆ ) − L ( x ∗ ) − < ∆ , ∇ L ( x ∗ ) > ∀ ∆ ∈ R n . This ensuresthat our optimal solution, if it exists, will reach the unique global minimum ata linear convergence rate. However strong convexity is impossible for all vectorsin R n in our case, as the matrix A is low rank (size m × n, m < n ). Instead,the notion of strong convexity defined over a restricted space of ∆ is RestrictedStrong Convexity (RSC) defined as follows (see Lemma 1 of [11]): δL ( x ∗ , ∆ ) ≥ κ L (cid:107) ∆ (cid:107) − τ L ( x ∗ ) (10)for the curvature term κ L > τ L for, ∆ ∈ C such that C (cid:44) { ∆ : (cid:107) ∆ S c (cid:107) ≤ (cid:107) ∆ S (cid:107) + 4 (cid:107) x ∗ S c (cid:107) } as defined in [11]. Here S stands for the set of indices of the s largest entries of the true signal x ∗ , and S c is the complement of S . This paper primarily considers purely sparse signals,and hence S would correspond to the s non-zero entries of x ∗ due to which x ∗ S c = , but extensions to weakly sparse signals are also easily possible.11 .3.1 The form of the function δL ( x ∗ , ∆ )Define, A ( x ∗ , ∆ ) = L ( y , A ( x ∗ + ∆ ); τ ) − L ( y , Ax ∗ ; τ )= − m (cid:88) i =1 ln [1 − Φ( τ − A i ( x ∗ + ∆ ) σ )] − m (cid:88) i = m +1 ln [Φ( − τ − A i ( x ∗ + ∆ ) σ )]+ 12 m (cid:88) i = m +1 ( y i − A i ( x ∗ + ∆ ) σ ) + m (cid:88) i =1 ln [1 − Φ( τ − A i x σ )]+ m (cid:88) i = m +1 ln [Φ( − τ − A i x ∗ σ )] − m (cid:88) i = m +1 ( y i − A i x ∗ σ ) = − m (cid:88) i =1 (ln [1 − Φ( τ − A i ( x ∗ + ∆ ) σ )] − ln [1 − Φ( τ − A i x ∗ σ )]) − m (cid:88) i = m +1 (ln [Φ( − τ − A i ( x ∗ + ∆ ) σ )] − ln [Φ( − τ − A i x ∗ σ )])+ 12 m (cid:88) i = m +1 (( y i − A i ( x ∗ + ∆ ) σ ) − ( y i − A i x σ ) ) . (11)In A ( x ∗ , ∆ ) , we have 3 terms:Simplifying the 3 rd term,12 m (cid:88) i = m +1 (( y i − A i ( x ∗ + ∆ ) σ ) − ( y i − A i x σ ) )= 12 σ m (cid:88) i = m +1 { y i + [ A i ( x ∗ + ∆ )] − y i A i ( x ∗ + ∆) − y i − { A i x ∗ } + 2 y i A i x ∗ } = 12 σ m (cid:88) i = m +1 { ( A i x ∗ ) T ( A i x ∗ ) + 2 x ∗ T A iT A i ∆ + ∆ T A iT A i ∆ − y i A i ∆ − ( A i x ∗ ) T ( A i x ∗ ) } = 1 σ m (cid:88) i = m +1 { ∆ T A iT A i ∆ x ∗ T A iT A i ∆ − y i A i ∆ } (12)Again, define B ( x ∗ , ∆ ) = < ∆ , ∇ L ( x ∗ ) > {− σ m (cid:88) i =1 A i φ ( τ − A i x ∗ σ )[1 − Φ( τ − A i x ∗ σ )] + 1 σ m (cid:88) i = m +1 A i φ ( − τ − A i x ∗ σ )[Φ( − τ − A i x ∗ σ )] − σ m (cid:88) i = m +1 A i ( y i − A i x ∗ σ ) } . ∆ (13)= − σ m (cid:88) i =1 ( A i ∆ ) φ ( τ − A i x ∗ σ )[1 − Φ( τ − A i x ∗ σ )] + 1 σ m (cid:88) i = m +1 ( A i ∆ ) φ ( − τ − A i x ∗ σ )[Φ( − τ − A i x ∗ σ )] − σ m (cid:88) i = m +1 ( y i A i ∆ − x ∗ T A iT A i ∆ )(14)Now, δL ( x ∗ , ∆ ) = A ( x ∗ , ∆ ) − B ( x ∗ , ∆ )= − m (cid:88) i =1 (ln [1 − Φ( τ − A i ( x ∗ + ∆ ) σ )] − ln [1 − Φ( τ − A i x ∗ σ )]) − m (cid:88) i = m +1 (ln [Φ( − τ − A i ( x ∗ + ∆ ) σ )] − ln [Φ( − τ − A i x ∗ σ )])+ 1 σ m (cid:88) i = m +1 { ∆ T A iT A i ∆ x ∗ T A iT A i ∆ − y i A i ∆ } + 1 σ m (cid:88) i =1 ( A i ∆ ) φ ( τ − A i x ∗ σ )[1 − Φ( τ − A i x ∗ σ )] − σ m (cid:88) i = m +1 ( A i ∆ ) φ ( − τ − A i x ∗ σ )[Φ( − τ − A i x ∗ σ )] + 1 σ m (cid:88) i = m +1 ( y i A i ∆ − x ∗ T A iT A i ∆ )= m (cid:88) i =1 {− ln [1 − Φ( τ − A i ( x ∗ + ∆ ) σ )] + ln [1 − Φ( τ − A i x ∗ σ )] + ( A i ∆ ) φ ( τ − A i x ∗ σ )[1 − Φ( τ − A i x ∗ σ )] } + m (cid:88) i = m +1 {− ln [Φ( − τ − A i ( x ∗ + ∆ ) σ )] + ln [Φ( − τ − A i x ∗ σ )] − ( A i ∆ ) φ ( − τ − A i x ∗ σ )[Φ( − τ − A i x ∗ σ )] } + 12 σ m (cid:88) i = m +1 { ∆ T A iT A i ∆ } (15)13 L ( x ∗ , ∆ ) consists of 3 terms as seen in equation 15 :- Term-1 : m (cid:88) i =1 {− ln [1 − Φ( τ − A i ( x ∗ + ∆ ) σ )] + ln [1 − Φ( τ − A i x ∗ σ )] + 1 σ ( A i ∆ ) φ ( τ − A i x ∗ σ )[1 − Φ( τ − A i x ∗ σ )] } Term-2 : m (cid:88) i = m +1 {− ln [Φ( − τ − A i ( x ∗ + ∆ ) σ )] + ln [Φ( − τ − A i x ∗ σ )] − σ ( A i ∆ ) φ ( − τ − A i x ∗ σ )[Φ( − τ − A i x ∗ σ )] } Term-3 : 12 σ m (cid:88) i = m +1 { ∆ T A iT A i ∆ } (16)The target now is to prove that the of each of the 3 terms separately and therebyprove the non-negativity of δL ( x ∗ , ∆ ) . To do this, we try and prove the non-negativity of each term in equation 16 separately . ≥ g : R → R such that g ( u ) = φ ( u )1 − Φ( u ) .We rewrite Term-1 as follows:-Term1 = m (cid:88) i =1 {− ln [1 − Φ( τ − A i x ∗ − A i ∆ ) σ )] + ln [1 − Φ( τ − A i x ∗ σ )] + 1 σ ( A i ∆ ) φ ( τ − A i x ∗ σ )[1 − Φ( τ − A i x ∗ σ )] } . Taking u i = τ − A i x ∗ σ and k i = A i ∆ σ , ∀ i = 1(1) m , we can write Term-1 as: m (cid:88) i =1 { ln [1 − Φ( u i )] − ln [1 − Φ( u i − k i )] + k i φ ( u i )[1 − Φ( u i )] } (17) Defining a function : f : R → R such that , f ( u ) = ln [1 − Φ( u )] − ln [1 − Φ( u − k )] + kg ( u ) ,where k is any constant. Claim-1 : f ( . ) is a monotonically increasing functionProof: Differentiating f w.r.t u , we get:14 (cid:48) ( u ) = − φ ( u )1 − Φ( u ) + φ ( u − k )1 − Φ( u − k ) + k.g (cid:48) ( u )= − g ( u ) + g ( u − k ) + k.g (cid:48) ( u ) (18)Taking the Taylor’s series expansion of g(.) up to the second term, g ( u − k ) = g ( u ) − k g (cid:48) ( u ) + k g (cid:48)(cid:48) ( ζ ) ; ζ ∈ ( u − k, u ) and k ∈ R = ⇒ − g ( u ) + g ( u − k ) + k.g (cid:48) ( u ) = k g (cid:48)(cid:48) ( ζ )Replacing this form in the structure presented in equation 17: f (cid:48) ( u ) = k g (cid:48)(cid:48) ( ζ ) (19)Here, g ( u ) is the inverse of the Mills’ ratio, which is proved to be a convexfunction in [6]. By definition of a strictly convex function, g (cid:48)(cid:48) ( u ) ≥ ∀ u ∈ R (20)Incorporating equation 19 in 18 , we get, for any k = k i ∀ i = 1 , , ...mf (cid:48) ( u ) = k g (cid:48)(cid:48) ( ζ ) > (cid:107) A i ∆ (cid:107) σ .g (cid:48)(cid:48) ( ζ ) ≥ ∀ u ∈ IR .Hence, f (cid:48) ( u ) ≥ ∀ u ∈ IR.This implies that f ( . ) is a monotonically increasing function . Claim-2 : f ( −∞ ) = 0 Proof:
Now, lim u →−∞ Φ( u − k ) = lim u →−∞ Φ( u ) = 0= ⇒ ln[1 − Φ( −∞ )] = 0= ⇒ lim u →−∞ ln[1 − Φ( u − k )] = lim u →−∞ ln[1 − Φ( u )] = 0Also, lim u →−∞ φ ( u ) = 0 = ⇒ lim u →−∞ g ( u ) = 0 f ( −∞ ) = lim u →−∞ f ( u ) = lim u →−∞ [ln [1 − Φ( u )] − ln [1 − Φ( u − k )] + kg ( u )]= lim u →−∞ ln[1 − Φ( u )] − lim u →−∞ ln[1 − Φ( u − k )] + k lim u →−∞ g ( u ) = 0 (21)Hence, f ( −∞ ) = 0 Thus, from Claim-1 and Claim-2 , f ( . ) is a monotonically increasing func-tion bounded below by 0 . This implies, f ( u ) ≥ ∀ u ∈ R (22)Putting equation 22 in 17, we have, 15erm 1 = (cid:80) m i =1 f ( u i ) . Since f ( u i ) ≥ ∀ u i , Term 1 ≥ ≥ h : R → R such that h ( v ) = φ ( v )Φ( v ) .Rewriting Term 2 as follows: Term 2 = m (cid:88) i = m +1 {− ln [Φ( − τ − A i x ∗ − A i ∆ σ )] + ln [Φ( − τ − A i x ∗ σ )] − σ ( A i ∆ ) φ ( − τ − A i x ∗ σ )[Φ( − τ − A i x ∗ σ )] } Taking v i = − t − A i x ∗ σ and k i = A i ∆ σ , we have Term 2 as: m (cid:88) i = m +1 { ln[Φ( v i )] − ln[Φ( v i − k i )] − k i . φ ( v i )Φ( v i ) } (24) Defining a function : f : R → R such that , f ( v ) = ln [Φ( v )] − ln [Φ( v − k )] + k.h ( v ) ,where k is anyconstant. Claim-3 : f ( . ) is a monotonically decreasing functionProof: Differentiating f w.r.t v , we get: f (cid:48) ( v ) = φ ( v )Φ( v ) − φ ( v − k )Φ( v − k ) − k.h (cid:48) ( v )= h ( v ) − h ( v − k ) − k.h (cid:48) ( v ) (25)Taking the Taylor’s Series expansion of h ( . ) up to the second term, h ( v − k ) = h ( v ) − k h (cid:48) ( v ) + k h (cid:48)(cid:48) ( ζ ) ; ζ ∈ ( v − k, v ) and k ∈ R = ⇒ h ( v ) − h ( v − k ) − k.h (cid:48) ( v ) = − k h (cid:48)(cid:48) ( ζ ) f (cid:48) ( v ) = − k h (cid:48)(cid:48) ( ζ ) (26) Lemma-1: h(.) is a convex functionProof:
Related to standard normal, consider two properties:-16) φ ( x ) = φ ( − x ) ∀ x ∈ R − Φ( x ) = Φ( − x ) ∀ x ∈ R (27)We have, g ( x ) = φ ( x )1 − Φ( x ) = φ ( − x )Φ( − x ) = h ( − x ) ∀ x ∈ R = ⇒ g ( x ) = h ( − x ) ∀ x ∈ R Differentiating w.r.t x, g (cid:48) ( x ) = − h (cid:48) ( − x ) ∀ x ∈ R Again, differentiating w.r.t x, g (cid:48)(cid:48) ( x ) = − ( − h (cid:48)(cid:48) ( − x )) = h (cid:48)(cid:48) ( − x ) ∀ x ∈ R = ⇒ h (cid:48)(cid:48) ( − x ) = g (cid:48)(cid:48) ( x ) ∀ x ∈ R as g(.) is a convex function.So, h (cid:48)(cid:48) ( − x ) ≥ ⇒ h (cid:48)(cid:48) ( x ) ≥ ∀ x ∈ R h ( . ) is a convex function (28)Thus putting equation 26 in 25, we get, for any k = k i ∀ i = 1 , , ...mf (cid:48) ( v ) = − k h (cid:48)(cid:48) ( ζ ) ≤ (cid:107) A i ∆ (cid:107) σ .h (cid:48)(cid:48) ( ζ ) ≤ ∀ v ∈ IR .Hence, f (cid:48) ( v ) ≤ ∀ v ∈ IR.This implies that f ( . ) is a monotonically decreasing function . Claim-4 : f ( ∞ ) = 0 Proof:
Now, lim v →∞ Φ( v − k ) = lim v → v ∞ Φ( v ) = 1= ⇒ ln[Φ( ∞ )] = 0= ⇒ lim v →∞ ln[Φ( v − k )] = lim v →∞ ln[Φ( v )] = 0Also, lim v →∞ φ ( v ) = 0 = ⇒ lim v →∞ h ( v ) = 0 f ( ∞ ) = lim v → v ∞ f ( v ) = lim v →∞ [ln [Φ( v )] − ln [Φ( v − k )] − k.h ( v )]= lim v →∞ ln[Φ( v )] − lim v →∞ ln[Φ( v − k )] − k lim v → v ∞ h ( v ) = 0 (29)Hence, f ( ∞ ) = 0 Thus, from Claim-1 and Claim-2 , f ( . ) is a monotonically decreasing functionbounded below by 0 . This implies, f ( v ) ≥ ∀ v ∈ R (30)Putting equation 30 in equation 24, we have,Term 2 = (cid:80) m i = m +1 f ( v i ) . Since f ( v i ) ≥ ∀ v i , Term 2 ≥ .3.4 To prove that Term3 ≥ A as, A m × n = A m × n A m × n A m × n Rewriting Term 3 as follows:
T erm = σ (cid:80) m i = m +1 { ∆ T A iT A i ∆ } = σ { ∆ T A T A ∆ } Also, (cid:80) m i = m +1 k i = (cid:80) m i = m +1 ( A i ∆ ) T ( A i ∆ ) σ = ∆ T A T A ∆ σ From [7], ∆ T A T A ∆ = (cid:107) A ∆ (cid:107) > γ (cid:107) ∆ (cid:107) ∀ ∆ ∈ C ; where, C = { ∆ ∈ R | (cid:107) ∆ S (cid:107) ≤ α (cid:107) ∆ S (cid:107) } and γ is non-negative constant.So, m (cid:88) i = m +1 k i > γ (cid:107) ∆ (cid:107) σ (32)From 32, we have, Term 3 = 12 (cid:80) m i = m +1 k i > γ (cid:107) ∆ (cid:107) σ ≥ Term 3 > L ( y , Ax ; τ ) satisfies the RSC property Thus, from equations 23, 31 and 33, we have, δL ( x ∗ , ∆ ) = Term 1 + Term 2 + Term 3 ≥ γ (cid:107) ∆ (cid:107) σ This inequality holds for ∆ ∈ C where C (cid:44) { ∆ | (cid:107) ∆ S c (cid:107) ≤ α (cid:107) ∆ S (cid:107) } In our model, the vector x ∗ is strictly sparse. Hence, (cid:107) x true Sc (cid:107) = 0.Taking α = 3 , the set C satisfies the condition on ∆ require by RSC.Hence, L(x) satisfies Restricted Strong Convexity with curvature κ L = γ σ . The gradient term represented by ∇ L is shown by: ∇ L = − σ m (cid:88) i =1 A i φ ( τ − A i x ∗ σ )[1 − Φ( τ − A i x ∗ σ )] + 1 σ m (cid:88) i = m +1 A i φ ( − τ − A i x ∗ σ )[Φ( − τ − A i x ∗ σ )] − σ m (cid:88) i = m +1 A i ( y i − A i x ∗ σ )(34)18 L consists of 3 terms:- Term 1: = 1 σ m (cid:88) i =1 A i φ ( τ − A i x ∗ σ )[1 − Φ( τ − A i x ∗ σ )] Term 2: = 1 σ m (cid:88) i = m +1 A i φ ( − τ − A i x ∗ σ )[Φ( − τ − A i x ∗ σ )] Term 3: = − σ m (cid:88) i = m +1 A i ( y i − A i x ∗ σ ) (35)such that ∇ L = − Term 1+Term 2 + Term 3
For deriving this bound, we consider one condition :The signal x is bounded i.e., α ≤ x ≤ β (36), where all elements of α is α and all elements of β is β . Φ( . ) and φ ( . )From (3) , we have, α ≤ x j ≤ β ∀ j = 1 , , ...., n . Thus for any i = 1 , , ...., m ,( A ij being the < i, j > th element of A )If A ij ≥
0, then A ij α ≤ A ij x j ≤ A ij β ∀ j : A ij ≥ A ij <
0, then A ij β ≤ A ij x j ≤ A ij α ∀ j : A ij < A i x = (cid:80) nj =1 A ij x j ∀ i = 1 , , ...., m . So, A i x is bounded by, (cid:88) j : A ij ≥ A ij α + (cid:88) j : A ij < A ij β ≤ n (cid:88) j =1 A ij x j ≤ (cid:88) j : A ij ≥ A ij β + (cid:88) j : A ij < A ij α (37)Let p i = (cid:80) j : A ij ≥ A ij α + (cid:80) j : A ij < A ij β and q i = (cid:80) j : A ij ≥ A ij β + (cid:80) j : A ij < A ij α ∀ i =1 , , ..., m . Thus we have, p i ≤ A i x ≤ q i = ⇒ τ − q i σ ≤ τ − A i x σ ≤ τ − p i σ (38)19e know that ,Φ( . ) is a non-decreasing function. Thus, from (5) , ∀ i =1 , , ...., m , Φ( τ − q i σ ) ≤ Φ( τ − A i x σ ) ≤ Φ( τ − p i σ )= ⇒ τ − p i σ ) ≤ τ − A i x σ ) ≤ τ − q i σ ) (39) Also, − Φ( τ − p i σ ) ≤ − Φ( τ − A i x σ ) ≤ − Φ( τ − q i σ )= ⇒ − Φ( τ − q i σ ) ≤ − Φ( τ − A i x σ ) ≤ − Φ( τ − p i σ ) (40)Now, since φ ( . ) is not a monotone function, there are 3 cases depending on thevalues of p i and q i , ∀ i = 1 , , ...., m . Case 1: −∞ < τ − q i ≤ τ − p i ≤ φ ( . ) is an increasing function in ( −∞ ,
0] . Hence, φ ( τ − q i σ ) ≤ φ ( τ − A i x σ ) ≤ φ ( τ − p i σ )= ⇒ K i ≤ φ ( τ − A i x σ ) ≤ L i ∀ i : −∞ < τ − q i ≤ τ − p i ≤ K i = φ ( τ − q i σ ) and L i = φ ( τ − p i σ ). Case 2: ≤ τ − q i ≤ τ − p i < ∞ for some i φ ( . ) is a non-increasing function on [0 , ∞ ). Hence, φ ( τ − p i σ ) ≤ φ ( τ − A i x σ ) ≤ φ ( τ − q i σ )= ⇒ K i ≤ φ ( τ − A i x σ ) ≤ L i ∀ i : 0 ≤ τ − q i ≤ τ − p i < ∞ (42)where, K i = φ ( τ − p i σ ) and L i = φ ( τ − q i σ ). Case 3: −∞ < τ − q i ≤ ≤ τ − p i < ∞ for some iHere, min { φ ( τ − q i σ ) , φ ( τ − p i σ ) } ≤ φ ( τ − A i x σ ) ≤ φ (0) = 1 √ π = ⇒ K i ≤ φ ( τ − A i x σ ) ≤ L i ∀ i : −∞ < τ − q i ≤ ≤ τ − p i < ∞ (43)where K i = min { φ ( τ − q i σ ) , φ ( τ − p i σ ) } and L i = 1 √ π .20 .4.4 Bound for Term 1 Combining equations 40 ,41, 42 and 43 together, we get, K i − Φ( τ − q i σ ) ≤ φ ( τ − A i x σ )1 − Φ( τ − A i x σ ) ≤ L i − Φ( τ − p i σ ) ∀ i = 1 , , ..., m (44)For a given A i , if A ij ,i.e. the j th element of the row vector A i of the sensingmatrix A is positive, then multiplying A ij to equation 44 , we get , A ij K i − Φ( τ − q i σ ) ≤ A ij φ ( τ − A i x σ )1 − Φ( τ − A i x σ ) ≤ A ij L i − Φ( τ − p i σ ) ∀ i = 1 , , ..., m = ⇒ U ij ≤ A ij φ ( τ − A i x σ )1 − Φ( τ − A i x σ ) ≤ V ij ∀ i = 1 , , ..., m (45)where U ij = A ij K i − Φ( τ − q i σ ) and V ij = A ij L i − Φ( τ − p i σ )Again if, A ij is negative, then multiplying A ij to equation 44 , we get , A ij L i − Φ( τ − p i σ ) ≤ A ij φ ( τ − A i x σ )1 − Φ( τ − A i x σ ) ≤ A ij K i − Φ( τ − q i σ ) ∀ i = 1 , , ..., m = ⇒ U ij ≤ A ij φ ( τ − A i x σ )1 − Φ( τ − A i x σ ) ≤ V ij ∀ i = 1 , , ..., m (46)where U ij = A ij L i − Φ( τ − p i σ ) and V ij = A ij K i − Φ( τ − q i σ )Let us define U i = ( U i , U i , ....., U in ) and V i = ( V i , V i , ....., V in ) .We cannow join equations 45 and 46 as a vector inequality as follows, U i ≤ A i φ ( τ − A i x σ )1 − Φ( τ − A i x σ ) ≤ V i ∀ i = 1 , , ..., m (47)21rom equation 47, summing over i = 1 , , ..., m and multiplying throughout by1 σ , we have, 1 σ m (cid:88) i =1 U i ≤ σ m (cid:88) i =1 A i φ ( τ − A i x σ )1 − Φ( τ − A i x σ ) ≤ σ m (cid:88) i =1 V i (48)These are the bounds for Term 15.4.5 Bound for Term 2
In equations 39. 41,42 and 43, replace τ by − τ . Let K i = K i with τ replacedby − τ and L i = L i with τ replaced by − τ ∀ i = 1 , , ...m . Now combining theseequations together, we have, K i − Φ( − τ − p i σ ) ≤ φ ( − τ − A i x σ )Φ( − τ − A i x σ ) ≤ L i Φ( − τ − q i σ ) ∀ i = 1 , , ..., m (49)For a given A i , if A ij ,i.e. the j th element of the vector A i , is positive, thenmultiplying A ij to 49 , we get , A ij K i − Φ( − τ − q i σ ) ≤ A ij φ ( − τ − A i x σ )1 − Φ( − τ − A i x σ ) ≤ A ij L i − Φ( − p i σ ) ∀ i = 1 , , ..., m = ⇒ U ij ≤ A ij φ ( − τ − A i x σ )1 − Φ( − τ − A i x σ ) ≤ V ij ∀ i = 1 , , ..., m (50)where U ij = A ij K i − Φ( − τ − q i σ ) and V ij = A ij L i − Φ( − τ − p i σ )Again if, A ij is negative, then multiplying A ij to equation 49 , we get , A ij L i − Φ( − τ − p i σ ) ≤ A ij φ ( − τ − A i x σ )1 − Φ( − τ − A i x σ ) ≤ A ij K i − Φ( − τ − q i σ ) ∀ i = 1 , , ..., m = ⇒ U ij ≤ A ij φ ( − τ − A i x σ )1 − Φ( τ − A i x σ ) ≤ V ij ∀ i = 1 , , ..., m (51)22here U ij = A ij L i − Φ( − τ − p i σ ) and V ij = A ij K i − Φ( − τ − q i σ )Let us define U i = ( U i , U i , ....., U in ) and V i = ( V i , V i , ....., V in ) .We cannow join equations 50 and 51 as a vector inequality as follows, U i ≤ A i φ ( − τ − A i x σ )1 − Φ( − τ − A i x σ ) ≤ V i ∀ i = 1 , , ..., m (52)From equation 52, summing over i = m + 1 , , ..., m and multiplying through-out by 1 σ , we have,1 σ m (cid:88) i = m +1 U i ≤ σ m (cid:88) i = m +1 A i φ ( − τ − A i x σ )1 − Φ( − τ − A i x σ ) ≤ σ m (cid:88) i = m +1 V i (53)These are the bounds for Term 25.4.6 L ∞ norm bounds for Term2-Term1 From equation 48 and 53,
Term 2 - Term 1 is bounded by,1 σ { m (cid:88) i =1 V i + m (cid:88) i = m +1 U i }≤ − σ m (cid:88) i =1 A i φ ( τ − A i x σ )1 − Φ( τ − A i x σ ) + 1 σ m (cid:88) m +1 A i φ ( − τ − A i x σ )Φ( − τ − A i x σ ) ≤ σ { m (cid:88) i =1 U i + m (cid:88) i = m +1 V i } (54)From the inequality 54, the L ∞ norm on Term 2 - Term 1 would be boundby, (cid:107)
Term 2 - Term 1 (cid:107) ∞ ≤ σ max { m (cid:88) i =1 V i + m (cid:88) i = m +1 U i , m (cid:88) i =1 U i + m (cid:88) i = m +1 V i }≤ Qσ (55)where we define Q (cid:44) max { (cid:80) m i =1 V i + (cid:80) m i = m +1 U i , (cid:80) m i =1 U i + (cid:80) m i = m +1 V i } . Description of Q
From equation 55, we have Q = max { (cid:80) m i =1 V i + (cid:80) m i = m +1 U i , (cid:80) m i =1 U i + (cid:80) m i = m +1 V i } .23ow, U i , V i , U i , V i for all i are n × A multiplied by some scalar. Since, all elements A ij ofthe matrix A are drawn from a Gaussian (0 , m ) distribution, each row of oneof the four vectors V i , U i , U i , U i is a scalar multiplied to A ij . Let C be theupper bound of all the scalars multiplied to all the vectors U i , V i , U i , V i forall i. Now, C A ij ∼ N (0 , C m ). For each term, (cid:80) m i =1 V i + (cid:80) m i = m +1 U i and (cid:80) m i =1 U i + (cid:80) m i = m +1 V i , there are upper bounded by m + m terms of theform C A ij . So, the aforementioned terms will be bounded above by a termwhich has the distribution N (0 , m C m ). To find Q, we need to find the maximumbetween (cid:80) m i =1 V i + (cid:80) m i = m +1 U i and (cid:80) m i =1 U i + (cid:80) m i = m +1 V i . To do this, weuse the union bound used in example 11.1 from [7]. Since the vectors are ofthe dimension n ×
1, putting the union bound on the two aforementioned termbrings in the quantity log ( n ) in the structure of Q. From that, we get Q of theform C (cid:113) ( m + m ) log ( n ) m . L ∞ norm bound on Term 3 We have ,
Term 3 = − σ (cid:80) m i = m +1 A i ( y i − A i x ∗ σ )We know, y i ∼ N ( A i x , σ ) ∀ i = 1 , , ...., m .Standardising, z i = y i − A i x σ ∀ i = 1 , , .., m . Hence, z i ∼ N (0 , ∀ i =1 , , ..., m Diving the matrix A as, A m × n = A m × n A m × n A m × n Now, A ij : ¡i,j¿th ele-ment of A ∼ N (0 , m ). Let z : standarised measurements w.r.t. A .Clearly, Term 3 = − σ A T z . Note that A jT z is the j th element of the vector A T z ∀ j = 1 , , .., n . By the property of linear combination of normal vari-ables, A jT z ∼ N (0 , (cid:107) A j (cid:107) ) ∀ j = 1 , , ..., n (56)where, (cid:107) A j (cid:107) = (cid:80) m i = m +1 ( A ij ) . Now, A ij ∼ N (0 , m ) = ⇒ √ mA ij ∼ N (0 ,
1) = ⇒ m ( A ij ) ∼ χ = ⇒ m m (cid:88) i = m +1 ( A ij ) ∼ χ m = ⇒ E [ m m (cid:88) i = m +1 ( A ij ) ] = m = ⇒ E [ m (cid:88) i = m +1 ( A ij ) ] = m m ∀ j = 1 , , ..., n (57)24e take the approximation (cid:107) A j (cid:107) = m m ∀ j = 1 , , ..., n . Hence, A jt z ∼ N (0 , m m ) = ⇒ − A jt z ∼ N (0 , m m )= ⇒ − A jt z σ ∼ N (0 , m mσ ) (58)Thus, the Gaussian tail bound is given by, P (cid:20) | A t z σ | ≥ u (cid:21) ≤ (cid:26) − u σ m m (cid:27) (59)The union bound on equation 59 gives us, P (cid:20) (cid:107) A t z σ (cid:107) ∞ ≥ u (cid:21) ≤ (cid:26) − u σ m m + log( n ) (cid:27) (60)Equality takes place when , u = σ (cid:113) m (cid:37) log( n ) m , where, (cid:37) >
2. Thus, P (cid:34) (cid:107) A t z σ (cid:107) ∞ ≥ σ (cid:114) m (cid:37) log( n ) m (cid:35) ≤ (cid:26) −
12 ( (cid:37) −
2) log( n ) (cid:27) = ⇒ (cid:107) Term 3 (cid:107) ∞ ≥ σ (cid:114) m (cid:37) log( n ) m (61)with prob. 2 exp (cid:26) −
12 ( (cid:37) −
2) log( n ) (cid:27) (62) L ∞ norm Bound for ∇ L Let ϑ = Term 3 and ϑ = Term 2 - Term 1.From the Reverse Triangle Inequality on equations 55 and 61 , we have, (cid:107) ϑ + ϑ (cid:107) ∞ ≥ | (cid:107) ϑ (cid:107) ∞ − (cid:107) ϑ (cid:107) ∞ | = ⇒ (cid:107) -Term 1+Term 2+Term 3 (cid:107) ∞ ≥ | σ (cid:114) m (cid:37) log( n ) m − Qσ | = ⇒ (cid:107) ∇ L (cid:107) ∞ ≥ σ | (cid:114) m (cid:37) log( n ) m − Q | (63)with probability 2 exp (cid:8) − ( (cid:37) −
2) log( n ) (cid:9) .Again, from the Triangle Inequality on 55 and 61, we have, (cid:107) ϑ + ϑ (cid:107) ∞ ≤ (cid:107) ϑ (cid:107) ∞ + (cid:107) ϑ (cid:107) ∞ = ⇒ (cid:107) -Term 1+Term 2+Term 3 (cid:107) ∞ ≤ σ (cid:114) m (cid:37) log( n ) m + Qσ = ⇒ (cid:107) ∇ L (cid:107) ∞ ≤ σ { (cid:114) m (cid:37) log( n ) m + Q } (64)25ith probability 2 exp (cid:8) − ( (cid:37) −
2) log( n ) (cid:9) . where, Q (cid:44) max { (cid:80) m i =1 V i + (cid:80) m i = m +1 U i , (cid:80) m i =1 U i + (cid:80) m i = m +1 V i } .This upper bound will be useful in the final performance bounds. From Theorem-1 of [11], given a λ ≥ R ∗ ( ∇ L ( y , Ax ; τ )) , for any optimalsolution (cid:98) x λ with regulariser λ , the reconstruction error of the cost functionsatisfies the upper bound ( x ∗ ) :- (cid:107) (cid:98) x λ − x ∗ (cid:107) ≤ λ κ L ψ ( M ) + λκ L [2 τ L ( x ∗ ) + 4 R ( x ∗ M )] (65)where, R ( . ) is the regularisation function , R ∗ ( . ) is the dual of the regularisationfunction, ψ ( M ) = sup v ∈ R R ( v ) (cid:107) v (cid:107) and x ∗ M is all the elements except the s largestelements of vector x as defined in [11].In our model, R ( x ) = (cid:107) x (cid:107) and R ∗ ( x ) = (cid:107) x (cid:107) ∞ . Since the true signal x isassumed to be strictly sparse , R ( x ∗ M ) = 0. Also, ψ ( M ) = s , where s is thesparsity of the original signal x [7]. Now, from the upper bound for (cid:107)∇ L (cid:107) ∞ in63, we take λ = 2( σ (cid:113) m log n(cid:37)m + Qσ ). where (cid:37) > (cid:107) (cid:98) x λ − x ∗ (cid:107) ≤ { σ ( (cid:114) m log( n ) (cid:37)m + Q ) } × { σ γ } × s = 144 s { (cid:114) m log( n ) (cid:37)m + Q } σ γ (66)where Q (cid:44) max { (cid:80) m i =1 V i + (cid:80) m i = m +1 U i , (cid:80) m i =1 U i + (cid:80) m i = m +1 V i } . This provesTheorem 4. As described in Section 5.6, Q is of the order O ( (cid:113) ( m + m ) log ( n ) m ).Note that the range of values in the signal x is from α to β , both of which couldpotentially have large absolute value. The terms V i , U i , U i , U i are either ofthe form φ ()Φ() or φ ()1 − Φ() . Hence, these terms can be really large. So, the coefficient C in Q would also be very large. Consequently, the terms m + m dominatesin the upper bound presented in Eqn. 66, i.e. with increase in the saturatedmeasurements the upper bound in the reconstruction error becomes looser. References [1] Supplemental material for this paper. .262] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithmfor linear inverse problems.
SIAM Journal on Imaging Sciences , 2(1):183–202, 2009.[3] P. Bohra, D. Garg, K. S. Gurumoorthy, and A. Rajwade. Variance-stabilization-based compressive inversion under poisson or pois-son–gaussian noise with analytical bounds.
Inverse Problems , 35(10),2019.[4] E. Candes and M. Wakin. An introduction to compressive sampling.
IEEESignal Processing Magazine , 2008.[5] S. Foucart and J. Li. Sparse recovery from inaccurate saturated measure-ments.
Acta Applicandae Mathematicae , 158:49–66, 2018.[6] A. Gasull and F. Utzet. Approximating mill’s ratio.
Journal of Mathemat-ical Analysis and Applications , pages 1841–1844, 2014.[7] T. Hastie, R. Tibshirani, and M. Wainwright.
Statistical Learning withSparsity: The LASSO and Generalizations . CRC Press, 2015.[8] J. Laska, P. Boufounos, M. Davenport, and R. Baraniuk. Democracy inaction: Quantization, saturation, and compressive sensing.
Applied andComputational Harmonic Analysis , 31(3):429 – 443, 2011.[9] J. Laska, M. Davenport, and R. Baraniuk. Exact signal recovery fromsparsely corrupted measurements through the pursuit of justice. In
IEEEAsilomar Conference on Signals, Systems and Computers , 2009.[10] Z. Li, F. Wu, and J. Wright. On the systematic measurement matrix forcompressed sensing in the presence of gross errors. In
IEEE Data Com-pression Conference , 2010.[11] S. Negahban, P. Ravikumar, M. Wainwright, and Bin Yu. A unified frame-work for high-dimensional analysis of M-estimators with decomposable reg-ularisers.
Statistical Science , 27(10), 2012.[12] G. Raskutti, M. Wainwright, and B. Yu. Restricted eigenvalue propertiesfor correlated gaussian designs.
Journal of Machine Learning Research ,2010.[13] G. Tzagkarakis, J. Nolan, and P. Tsakalides. Compressive sensing usingsymmetric alpha-stable distributions for robust sparse signal reconstruc-tion.
IEEE Transactions on Signal Processing , 67(3), 2019.[14] J. Zhang, L. Chen, P. T. Boufounos, and Y. Gu. On the theoretical analysisof cross validation in compressive sensing. In