[PDF] Global Guarantees for Blind Demodulation with Generative Priors

Abstract

Full PDF

GGlobal Guarantees for Blind Demodulation with

Generative Priors

Paul Hand ∗ and Babhru Joshi † May 30, 2019

Abstract

We study a deep learning inspired formulation for the blind demodulation problem,which is the task of recovering two unknown vectors from their entrywise multiplication.We consider the case where the unknown vectors are in the range of known deepgenerative models, G (1) : R n → R (cid:96) and G (2) : R p → R (cid:96) . In the case when the networkscorresponding to the generative models are expansive, the weight matrices are randomand the dimension of the unknown vectors satisfy (cid:96) = Ω( n + p ) , up to log factors,we show that the empirical risk objective has a favorable landscape for optimization.That is, the objective function has a descent direction at every point outside of a smallneighborhood around four hyperbolic curves. We also characterize the local maximizersof the empirical risk objective and, hence, show that there does not exist any otherstationary points outside of these neighborhood around four hyperbolic curves andthe set of local maximizers. We also implement a gradient descent scheme inspiredby the geometry of the landscape of the objective function. In order to converge toa global minimizer, this gradient descent scheme exploits the fact that exactly one ofthe hyperbolic curve corresponds to the global minimizer, and thus points near thishyperbolic curve have a lower objective value than points close to the other spurioushyperbolic curves. We show that this gradient descent scheme can eﬀectively removedistortions synthetically introduced to the MNIST dataset. We study the problem of recovering two unknown vectors x ∈ R (cid:96) and w ∈ R (cid:96) fromobservations y ∈ R (cid:96) of the form y = w (cid:12) x , (1)where (cid:12) is entrywise multiplication. This bilinear inverse problem (BIP) is known as theblind demodulation problem. BIPs, in general, have been extensively studied and includeproblems such as blind deconvolution/demodulation [Ahmed et al., 2014, Stockham et al., ∗ [email protected], Department of Mathematics and College of Computer and Information Science,Northeastern University † [email protected], Department of Computational and Applied Mathematics, Rice University a r X i v : . [ m a t h . O C ] M a y { c w , c x } for c (cid:54) = 0 solves (1). In addition to the scaling ambiguity,this BIP is diﬃcult to solve because the solutions are non-unique, even when excluding thescaling ambiguity. For example, ( w , x ) and ( , w (cid:12) x ) both satisfy (1). This structuralambiguity can be solved by assuming a prior model of the unknown vectors. In past worksrelating to blind deconvolution and blind demodulation [Ahmed et al., 2014, Aghasi et al.,2019], this structural ambiguity issue was addressed by assuming a subspace prior, i.e. theunknown signals belong to known subspaces. Additionally, in many applications, the signalsare compressible or sparse with respect to a basis like a wavelet basis or the Discrete CosineTransform basis, which can address this structural ambiguity issue.In contrast to subspace and sparsity priors, we address the structural ambiguity issue byassuming the signals w and x belong to the range of known generative models G (1) : R n → R (cid:96) and G (2) : R p → R (cid:96) , respectively. That is, we assume that w = G (1) ( h ) for some h ∈ R n and x = G (2) ( m ) for some m ∈ R p . So, to recover the unknown vectors w and x , weﬁrst recover the latent code variables h and m and then apply G (1) and G (2) on h and m ,respectively. Thus, the blind demodulation problem under generative prior we study is:ﬁnd h ∈ R n and m ∈ R p , up to the scaling ambiguity, such that y = G (1) ( h ) (cid:12) G (2) ( m ) . In recent years, advances in generative modeling of images [Karras et al., 2017] hassigniﬁcantly increased the scope of using a generative model as a prior in inverse problems.Generative models are now used in speech synthesis [van den Oord et al., 2016], imagein-painting [Iizuka et al., 2017], image-to-image translation [Zhu et al., 2017], superresolution[Sønderby et al., 2017], compressed sensing [Bora et al., 2017, Lohit et al., 2018], blinddeconvolution [Asim et al., 2018], blind ptychography [Shamshad et al., 2018], and in manymore ﬁelds. Most of these papers empirically show that using generative model as a prior tosolve inverse problems outperform classical methods. For example, in compressed sensing,optimization over the latent code space to recover images from its compressive measurementshave been empirically shown to succeed with 10x fewer measurements than classical sparsitybased methods [Bora et al., 2017]. Similarly, the authors of Asim et al. [2018] empiricallyshow that using generative priors in image debluring inverse problem provide a very eﬀectiveregularization that produce sharp deblurred images from very blurry images.In the present paper, we use generative priors to solve the blind demodulation problem (1).The generative model we consider is the an expansive, fully connected, feed forward neuralnetwork with Rectiﬁed Linear Unit (ReLU) activation functions and no bias terms. Our maincontribution is we show that the empirical risk objective function, for a suﬃciently expansiverandom generative model, has a landscape favorable for gradient based methods to convergeto a global minimizer. Our result implies that if the dimension of the unknown signals satisfy (cid:96) = Ω( n + p ) , up to log factors, then the landscape is favorable. In comparison, classicalsparsity based methods for similar BIPs like sparse blind demodulation [Lee et al., 2017] and2parse phase retrieval [Li and Voroninski, 2013] showed that exact recovery of the unknownsignals is possible if the number of measurements scale quadratically, up to a log factor,w.r.t. the sparsity level of the signals. While we show a similar scaling of the number ofmeasurements w.r.t. the latent code dimension, the latent code dimension can be smallerthan the sparsity level for the same signal, and thus recovering the signal using generativeprior would require less number of measurements. We study the problem of recovering two unknown signals w and x in R (cid:96) from observations y = w (cid:12) x , where (cid:12) denotes entrywise product. We assume, as a prior, that the vectors w and x belong to the range of d -layer and s -layer neural networks G (1) : R n → R (cid:96) and G (2) : R p → R (cid:96) , respectively. The task of recovering w and x is reduced to ﬁnding the latent codes h ∈ R n and m ∈ R p such that G (1) ( h ) = w and G (2) ( m ) = x . More precisely, we con-sider the generative networks modeled by G (1) ( h ) = relu ( W (1) d . . . relu ( W (1)2 relu ( W (1)1 h )) . . . ) and G (2) ( m ) = relu ( W (2) s . . . relu ( W (2)2 relu ( W (2)1 m )) . . . ) , where relu ( x ) = max( x , ) ap-plies entrywise, W (1) i ∈ R n i × n i − for i = 1 , . . . , d with n = n < n < · · · < n d = (cid:96) , and W (2) i ∈ R p i × p i − for i = 1 , . . . , s with p = p < p < · · · < p s = (cid:96) . The blind demodulationproblem we consider is:Let: y ∈ R (cid:96) , h ∈ R n , m ∈ R p such that y = G (1) ( h ) (cid:12) G (2) ( m ) , Given: G (1) , G (2) and measurements y , Find: h and m , up to the scaling ambiguity.In order to recover h and m , up to the scaling ambiguity, we consider the following empiricalrisk minimization program:minimize h ∈ R n , m ∈ R p f ( h , m ) := 12 (cid:13)(cid:13) G (1) ( h ) (cid:12) G (2) ( m ) − G (1) ( h ) (cid:12) G (2) ( m ) (cid:13)(cid:13) . (2) (a) Landscape of the empirical risk function. -1 -0.5 0 0.5 1 1.5-1-0.500.511.5 (b) Note the four hyperbolic branches visible. Figure 1: Plots showing the landscape of the objective function with h = 1 and m = 1 . h = m = 1 , s = d = 2 , the networks are expansive, and the weight matrices W (1) i and W (2) i contain i.i.d. Gaussian entries. Clearly, the objective function in (2) is non-convex and, as aresult, there does not exist a prior guarantee that gradient based methods will converge to aglobal minima. Additionally, the objective function does not contain any regularizer which aregenerally be used to resolve the scaling ambiguity, and thus every point in (cid:8) ( c h , c m ) | c > (cid:9) is a global optima of (2). Nonetheless, we show that under certain conditions on the networks,the minimizers of (2) are in the neighborhood of four hyperbolic curves, one of which is thehyperbolic curve containing the global minimizers.In order to deﬁne these hyperbolic neighborhoods, let A (cid:15), (˜ h , ˜ m ) = (cid:26) ( h , m ) ∈ R n × p (cid:12)(cid:12)(cid:12)(cid:12) ∃ c > s.t. (cid:13)(cid:13)(cid:13)(cid:13) ( h , m ) − (cid:18) c ˜ h , c ˜ m (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:15) (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) c ˜ h , c ˜ m (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) (cid:27) , (3)where ( ˜ h , ˜ m ) ∈ R n × p is ﬁxed. This set is an (cid:15) -neighborhood of the hyperbolic set { ( c ˜ h , c ˜ m ) | c > } . We show that the minimizers of (2) are contained in the four hyperbolic sets givenby A (cid:15), ( h , m ) , A (cid:15), ( − ρ (1) d h , m ) , A (cid:15), ( h , − ρ (2) s m ) , and A (cid:15), ( − ρ (1) d h , − ρ (2) s m ) . Here, (cid:15) depends on theexpansivity and number of layers in the networks, and both ρ (1) d and ρ (2) s are positive constantsclose to 1. We also show that the points in the set { ( h , ) | h ∈ R n } ∪ { ( , m ) | m ∈ R p } arelocal maximizers. This result holds for networks with the following assumptions:A1. The weight matrices are random.A2. The weight matrices of inner layers satisfy n i = Ω( n i − ) , up to a log factor, for i = 1 , . . . , d − and p i = Ω( p i − ) , up to a log factor, for i = 1 , . . . , s − .A3. The weight matrices of the last layer for each generator satisfy (cid:96) = Ω( n d − + p s − ) , upto log factors.Figures 1a and 1b show the landscape of the objective function and corroborate our ﬁndings. Theorem 1 (Informal) . Let A = A (cid:15), ( h , m ) ∪ A (cid:15), ( − ρ (1) d h , m ) ∪ A (cid:15), ( h , − ρ (2) s m ) ∪ A (cid:15), ( − ρ (1) d h , − ρ (2) s m ) , where (cid:15) > depends on the expansivity of our networks and ρ (1) d , ρ (2) s → as d, s → ∞ ,respectively. Suppose the networks are suﬃciently expansive such that the number of neuronsin the inner layers and the last layers satisfy assumptions A2 and A3, respectively. Thenthere exist a descent direction, given by one of the one-sided partial derivative of the objectivefunction in (2) , for every ( h , m ) / ∈ A ∪ { ( h , ) | h ∈ R n } ∪ { ( , m ) | m ∈ R p } with highprobability. In addition, the elements of the set { ( h , ) | h ∈ R n } ∪ { ( , m ) | m ∈ R p } are localmaximizers. Our main result states that the objective function in (2) does not have any spuriousminimizers outside of the four hyperbolic neighborhoods. Thus, a gradient descent algorithmwill converge to a point inside the four neighborhoods, one of which contains the globalminimizers of (2). However, it may not guarantee convergence to a global minimizer and4t may not resolve the inherent scaling ambiguity present in the problem. So, in orderto converge to a global minimizer, we implement a gradient descent scheme that exploitsthe landscape of the objective function. That is, we exploit the fact that points near thehyperbolic curve corresponding to the global minimizer have a lower objective value thanpoints that are close to the remaining three spurious hyperbolic curves. Second, in order toresolve the scaling ambiguity, we promote solutions that have equal (cid:96) norm by normalizingthe estimates in each iteration of the gradient descent scheme (See Section 2).Theorem 1 also provides a global guarantee of the landscape of the objective functionin (2) if the dimension of the unknown signals scale quadratically w.r.t. to the dimensionof the latent codes, i.e. (cid:96) = Ω( n + p ) , up to log factors. Our result, which we get byenforcing generative priors may enjoy better sample complexity than classical priors likesparsity because: i) the same signals can have a latent code dimension that is smaller than itssparsity level w.r.t. to a particular basis, and ii) existing recovery guarantee of unstructuredsignals require number of measurements that scale quadratically with the sparsity level. Thus,our result may be less limiting in terms of sample complexity. A common approach of solving the BIP in (1) is to assume a subspace or sparsity prior onthe unknown vectors. In these cases the unknown vectors w and x are assumed to be inthe range of known matrices B ∈ R (cid:96) × n and C ∈ R (cid:96) × p , respectively. In Ahmed et al. [2014],the authors assumed a subspace prior and cast the BIP as a linear low rank matrix recoveryproblem. They introduced a semideﬁnite program based on nuclear norm minimization torecover the unknown matrix. For the case where the rows of B and C are Fourier andGaussian vectors, respectively, they provide a recovery guarantee that depend on the numberof measurements as (cid:96) = Ω( n + p ) , up to log factors. However, because this method operatesin the space of matrices, it is computationally prohibitively expensive. Another limitation ofthe lifted approach is that recovering a low rank and sparse matrix eﬃciently from linearobservation of the matrix has been challenging. Recently, Lee et al. [2017] provided arecovery guarantee with near optimal sample complexity for the low rank and sparse matrixrecovery problem using an alternating minimization method for a class of signals that satisfya peakiness condition. However, for general signals the same work established a recoveryresult for the case where the number of measurements scale quadratically with the sparsitylevel.In order to address the computational cost of working in the lifted case, a recent themehas been to introduce convex and non-convex programs that work in the natural parameterspace. For example, in Bahmani and Romberg [2016], Goldstein and Studer [2016], theauthors introduced PhaseMax, which is a convex program for phase retrieval that is basedon ﬁnding a simple convex relaxation via the convex hull of the feasibility set. The authorsshowed that PhaseMax enjoys rigorous recovery guarantee if a good anchor is available.This formulation was extended to the sparse case in Hand and Voroninski [2016], where theauthors considered SparsePhaseMax and provided a recovery guarantee with optimal samplecomplexity. The idea of formulating a convex program using a simple convex relaxation viathe convex hull of the feasibility set was used in the blind demodulation problem as well[Aghasi et al., 2019, 2018]. In particular, Aghasi et al. [2018] introduced a convex program in5he natural parameter space for the sparse blind demodulation problem in the case where thesign of the unknown signals are known. Like in Lee et al. [2017], the authors in Aghasi et al.[2019] provide a recovery guarantee with optimal sample complexity for a class of signals.However, the result does not extend to signals with no constraints. Other approaches thatoperate in the natural parameter space are methods based on Wirtinger Flow. For example,in Candès et al. [2015], Wang et al. [2016], Li et al. [2016], the authors use Wirtinger Flowand its variants to solve the phase retrieval and the blind deconvolution problem. Thesemethods are non-convex and require a good initialization to converge to a global solution.However, they are simple to solve and enjoys rigorous recovery guarantees. In this paper, we consider the blind demodulation problem with the unknown signals assumedto be in the range of known generative models. Our work is motivated by experimental resultsin deep compressed sensing and deep blind deconvolution presented in Bora et al. [2017],Asim et al. [2018] and theoretical work in deep compressed sensing presented in Hand andVoroninski [2017]. In Bora et al. [2017], the authors consider the compressed sensing problemwhere, instead of a sparsity prior, a generative prior is considered. They used an empirical riskoptimization program over the latent code space to recover images and empirical showed thattheir method succeeds with 10x fewer measurements than previous sparsity based methods.Following the empirical successes of deep compressed sensing, the authors in Hand andVoroninski [2017] provided a theoretical understanding for these successes by characterizingthe landscape of the empirical risk objective function. In the random case with the layersof the generative model suﬃciently expansive, they showed that every point outside of asmall neighborhood around the true solution and a negative multiple of the true solutionhas a descent direction with high probability. Another instance where generative modelcurrently outperforms sparsity based methods is in sparse phase retrieval Hand et al. [2018].In sparse phase retrieval, current algorithms that enjoy a provable recovery guarantee of anunknown n -dimensional k -sparse signal require at least O ( k log n ) measurements; whereas,when assuming the unknown signal is an output of a known d -layer generator G : R k → R n ,the authors in Hand et al. [2018] showed that, under favorable conditions on the generatorand with at least O ( kd log n ) measurements, the empirical risk objective enjoys a favorablelandscape.Similarly, in Asim et al. [2018], the authors consider the blind deconvolution problemwhere a generative prior over the unknown signal is considered. They empirically showedthat using generative priors in the image deblurring inverse problem provide a very eﬀectiveregularization that produce sharp deblurred images from very blurry images. The algorithmused to recovery these deblurred images is an alternating minimization approach whichsolves the empirical risk minimization with (cid:96) regularization on the unknown signals. The (cid:96) regularization promotes solution with least (cid:96) norm and resolves the scaling ambiguitypresent in the blind deconvolution problem. We consider a related problem, namely the blinddemodulation problem with a generative prior on the unknown signals, and show that undercertain conditions on the generators the empirical risk objective has a favorable landscape.6 .4 Notations Vectors and matrices are written with boldface, while scalars and entries of vectors are writtenin plain font. We write as the vector of all ones with dimensionality appropriate for thecontext. Let S n − be the unit sphere in R n . We write I n as the n × n identity matrix. For x ∈ R K and y ∈ R N , ( x , y ) is the corresponding vector in R K × R N . Let relu ( x ) = max( x , ) apply entrywise for x ∈ R n . Let diag ( W x > be the diagonal matrix that is in the ( i, i ) entry if ( W x ) i > and otherwise. Let A (cid:22) B mean that B − A is a positive semideﬁnitematrix. We will write γ = O ( δ ) to mean that there exists a positive constant C such that γ ≤ Cδ , where γ is understood to be positive. Similarly we will write c = Ω( δ ) to mean thatthere exists a positive constant C such that c ≥ Cδ . When we say that a constant dependspolynomially on (cid:15) − , that means that it is at most C(cid:15) − k for some positive C and positiveinteger k . For notational convenience, we will write a = b + O ( (cid:15) ) if (cid:107) a − b (cid:107) ≤ (cid:15) , where thenorm is understood to be absolute value for scalars, the (cid:96) norm for vectors, and the spectralnorm for matrices. In this section, we propose a gradient descent scheme that solves (2). The gradient descentscheme exploits the global geometry present in the landscape of the objective function in (2)and avoids regions containing spurious minimizers. The gradient descent scheme is based ontwo observations. The ﬁrst observation is that the minimizers of (2) are close to four hyperboliccurves given by { ( c h , c m ) | c > } , { ( − cρ (1) d h , c m ) | c > } , { ( c h , − ρ (2) d c m ) | c > } , and { ( − cρ (1) d h , − ρ (2) s c m ) | c > } , where ρ (1) d and ρ (2) s are close to 1. The second observation isthat f ( c h , c m ) is less than f ( − c h , c m ) , f ( c h , − c m ) , and f ( − c h , − c m ) for any c > . This is because the curve { ( c h , c m ) | c > } corresponds to the global minimizer of(2).We now introduce some quantities which are useful in stating the gradient descentalgorithm. For any h ∈ R n and W ∈ R l × n , deﬁne W + , h = diag ( W h > W . That is, W + , h zeros out the rows of W that do not have a positive inner product with h and keeps theremaining rows. We will extend the deﬁnition of W + , h to each layer of weights W (1) i in ourneural network. For W (1) i ∈ R n × n and h ∈ R n , deﬁne W (1)1 , + , h := ( W (1)1 ) + , h = diag ( W (1)1 h > W (1)1 . For each layer i > 1, deﬁne W (1) i, + , h = diag ( W (1) i W (1) i − , + , h . . . W (1)2 , + , h W (1)1 , + , h h > W (1) i . Lastly, deﬁne Λ (1) d, + , h := (cid:81) i = d W (1) i, + , h . Using the above notation, G (1) ( h ) can be compactlywritten as Λ (1) d, + , h h . Similarly, we may write G (2) ( m ) compactly as Λ (2) s, + , m m .The gradient descent scheme is an alternating descent direction algorithm. We ﬁrst pickan initial iterate ( h , m ) such that h (cid:54) = and m (cid:54) = . At each iteration i = 1 , , . . . , weﬁrst compare the objective value at ( h , m ) , ( − h , m ) , ( h , − m ) , and ( − h , − m ) andreset ( h , m ) to be the point with least objective value. Second we descend along a direction.7e compute the descent direction ˜ g , ( h , m ) , given by the partial derivative of f in (2) w.r.t. h , ˜ g , ( h , m ) := Λ (1) d, + , h (cid:124) (cid:16) diag ( Λ (2) s, + , m m ) Λ (1) d, + , h h − diag ( Λ (2) s, + , m m (cid:12) Λ (2) s, + , m m ) Λ (1) d, + , h h (cid:17) and take a step along this direction. Next, we compute the descent direction ˜ g , ( h , m ) , givenby the partial derivative of f in w.r.t. m , ˜ g , ( h , m ) := Λ (2) s, + , m (cid:124) (cid:16) diag ( Λ (1) d, + , h h ) Λ (2) s, + , m m − diag ( Λ (1) d, + , h h (cid:12) Λ (1) d, + , h h ) Λ (2) s, + , m m (cid:17) . and again take a step along this direction. Lastly, we normalize the iterate so that at eachiteration i (cid:107) h i (cid:107) = (cid:107) m i (cid:107) . We repeat this process until convergence. Algorithm 1 outlinesthis process. Algorithm 1

Alternating descent algorithm for (2)Input: Weight matrices, W (1) i and W (2) i , observation y and step size η > .Output: An estimate of a global minimizer of (2) Choose an arbitrary point ( h , m ) such that h (cid:54) = and m (cid:54) = for i = 1 , , . . . do : ( h i , m i ) ← arg min( f ( h i , m i ) , f ( − h i , m i ) , f ( h i , − m i ) , f ( − h i , − m i )) h i +1 ← h i − η ˜ g , ( h i , m i ) , m i +1 ← m i − η ˜ g , ( h i +1 , m i ) c ← (cid:112) (cid:107) h i +1 (cid:107) / (cid:107) m i +1 (cid:107) , h i +1 ← h i +1 /c , m i +1 ← m i +1 · c end for We now present our main results which states that the objective function has a descentdirection at every point outside of four hyperbolic regions. In order to state these directions,we ﬁrst note that the partial derivatives of f at a diﬀerentiable point ( h , m ) are ∇ h f ( h , m ) = ˜ g , ( h , m ) and ∇ m f ( h , m ) = ˜ g , ( h , m ) . The function f is not diﬀerentiable everywhere because of the behavior of the RELU activationfunction in the neural network. However, since G (1) and G (2) are piecewise linear, f isdiﬀerentiable at ( h , m ) + δ w for all ( h , m ) and w and suﬃciently small δ . The directionswe consider are g , ( h , m ) ∈ R n + p and g , ( h , m ) ∈ R n + p , where g , ( h , m ) = (cid:34) lim δ → + ∇ h f (( h , m ) + δ w ) (cid:35) , g , ( h , m ) = (cid:34) lim δ → + ∇ m f (( h , m ) + δ w ) (cid:35) , and(4) w is ﬁxed. Let D g f ( h , m ) be the unnormalized one-sided directional derivative of f ( h , m ) in the direction of g : D g f ( h , m ) = lim t → + f (( h , m )+ t g ) − f ( h , m ) t .8 heorem 2. Fix (cid:15) > such that K ( d s + d s ) (cid:15) / < , d ≥ , and s ≥ . Assume thenetworks satisfy assumptions A2 and A3. Assume for each i = 1 , . . . , d − , the entries of W (1) i are i.i.d. N ( , n i ) and i th row of W (1) d satisﬁes ( w (1) d ) (cid:124) i = w (cid:124) · (cid:107) w (cid:107) ≤ √ n d − /(cid:96) with w ∼ N (0 , (cid:96) I n d − ) . Similarly, assume for each i = 1 , . . . , s − , the entries of W (2) i are i.i.d. N (0 , p i ) and i th row of W (2) s satisﬁes ( w (2) s ) (cid:124) i = w (cid:124) · (cid:107) w (cid:107) ≤ √ p s − /(cid:96) with w ∼ N ( , (cid:96) I p s − ) .Let K = { ( h , ) ∈ R n × p | h ∈ R n } ∪ { ( , m ) ∈ R n × p | m ∈ R p } and A = A K d s (cid:15) , ( h , m ) ∪A K d s (cid:15) , (cid:16) − ρ (1) d h , m (cid:17) ∪ A K d s (cid:15) , (cid:16) ρ (2) s h , − m (cid:17) ∪ A K d s (cid:15) , (cid:16) − ρ (1) d ρ (2) s h , − m (cid:17) . Then on an eventof probability at least − (cid:80) di =1 ˜ cn i e − γn i − − (cid:80) si =1 ˜ cp i e − γp i − − ce − γ(cid:96) we have the following: for ( h , m ) (cid:54) = ( , ) , and ( h , m ) / ∈ A ∪ K the one-sided directional derivative of f in the direction of g = g , ( h , m ) or g = g , ( h , m ) ,deﬁned in (4) , satisfy D − g f ( h , m ) < . Additionally, for all ( h , m ) ∈ K and ( x , y ) D ( x , y ) f ( h , m ) ≤ . Here, ρ ( k ) d are positive numbers that converge to as d → ∞ , c and γ − are constants thatdepend polynomially on (cid:15) − and ˜ c , K , and K are absolute constants. We prove Theorem 2 by showing that neural networks with random weights satisfy twodeterministic conditions. These conditions are the Weight Distributed Condition (WDC) andthe joint Weight Distributed Condition (joint-WDC). The WDC is a slight generalization ofthe WDC introduced in Hand and Voroninski [2017]. We say a matrix W ∈ R (cid:96) × n satisﬁesthe WDC with constants (cid:15) > and < α ≤ if for all nonzero x , y ∈ R k , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:96) (cid:88) i =1 w i · x > w i · y > · w i w (cid:124) i − α Q x , y (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:15), with Q x , y = π − θ π I n + sin θ π M ˆ x ↔ ˆ y , (5)where w i ∈ R n is the i th row of W ; M ˆ x ↔ ˆ y ∈ R n × n is the matrix such that ˆ x → ˆ y , ˆ y → ˆ x ,and z → for all z ∈ span ( { x , y } ) ⊥ ; ˆ x = x / (cid:107) x (cid:107) and ˆ y = y / (cid:107) y (cid:107) ; θ = ∠ ( x , y ) ; and S is the indicator function on S . If w i ∼ N ( , (cid:96) I n ) for all i , then an elementary calculationshows that E (cid:104)(cid:80) (cid:96)i =1 w i · x > w i · y > · w i w (cid:124) i (cid:105) = Q x , y and if x = y then Q x , y is an isometry upto a factor of / . Also, note that if W satisﬁes WDC with constants (cid:15) and α , then √ α W satisﬁes WDC with constants (cid:15)/α and .We now state the joint Weight Distributed Condition. We say that B ∈ R (cid:96) × n and C ∈ R (cid:96) × p satisfy joint-WDC with constants (cid:15) > and < α ≤ if for all nonzero h , x ∈ R n and nonzero m , y ∈ R p , (cid:13)(cid:13)(cid:13) B (cid:124) + , h diag ( C + , m m (cid:12) C + , y y ) B + , x − α(cid:96) m (cid:124) Q m , y y · Q h , x (cid:13)(cid:13)(cid:13) ≤ (cid:15)(cid:96) (cid:107) m (cid:107) (cid:107) y (cid:107) , and (6) (cid:13)(cid:13)(cid:13) C (cid:124) + , m diag ( B + , h h (cid:12) B + , x x ) C + , y − α(cid:96) h (cid:124) Q h , x x · Q m , y (cid:13)(cid:13)(cid:13) ≤ (cid:15)(cid:96) (cid:107) h (cid:107) (cid:107) x (cid:107) (7)9e analyze networks G (1) and G (2) where the weight matrices corresponding to the innerlayers satisfy the WDC with constants (cid:15) > and and for the two matrices corresponding tothe outer layers, we assume that one of them satisﬁes WDC with constants (cid:15) and < α ≤ and the other satiﬁes WDC with constants (cid:15) and < α ≤ . We also assume that the twoouter layer matrices satisfy joint-WDC with constants (cid:15) > and α = α · α . We now statethe main deterministic result: Theorem 3.

Fix (cid:15) > , < α ≤ and < α ≤ such that K ( d s + d s ) (cid:15) / / ( α α ) < , d ≥ , and s ≥ . Let K = { ( h , ) ∈ R n × p | h ∈ R n } ∪ { ( , m ) ∈ R n × p | m ∈ R p } . Supposethat W (1) i ∈ R n i × n i − for i = 1 , . . . , d − and W (2) i ∈ R p i × p i − for i = 1 , . . . , s − satisfythe WDC with constant (cid:15) and . Suppose W (1) d ∈ R (cid:96) × n d − satisfy WDC with constants (cid:15) and α , and W (2) s ∈ R (cid:96) × p s − satisfy WDC with constants (cid:15) and α . Also, suppose (cid:16) W (1) d , W (2) s (cid:17) satisfy joint-WDC with constants (cid:15) , α = α · α . Let K = { ( h , ) ∈ R n × p | h ∈ R n }∪{ ( , m ) ∈ R n × p | m ∈ R p } and A = A K d s (cid:15) α − , ( h , m ) ∪ A K d s (cid:15) α − , (cid:16) − ρ (1) d h , m (cid:17) ∪ A K d s (cid:15) α − , (cid:16) ρ (2) s h , − m (cid:17) ∪ A K d s (cid:15) α − , (cid:16) − ρ (1) d ρ (2) s h , − m (cid:17) . Then, for ( h , m ) (cid:54) = ( , ) , and ( h , m ) / ∈ A ∪ K the one-sided directional derivative of f in the direction of g = g , ( h , m ) or g = g , ( h , m ) satisfy D − g f ( h , m ) < . Additionally, for all ( h , m ) ∈ K and for all ( x , y ) D ( x , y ) f ( h , m ) ≤ . Here, ρ ( k ) d are positive numbers that converge to as d → ∞ , and K , and K are absoluteconstants. We prove the theorems by showing that the descent directions g , ( h , m ) and g , ( h , m ) concen-trate around its expectation and then characterize the set of points where the correspondingexpectations are simultaneously zero. The outline of the proof is: • The WDC and joint-WDC imply that the one-sided partial directional derivatives of f concentrate uniformly for all non-zero h , h ∈ R n and m , m ∈ R p around continuousvectors t (1)( h , m ) , ( h , m ) and t (2)( h , m ) , ( h , m ) , respectively, deﬁned in equations (10) and (11)in the Appendix. • Direct analysis show that t (1)( h , m ) , ( h , m ) and t (2)( h , m ) , ( h , m ) are simultaneously approxi-mately zero around the four hyperbolic sets A (cid:15), ( h , m ) , A (cid:15), ( − ρ (1) d h , m ) , A (cid:15), ( h , − ρ (2) s m ) ,and A (cid:15), ( − ρ (1) d h , − ρ (2) s m ) , where (cid:15) depends on the expansivity and number of layers in thenetworks, and both ρ (1) d and ρ (2) s are positive constants close to 1 and depends on thenumber of layers in the two neural networks as well.10 Using sphere covering arguments, Gaussian and truncated Gaussian matrices withappropriate dimensions satisfy the WDC and joint-WDC conditions.The full proof of Theorem 3 is provided in the Appendix.

We now empirically show that Algorithm 1 can remove distortions present in the dataset. Weconsider the image recovery task of removing distortions that were synthetically introducedto the MNIST dataset. The distortion dataset contain 8100 images of size × wherethe distortions are generated using a 2D Gaussian function, g ( x, y ) = e − ( x − c ) +( y − c ) σ , where c is the center and σ controls its tail behavior. For each of the 8100 image, we ﬁx c and σ , which vary uniformly in the intervals [ − , and [20 , , respectively, and x and y arein the interval [ − , . Prior to training the generators, the images in the MNIST datasetand the distortion dataset were resized to × images. We used DCGAN [Radford et al.,2016] with a learning rate of 0.0002 and latent code dimension of 50 to train a generator, G (2) , for the distortion images. Similarly, we used the DCGAN with learning rate of 0.0002and latent code dimension of 100 to train a generator, G (1) , for the MNIST images. Finally,a distorted image y is generated via the pixelwise multiplication of an image w from theMNIST dataset and an image x from the distortion dataset, i.e. y = w (cid:12) x . Figure 2: The ﬁgure shows the result removing distortion in an image by solving (2) usingAlgorithm 1. The top row corresponds to the input distorted image. The second and third rowcorresponds to the images recovered using empirical risk minimization.

Figure 2 shows the result of using Algorithm 1 to remove distortion from y . In theimplementation of Algorithm 1, ˜ g , ( h i , m i ) and ˜ g , ( h i , m i ) corresponds to the partial derivativesof f with the generators as G (1) and G (2) . We used the Stochastic Gradient Descent algorithmwith the step size set to 1 and momentum set to 0.9. For each image in the ﬁrst row of Figure2, the corresponding images in the second and third rows are the output of Algorithm 1 after500 iterations. References

Ali Ahmed, Benjamin Recht, and Justin Romberg. Blind deconvolution using convexprogramming.

IEEE Trans. Inform. Theory , 60(3):1711–1732, 2014.Thomas G Stockham, Thomas M Cannon, and Robert B Ingebretsen. Blind deconvolutionthrough digital signal processing.

Proceedings of the IEEE , 63(4):678–692, 1975.11eepa Kundur and Dimitrios Hatzinakos. Blind image deconvolution.

IEEE signal processingmagazine , 13(3):43–64, 1996.Alireza Aghasi, Barmak Heshmat, Albert Redo-Sanchez, Justin Romberg, and RameshRaskar. Sweep distortion removal from terahertz images via blind demodulation.

Optica , 3(7):754–762, 2016.Alireza Aghasi, Ali Ahmed, and Paul Hand. Branchhull: Convex bilinear inversion fromthe entrywise product of signals with known signs.

Applied Computational and HarmonicAnalysis , 2019. doi: https://doi.org/10.1016/j.acha.2019.03.002.James R Fienup. Phase retrieval algorithms: a comparison.

Applied optics , 21(15):2758–2769,1982.E. Candès and X. Li. Solving quadratic equations via phaselift when there are about as manyequations as unknowns.

Found. Comput. Math. , pages 1–10, 2012.E. Candès, T. Strohmer, and V. Voroninski. Phaselift: Exact and stable signal recovery frommagnitude measurements via convex programming.

Commun. Pure Appl. Math. , 66(8):1241–1274, 2013.Ivana Tosic and Pascal Frossard. Dictionary learning.

IEEE Signal Processing Magazine , 28(2):27–38, 2011.Patrik O Hoyer. Non-negative matrix factorization with sparseness constraints.

Journal ofmachine learning research , 5(Nov):1457–1469, 2004.Daniel D Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. In

Advances in neural information processing systems , pages 556–562, 2001.Shuyang Ling and Thomas Strohmer. Self-calibration and biconvex compressive sensing.

Inverse Problems , 31(11):115002, 2015.Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gansfor improved quality, stability, and variation.

CoRR , abs/1710.10196, 2017. URL http://arxiv.org/abs/1710.10196 .Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, AlexGraves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: Agenerative model for raw audio.

CoRR , abs/1609.03499, 2016. URL http://arxiv.org/abs/1609.03499 .Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Globally and locally consistentimage completion.

ACM Trans. Graph. , 36(4):107:1–107:14, July 2017. ISSN 0730-0301.doi: 10.1145/3072959.3073659. URL http://doi.acm.org/10.1145/3072959.3073659 .Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-imagetranslation using cycle-consistent adversarial networks.

CoRR , abs/1703.10593, 2017. URL http://arxiv.org/abs/1703.10593 . 12asper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszár. Amor-tised MAP inference for image super-resolution. In , 2017. URL https://openreview.net/forum?id=S1RP6GLle .Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G. Dimakis. Compressed sensing usinggenerative models. 2017. URL https://arxiv.org/abs/1703.03208 .S. Lohit, K. Kulkarni, R. Kerviche, P. Turaga, and A. Ashok. Convolutional neural networksfor noniterative reconstruction of compressively sensed images.

IEEE Transactions onComputational Imaging , 4(3):326–340, Sep. 2018. ISSN 2333-9403. doi: 10.1109/TCI.2018.2846413.Muhammad Asim, Fahad Shamshad, and Ali Ahmed. Solving bilinear inverse problems usingdeep generative priors.

CoRR , abs/1802.04073, 2018. URL http://arxiv.org/abs/1802.04073 .Fahad Shamshad, Farwa Abbas, and Ali Ahmed. Deep ptych: Subsampled fourier ptychogra-phy using generative priors.

CoRR , abs/1812.11065, 2018.Kiryung Lee, Yihing Wu, and Yoram Bresler. Near optimal compressed sensing of a class ofsparse low-rank matrices via sparse power factorization. arXiv preprint arXiv:1702.04342 ,2017.Xiaodong Li and Vladislav Voroninski. Sparse signal recovery from quadratic measurementsvia convex programming.

SIAM Journal on Mathematical Analysis , 45(5):3019–3033, 2013.Sohail Bahmani and Justin Romberg. Phase retrieval meets statistical learning theory: Aﬂexible convex relaxation. arXiv preprint arXiv:1610.04210 , 2016.Tom Goldstein and Christoph Studer. Phasemax: Convex phase retrieval via basis pursuit. arXiv preprint arXiv:1610.07531 , 2016.Paul Hand and Vladislav Voroninski. Compressed sensing from phaseless gaussian measure-ments via linear programming in the natural parameter space.

CoRR , abs/1611.05985,2016. URL http://arxiv.org/abs/1611.05985 .Alireza Aghasi, Ali Ahmed, Paul Hand, and Babhru Joshi. A convex program for bilinearinversion of sparse vectors. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,

Advances in Neural Information Processing Systems 31 ,pages 8548–8558. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/8074-a-convex-program-for-bilinear-inversion-of-sparse-vectors.pdf .Emmanuel Candès, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtingerﬂow: Theory and algorithms.

IEEE Trans. Inform. Theory , 61(4):1985–2007, 2015.Gang Wang, Georgios B Giannakis, and Yonina C Eldar. Solving systems of random quadraticequations via truncated amplitude ﬂow. arXiv preprint arXiv:1605.08285 , 2016.13iaodong Li, Shuyang Ling, Thomas Strohmer, and Ke Wei. Rapid, robust, and reliableblind deconvolution via nonconvex optimization. arXiv preprint arXiv:1606.04933 , 2016.Paul Hand and Vladislav Voroninski. Global guarantees for enforcing deep generative priors byempirical risk.

CoRR , abs/1705.07576, 2017. URL http://arxiv.org/abs/1705.07576 .Paul Hand, Oscar Leong, and Vladislav Voroninski. Phase retrieval under a generative prior.

CoRR , abs/1807.04261, 2018. URL http://arxiv.org/abs/1807.04261 .Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning withdeep convolutional generative adversarial networks. In

ICLR , 2016.R. Vershynin.

Compressed sensing: theory and applications . Cambridge University Press,2012.Halyun Jeong, Xiaowei Li, Yaniv Plan, and Ozgur Yilmaz. Non-gaussian random matriceson sets: Optimal tail dependence and applications. In , 2019. 14

Appendix

Let ∠ ( h , h ) = ¯ θ (1)0 and ∠ ( m , m ) = ¯ θ (2)0 for non-zero h , h ∈ R n and m , m ∈ R p . In orderto understand how the operators h → W (1)+ , h h and m → W (2)+ , m m distort angles, we deﬁne g ( θ ) = cos − (cid:18) ( π − θ ) cos θ + sin θπ (cid:19) . (8)Also, for a ﬁxed p , q ∈ R n , deﬁne ˜ t ( k ) p , q := 12 a ( k )  a ( k ) − (cid:89) i =0 π − ¯ θ ( k ) i π  q + a ( k ) − (cid:88) i =0 sin ¯ θ ( k ) i π  a ( k ) − (cid:89) j = i +1 π − ¯ θ ( k ) j π  (cid:107) q (cid:107) (cid:107) p (cid:107) p  , (9)where ¯ θ ( k ) i = g (¯ θ ( k ) i − ) for g given by (8), ¯ θ ( k )0 = ∠ ( p , q ) , a (1) = d , and a (2) = s . Theorem 4 (Also Theorem 3) . Fix (cid:15) > , < α ≤ and < α ≤ such that K ( d s + d s ) (cid:15) / / ( α α ) < , d ≥ , and s ≥ . Let K = { ( h , ) ∈ R n × p | h ∈ R n } ∪ { ( , m ) ∈ R n × p | m ∈ R p } . Suppose that W (1) i ∈ R n i × n i − for i = 1 , . . . , d − and W (2) i ∈ R p i × p i − for i = 1 , . . . , s − satisfy the WDC with constant (cid:15) and . Suppose W (1) d ∈ R (cid:96) × n d − satisfy WDC with constants (cid:15) and α , and W (2) s ∈ R (cid:96) × p s − satisfy WDC with constants (cid:15) and α . Also, suppose (cid:16) W (1) d , W (2) s (cid:17) satisfy joint-WDC with constants (cid:15) , α = α · α . Let K = { ( h , ) ∈ R n × p | h ∈ R n } ∪ { ( , m ) ∈ R n × p | m ∈ R p } and A = A K d s (cid:15) α − , ( h , m ) ∪A K d s (cid:15) α − , (cid:16) − ρ (1) d h , m (cid:17) ∪ A K d s (cid:15) α − , (cid:16) ρ (2) s h , − m (cid:17) ∪ A K d s (cid:15) α − , (cid:16) − ρ (1) d ρ (2) s h , − m (cid:17) . Then, for ( h , m ) (cid:54) = ( , ) , and ( h , m ) / ∈ A ∪ K the one-sided directional derivative of f in the direction of g = g , ( h , m ) or g = g , ( h , m ) satisfy D − g f ( h , m ) < . Additionally, for all ( h , m ) ∈ K and for all ( x , y ) D ( x , y ) f ( h , m ) ≤ . Here, ρ ( k ) d are positive numbers that converge to as d → ∞ , and K , and K are absoluteconstants.Proof. Recall that v (1)( h , m ) , ( h , m ) = (cid:26) ∇ h f ( h , m ) G is diﬀerentiable at ( h , m ) , lim δ → + ∇ h f (( h , m ) + δ w ) otherwise v (2)( h , m ) , ( h , m ) = (cid:26) ∇ m f ( h , m ) G is diﬀerentiable at ( h , m ) , lim δ → + ∇ m f (( h , m ) + δ w ) otherwise15here G ( h , m ) is diﬀerentiable at ( h , m ) + δ w for suﬃciently small δ . Such a δ exists becauseof piecewise linearity of G ( h , m ) and any such w can be arbitrarily selected. Also, recallthat for any diﬀerentiable point ( h , m ) , we have ∇ h f ( h , m ) = (cid:16) Λ (1) d, + , h (cid:17) (cid:62) diag (cid:16) Λ (2) s, + , m m (cid:17) Λ (1) d, + , h h − (cid:16) Λ (1) d, + , h (cid:17) (cid:62) diag (cid:16) Λ (2) s, + , m m (cid:12) Λ (2) s, + , m m (cid:17) Λ (1) d, + , h h , ∇ m f ( h , m ) = (cid:16) Λ (2) s, + , m (cid:17) (cid:62) diag (cid:16) Λ (1) d, + , h h (cid:17) Λ (2) s, + , m m − (cid:16) Λ (2) s, + , m (cid:17) (cid:62) diag (cid:16) Λ (1) d, + , h h (cid:12) Λ (1) d, + , h h (cid:17) Λ (2) s, + , m m . Let g , ( h , m ) = (cid:34) v (1)( h , m ) , ( h , m ) (cid:35) ∈ R n × p , g , ( h , m ) = (cid:34) v (2)( h , m ) , ( h , m ) (cid:35) ∈ R n × p , t (1)( h , m ) , ( h , m ) = α d + s (cid:96) (cid:107) m (cid:107) h − α(cid:96) m (cid:124) ˜ t (2) m , m ˜ t (1) h , h , (10) t (2)( h , m ) , ( h , m ) = α d + s (cid:96) (cid:107) h (cid:107) m − α(cid:96) h (cid:124) ˜ t (1) h , h ˜ t (2) m , m , (11) S (1) (cid:15), ( h , m ) = (cid:40) ( h , m ) ∈ R n × p \K (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) t (1)( h , m ) , ( h , m ) (cid:107) (cid:107) m (cid:107) ≤ (cid:15) max( (cid:107) h (cid:107) (cid:107) m (cid:107) , (cid:107) h (cid:107) (cid:107) m (cid:107) )2 d + s (cid:96) (cid:41) ,S (2) (cid:15), ( h , m ) = (cid:40) ( h , m ) ∈ R n × p \K (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) t (2)( h , m ) , ( h , m ) (cid:107) (cid:107) h (cid:107) ≤ (cid:15) max( (cid:107) h (cid:107) (cid:107) m (cid:107) , (cid:107) h (cid:107) (cid:107) m (cid:107) )2 d + s (cid:96) (cid:41) , where ˜ t (2) m , m and ˜ t (1) h , h is as deﬁned in (9). For brevity of notation, write v (1)( h , m ) = v (1)( h , m ) , ( h , m ) , v (2)( h , m ) = v (2)( h , m ) , ( h , m ) , t ( h , m ) = t ( h , m ) , ( h , m ) , t (1)( h , m ) = t (1)( h , m ) , ( h , m ) and t (2)( h , m ) = t (2)( h , m ) , ( h , m ) .Since W (1) i ∈ R n i × n i − for i = 1 , . . . , d − and W (2) i ∈ R p i × p i − for i = 1 , . . . , s − satisfythe WDC with constant (cid:15) and , W (1) d ∈ R (cid:96) × n d − satisfy WDC with constants (cid:15) and α , W (2) s ∈ R (cid:96) × p s − satisfy WDC with constants (cid:15) and α , and (cid:16) W (1) d , W (2) s (cid:17) satisfy joint-WDCwith constants (cid:15) , α = α · α , we have lemma 2 implying for all nonzero h , h ∈ R n andnonzero m , m ∈ R p (cid:107)∇ h f ( h , m ) − t (1)( h , m ) (cid:107) ≤ K d s √ (cid:15) d + s (cid:96) max( (cid:107) h (cid:107) (cid:107) m (cid:107) , (cid:107) h (cid:107) (cid:107) m (cid:107) ) (cid:107) m (cid:107) , (12) (cid:107)∇ m f ( h , m ) − t (2)( h , m ) (cid:107) ≤ K d s √ (cid:15) d + s (cid:96) max( (cid:107) h (cid:107) (cid:107) m (cid:107) , (cid:107) h (cid:107) (cid:107) m (cid:107) ) (cid:107) h (cid:107) . (13)16hus, we have, for all nonzero h , h ∈ R n and nonzero m , m ∈ R p , (cid:107) v (1)( h , m ) − t (1)( h , m ) (cid:107) = lim δ → + (cid:107)∇ h f (( h , x ) + δ w ) − t (1)( h , m )+ δ w (cid:107) ≤ K d s √ (cid:15) d + s (cid:96) max( (cid:107) h (cid:107) (cid:107) m (cid:107) , (cid:107) h (cid:107) (cid:107) m (cid:107) ) (cid:107) m (cid:107) , and (cid:107) v (2)( h , m ) − t (2)( h , m ) (cid:107) = lim δ → + (cid:107)∇ m f (( h , x ) + δ w ) − t (2)( h , m )+ δ w (cid:107) ≤ K d s √ (cid:15) d + s (cid:96) max( (cid:107) h (cid:107) (cid:107) m (cid:107) , (cid:107) h (cid:107) (cid:107) m (cid:107) ) (cid:107) h (cid:107) , where the inequalities follow from (12) and (13).Note that the one-sided directional derivative of f in the direction of ( x , y ) (cid:54) = at ( h , y ) is D ( x , y ) f ( h , x ) = lim t → + t ( f (( h , x ) + t ( x , y )) − f ( h , x )) . Due to the continuity andpiecewise linearity of the function G ( h , m ) = Λ (1) d, + , h h (cid:12) Λ (2) s, + , m m , we have that for any ( h , m ) (cid:54) = ( , ) and ( x , y ) (cid:54) = that there exists a sequence { ( h n , m n ) } → ( h , m ) such that f is diﬀerentiable at each ( h n , m n ) and D ( x , y ) f ( h , m ) = lim n →∞ ∇ f ( h n , m n ) · ( x , y ) . Thus, as ∇ f ( h n , m n ) = (cid:34) v (1)( h n , m n ) v (2)( h n , m n ) (cid:35) , D − g , ( h , m ) f ( h , m ) = lim n →∞ ∇ f ( h n , m n ) · − g , ( h , m ) (cid:107) g , ( h , m ) (cid:107) = − (cid:107) g , ( h , m ) (cid:107) lim n →∞ v (1)( h n , m n ) · v (1)( h , m ) ,D − g , ( h , m ) f ( h , m ) = lim n →∞ ∇ f ( h n , m n ) · − g , ( h , m ) (cid:107) g , ( h , m ) (cid:107) = − (cid:107) g , ( h , m ) (cid:107) lim n →∞ v (2)( h n , m n ) · v (2)( h , m ) . Now, we write v (1)( h n , m n ) · v (1)( h , m ) = t (1)( h n , m n ) · t (1)( h , m ) + ( v (1)( h n , m n ) − t (1)( h n , m n ) ) · t (1)( h , m ) + t (1)( h n , m n ) · ( v (1)( h , m ) − t (1)( h , m ) )+ ( v (1)( h n , m n ) − t (1)( h n , m n ) ) · ( v (1)( h , m ) − t (1)( h , m ) ) ≥ t (1)( h n , m n ) · t (1)( h , m ) − (cid:107) v (1)( h n , m n ) − t (1)( h n , m n ) (cid:107) (cid:107) t (1)( h , m ) (cid:107) − (cid:107) t (1)( h n , m n ) (cid:107) (cid:107) v (1)( h , m ) − t (1)( h , m ) (cid:107) (cid:107) v (1)( h n , m n ) − t (1)( h n , m n ) (cid:107) (cid:107) v (1)( h , m ) − t (1)( h , m ) (cid:107) ≥ t (1)( h n , m n ) · t (1)( h , m ) − K d s √ (cid:15) d + s (cid:96) max( (cid:107) h n (cid:107) (cid:107) m n (cid:107) , (cid:107) h (cid:107) (cid:107) m (cid:107) ) (cid:107) m n (cid:107) (cid:107) t (1)( h , m ) (cid:107) − K d s √ (cid:15) d + s (cid:96) max( (cid:107) h (cid:107) (cid:107) m (cid:107) , (cid:107) h (cid:107) (cid:107) m (cid:107) ) (cid:107) m (cid:107) (cid:107) t (1)( h n , m n ) (cid:107) − (cid:18) K d s √ (cid:15) d + s (cid:96) (cid:19) max( (cid:107) h n (cid:107) (cid:107) m n (cid:107) , (cid:107) h (cid:107) (cid:107) m (cid:107) )max( (cid:107) h (cid:107) (cid:107) m (cid:107) , (cid:107) h (cid:107) (cid:107) m (cid:107) ) (cid:107) m n (cid:107) (cid:107) m (cid:107) . t (1)( h , m ) is continuous in ( h , m ) for all ( h , m ) / ∈ K , we have for all ( h , m ) / ∈ S (1)4 Kd s √ (cid:15), ( h , m ) ∪K , lim n →∞ v (1)( h n , m n ) · v (1)( h , m ) ≥(cid:107) t (1)( h , m ) (cid:107) − K d s √ (cid:15) d + s (cid:96) max( (cid:107) h (cid:107) (cid:107) m (cid:107) , (cid:107) h (cid:107) (cid:107) m (cid:107) ) (cid:107) m (cid:107) (cid:107) t (1)( h , m ) (cid:107) − (cid:18) K d s √ (cid:15) d + s (cid:96) max( (cid:107) h (cid:107) (cid:107) m (cid:107) , (cid:107) h (cid:107) (cid:107) m (cid:107) ) (cid:107) m (cid:107) (cid:19) ≥ (cid:107) t (1)( h , m ) (cid:107) (cid:104) (cid:107) t (1)( h , m ) (cid:107) − K d s √ (cid:15) d + s (cid:96) max( (cid:107) h (cid:107) (cid:107) m (cid:107) , (cid:107) h (cid:107) (cid:107) m (cid:107) ) (cid:107) m (cid:107) (cid:105) +12 (cid:104) (cid:107) t (1)( h , m ) (cid:107) − (cid:18) K d s √ (cid:15) d + s (cid:96) max( (cid:107) h (cid:107) (cid:107) m (cid:107) , (cid:107) h (cid:107) (cid:107) m (cid:107) ) (cid:107) m (cid:107) (cid:19) (cid:105) > . (14)Similarly, we have for all ( h , m ) / ∈ S (2)4 Kd s √ (cid:15), ( h , m ) ∪ K , lim n →∞ v (2)( h n , m n ) · v (2)( h , m ) > . (15)So, for all ( h , m ) / ∈ (cid:16) S (1)4 Kd s √ (cid:15), ( h , m ) ∩ S (2)4 Kd s √ (cid:15), ( h , m ) (cid:17) ∪ K , at least (14) or (15) holds. If(14) holds, then we have D − g , ( h , m ) f ( h , m ) < and if (15) holds, then we have D − g , ( h , m ) f ( h , m ) < It remain to prove that for all ( h , m ) ∈ K and for all ( x , y ) ∈ R n × p , D ( x , y ) f ( h , m ) ≤ .We ﬁrst assume h = and m is arbitrary. Let ˜ h = Λ (1) d − , + , h h , ˜ x = Λ (1) d − , + , x x , ˜ m = Λ (2) s − , + , m m , ˜ m = Λ (2) s − , + , m m , ¯ θ ( k ) i = g (¯ θ ( k ) i − ) for g given in (8), ¯ θ (1)0 = ∠ ( x , h ) and ¯ θ (2)0 = ∠ ( m , m ) . We compute − D ( x , y ) f ( h , m ) · (cid:107) ( x , y ) (cid:107) = lim t → + f (( h , m ) + t ( x , y )) − f ( h , m ) t = (cid:68) diag (cid:16) Λ (2) s, + , m m (cid:17) Λ (1) d, + , x x , diag (cid:16) Λ (2) s, + , m m (cid:17) Λ (1) d, + , h h (cid:69) = (cid:68) ˜ x , (cid:16) W (2) d − , + , m (cid:17) (cid:124) diag (cid:16) W (1) d − , + , x ˜ m (cid:12) W (1) d − , + , x ˜ m (cid:17) W (1) d − , + , h ˜ h (cid:69) = (cid:68) ˜ x , (cid:16) (cid:16) W (2) d − , + , m (cid:17) (cid:124) diag (cid:16) W (1) d − , + , x ˜ m (cid:12) W (1) d − , + , x ˜ m (cid:17) W (1) d − , + , h − αn Q ˜ x , ˜ h ˜ m (cid:124) Q ˜ m , ˜ m ˜ m (cid:17) ˜ h (cid:69) + αn ˜ m (cid:124) Q ˜ m , ˜ m ˜ m · ˜ x (cid:124) Q ˜ x , ˜ h ˜ h ≥ − (cid:13)(cid:13)(cid:13) (cid:16) W (2) d − , + , m (cid:17) (cid:124) diag (cid:16) W (1) d − , + , x ˜ m (cid:12) W (1) d − , + , x ˜ m (cid:17) W (1) d − , + , h − αn Q ˜ x , ˜ h ˜ m (cid:124) Q ˜ m , ˜ m ˜ m (cid:13)(cid:13)(cid:13) (cid:107) ˜ x (cid:107) (cid:107) ˜ h (cid:107) + α n (cid:32) ( π − ¯ θ (1) d − ) cos ¯ θ (1) d − + sin ¯ θ (1) d − π (cid:33) ( π − ¯ θ (2) d − ) cos ¯ θ (2) d − + sin ¯ θ (2) d − π (cid:33) (cid:107) ˜ x (cid:107) ˜ h (cid:107)(cid:107) ˜ m (cid:107)(cid:107) ˜ m (cid:107)≥ − (cid:15)n (cid:107) ˜ m (cid:107) (cid:107) ˜ m (cid:107) (cid:107) ˜ x (cid:107) (cid:107) ˜ h (cid:107) + α n (cid:107) ˜ x (cid:107) ˜ h (cid:107)(cid:107) ˜ m (cid:107)(cid:107) ˜ m (cid:107) cos ¯ θ (1) d cos ¯ θ (2) d ≥ (cid:18) − (cid:15)n + α π n (cid:19) (cid:107) ˜ x (cid:107) ˜ h (cid:107)(cid:107) ˜ m (cid:107)(cid:107) ˜ m (cid:107) . So, if π (cid:15)/α < , then D ( x , y ) f ( h , m ) · (cid:107) ( x , y ) (cid:107) ≤ for all ( x , y ) ∈ R n × p and ( h , m ) ∈{ ( h , m ) | h = , m ∈ R p } . Similarly, D ( x , y ) f ( h , m ) · (cid:107) ( x , y ) (cid:107) ≤ for all ( x , y ) ∈ R n × p and ( h , m ) ∈ { ( h , m ) | h ∈ R n , m = } .Let S = S (1)4 Kd s √ (cid:15), ( h , m ) ∩ S (2)4 Kd s √ (cid:15), ( h , m ) . The proof is ﬁnished by applying Lemma 3and d + s ) (cid:112) Kd s √ (cid:15)/α < to get S ⊆A ˜ K d s (cid:15) / α , ( h , m ) ∪ A ˜ K d s (cid:15) / α , (cid:16) − ρ (1) d h , m (cid:17) ∪ A ˜ K d s (cid:15) / α , (cid:16) ρ (2) s h , − m (cid:17) ∪ A ˜ K d s (cid:15) / α , (cid:16) − ρ (1) d ρ (2) s h , − m (cid:17) , for some absolute constant ˜ K . ˜ g , ( h , m ) and ˜ g , ( h , m ) Lemma 1.

Fix < (cid:15) < d − / (16 π ) and d ≥ . Suppose that W i ∈ R n i × n i − satisﬁes theWDC with constant (cid:15) and for i = 1 , . . . , d . Deﬁne ˜ t p , q = 12 d (cid:34)(cid:32) d − (cid:89) i =0 π − ¯ θ i π (cid:33) q + d − (cid:88) i =0 sin ¯ θ i π (cid:32) d − (cid:89) j = i +1 π − ¯ θ j π (cid:33) (cid:107) q (cid:107) (cid:107) p (cid:107) p (cid:35) , where ¯ θ i = g (¯ θ i − ) for g given by (8) and ¯ θ = ∠ ( p , q ) . For all p (cid:54) = 0 and q (cid:54) = 0 , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:32) (cid:89) i = d W i, + , p (cid:33) (cid:124) (cid:32) (cid:89) i = d W i, + , q (cid:33) q − ˜ t p , q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ d √ (cid:15) d (cid:107) q (cid:107) . (16)We refer the readers to Hand and Voroninski [2017] for proof of Lemma 1. We now statea related Lemma. Lemma 2.

Fix < (cid:15) < / (( d + s )16 π ) , d ≥ and s ≥ . Suppose that W (1) i ∈ R n i × n i − for i = 1 , . . . , d − and W (2) i ∈ R p i × p i − for i = 1 , . . . , s − satisfy the WDC with constant (cid:15) and . Suppose W (1) d ∈ R (cid:96) × n d − satisfy WDC with constants (cid:15) and α , and W (2) s ∈ R (cid:96) × p s − satisfy WDC with constants (cid:15) and α . Also, suppose (cid:16) W (1) d , W (2) s (cid:17) satisfy pair-WDC withconstants (cid:15) , α = α · α . Deﬁne ˜ t ( k ) p , q = 12 a ( k )  a ( k ) − (cid:89) i =0 π − ¯ θ ( k ) i π  q + a ( k ) − (cid:88) i =0 sin ¯ θ ( k ) i π  a ( k ) − (cid:89) j = i +1 π − ¯ θ ( k ) j π  (cid:107) q (cid:107) (cid:107) p (cid:107) p  , here ¯ θ ( k ) i = g (¯ θ ( k ) i − ) for g given by (8) , ¯ θ ( k )0 = ∠ ( p , q ) , a (1) = d , and a (2) = s . For all h (cid:54) = 0 , x (cid:54) = 0 , m (cid:54) = 0 and y (cid:54) = 0 , (cid:13)(cid:13)(cid:13)(cid:16) Λ (1) d, + , h (cid:17) (cid:124) diag (cid:16) Λ (2) s, + , m m (cid:12) Λ (2) s, + , y y (cid:17) Λ (1) d, + , x x − αn (cid:16) m (cid:124) ˜ t (2) m , y (cid:17) ˜ t (1) h , x (cid:13)(cid:13)(cid:13) ≤ d s √ (cid:15) d + s (cid:96) (cid:107) x (cid:107) (cid:107) m (cid:107) (cid:107) y (cid:107) , (17) (cid:13)(cid:13)(cid:13)(cid:16) Λ (2) s, + , m (cid:17) (cid:124) diag (cid:16) Λ (1) d, + , h h (cid:12) Λ (1) d, + , x x (cid:17) Λ (2) s, + , y y − αn (cid:16) h (cid:124) ˜ t (1) h , x (cid:17) ˜ t (2) m , y (cid:13)(cid:13)(cid:13) ≤ d s √ (cid:15) d + s (cid:96) (cid:107) y (cid:107) (cid:107) h (cid:107) (cid:107) x (cid:107) . (18) Proof.

We will prove (17). Proof of (18) is identical to proof of (17). Deﬁne h = h , x = x , m = m , y = y , h d := (cid:32) (cid:89) i = d W (1) i, + , h (cid:33) h = (cid:16) W (1) d, + , h W (1) d − , + , h . . . W (1)1 , + , h (cid:17) h = W (1) d, + , h h d − = ( W (1) d ) + , h d − h d − , and analogously x d = (cid:16)(cid:81) i = d W (1) i, + , x (cid:17) x , m s = (cid:16)(cid:81) i = s W (2) i, + , m (cid:17) m , and y s = (cid:16)(cid:81) i = s W (2) i, + , y (cid:17) y .By the WDC, we have for all h (cid:54) = , m (cid:54) = , (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) W (1) i (cid:17) (cid:124) + , h (cid:16) W (1) i (cid:17) + , h − I n i − (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:15) for all i = 1 , . . . , d − , and (19) (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) W (2) i (cid:17) (cid:124) + , m (cid:16) W (2) i (cid:17) + , m − I p i − (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:15) for all i = 1 , . . . , s − . (20)In particular, (cid:13)(cid:13)(cid:13)(cid:16) W (1) i, + , h (cid:17) (cid:124) W (1) i, + , h − I n i − (cid:13)(cid:13)(cid:13) ≤ (cid:15) and (cid:13)(cid:13)(cid:13)(cid:16) W (2) i, + , m (cid:17) (cid:124) W (2) i, + , m − I p i − (cid:13)(cid:13)(cid:13) ≤ (cid:15) .and consequently, − (cid:15) ≤ (cid:13)(cid:13)(cid:13) W (1) i, + , h (cid:13)(cid:13)(cid:13) ≤

12 + (cid:15) − (cid:15) ≤ (cid:13)(cid:13)(cid:13) W (2) i, + , m (cid:13)(cid:13)(cid:13) ≤

12 + (cid:15).

Hence, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:89) i = d − W (1) i, + , h (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:89) i = d − W (1) i, + , x (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ d − (1 + 2 (cid:15) ) d − = 12 d − e ( d −

1) log(1+2 (cid:15) ) ≤ (cid:15) ( d − d − , (21)20here we used that log(1 + z ) ≤ z , e z ≤ z for z < , and d − (cid:15) ≤ . Similarly, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:89) i = s − W (2) i, + , m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:89) i = s − W (2) i, + , y (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:15) ( s − s − . (22)Let ˜ h = Λ (1) d − , + , h h , ˜ x = Λ (1) d − , + , x x , ˜ m = Λ (2) s − , + , m m , ˜ y = Λ (2) s − , + , y y , and consider (cid:13)(cid:13)(cid:13)(cid:16) Λ (1) d, + , h (cid:17) (cid:124) diag (cid:16) Λ (2) s, + , m m (cid:12) Λ (2) s, + , y y (cid:17) Λ (1) d, + , x x − α(cid:96) (cid:0) m (cid:124) ˜ t (2) m , y (cid:1) ˜ t (1) h , x (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) (cid:16) Λ (1) d − , + , h (cid:17) (cid:124) (cid:16) (cid:16) W (1) d, + , h (cid:17) (cid:124) diag (cid:16) W (2) d, + , m ˜ m (cid:12) W (2) d, + , y ˜ y (cid:17) W (1) d, + , x − α(cid:96) Q ˜ h , ˜ x ˜ m (cid:124) Q ˜ m , ˜ y ˜ y (cid:17) Λ (1) d − , + , x x (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) α(cid:96) (cid:16) Λ (1) d − , + , h (cid:17) (cid:124) Q ˜ h , ˜ x ˜ m (cid:124) Q ˜ m , ˜ y ˜ y Λ (1) d − , + , x x − α(cid:96) (cid:0) m (cid:124) ˜ t (2) m , y (cid:1) ˜ t (1) h , x (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) (cid:16) Λ (1) d − , + , h (cid:17) (cid:124) (cid:16) (cid:16) W (1) d, + , h (cid:17) (cid:124) diag (cid:16) W (2) d, + , m ˜ m (cid:12) W (2) d, + , y ˜ y (cid:17) W (1) d, + , x − α(cid:96) Q ˜ h , ˜ x ˜ m (cid:124) Q ˜ m , ˜ y ˜ y (cid:17) Λ (1) d − , + , x x (cid:13)(cid:13)(cid:13) + α(cid:96) (cid:13)(cid:13)(cid:13) ˜ m (cid:124) Q ˜ m , ˜ y ˜ y (cid:16) Λ (1) d − , + , h (cid:17) (cid:124) Q ˜ h , ˜ x Λ (1) d − , + , x x − m (cid:124) ˜ t (2) m , y (cid:16) Λ (1) d − , + , h (cid:17) (cid:124) Q ˜ h , ˜ x Λ (1) d − , + , x x (cid:13)(cid:13)(cid:13) + α(cid:96) (cid:13)(cid:13)(cid:13) m (cid:124) ˜ t (2) m , y (cid:16) Λ (1) d − , + , h (cid:17) (cid:124) Q ˜ h , ˜ x Λ (1) d − , + , x x − m (cid:124) ˜ t (2) m , y ˜ t (1) h , x (cid:13)(cid:13)(cid:13) (23)where both the ﬁrst and second inequality holds because of triangle inequality. We boundthe terms in the inequality above separately. First consider (cid:13)(cid:13)(cid:13) (cid:16) Λ (1) d − , + , h (cid:17) (cid:124) (cid:16) (cid:16) W (1) d, + , h (cid:17) (cid:124) diag (cid:16) W (2) d, + , m ˜ m (cid:12) W (2) d, + , y ˜ y (cid:17) W (1) d, + , x − α(cid:96) Q ˜ h , ˜ x ˜ m (cid:124) Q ˜ m , ˜ y ˜ y (cid:17) Λ (1) d − , + , x x (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) (cid:16) W (1) d, + , h (cid:17) (cid:124) diag (cid:16) W (2) d, + , m ˜ m (cid:12) W (2) d, + , y ˜ y (cid:17) W (1) d, + , x − α(cid:96) Q ˜ h , ˜ x ˜ m (cid:124) Q ˜ m , ˜ y ˜ y (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Λ (1) d − , + , h (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) Λ (1) d − , + , x (cid:13)(cid:13)(cid:13) (cid:107) x (cid:107) ≤ (cid:18) (cid:15) ( d − d − (cid:19) (cid:15)(cid:96) (cid:107) x (cid:107) (cid:107) ˜ m (cid:107) (cid:107) ˜ y (cid:107) = (1 + 4 (cid:15) ( d − d (cid:15)(cid:96) (cid:107) x (cid:107) (cid:107) Λ (2) s − , + , m m (cid:107) (cid:107) Λ (2) s − , + , y y (cid:107) ≤ (1 + 4 (cid:15) ( d − d (1 + 4 (cid:15) ( s − s (cid:15)(cid:96) (cid:107) x (cid:107) (cid:107) m (cid:107) (cid:107) y (cid:107) ≤ (cid:15) d + s (cid:96) (cid:107) x (cid:107) (cid:107) m (cid:107) (cid:107) y (cid:107) . (24)21here the ﬁrst inequality holds because spectral norm is a sub-multiplicative norm. The secondinequality holds because of (21) and joint-WDC. The last inequality holds if (cid:15) ( d − < and (cid:15) ( s − < .Second, consider (cid:13)(cid:13)(cid:13) ˜ m (cid:124) Q ˜ m , ˜ y ˜ y (cid:16) Λ (1) d − , + , h (cid:17) (cid:124) Q ˜ h , ˜ x Λ (1) d − , + , x x − m (cid:124) ˜ t (2) m , y (cid:16) Λ (1) d − , + , h (cid:17) (cid:124) Q ˜ h , ˜ x Λ (1) d − , + , x x (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:16) ˜ m (cid:124) Q ˜ m , ˜ y ˜ y − m (cid:124) ˜ t (2) m , y (cid:17) (cid:16) Λ (1) d − , + , h (cid:17) (cid:124) Q ˜ h , ˜ x Λ (1) d − , + , x x (cid:13)(cid:13)(cid:13) ≤ (cid:15) ( d − d − (cid:107) Q ˜ h , ˜ x (cid:107) (cid:13)(cid:13)(cid:13) (cid:16) Λ (2) s − , + , m (cid:17) (cid:124) Q ˜ m , ˜ y Λ (2) s − , + , y y − ˜ t (2) m , y (cid:13)(cid:13)(cid:13) (cid:107) x (cid:107) (cid:107)(cid:107) m (cid:107) = 1 + 4 (cid:15) ( d − d (cid:13)(cid:13)(cid:13) (cid:16) Λ (2) s − , + , m (cid:17) (cid:124) (cid:18) Q ˜ m , ˜ y − α (cid:16) W (2) d, + , m (cid:17) (cid:124) W (2) d, + , y (cid:19) Λ (2) s − , + , y y + 1 α (cid:16) Λ (2) s, + , m (cid:17) (cid:124) Λ (2) s, + , y y − ˜ t (2) m , y (cid:13)(cid:13)(cid:13) (cid:107) x (cid:107) (cid:107) m (cid:107) ≤ (cid:15) ( d − d (cid:16)(cid:13)(cid:13)(cid:13) Λ (2) s − , + , m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Λ (2) s − , + , y (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:18) Q ˜ m , ˜ y − α (cid:16) W (2) d, + , m (cid:17) (cid:124) W (2) d, + , y (cid:19) (cid:13)(cid:13)(cid:13) (cid:107) y (cid:107) + (cid:13)(cid:13)(cid:13) α (cid:16) Λ (2) s, + , m (cid:17) (cid:124) Λ (2) s, + , y y − ˜ t (2) m , y (cid:13)(cid:13)(cid:13) (cid:17) (cid:107) x (cid:107) (cid:107) m (cid:107) ≤ (cid:15) ( d − d + s (cid:16) (cid:15) ( s − (cid:15)/α + 24 s (cid:112) (cid:15)/α (cid:17) (cid:107) x (cid:107) (cid:107) m (cid:107) (cid:107) y (cid:107) ≤ d + s (cid:16) (cid:15)/α + 24 s (cid:112) (cid:15)/α (cid:17) (cid:107) x (cid:107) (cid:107) m (cid:107) (cid:107) y (cid:107) ≤ s √ (cid:15) d α (cid:107) x (cid:107) (cid:107) m (cid:107) (cid:107) y (cid:107) . (25)where the ﬁrst inequality holds because of (21). The second inequality holds because oftriangle inequality. The third inequality holds because of (22), √ α W (2) d satisfy WDC withconstant (cid:15)/α and 1, and Lemma 1.Third, consider (cid:13)(cid:13)(cid:13) m (cid:124) ˜ t (2) m , y (cid:16) Λ (1) d − , + , h (cid:17) (cid:124) Q ˜ h , ˜ x Λ (1) d − , + , x x − m (cid:124) ˜ t (2) m , y ˜ t (1) h , x (cid:13)(cid:13)(cid:13) = | m (cid:124) ˜ t (2) m , y | (cid:13)(cid:13)(cid:13) (cid:16) Λ (1) d − , + , h (cid:17) (cid:124) Q ˜ h , ˜ x Λ (1) d − , + , x x − ˜ t (1) h , x (cid:13)(cid:13)(cid:13) ≤(cid:107) ˜ t (2) m , y (cid:107) (cid:13)(cid:13)(cid:13) (cid:16) Λ (1) d − , + , h (cid:17) (cid:124) (cid:18) Q ˜ h , ˜ x − α (cid:16) W (1) d, + , h (cid:17) (cid:124) W (1) d, + , x (cid:19) Λ (1) d − , + , x x + 1 α (cid:16) Λ (1) d, + , h (cid:17) (cid:124) Λ (1) d, + , x − ˜ t (1) h , x (cid:13)(cid:13)(cid:13) (cid:107) m (cid:107) ≤ s s (cid:16)(cid:13)(cid:13)(cid:13) Λ (1) d − , + , h (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Λ (1) d − , + , x (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Q ˜ h , ˜ x − α (cid:16) W (1) d, + , h (cid:17) (cid:124) W (1) d, + , x (cid:13)(cid:13)(cid:13) (cid:107) x (cid:107) + (cid:13)(cid:13)(cid:13) α (cid:16) Λ (1) d, + , h (cid:17) (cid:124) Λ (1) d, + , x x − ˜ t (1) h , x (cid:13)(cid:13)(cid:13) (cid:17) (cid:107) m (cid:107) (cid:107) y (cid:107) ≤ s d + s (cid:16) (cid:15) ( d − (cid:15)/α + 24 d (cid:112) (cid:15)/α (cid:17) (cid:107) x (cid:107) (cid:107) m (cid:107) (cid:107) y (cid:107) sd √ (cid:15) d + s α (cid:107) x (cid:107) (cid:107) m (cid:107) (cid:107) y (cid:107) . (26)where the ﬁrst inequality holds because of Cauchy-Schwartz inequality. The second inequalityholds because of triangle inequality along with (cid:107) ˜ t (2) m , y (cid:107) ≤ s s . The third inequality holdsbecause of (21), √ α W (1) d satisfy WDC with constant (cid:15)/α and , and Lemma 1.Hence, combining (23), (24), (25), and (26), we get (cid:13)(cid:13)(cid:13)(cid:16) Λ (1) d, + , h (cid:17) (cid:124) diag (cid:16) Λ (2) s, + , m m (cid:12) Λ (2) s, + , y y (cid:17) Λ (1) d, + , x x − α α (cid:96) (cid:0) m (cid:124) ˜ t (2) m , y (cid:1) ˜ t (1) h , x (cid:13)(cid:13)(cid:13) ≤ (cid:16) (cid:15) d + s (cid:96) + 72 s √ (cid:15) d + s (cid:96) + 72 sd √ (cid:15) d (cid:96) (cid:17) (cid:107) x (cid:107) (cid:107) m (cid:107) (cid:107) y (cid:107) ≤ s d √ (cid:15) d + s (cid:96) (cid:107) x (cid:107) (cid:107) m (cid:107) (cid:107) y (cid:107) . t ( h , m ) , ( h , m ) Lemma 3.

Fix < (cid:15) < and < α ≤ such that d + s ) √ (cid:15)/α < . Let K = { ( h , ) ∈ R n × p | h ∈ R n } ∪ { ( , m ) ∈ R n × p (cid:12)(cid:12) m ∈ R p } . Let S (1) (cid:15), ( h , m ) = (cid:40) ( h , x ) ∈ R n × p \K (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) t (1)( h , m ) , ( h , m ) (cid:107) (cid:107) m (cid:107) ≤ (cid:15) max( (cid:107) h (cid:107) (cid:107) m (cid:107) , (cid:107) h (cid:107) (cid:107) m (cid:107) )2 d + s (cid:96) (cid:41) ,S (2) (cid:15), ( h , m ) = (cid:40) ( h , x ) ∈ R n × p \K (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) t (2)( h , m ) , ( h , m ) (cid:107) (cid:107) h , (cid:107) ≤ (cid:15) max( (cid:107) h (cid:107) (cid:107) m (cid:107) , (cid:107) h (cid:107) (cid:107) m (cid:107) )2 d + s (cid:96) (cid:41) , where d and s are an integers greater than 1. Let ˜ t ( k ) m , y = 12 a ( k )  a ( k ) − (cid:89) i =0 π − ¯ θ ( k ) i π y + a ( k ) − (cid:88) i =0 sin ¯ θ ( k ) i π  a ( k ) − (cid:89) j = i +1 π − ¯ θ ( k ) j π  (cid:107) y (cid:107) (cid:107) m (cid:107) m  , (27) where ¯ θ ( k ) i = g (¯ θ ( k ) i − ) for g given in (8) , ¯ θ ( k )0 = ∠ ( m , y ) , a (1) = d , and a (2) = s . Let t ( h , m ) , ( h , m ) = (cid:34) t (1)( h , m ) , ( h , m ) t (2)( h , m ) , ( h , m ) (cid:35) where t (1)( h , m ) , ( h , m ) = α d + s (cid:96) (cid:107) m (cid:107) h − α(cid:96) m (cid:124) ˜ t (2) m , m ˜ t (1) h , h , (28) t (2)( h , m ) , ( h , m ) = α d + s (cid:96) (cid:107) h (cid:107) m − α(cid:96) h (cid:124) ˜ t (1) h , h ˜ t (2) m , m . (29)23 eﬁne ρ ( k ) a ( k ) := a ( k ) − (cid:88) i =1 sin ˇ θ ( k ) i π  a ( k ) − (cid:89) j = i +1 π − ˇ θ ( k ) j π  , where ˇ θ ( k )0 = π and ˇ θ ( k ) i = g (ˇ θ ( k ) i − ) . If ( h , m ) ∈ S (1) (cid:15), ( h , m ) ∩ S (2) (cid:15), ( h , m ) then one of the followingholds: • (cid:12)(cid:12)(cid:12) ¯ θ (1)0 (cid:12)(cid:12)(cid:12) ≤ √ (cid:15), (cid:12)(cid:12)(cid:12) ¯ θ (2)0 (cid:12)(cid:12)(cid:12) ≤ √ (cid:15) and |(cid:107) h (cid:107) (cid:107) m (cid:107) − (cid:107) h (cid:107) (cid:107) m (cid:107) | ≤ ds √ (cid:15)α (cid:107) h (cid:107) (cid:107) m (cid:107) , • (cid:12)(cid:12)(cid:12) ¯ θ (1)0 − π (cid:12)(cid:12)(cid:12) ≤ π d √ (cid:15)/α, (cid:12)(cid:12)(cid:12) ¯ θ (2)0 (cid:12)(cid:12)(cid:12) ≤ . √ (cid:15) and (cid:12)(cid:12)(cid:12) (cid:107) h (cid:107) (cid:107) m (cid:107) − ρ (1) d (cid:107) h (cid:107) (cid:107) m (cid:107) (cid:12)(cid:12)(cid:12) ≤ d s √ (cid:15)α (cid:107) h (cid:107) (cid:107) m (cid:107) , • (cid:12)(cid:12)(cid:12) ¯ θ (1)0 (cid:12)(cid:12)(cid:12) ≤ √ (cid:15), (cid:12)(cid:12)(cid:12) ¯ θ (2)0 − π (cid:12)(cid:12)(cid:12) ≤ π s √ (cid:15)/α and (cid:12)(cid:12)(cid:12) (cid:107) h (cid:107) (cid:107) m (cid:107) − ρ (1) d (cid:107) h (cid:107) (cid:107) m (cid:107) (cid:12)(cid:12)(cid:12) ≤ ds √ (cid:15)α (cid:107) h (cid:107) (cid:107) m (cid:107) , • (cid:12)(cid:12)(cid:12) ¯ θ (1)0 − π (cid:12)(cid:12)(cid:12) ≤ π d √ (cid:15)α, (cid:12)(cid:12)(cid:12) ¯ θ (2)0 − π (cid:12)(cid:12)(cid:12) ≤ π d √ (cid:15)/α and (cid:12)(cid:12)(cid:12) (cid:107) h (cid:107) (cid:107) m (cid:107) − ρ (1) d ρ (2) s (cid:107) h (cid:107) (cid:107) m (cid:107) (cid:12)(cid:12)(cid:12) ≤ d s √ (cid:15)α (cid:107) h (cid:107) (cid:107) m (cid:107) . In particular, S (1) (cid:15), ( h , m ) ∩ S (2) (cid:15), ( h , m ) ⊆ A ds √ (cid:15)α , ( h , m ) ∪ A π d s √ (cid:15)α , (cid:16) − ρ (1) d h , m (cid:17) ∪ A π ds √ (cid:15)α , (cid:16) ρ (2) s h , − m (cid:17) ∪ A π d s √ (cid:15)α , (cid:16) − ρ (1) d ρ (2) s h , − m (cid:17) , where A (cid:15), ( h , m ) is deﬁned in (3) .Proof. Without loss of generality, let h = e , m = e , ˆ h = cos ¯ θ (1)0 + sin ¯ θ (1)0 and ˆ m =cos ¯ θ (2)0 + sin ¯ θ (2)0 for some ¯ θ (1) i , ¯ θ (2)0 ∈ [0 , π ] . First we introduce some notation for convenience.Let ξ ( k ) = a ( k ) − (cid:89) i =0 π − ¯ θ ( k ) i π , ζ ( k ) = a ( k ) − (cid:88) i =1 sin ¯ θ ( k ) i π a ( k ) − (cid:89) j = i +1 π − ¯ θ ( k ) j π ,r (1) = (cid:107) h (cid:107) , r (2) = (cid:107) m (cid:107) , and M = max( r (1) r (2) , . Using these notation, we can rewrite t (1)( h , m ) , ( h , m ) as t (1)( h , m ) , ( h , m ) = α d + s (cid:96) (cid:18) (cid:107) m (cid:107) h − (cid:16) ξ (2) cos ¯ θ (2)0 + ζ (2) (cid:17) (cid:18) ξ (1) h (cid:107) h (cid:107) + ζ (1) h (cid:107) h (cid:107) (cid:19) (cid:107) h (cid:107) (cid:107) m (cid:107) (cid:19) (cid:107) m (cid:107) α d + s (cid:96) (cid:18) (cid:107) m (cid:107) h − cos ¯ θ (2) s (cid:18) ξ (1) h (cid:107) h (cid:107) + ζ (1) h (cid:107) h (cid:107) (cid:19) (cid:107) h (cid:107) (cid:107) m (cid:107) (cid:19) (cid:107) m (cid:107) = α d + s (cid:96) (cid:18) r (1) r (2) (cid:16) cos ¯ θ (1)0 e + sin ¯ θ (1)0 e ) (cid:17) − cos ¯ θ (2) s (cid:16) ξ (1) e + ζ (1) (cid:16) cos ¯ θ (1)0 e + sin ¯ θ (1)0 e (cid:17) (cid:17)(cid:19) r (2) . By inspecting the components of t (1)( h , m ) , ( h , m ) , we have that ( h , m ) ∈ S (1) (cid:15), ( h , m ) implies (cid:12)(cid:12)(cid:12) r (1) r (2) cos ¯ θ (1)0 − cos ¯ θ (2) s (cid:16) ξ (1) + ζ (1) cos ¯ θ (1)0 (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:15)Mα (30) (cid:12)(cid:12)(cid:12) r (1) r (2) sin ¯ θ (1)0 − cos ¯ θ (2) s ζ (1) sin ¯ θ (1)0 (cid:12)(cid:12)(cid:12) ≤ (cid:15)Mα (31)Similarly, by inspecting the components of t (2)( h , m ) , ( h , m ) , we have that ( h , m ) ∈ S (2) (cid:15), ( h , m ) implies (cid:12)(cid:12)(cid:12) r (1) r (2) cos ¯ θ (2)0 − cos ¯ θ (1) d (cid:16) ξ (2) + ζ (2) cos ¯ θ (2)0 (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:15)Mα (32) (cid:12)(cid:12)(cid:12) r (1) r (2) sin ¯ θ (2)0 − cos ¯ θ (1) d ζ (2) sin ¯ θ (2)0 (cid:12)(cid:12)(cid:12) ≤ (cid:15)Mα (33)Now, we record several properties. We have: ¯ θ ( k ) i ∈ [0 , π/ for i ≥ (34) ¯ θ ( k ) i ≤ ¯ θ ( k ) i − for i ≥ (35) | ξ ( k ) | ≤ (36) ˇ θ ( k ) i ≤ πi + 3 for i ≥ (37) ˇ θ ( k ) i ≥ πi + 1 for i ≥ (38) ξ ( k ) = a ( k ) − (cid:89) i =1 π − ¯ θ ( k ) i π ≥ π − ¯ θ ( k )0 π a ( k ) − (39) ¯ θ ( k )0 = π + O ( δ ) ⇒ ¯ θ ( k ) i = ˇ θ ( k ) i + O ( iδ ) (40) ¯ θ ( k )0 = π + O ( δ ) ⇒ | ξ ( k ) | ≤ δπ (41) ¯ θ ( k )0 = π + O ( δ ) ⇒ ζ ( k ) = ρ ( k ) d + O (3 a ( k )3 δ ) if a ( k )2 δπ ≤ (42) | ζ ( k ) | = | ξ ( k ) cos ¯ θ ( k )0 − cos ¯ θ ( k ) a ( k ) | ≤ (43) cos ¯ θ ( k ) i ≥ π for i ≥ (44)For a proof of (36)-(42), we refer the readers to Lemma 8 of Hand and Voroninski [2017].25lso, we note that (44) follows directly from (43).We ﬁrst show that if ( h , m ) ∈ S (1) (cid:15), ( h , m ) then r (1) r (2) ≤ , and thus M ≤ . Suppose r (1) r (2) > . At least one of the following holds: | sin ¯ θ (1)0 | ≥ / √ or | cos ¯ θ (1)0 | ≥ / √ .If | sin ¯ θ (1)0 | ≥ / √ then (31) implies that (cid:12)(cid:12)(cid:12) r (1) r (2) − cos ¯ θ (2) s ζ (1) (cid:12)(cid:12)(cid:12) ≤ √ (cid:15)r (1) r (2) /α . Using(43), we get r (1) r (2) ≤ −√ (cid:15)/α ≤ if (cid:15)/α < / . If | cos ¯ θ (1)0 | ≥ / √ , then (30) implies (cid:12)(cid:12)(cid:12) r (1) r (2) − cos ¯ θ (2) s ζ (1) (cid:12)(cid:12)(cid:12) ≤ √ (cid:0) (cid:15)r (1) r (2) /α + | ξ (1) | (cid:1) . Using (36), (43), and (cid:15)/α < / , we get r (1) r (2) ≤ √ | ξ ( k ) | +cos ¯ θ (2) s ζ (1) −√ (cid:15)/α ≤ √ −√ (cid:15)/α ≤ . Thus, we have ( h , m ) ∈ S (1) (cid:15), ( h , m ) ⇒ r (1) r (2) ≤ ⇒ M ≤ . Similarly, we have ( h , m ) ∈ S (2) (cid:15), ( h , m ) ⇒ r (1) r (2) ≤ ⇒ M ≤ .Next we establish that we only need to consider the small angle case and the largeangle case (i.e. ¯ θ ( k )0 ≈ or π ) if ( h , m ) ∈ S (1) (cid:15), ( h , m ) ∩ S (2) (cid:15), ( h , m ) . Exactly one of thefollowing holds: (cid:12)(cid:12)(cid:12) r (1) r (2) − cos ¯ θ (2) s ζ (1) (cid:12)(cid:12)(cid:12) ≥ √ (cid:15)M/α or (cid:12)(cid:12)(cid:12) r (1) r (2) − cos ¯ θ (2) s ζ (1) (cid:12)(cid:12)(cid:12) < √ (cid:15)M/α . If (cid:12)(cid:12)(cid:12) r (1) r (2) − cos ¯ θ (2) s ζ (1) (cid:12)(cid:12)(cid:12) ≥ √ (cid:15)M/α , then by (31), we have | sin ¯ θ (1)0 | ≤ √ (cid:15) . Hence ¯ θ (1)0 = O (2 √ (cid:15) ) or ¯ θ (1)0 = π + O (2 √ (cid:15) ) , as (cid:15) < . If (cid:12)(cid:12)(cid:12) r (1) r (2) − cos ¯ θ (2) s ζ (1) (cid:12)(cid:12)(cid:12) < √ (cid:15)M/α , then by (30) and(44) we have (cid:12)(cid:12) ξ (1) (cid:12)(cid:12) ≤ π √ (cid:15)M/α . Using (39), we get ¯ θ (1)0 = π + O (2 π d √ (cid:15)M/α ) . Thus,we only need to consider the small angle case, ¯ θ (1)0 = O (2 √ (cid:15) ) and the large angle case ¯ θ (1)0 = π + O (12 π d √ (cid:15)/α ) , where we have used M ≤ . Similarly, we only need to considerthe small angle case, ¯ θ (2)0 = O (2 √ (cid:15) ) and the large angle case ¯ θ (2)0 = π + O (12 π s √ (cid:15)/α ) . Case 1: ¯ θ (1)0 ≈ and ¯ θ (2)0 ≈ . Assume ¯ θ ( k )0 = O (2 √ (cid:15) ) . As ¯ θ ( k ) i ≤ ¯ θ ( k )0 ≤ √ (cid:15) for all i , we have ξ ( k ) ≥ (cid:16) − √ (cid:15)π (cid:17) a ( k ) = 1 + O ( a ( k ) √ (cid:15)π ) provided a ( k ) √ (cid:15) ≤ / . By (43), we alsohave ζ ( k ) = O ( a ( k ) π √ (cid:15) ) = O ( a ( k ) √ (cid:15) ) . By (30), we have (cid:12)(cid:12)(cid:12) r (1) r (2) cos ¯ θ (1)0 − (cid:16) ξ (2) cos ¯ θ (2)0 + ζ (2) (cid:17) (cid:16) ξ (1) + ζ (1) cos ¯ θ (1)0 (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:15)Mα where we used cos ¯ θ ( k ) a ( k ) = ξ ( k ) cos ¯ θ ( k )0 + ζ ( k ) . As cos ¯ θ ( k )0 = 1 + O ((¯ θ ( k )0 ) /

2) = 1 + O (2 (cid:15) ) , ξ (2) cos ¯ θ (2)0 + ζ (2) =1 + O (8 s(cid:15) √ (cid:15) + 4 s √ (cid:15) + 2 (cid:15) + s √ (cid:15) ) = 1 + O (15 s √ (cid:15) ) ,ξ (1) + ζ (1) cos ¯ θ (1)0 =1 + O (4 d √ (cid:15) + 2 d(cid:15) √ (cid:15) + d √ (cid:15) ) = 1 + O (7 d √ (cid:15) ) . Thus, r (1) r (2) = 1 + O (12 (cid:15) + 6 (cid:15)/α + 105 ds(cid:15) + 7 d √ (cid:15) + 15 s √ (cid:15) ) = 1 + O (145 ds √ (cid:15)/α ) . (45)We now show ( h , m ) is close to (cid:0) c h , c m (cid:1) , where c = (cid:107) m (cid:107) (cid:107) m (cid:107) . Consider (cid:13)(cid:13)(cid:13)(cid:13) h − (cid:107) m (cid:107) (cid:107) m (cid:107) h (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) m (cid:107) (cid:16) |(cid:107) h (cid:107) (cid:107) m (cid:107) − (cid:107) m (cid:107) (cid:107) h (cid:107) | + ( (cid:107) h (cid:107) (cid:107) m (cid:107) + |(cid:107) h (cid:107) (cid:107) m (cid:107) − (cid:107) m (cid:107) (cid:107) h (cid:107) | ) ¯ θ (1)0 (cid:17) (cid:107) m (cid:107) (cid:0) ds √ (cid:15) (cid:107) h (cid:107) (cid:107) m (cid:107) /α + (cid:0) (cid:107) h (cid:107) (cid:107) m (cid:107) + 145 ds √ (cid:15) (cid:107) h (cid:107) (cid:107) m (cid:107) /α (cid:1) √ (cid:15) (cid:1) ≤ ds √ (cid:15)α (cid:107) h (cid:107) (cid:107) m (cid:107) (cid:107) m (cid:107) . Similarly, (cid:13)(cid:13)(cid:13)(cid:13) m − (cid:107) m (cid:107) (cid:107) m (cid:107) m (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:16) |(cid:107) m (cid:107) − (cid:107) m (cid:107) | + ( (cid:107) m (cid:107) + |(cid:107) m (cid:107) − (cid:107) m (cid:107) | ) ¯ θ (2)0 (cid:17) ≤ √ (cid:15) (cid:107) m (cid:107) . Hence, (cid:13)(cid:13)(cid:13)(cid:13) ( h , m ) − (cid:18) c h , c m (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ds √ (cid:15)α (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) c h , c m (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) . Case 2: ¯ θ (1)0 ≈ π and ¯ θ (2)0 ≈ Assume ¯ θ (1)0 = π + O ( δ ) where δ = 12 π d √ (cid:15)/α . By(41) and (42), we have ξ (1) = O ( δ/π ) , and we have ζ (1) = ρ (1) d + O (3 d δ ) if d √ (cid:15)/α ≤ .Also, assume ¯ θ (2)0 = O (2 √ (cid:15) ) . As ¯ θ (2) i ≤ ¯ θ (2)0 ≤ √ (cid:15) for all i , we have ξ (2) ≥ (cid:16) − √ (cid:15)π (cid:17) s =1 + O ( s √ (cid:15)π ) provided s √ (cid:15) ≤ / . By (43), we also have ζ (2) = O ( sπ √ (cid:15) ) = O ( s √ (cid:15) ) . By(32), we have (cid:12)(cid:12)(cid:12) r (1) r (2) cos ¯ θ (2)0 − (cid:16) ξ (1) cos ¯ θ (1)0 + ζ (1) (cid:17) (cid:16) ξ (2) + ζ (2) cos ¯ θ (2)0 (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:15)Mα where we used cos ¯ θ ( k ) a ( k ) = ξ ( k ) cos ¯ θ ( k )0 + ζ ( k ) . As cos ¯ θ (1)0 = − O ((¯ θ (1)0 − π ) /

2) = − O ( δ / provided δ < and cos ¯ θ (2)0 = 1 + O ((¯ θ (2)0 ) /

2) = 1 + O (2 (cid:15) ) , ξ (1) cos ¯ θ (1)0 + ζ (1) = ρ (1) d + O ( δ π + δπ + 3 d δ ) = ρ (1) d + O (4 δd ) ,ξ (2) + ζ (2) cos ¯ θ (2)0 =1 + O (4 s √ (cid:15) + 2 s(cid:15) √ (cid:15) + s √ (cid:15) ) = 1 + O (7 s √ (cid:15) ) . Thus, r (1) r (2) = ρ (1) d + O (12 (cid:15) + 6 (cid:15)/α + 4 δd + 7 s √ (cid:15) + 28 d sδ √ (cid:15) )= ρ (1) d + O (30 d √ (cid:15)/α + 4 δd + 28 d s √ (cid:15) )= ρ (1) d + O (532 d s √ (cid:15)/α ) . where, in the second equality, we use δ < . We now show ( h , m ) is close to (cid:16) − cρ (1) d h , c m (cid:17) ,where c = (cid:107) m (cid:107) (cid:107) m (cid:107) . Consider (cid:13)(cid:13)(cid:13)(cid:13) h + (cid:107) m (cid:107) (cid:107) m (cid:107) ρ (1) d h (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) m (cid:107) (cid:16) (cid:12)(cid:12)(cid:12) (cid:107) h (cid:107) (cid:107) m (cid:107) − ρ (1) d (cid:107) h (cid:107) (cid:107) m (cid:107) (cid:12)(cid:12)(cid:12) + (cid:16) ρ (1) d (cid:107) h (cid:107) (cid:107) m (cid:107) + (cid:12)(cid:12) (cid:107) h (cid:107) (cid:107) m (cid:107) ρ (1) d (cid:107) h (cid:107) (cid:107) m (cid:107) (cid:12)(cid:12)(cid:17) ¯ θ (1)0 (cid:17) ≤ (cid:107) m (cid:107) (cid:0) d s √ (cid:15) (cid:107) m (cid:107) (cid:107) h (cid:107) /α + (cid:0) (cid:107) h (cid:107) (cid:107) m (cid:107) + 532 d s √ (cid:15) (cid:107) m (cid:107) (cid:107) h (cid:107) /α (cid:1) d √ (cid:15)/α (cid:1) ≤ (cid:107) m (cid:107) (cid:0) d s √ (cid:15) (cid:107) m (cid:107) (cid:107) h (cid:107) /α + (2 (cid:107) h (cid:107) (cid:107) m (cid:107) + 14 s (cid:107) m (cid:107) (cid:107) h (cid:107) ) 119 d √ (cid:15)/α (cid:1) ≤ π d s √ (cid:15)α ρ (1) d (cid:107) h (cid:107) (cid:107) m (cid:107) (cid:107) m (cid:107) . Similarly, (cid:13)(cid:13)(cid:13)(cid:13) m − (cid:107) m (cid:107) (cid:107) m (cid:107) m (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:16) |(cid:107) m (cid:107) − (cid:107) m (cid:107) | + ( (cid:107) m (cid:107) + |(cid:107) m (cid:107) − (cid:107) m (cid:107) | ) ¯ θ (2)0 (cid:17) ≤ √ (cid:15) (cid:107) m (cid:107) . Hence, (cid:13)(cid:13)(cid:13)(cid:13) ( h , m ) − (cid:18) − cρ (1) d h , c m (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ π d s √ (cid:15)α (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) − cρ (1) d h , c m (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) . Case 3: ¯ θ (1)0 ≈ and ¯ θ (2)0 ≈ π . The analysis is similar to case 2. Using (30), we get r (1) r (2) = ρ (2) s + O (532 ds √ (cid:15)/α ) . Again, similar to case 2, we can show ( h , m ) is close to (cid:16) cρ (2) s h , − c m (cid:17) , where c = (cid:107) h (cid:107) (cid:107) h (cid:107) .We get, (cid:13)(cid:13)(cid:13)(cid:13) ( h , m ) − (cid:18) cρ (2) s h , − c m (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ π ds √ (cid:15)α (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) cρ (2) s h , − c m (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) . Case 4: ¯ θ (1)0 ≈ π and ¯ θ (2)0 ≈ π . Assume ¯ θ ( k )0 = π + O ( δ ( k ) ) where δ ( k ) = 12 π a ( k )3 √ (cid:15)/α .By (41) and (42), we have ξ ( k ) = O ( δ ( k ) /π ) , and we have ζ ( k ) = ρ ( k ) d + O (3 a ( k )3 δ ( k ) ) if a ( k )2 δπ ≤ . By (30), we have (cid:12)(cid:12)(cid:12) r (1) r (2) cos ¯ θ (1)0 − (cid:16) ξ (2) cos ¯ θ (2)0 + ζ (2) (cid:17) (cid:16) ξ (1) + ζ (1) cos ¯ θ (1)0 (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:15)Mα where we used cos ¯ θ ( k ) a ( k ) = ξ ( k ) cos ¯ θ ( k )0 + ζ ( k ) . As cos ¯ θ ( k )0 = − O ((¯ θ ( k )0 − π ) /

2) = − O (( δ ( k ) ) / , ξ (2) cos ¯ θ (2)0 + ζ (2) = ρ (2) s + O ( ( δ (2) ) π + δ (2) π + 3 s δ (2) ) = ρ (2) s + O (4 δ (2) s ) ,ξ (1) + ζ (1) cos ¯ θ (1)0 = − ρ (1) d + O ( δ (1) π + 32 d ( δ (1) ) + 3 δ (1) d ) = − ρ (1) d + O (5 δ (1) d ) . r (1) r (2) = ρ (1) d ρ (2) s + O (6 (cid:15)/α + 4( δ (1) ) + 4 δ (2) s + 5 δ (1) d + 20 δ (1) δ (2) d s )= ρ (1) d ρ (2) s + O (6 (cid:15)/α + 4 δ (1) + 4 δ (2) s + 5 δ (1) d + 20 δ (2) d s )= ρ (1) d ρ (2) s + O (6 (cid:15)/α + 3909 d s √ (cid:15)/α )= ρ (1) d ρ (2) s + O (3915 d s √ (cid:15)/α ) , where, in the second equality, we used δ (1) ≤ πd < . We now show ( h , m ) is close to (cid:16) − cρ (1) d ρ (2) s h , − c m (cid:17) , where c = (cid:107) m (cid:107) (cid:107) m (cid:107) . Consider (cid:13)(cid:13)(cid:13)(cid:13) h + (cid:107) m (cid:107) (cid:107) m (cid:107) ρ (1) d ρ (2) s h (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) m (cid:107) (cid:16) (cid:12)(cid:12)(cid:12) (cid:107) h (cid:107) (cid:107) m (cid:107) − ρ (1) d ρ (2) s (cid:107) m (cid:107) (cid:107) h (cid:107) (cid:12)(cid:12)(cid:12) + (cid:16) ρ (1) d ρ (2) s (cid:107) h (cid:107) (cid:107) m (cid:107) + (cid:12)(cid:12)(cid:12) (cid:107) h (cid:107) (cid:107) m (cid:107) − ρ (1) d ρ (2) s (cid:107) m (cid:107) (cid:107) h (cid:107) (cid:12)(cid:12)(cid:12) (cid:17) ¯ θ (1)0 (cid:17) ≤ (cid:107) m (cid:107) (cid:0) d s √ (cid:15) (cid:107) h (cid:107) (cid:107) m (cid:107) /α + (cid:0) (cid:107) h (cid:107) (cid:107) m (cid:107) + 3915 d s √ (cid:15) (cid:107) h (cid:107) (cid:107) m (cid:107) /α (cid:1) d √ (cid:15)/α (cid:1) ≤ (cid:107) m (cid:107) (cid:0) d s √ (cid:15) (cid:107) h (cid:107) (cid:107) m (cid:107) /α + (cid:0) (cid:107) h (cid:107) (cid:107) m (cid:107) + 104 ds (cid:107) h (cid:107) (cid:107) m (cid:107) (cid:1) d √ (cid:15)/α (cid:1) ≤ π d s √ (cid:15)α ρ (1) d ρ (2) s (cid:107) h (cid:107) (cid:107) m (cid:107) (cid:107) m (cid:107) . Similarly, (cid:13)(cid:13)(cid:13)(cid:13) m + (cid:107) m (cid:107) (cid:107) m (cid:107) m (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:16) |(cid:107) m (cid:107) − (cid:107) m (cid:107) | + ( (cid:107) m (cid:107) + |(cid:107) m (cid:107) − (cid:107) m (cid:107) | ) ¯ θ (2)0 (cid:17) ≤ s √ (cid:15) (cid:107) m (cid:107) /α. Hence, (cid:13)(cid:13)(cid:13)(cid:13) ( h , m ) − (cid:18) − cρ (1) d ρ (2) s h , − c m (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ π d s √ (cid:15)α (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) cρ (1) d ρ (2) s h , c m (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) . We ﬁrst state a lemma that shows that the weight W ∈ R (cid:96) × n of a layer of a neural networklayer with i.i.d. N (0 , /(cid:96) ) entries satisﬁes the WDC with constant (cid:15) and , and we refer thereaders to Hand and Voroninski [2017] for a proof of the lemma. Lemma 4.

Fix < (cid:15) < . Let W ∈ R (cid:96) × n have i.i.d. N (0 , /(cid:96) ) entries. If (cid:96) > cn log n , thenwith probability at least − (cid:96)e − γn , W satisﬁes the WDC with constant (cid:15) and . Here c, γ − re constants that depend only polynomially on (cid:15) − . We now state a lemma similar to Lemma 4 which applies to truncated random variable.The proof follows the proof of lemma 4 in Hand and Voroninski [2017].

Lemma 5.

Fix < (cid:15) < . Let W ∈ R (cid:96) × n where i th row of W satisfy w (cid:124) i = w (cid:124) · (cid:107) w (cid:107) ≤ √ n/(cid:96) and w ∼ N ( , (cid:96) I n ) . If (cid:96) > cn log n , then with probability at least − ne − γn , W satisﬁesthe WDC with constant (cid:15) and α . Here c, γ − are constants that depend only polynomially on (cid:15) − and α = Γ (cid:0) n +22 (cid:1) − Γ (cid:0) n +22 , n (cid:1) Γ (cid:0) n +22 (cid:1) , (46) where Γ is the Gamma function. The WDC condition with constant (cid:15) and α can be written as (cid:107) W (cid:124) + , x W + , y − α Q x , y (cid:107) ≤ (cid:15) for all nonzero x , y ∈ R n . We note that W (cid:124) + , x W + , y = (cid:96) (cid:88) i =1 w (cid:124) i x w (cid:124) i y w i w (cid:124) i and it is not continuous in x and y . So, we consider an arbitrarily good continuousapproximation of W (cid:124) + , x W + , y . Let t − (cid:15) ( z ) =  z ≤ − (cid:15), z(cid:15) − (cid:15) ≤ z ≤ , z ≤ , and t (cid:15) ( z ) =  z ≤ , z(cid:15) ≤ z ≤ (cid:15), z ≥ (cid:15). and deﬁne H − (cid:15) ( xy ) := (cid:96) (cid:88) i =1 t − (cid:15) ( w (cid:124) i x ) t − (cid:15) ( w (cid:124) i y ) w i w (cid:124) i ,H (cid:15) ( x , y ) := (cid:96) (cid:88) i =1 t (cid:15) ( w (cid:124) i x ) t (cid:15) ( w (cid:124) i y ) w i w (cid:124) i . The proof of Lemma 5 follows from the follow two lemmas. We ﬁrst provide an upper boundon the singular values of H − (cid:15) ( x , y ) . Lemma 6.

Fix < (cid:15) < . Let W ∈ R (cid:96) × n where i th row of W satisfy w (cid:124) i = w (cid:124) · (cid:107) w (cid:107) ≤ √ n and w ∼ N ( , I n ) . If (cid:96) > cn log n , then with probability at least − ne − γn , ∀ ( x , y ) (cid:54) = ( , ) , H − (cid:15) ( x , y ) (cid:22) α(cid:96) Q x , y + 3 (cid:96)(cid:15)I n . ere, c and γ − are constants that depend only polynomially on (cid:15) − and α is α = Γ (cid:0) n +22 (cid:1) − Γ (cid:0) n +22 , n (cid:1) Γ (cid:0) n +22 (cid:1) , (47) where Γ is the Gamma function.Proof. First we bound E [ H − (cid:15) ( x , y )] for ﬁxed x , y ∈ S n − . Noting that t − (cid:15) ( z ) ≤ z ≥− (cid:15) ( z ) = z> ( z ) + − (cid:15) ≤ z ≤ ( z ) , we have E [ H − (cid:15) ( x , y )] (cid:22) E (cid:34) (cid:96) (cid:88) i =1 w (cid:124) i x ≥− (cid:15) w (cid:124) i y ≥− (cid:15) w i w (cid:124) i (cid:35) = (cid:96) E (cid:104) w (cid:124) i x ≥− (cid:15) w (cid:124) i y ≥− (cid:15) w i w (cid:124) i (cid:105) = (cid:96) E (cid:104) (cid:16) w (cid:124) i x ≥ w (cid:124) i y ≥ w i w (cid:124) i (cid:17) (cid:105) + 2 (cid:96) E (cid:104)(cid:16) − (cid:15) ≤ w (cid:124) i x ≤ w i w (cid:124) i (cid:17)(cid:105) . We ﬁrst note that E (cid:104) w (cid:124) i x ≥ w (cid:124) i y ≥ b i b (cid:124) i (cid:105) = α Q x , y where α satisﬁes . < α < . Also, wehave E (cid:104) − (cid:15) ≤ w (cid:124) i x ≤ w i w (cid:124) i (cid:105) (cid:22) (cid:15)α I n . Thus, E [ H − (cid:15) ( x , y )] (cid:22) α(cid:96) · m (cid:124) Q x , y y + (cid:15)α(cid:96) I n (cid:22) α(cid:96) · m (cid:124) Q x , y + (cid:15)(cid:96) I n (48)Second, we show concentration of H − (cid:15) ( x , y ) for ﬁxed x , y ∈ S n − . Let ξ i = (cid:113) t − (cid:15) ( w (cid:124) i x ) t − (cid:15) ( w (cid:124) i y ) w i . We have H − (cid:15) ( x , y ) − E [ H − (cid:15) ( x , y )]= (cid:96) (cid:88) i =1 (cid:16) t − (cid:15) ( w (cid:124) i x ) t − (cid:15) ( w (cid:124) i y ) w i w (cid:124) i − E [ t − (cid:15) ( w (cid:124) i x ) t − (cid:15) ( w (cid:124) i y ) w i w (cid:124) i ] (cid:17) = (cid:96) (cid:88) i =1 ( ξ i ξ (cid:124) i − E [ ξ i ξ (cid:124) i ]) . Note that ξ i is sub-Gaussian for all i and that the sub-Gaussian norm of ξ i is bounded fromabove by an absolute constant which we call K . By ﬁrst part of Remark 5.40 in Vershynin[2012], there exists a c K and γ K such that for all t ≥ , with probability at least − e − γ K t , (cid:107) H − (cid:15) ( x , y ) − E [ H − (cid:15) ( x , y )] (cid:107) ≤ max( δ, δ ) (cid:96), where δ = c K (cid:114) n(cid:96) + t √ (cid:96) . (cid:96) > (2 c K /(cid:15) ) n, t = (cid:15) √ (cid:96)/ , and (cid:15) < , we have (cid:107) H − (cid:15) ( x , y ) − E [ H − (cid:15) ( x , y )] (cid:107) ≤ (cid:15)(cid:96) (49)with probability at least − e − γ K (cid:15) (cid:96) .Third, we bound the Lipschitz constant of H − (cid:15) . For ˜ x , ˜ y ∈ R n we have H − (cid:15) ( x , y ) − H − (cid:15) ( ˜ x , ˜ y )= (cid:96) (cid:88) i =1 (cid:104) t − (cid:15) ( w (cid:124) i x ) t − (cid:15) ( w (cid:124) i y ) − t − (cid:15) ( w (cid:124) i ˜ x ) t − (cid:15) ( w (cid:124) i ˜ y ) (cid:105) w i w (cid:124) i = (cid:96) (cid:88) i =1 (cid:104) t − (cid:15) ( w (cid:124) i x ) ( t − (cid:15) ( w (cid:124) i y ) − t − (cid:15) ( w (cid:124) i ˜ y ))+ t − (cid:15) ( w (cid:124) i ˜ y ) ( t − (cid:15) ( w (cid:124) i x ) − t − (cid:15) ( w (cid:124) i ˜ x )) (cid:105) w i w (cid:124) i = W (cid:124) (cid:104) diag ( t − (cid:15) ( W x )) diag (( W y ) + − ( W ˜ y ) + )+ diag ( t − (cid:15) ( W ˜ y )) diag (( W x ) + − ( W ˜ x ) + ) (cid:105) W Thus, (cid:107) H − (cid:15) ( x , y ) − H − (cid:15) ( ˜ x , ˜ y ) (cid:107)≤(cid:107) W (cid:107) (cid:104) (cid:107) t − (cid:15) ( W y ) − t − (cid:15) ( W ˜ y ) (cid:107) ∞ + (cid:107) t − (cid:15) ( W x ) − t − (cid:15) ( W ˜ x ) (cid:107) ∞ (cid:105) ≤(cid:107) W (cid:107) (cid:34) max i ∈ [ (cid:96) ] | t − (cid:15) ( w (cid:124) i y ) − t − (cid:15) ( w (cid:124) i ˜ y ) | + max i ∈ [ (cid:96) ] | t − (cid:15) ( w (cid:124) i x ) − t − (cid:15) ( w (cid:124) i ˜ x ) | (cid:35) ≤(cid:107) W (cid:107) (cid:34) max i ∈ [ (cid:96) ] (cid:15) | w (cid:124) i ( y − ˜ y ) | + max i ∈ [ (cid:96) ] (cid:15) | w (cid:124) i ( x − ˜ x ) | (cid:35) ≤(cid:107) W (cid:107) (cid:34) (cid:15) max i ∈ [ (cid:96) ] (cid:107) w i (cid:107) (cid:107) y − ˜ y (cid:107) + 1 (cid:15) max i ∈ [ (cid:96) ] (cid:107) w i (cid:107) (cid:107) x − ˜ x (cid:107) (cid:35) ≤(cid:107) W (cid:107) (cid:104) (cid:15) √ n (cid:107) x − ˜ x (cid:107) + 9 (cid:15) √ n (cid:13)(cid:13)(cid:13) h − ˜ h (cid:13)(cid:13)(cid:13) (cid:105) where the ﬁrst inequality follows because | t − (cid:15) ( z ) | ≤ for all z , and the third inequalityfollows because t − (cid:15) ( z ) is /(cid:15) -Lipschitz. Let E be the event that (cid:107) W (cid:107) ≤ √ (cid:96) . ByCorollary 5.35 in Vershynin [2012], for A ∈ R (cid:96) × n with rows of A following N ( , I n ) , we have P ( (cid:107) A (cid:107) ≤ √ (cid:96) ) ≥ − e − (cid:96)/ , if (cid:96) ≥ n . As rows of W are truncated, we have P ( E ) ≥ − e − (cid:96)/ ,if (cid:96) ≥ n as well. On E , we have (cid:107) H − (cid:15) ( x , y ) − H − (cid:15) ( ˜ x , ˜ y ) (cid:107)≤ (cid:96) √ n(cid:15) [ (cid:107) x − ˜ y (cid:107) + (cid:107) y − ˜ y (cid:107) ] (50)32or all x , y , ˜ x , ˜ y ∈ S n − .Finally, we complete the proof by a covering argument. Let N δ be a δ -net on S n − suchthat |N δ | ≤ (3 /δ ) n . Take δ = (cid:15) √ n . Combining (48) and (53), we have ∀ ( x , y ) , ∈ N δ , H − (cid:15) ( x , y ) (cid:22) E H − (cid:15) ( x , y ) + (cid:96)(cid:15)I n (cid:22) α(cid:96) Q x , y + 2 (cid:96)(cid:15)I n . with probability at least − |N δ | e − γ K (cid:15) (cid:96)/ ≥ − (cid:18) δ (cid:19) n e − γ K (cid:15) (cid:96)/ ≥ − e − γ K (cid:15) (cid:96)/ n log(3 · √ n/(cid:15) ) . If (cid:96) ≥ ˜ cn log( n ) for some ˜ c = Ω( (cid:15) log (cid:15) ) , then this probability is at least − e − ˜ γ(cid:96) for some ˜ γ = O ( (cid:15) ) . For x , y ∈ S n − , let ˜ x , ˜ y ∈ N δ be such that (cid:107) x − ˜ x (cid:107) ≤ δ , and (cid:107) y − ˜ y (cid:107) ≤ δ . By(50), we have that ∀ x , y (cid:54) = , H − (cid:15) ( x , y ) (cid:22) H − (cid:15) ( ˜ x , ˜ y ) + 27 (cid:96) √ n(cid:15) δ I n (cid:22) α(cid:96) Q x , y + 3 (cid:96)(cid:15) I n . In conclusion, the result of this lemma holds if (cid:96) > (2 c K /(cid:15) ) n and (cid:96) ≥ ˜ c ( n ) log n , withprobability at least − e − γ K (cid:15) (cid:96)/ − e − (cid:96)/ − e − ˜ γ(cid:96) > − e − γ(cid:96) for some γ = O ( (cid:15) ) and ˜ c = Ω( (cid:15) log (cid:15) ) .Next, we now provide an upper bound on the singular values of G (cid:15) ( h , x , m , y ) . Lemma 7.

Lemma 8.

Fix < (cid:15) < . Let B ∈ R (cid:96) × n where i th row of B satisfy b (cid:124) i = b (cid:124) · (cid:107) b (cid:107) ≤ √ n/(cid:96) and b ∼ N ( , I n /(cid:96) ) . Similarly, let C ∈ R (cid:96) × p where i th row of C satisfy c (cid:124) i = c (cid:124) · (cid:107) c (cid:107) ≤ √ p/(cid:96) and c ∼ N ( , I p /(cid:96) ) . If (cid:96) > c (( n log n ) + ( p log p ) ) , then with probability at least − e − γ(cid:96) , B and C satisfy joint-WDC with constants (cid:15) and α = α · α . Here, c and γ − are constantsthat depend only polynomially on (cid:15) − and α = Γ (cid:0) n +22 (cid:1) − Γ (cid:0) n +22 , n (cid:1) Γ (cid:0) n +22 (cid:1) and α = Γ (cid:0) p +22 (cid:1) − Γ (cid:0) p +22 , p (cid:1) Γ (cid:0) p +22 (cid:1) , (54) where Γ is the Gamma function. The proof of Lemma 8 follows directly from Lemmas 9 and ?? . Using Corollary 1, weprovide a concentration result of B (cid:124) + , h diag ( C + , m m ) diag ( C + , y y ) B + , x , which is part of thejoint-WDC condition. We note that B (cid:124) + , h diag ( C + , m m ) diag ( C + , y y ) B + , x = (cid:96) (cid:88) i =1 b (cid:124) i h > b (cid:124) i x > ( c (cid:124) i m ) + ( c (cid:124) i y ) + b i b (cid:124) i h and x . So, we consider an arbitrarily good continuousapproximation of B (cid:124) + , h diag ( C + , m m ) diag ( C + , y y ) B + , x . Let t − (cid:15) ( z ) =  z ≤ − (cid:15), z(cid:15) − (cid:15) ≤ z ≤ , z ≤ , and t (cid:15) ( z ) =  z ≤ , z(cid:15) ≤ z ≤ (cid:15), z ≥ (cid:15). and deﬁne G − (cid:15) ( h , x , m , y ) := (cid:96) (cid:88) i =1 t − (cid:15) ( b (cid:124) i h ) t − (cid:15) ( b (cid:124) i x )( c (cid:124) i m ) + ( c (cid:124) i y ) + b i b (cid:124) i ,G (cid:15) ( h , x , m , y ) := (cid:96) (cid:88) i =1 t (cid:15) ( b (cid:124) i h ) t (cid:15) ( b (cid:124) i x )( c (cid:124) i m ) + ( c (cid:124) i y ) + b i b (cid:124) i . We now provide an upper bound on the singular values of G − (cid:15) ( h , x , m , y ) . Lemma 9.

Fix < (cid:15) < . Let B ∈ R (cid:96) × n where i th row of B satisfy b (cid:124) i = b (cid:124) · (cid:107) b (cid:107) ≤ √ n and b ∼ N ( , I n ) . Similarly, let C ∈ R (cid:96) × p where i th row of C satisfy c (cid:124) i = c (cid:124) · (cid:107) c (cid:107) ≤ √ p and c ∼ N ( , I p ) . If (cid:96) > c (( n log n ) + ( p log p ) ) , then with probability at least − e − γ(cid:96) , ∀ ( h , x ) (cid:54) = ( , ) and m , y ∈ S p − ,G − (cid:15) ( h , x , m , y ) (cid:22) α α (cid:96) Q h , x m (cid:124) Q m , y y + 4 (cid:96)(cid:15)I n . Here, c and γ − are constants that depend only polynomially on (cid:15) − and α and α is as in (54) .Proof. First we bound E [ G − (cid:15) ( h , x , m , y )] for ﬁxed h , x ∈ S n − and m , y ∈ S p − . Notingthat t − (cid:15) ( z ) ≤ z ≥− (cid:15) ( z ) = z> ( z ) + − (cid:15) ≤ z ≤ ( z ) , we have E [ G − (cid:15) ( h , x , m , y )] (cid:22) E (cid:34) (cid:96) (cid:88) i =1 b (cid:124) i h ≥− (cid:15) b (cid:124) i x ≥− (cid:15) ( c (cid:124) i m ) + ( c (cid:124) i y ) + b i b (cid:124) i (cid:35) = (cid:96) E (cid:104) b (cid:124) i h ≥− (cid:15) b (cid:124) i x ≥− (cid:15) ( c (cid:124) i m ) + ( c (cid:124) i y ) + b i b (cid:124) i (cid:105) (cid:22) (cid:96) E (cid:104)(cid:16) b (cid:124) i h ≥ b (cid:124) i x ≥ ( c (cid:124) i m ) + ( c (cid:124) i y ) + + (cid:16) − (cid:15) ≤ b (cid:124) i h ≤ + − (cid:15) ≤ b (cid:124) i x ≤ (cid:17) ( c (cid:124) i m ) + ( c (cid:124) i y ) + (cid:17) b i b (cid:124) i (cid:105) = (cid:96) E (cid:104) (cid:16) b (cid:124) i h ≥ b (cid:124) i x ≥ b i b (cid:124) i (cid:17) ( c (cid:124) i m ) + ( c (cid:124) i y ) + (cid:105) + 2 (cid:96) E (cid:104)(cid:16) − (cid:15) ≤ b (cid:124) i h ≤ b i b (cid:124) i (cid:17) ( c (cid:124) i m ) + ( c (cid:124) i y ) + (cid:105) . We ﬁrst note that E (cid:104) b (cid:124) i h ≥ b (cid:124) i x ≥ b i b (cid:124) i (cid:105) = α Q h , x and E [( c (cid:124) i m ) + ( c (cid:124) i y ) + ] = α m (cid:124) Q m , y y where α i satisﬁes . < α i < . Also, we have (cid:12)(cid:12) m (cid:124) Q m , y y (cid:12)(cid:12) ≤ and E (cid:104) − (cid:15) ≤ b (cid:124) i h ≤ b i b (cid:124) i (cid:105) (cid:22) (cid:15)α I n . Thus, E [ G − (cid:15) ( h , x , m , y )] (cid:22) α α (cid:96) · m (cid:124) Q m , y y · Q h , x + 2 (cid:96) · E (cid:104) − (cid:15) ≤ b (cid:124) i h ≤ b i b (cid:124) i (cid:105) · α m (cid:124) Q m , y y α α (cid:96) · m (cid:124) Q m , y y · Q h , x + (cid:15)α α (cid:96) I n (cid:22) α α (cid:96) · m (cid:124) Q m , y y · Q h , x + (cid:15)(cid:96) I n (55)Second, we show concentration of G − (cid:15) ( h , x , m , y ) for ﬁxed h , x ∈ S n − and m , y ∈ S p − .Let ξ i = (cid:112) t − (cid:15) ( b (cid:124) i h ) t − (cid:15) ( b (cid:124) i x )( c (cid:124) i m ) + ( c (cid:124) i y ) + b i . We have G − (cid:15) ( h , x , m , y ) − E [ G − (cid:15) ( h , x , m , y )]= (cid:96) (cid:88) i =1 (cid:16) t − (cid:15) ( b (cid:124) i h ) t − (cid:15) ( b (cid:124) i x )( c (cid:124) i m ) + ( c (cid:124) i y ) + b i b (cid:124) i − E [ t − (cid:15) ( b (cid:124) i h ) t − (cid:15) ( b (cid:124) i x )( c (cid:124) i m ) + ( c (cid:124) i y ) + b i b (cid:124) i ] (cid:17) = (cid:96) (cid:88) i =1 ( ξ i ξ (cid:124) i − E [ ξ i ξ (cid:124) i ]) . Note that ξ i is sub-Gaussian for all i and that the sub-Gaussian norm of ξ i is boundedfrom above by K = ˜ K √ n , where ˜ K is an absolute constant. By Corollary 1, there exists a c = ¯ c √ n log n and γ = ¯ γn log n such that for all t ≥ , with probability at least − e − γt , (cid:107) G − (cid:15) ( h , x , m , y ) − E [ G − (cid:15) ( h , x , m , y )] (cid:107) ≤ max( δ, δ ) (cid:96), where δ = c (cid:114) n(cid:96) + t √ (cid:96) . Here, ¯ c and ¯ γ are absolute constants. If (cid:96) > (2¯ c/(cid:15) ) n (log n ) , t = (cid:15) √ (cid:96)/ , and (cid:15) < , we have (cid:107) G − (cid:15) ( h , x , m , y ) − E [ G − (cid:15) ( h , x , m , y )] (cid:107) ≤ (cid:15)(cid:96) (56)with probability at least − e − ¯ γ (cid:15) (cid:96)n log n .Third, we bound the Lipschitz constant of G − (cid:15) . For ˜ h , ˜ x ∈ R n and ˜ m , ˜ y ∈ S p − we have G − (cid:15) ( h , x , m , y ) − G − (cid:15) ( ˜ h , ˜ x , ˜ m , ˜ y )= (cid:96) (cid:88) i =1 (cid:104) t − (cid:15) ( b (cid:124) i h ) t − (cid:15) ( b (cid:124) i x )( c (cid:124) i m ) + ( c (cid:124) i y ) + − t − (cid:15) ( b (cid:124) i ˜ h ) t − (cid:15) ( b (cid:124) i ˜ x )( c (cid:124) i ˜ m ) + ( c (cid:124) i ˜ y ) + (cid:105) b i b (cid:124) i = (cid:96) (cid:88) i =1 (cid:104) t − (cid:15) ( b (cid:124) i h ) t − (cid:15) ( b (cid:124) i x )( c (cid:124) i m ) + (( c (cid:124) i y ) + − ( c (cid:124) i ˜ y ) + )+ t − (cid:15) ( b (cid:124) i h ) t − (cid:15) ( b (cid:124) i x )( c (cid:124) i ˜ y ) + (( c (cid:124) i m ) + − ( c (cid:124) i ˜ m ) + )+ t − (cid:15) ( b (cid:124) i h )( c (cid:124) i ˜ m ) + ( c (cid:124) i ˜ y ) + ( t − (cid:15) ( b (cid:124) i x ) − t − (cid:15) ( b (cid:124) i ˜ x ))+ t − (cid:15) ( b (cid:124) i ˜ x )( c (cid:124) i ˜ m ) + ( c (cid:124) i ˜ y ) + (cid:16) t − (cid:15) ( b (cid:124) i h ) − t − (cid:15) ( b (cid:124) i ˜ h ) (cid:17) (cid:105) b i b (cid:124) i = B (cid:124) (cid:104) diag ( t − (cid:15) ( Bh ) (cid:12) t − (cid:15) ( Bx ) (cid:12) ( Cm ) + ) diag (( Cy ) + − ( C ˜ y ) + )+ diag ( t − (cid:15) ( Bh ) (cid:12) t − (cid:15) ( Bx ) (cid:12) ( C ˜ y ) + ) diag (( Cm ) + − ( C ˜ m ) + )+ diag (cid:0) t − (cid:15) ( Bh ) (cid:12) ( C ˜ m ) + (cid:12) ( C ˜ y ) + (cid:1) diag ( t − (cid:15) ( Bx ) − t − (cid:15) ( B ˜ x )) diag ( t − (cid:15) ( B ˜ x ) (cid:12) ( C ˜ m ) + (cid:12) ( C ˜ y ) + ) diag (cid:16) t − (cid:15) ( Bh ) − t − (cid:15) ( B ˜ h ) (cid:17) (cid:105) B Thus, (cid:107) G − (cid:15) ( h , x , m , y ) − G − (cid:15) ( ˜ h , ˜ x , ˜ m , ˜ y ) (cid:107)≤(cid:107) B (cid:107) (cid:104) (cid:107) Cm (cid:107) ∞ (cid:107) ( Cy ) + − ( C ˜ y ) + (cid:107) ∞ + (cid:107) C ˜ y (cid:107) ∞ (cid:107) ( Cm ) + − ( C ˜ m ) + (cid:107) ∞ + (cid:107) C ˜ m (cid:107) ∞ (cid:107) C ˜ y (cid:107) ∞ (cid:107) t − (cid:15) ( Bx ) − t − (cid:15) ( B ˜ x ) (cid:107) ∞ + (cid:107) C ˜ m (cid:107) ∞ (cid:107) C ˜ y (cid:107) ∞ (cid:107) t − (cid:15) ( Bh ) − t − (cid:15) ( B ˜ h ) (cid:107) ∞ (cid:105) ≤(cid:107) B (cid:107) (cid:34) max i ∈ [ (cid:96) ] (cid:107) c i (cid:107) max i ∈ [ (cid:96) ] | ( c (cid:124) i y ) + − ( c (cid:124) i ˜ y ) + | + max i ∈ [ (cid:96) ] (cid:107) c i (cid:107) max i ∈ [ (cid:96) ] | ( c (cid:124) i m ) + − ( c (cid:124) i ˜ m ) + | + (cid:18) max i ∈ [ (cid:96) ] (cid:107) c i (cid:107) (cid:19) max i ∈ [ (cid:96) ] | t − (cid:15) ( b (cid:124) i x ) − t − (cid:15) ( b (cid:124) i ˜ x ) | + (cid:18) max i ∈ [ (cid:96) ] (cid:107) c i (cid:107) (cid:19) max i ∈ [ (cid:96) ] (cid:12)(cid:12)(cid:12) t − (cid:15) ( b (cid:124) i h ) − t − (cid:15) ( b (cid:124) i ˜ h ) (cid:12)(cid:12)(cid:12) (cid:35) ≤(cid:107) B (cid:107) (cid:34) max i ∈ [ (cid:96) ] (cid:107) c i (cid:107) max i ∈ [ (cid:96) ] | c (cid:124) i ( y − ˜ y ) | + max i ∈ [ (cid:96) ] (cid:107) c i (cid:107) max i ∈ [ (cid:96) ] | c (cid:124) i ( m − ˜ m ) | + (cid:18) max i ∈ [ (cid:96) ] (cid:107) c i (cid:107) (cid:19) max i ∈ [ (cid:96) ] (cid:15) | b (cid:124) i ( x − ˜ x ) | + (cid:18) max i ∈ [ (cid:96) ] (cid:107) c i (cid:107) (cid:19) max i ∈ [ (cid:96) ] (cid:15) (cid:12)(cid:12)(cid:12) b (cid:124) i ( h − ˜ h ) (cid:12)(cid:12)(cid:12) (cid:35) ≤(cid:107) B (cid:107) (cid:34) (cid:18) max i ∈ [ (cid:96) ] (cid:107) c i (cid:107) (cid:19) (cid:107) y − ˜ y (cid:107) + (cid:18) max i ∈ [ (cid:96) ] (cid:107) c i (cid:107) (cid:19) (cid:107) m − ˜ m (cid:107) + 1 (cid:15) (cid:18) max i ∈ [ (cid:96) ] (cid:107) c i (cid:107) (cid:19) max i ∈ [ (cid:96) ] (cid:107) b i (cid:107) (cid:107) x − ˜ x (cid:107) + 1 (cid:15) (cid:18) max i ∈ [ (cid:96) ] (cid:107) c i (cid:107) (cid:19) max i ∈ [ (cid:96) ] (cid:107) b i (cid:107) (cid:13)(cid:13)(cid:13) h − ˜ h (cid:13)(cid:13)(cid:13) (cid:35) ≤(cid:107) B (cid:107) (cid:104) p (cid:107) y − ˜ y (cid:107) + 9 p (cid:107) m − ˜ m (cid:107) + 27 (cid:15) √ np (cid:107) x − ˜ x (cid:107) + 27 (cid:15) √ np (cid:13)(cid:13)(cid:13) h − ˜ h (cid:13)(cid:13)(cid:13) (cid:105) where the ﬁrst inequality follows because | t − (cid:15) ( z ) | ≤ for all z , and the third inequalityfollows because t − (cid:15) ( z ) is /(cid:15) -Lipschitz and ( z ) + is -Lipschitz. Let E be the event that (cid:107) B (cid:107) ≤ √ (cid:96) . By Corollary 5.35 in Vershynin [2012], for A ∈ R (cid:96) × n with rows of A following N ( , I n ) , we have P ( (cid:107) A (cid:107) ≤ √ (cid:96) ) ≥ − e − (cid:96)/ , if (cid:96) ≥ n . As rows of B are truncated, wehave P ( E ) ≥ − e − (cid:96)/ , if (cid:96) ≥ n as well. On E , we have (cid:107) G − (cid:15) ( h , x , m , y ) − G − (cid:15) ( ˜ h , ˜ x , ˜ m , ˜ y ) (cid:107)≤ (cid:96) √ np(cid:15) (cid:104) (cid:107) y − ˜ y (cid:107) + (cid:107) m − ˜ m (cid:107) + (cid:107) x − ˜ x (cid:107) + (cid:13)(cid:13)(cid:13) h − ˜ h (cid:13)(cid:13)(cid:13)(cid:105) (57)for all ˜ h , ˜ x ∈ S n − and ˜ m , ˜ y ∈ S p − .Finally, we complete the proof by a covering argument. Let N δ be a δ -net on S n − × S p − such that |N δ | ≤ (3 /δ ) n + p . Take δ = (cid:15) √ np . Combining (55) and (56), we have ∀ ( h , m ) , ( x , y ) ∈ N δ , G − (cid:15) ( h , x , m , y ) (cid:22) E G − (cid:15) ( h , x , m , y ) + (cid:96)(cid:15)I n (cid:22) α (cid:96) Q h , x m (cid:124) Q m , y y + 3 (cid:96)(cid:15)I n . − |N δ | e − γ K (cid:15) (cid:96) n log n ≥ − (cid:18) δ (cid:19) n + p e − γ K (cid:15) (cid:96) n log n ≥ − e − γ K (cid:15) (cid:96) n log n +( n + p ) log(3 · √ np/(cid:15) ) . If (cid:96) ≥ ˜ cn ( n + p ) log( np ) for some ˜ c = Ω( (cid:15) log (cid:15) ) , then this probability is at least − e − ˜ γ(cid:96) for some ˜ γ = O ( (cid:15) ) . For ( h , m ) , ( x , y ) ∈ S n − × S p − , let ( ˜ h , ˜ m ) , ( ˜ x , ˜ y ) ∈ N δ be such that (cid:107) h − ˜ h (cid:107) ≤ δ , (cid:107) x − ˜ x (cid:107) ≤ δ , (cid:107) m − ˜ m (cid:107) ≤ δ and (cid:107) y − ˜ y (cid:107) ≤ δ . By (57), we have that ∀ ( h , x ) (cid:54) = ( , ) and m , y ∈ S p − , G − (cid:15) ( h , x , m , y ) (cid:22) G − (cid:15) ( ˜ h , ˜ x , ˜ m , ˜ y ) + 729 (cid:96) √ np(cid:15) δ I n (cid:22) α α (cid:96) Q h , x m (cid:124) Q m , y y + 4 (cid:96)(cid:15) I n . In conclusion, the result of this lemma holds if (cid:96) > (2¯ c/(cid:15) ) n (log n ) and (cid:96) ≥ ˜ cn ( n + p ) log( np ) ,with probability at least − e − (cid:96)/ − e − ˜ γ(cid:96) > − e − γ(cid:96) for some γ = O ( (cid:15) ) and ˜ c =Ω( (cid:15) log (cid:15) ) .Next, we now provide an upper bound on the singular values of G (cid:15) ( h , x , m , y ) . Lemma 10.

Theorem 5.

Let a = ( a , . . . , a n ) be a ﬁxed non-zero vector and let y , . . . , y m be independent,mean zero sub-exponential random variables satisfying E | y i | ≤ and (cid:107) y i (cid:107) ψ ≤ K i ( K i ≥ . hen for every u ≥ , we have P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m (cid:88) i =1 a i y i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ u (cid:33) ≤ (cid:104) − c min (cid:16) u (cid:80) mi =1 a i K i log K i , u (cid:107) a (cid:107) ∞ K log K (cid:17)(cid:105) , where K = max i K i and c is an absolute constant. We now state a theorem that controls the singular values of a random matrix A . TheTheorem is exactly the same as Theorem 5.39 in Vershynin [2012] with the notable diﬀerencein the dependence of the constants to the sub-gaussian parameters. We use Theorem 5 to getthis improved dependence. Theorem 6.

Let A be a N × n matrix whose rows a i are independent sub-gaussian isotropicrandom vectors in R n . Then for every t ≥ , with probability at least − − ct ) one has √ N − C √ n − t ≤ s min ( A ) ≤ s max ( A ) ≤ √ N + C √ n + t. (61) Here C = C K = K √ log K (cid:113) log 9 c , c = c K = c K log K > with c is an absolute constant and K = max i (cid:107) a i (cid:107) ψ . The proof structure of Theorem 6 is exactly the same as the proof of Theorem 5.39 inVershynin [2012], and so we provide the proof presented in Vershynin [2012] below.

Proof.

The proof is a basic version of a covering argunemt, and it has three steps. We needto control (cid:107) Ax (cid:107) for all vectors on the unit sphere. To this end, we discretize the sphereusing a N (the approximation step), establish a tight control of (cid:107) Ax (cid:107) for every ﬁxed vector x ∈ N with high probability (the concentration step), and ﬁnish oﬀ by taking a union boundover all x in the net. Step 1: Approximation . Using Lemma 5.36 in Vershynin [2012] for the matrix B = A / √ N we see that the conclusion of the theorem is equivalent to (cid:107) N A (cid:124) A − I (cid:107) ≤ max( δ, δ ) =: (cid:15) where δ = C (cid:114) nN + t √ N . (62)Using Lemma 5.34 in Vershynin [2012], we can evaluate the operator norm in (62) on a -net N pf unit sphere S n − : (cid:107) N A (cid:124) A − I (cid:107) ≤ x ∈N (cid:12)(cid:12)(cid:12)(cid:12)(cid:28) ( 1 N A (cid:124) A − I ) x , x (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) = 2 max x ∈N | N (cid:107) Ax (cid:107) − | . So to complete the proof it suﬃces to show that, which high probability, max x ∈N | N (cid:107) A (cid:107) − | ≤ (cid:15) . (63)By Lemma 5.2 in Vershynin [2012], we can choose the net N so that it has cardinality |N | ≤ n . 40 tep 2: Concentration Let us ﬁx any vector x ∈ S n − . We can express (cid:107) Ax (cid:107) as asum of independent random variablies (cid:107) Ax (cid:107) = N (cid:88) i =1 (cid:104) a i , x (cid:105) =: N (cid:88) i =1 z i (64)where a i denote the rows of the matrix A . By assumption, z i = (cid:104) a i , x (cid:105) are independentsub-gaussian random variables with E z i = 1 and (cid:107) z i (cid:107) ψ ≤ K . Therefore, by Remark 5.18 andLemma 5.14 in Vershynin [2012], z i − are independent centered sub-exponential randomvariables with (cid:107) z i − (cid:107) ψ ≤ (cid:107) z i (cid:107) ψ ≤ (cid:107) z i (cid:107) ψ = 4 K .We can therefore use an exponential deviation inequality, Theorem 5, to control the sum(64). P (cid:26) | N (cid:107) A (cid:107) − | ≥ (cid:15) (cid:27) = P (cid:40) | N N (cid:88) i =1 z i − | ≥ (cid:15) (cid:41) ≤ (cid:34) − ˜ c min (cid:32) (cid:15) N / (cid:80) Ni =1 K i log 2 K i , (cid:15)N/ K log 2 K (cid:33)(cid:35) ≤ (cid:20) − ˜ c K log 2 K min (cid:0) (cid:15) , (cid:15) (cid:1) N (cid:21) = 2 exp (cid:20) − ˜ c K log 2 K δN (cid:21) ≤ (cid:20) − c K log K (cid:0) C n + t (cid:1)(cid:21) , where the last inequality follows by the deﬁnition of δ and using the inequality ( a + b ) ≥ a + b for a, b ≥ . Step 3: Union bound.

Taking the union bound over all vectors x in the net N ofcardinality |N | ≤ n , we obtain P (cid:26) max x ∈N | N (cid:107) Ax (cid:107) − | ≥ (cid:15) (cid:27) ≤ n · − c K log K (cid:0) C n + t (cid:1) ] ≤ − c K log K t ] , where the second inequality follows for C = C K suﬃciently large, e.g. C = K √ log K (cid:113) log 9 c .We now state a corollary of Theorem 6 that applies to general, non-isotropic sub-gaussiandistribution. Corollary 1.

Let A be a N × n matrix whose rows a i are independent sub-gaussian randomvectors in R n with second moment matrix Σ . Then for every t ≥ , with probability at least − − ct ) one has (cid:107) N A (cid:124) A − Σ (cid:107) ≤ max( δ, δ ) where δ = C (cid:114) nN + t √ N . (65)

Here C = C K = K √ log K (cid:113) log 9 c , c = c K = c K log K > with c is an absolute constant and K = max i (cid:107) a i (cid:107) ψ ..