[PDF] On the stable recovery of deep structured linear networks under sparsity constraints

Abstract

We consider a deep structured linear network under sparsity constraints. We study sharp conditions guaranteeing the stability of the optimal parameters defining the network. More precisely, we provide sharp conditions on the network architecture and the sample under which the error on the parameters defining the network scales linearly with the reconstruction error (i.e. the risk). Therefore, under these conditions, the weights obtained with a successful algorithms are well defined and only depend on the architecture of the network and the sample. The features in the latent spaces are stably defined. The stability property is required in order to interpret the features defined in the latent spaces. It can also lead to a guarantee on the statistical risk. This is what motivates this study. The analysis is based on the recently proposed Tensorial Lifting. The particularity of this paper is to consider a sparsity prior. This leads to a better stability constant. As an illustration, we detail the analysis and provide sharp stability guarantees for convolutional linear network under sparsity prior. In this analysis, we distinguish the role of the network architecture and the sample input. This highlights the requirements on the data in connection to parameter stability.

Full PDF

aa r X i v : . [ m a t h . O C ] F e b Proceedings of Machine Learning Research vol 75:1–15, 2018

Stable recovery of deep linear networks under sparsity constraints

Franc¸ois Malgouyres

MALGOUYRES @ MATH . UNIV - TOULOUSE . FR Institut de Math´ematiques de Toulouse ; UMR5219Universit´e de Toulouse ; CNRSUPS IMT F-31062 Toulouse Cedex 9, France Address 1

Joseph Landsberg

JML @ MATH . TAMU . EDU

Department of MathematicsMailstop 3368Texas A& M University

Abstract

We study a deep linear network expressed under the form of a matrix factorization problem. Ittakes as input a matrix X obtained by multiplying K matrices (called factors and correspondingto the action of a layer). Each factor is obtained by applying a ﬁxed linear operator to a vector ofparameters satisfying a sparsity constraint. In machine learning, the error between the product ofthe estimated factors and X (i.e. the reconstruction error) relates to the statistical risk. The stablerecovery of the parameters deﬁning the factors is required in order to interpret the factors and theintermediate layers of the network.In this paper, we provide sharp conditions on the network topology under which the error on theparameters deﬁning the factors (i.e. the stability of the recovered parameters) scales linearly withthe reconstruction error (i.e. the risk). Therefore, under these conditions on the network topology,any successful learning tasks leads to robust and therefore interpretable layers.The analysis is based on the recently proposed Tensorial Lifting. The particularity of thispaper is to consider a sparse prior. As an illustration, we detail the analysis and provide sharpguarantees for the stable recovery of convolutional linear network under sparsity prior. As expected,the condition are rather strong. Keywords:

Stable recovery, deep linear networks, convolutional linear networks, feature robustess.

1. Introduction

Let K ∈ N ∗ , m . . . m K +1 ∈ N , write m = m , m K +1 = n . We impose the factors to be structuredmatrices deﬁned by a number S of unknown parameters. More precisely, for k = 1 . . . K , let M k : R S −→ R m k × m k +1 ,h M k ( h ) be a linear map. We assume that we know the matrix X ∈ R m × n which is provided by X = M ( h ) · · · M K ( h K ) + e, (1)for an unknown error term e , such that k e k ≤ δ , and parameters h = ( h k ) k =1 ..K ∈ R S × K .Moreover, considering a family of possible supports M (e.g., all the supports of size S ′ , for a c (cid:13) iven S ′ ≤ S ), we assume that the h satisfy a sparsity constraint of the form : there exists S =( S k ) k =1 ..K ∈ M such that supp (cid:0) h (cid:1) ⊂ S (i.e.: ∀ k , supp (cid:0) h k (cid:1) ⊂ S k ).This work investigates necessary and sufﬁcient conditions imposed on the constituents of (1) forwhich we can (up to obvious scale rearrangement) stably recover the parameters h from X . Besidethese conditions, we assume that we have a way to ﬁnd S ∗ ∈ M and h ∗ ∈ R S × K S ∗ such that η = k M ( h ∗ ) . . . M K ( h ∗ K ) − X k , is small. (2)As we will discuss later, at the writing, the success of algorithms for constructing h ∗ is mostlysupported by empirical evidence and lack theoretical justiﬁcations. These aspects of the problemare out of the scope of the present paper. However, in machine learning problems, the reconstructionerror η represents the risk. There is therefore no point in analyzing the properties of h ∗ , if η is large.The established upper-bound on the recovery error of the parameters linearly depends on δ + η .Therefore, when the learning algorithm is successful (i.e. the sum of the risk η and noise δ issufﬁciently small), if the deep linear network satisﬁes the conditions established in this paper theestimation of the parameters is stable. The latter property is required if one wants to interpret thefeatures provided by the machine learning algorithm. That is the main interest of the proposedanalysis. Notice that we also establish that the conditions are sharp.Also, the study considers deep linear networks instead of deep neural networks. As can bededuced from Eldan and Shamir (2016), this signiﬁcantly diminishes the expressiveness of the net-work. The main argument for studying deep linear networks (as is done in the present paper)comes from a remark in Safran and Shamir (2016). For the rectiﬁed linear unit activation function(ReLU) , between each layer every entry is multiplied by an element of the discrete set { , } . As aconsequence, the parameter space R S × K can be partitioned into subsets such that, on every subset,the action of the non-linear network is the same deep linear network (i.e. the activation function hasa constant action when h varies in the subset). Therefore, the objective function optimized in deeplearning is made of pieces and on every piece it is the objective function of a deep linear network.As a consequence, properties of the objective function for deep neural networks generalize proper-ties of the objective function for deep linear networks. Restricting the analysis to linear networks islegitimate as a step towards the study of deep neural networks.Notice, that the authors of Choromanska et al. (2015a,b); Kawaguchi (2016) use a differentargument but also end-up studying deep linear networks. The simplifying assumption assumes theindependence of the activation to the input. Taking the expectation then leads to linear networksthat the authors analyse. As explained by the same authors in Choromanska et al. (2015b), this ishowever a moderatly convincing argument. We prefer to say clearly that we consider deep linearnetworks.Finally, S ∗ and h ∗ are typically found by an algorithm (most often a heuristic) that tries to lower k M ( h ) . . . M K ( h K ) − X k (3)while avoiding overﬁt. A classical strategy is the dropout of Srivastava et al. (2014). This is per-fectly compatible with the assumption (2). However, even if we ignore the overﬁt issue and restrictthe analysis to the minimization of (3), we see that it is non-convex. Again, we do not address thisminimization issue but there is signiﬁcant empirical evidence suggesting that (3) can be minimizedefﬁciently in a surprisingly large number of situations. Despite an increasing theoretical activity

1. ReLU is the most common activation function. TABLE RECOVERY OF DEEP LINEAR NETWORKS UNDER SPARSITY CONSTRAINTS related to that question the theory explaining this phenomenon is still far from satisfactory when K ≥ (see Livni et al. (2014); Haeffele and Vidal (2015); Kawaguchi (2016); Choromanska et al.(2015a,b); Safran and Shamir (2016) ).The approach developed in this paper extends to K ≥ existing results for K ≤ . In particular,when K = 1 , the considered problems boils down to a compressed sensing problem Elad (2010).When K = 2 and when extended to other constraints on the parameters h , the statements applyto already studied problems such as: low rank approximation Candes et al. (2013), Non-negativematrix factorization Lee and Seung (1999); Donoho and Stodden (2003); Laurberg et al. (2008);Arora et al. (2012), dictionary learning Jenatton et al. (2012), phase retrieval Candes et al. (2013),blind deconvolution Ahmed et al. (2014); Choudhary and Mitra (2014); Li et al. (2016). Most ofthese papers use the same lifting property we are using. They further propose to convexify the prob-lem. A more general bilinear framework is considered in Choudhary and Mitra (2014). The onlyexisting statements when K ≥ are very recent Malgouyres and Landsberg (2017). They are alsoapplied to deep linear networks but do not include sparsity constraint.The present work describes an alternative analysis, specialized to sparsity constraints, of theresults exposed in Malgouyres and Landsberg (2017). Doing so, we obtain better bounds (deﬁnedwith an analogue of the lower-RIP) and weaker constraints on the model. Its application to sparseconvolutional linear networks leads to simple necessary and sufﬁcient conditions of stable recovery,for a large class of solvers. The stability inequality (see Theorem 5) only involves explicit andsimple ingredients of the problem. The condition on the network topology is rather strong buttakes an simple format. Implementing a test checking if the condition is met is easy and the testonly requires to apply the networks as many times as the network has leaves, for every couple ofsupports.

2. Notations and preliminaries on Tensorial Lifting

Set N K = { , . . . , K } and R S × K ∗ = { h ∈ R S × K , ∀ k = 1 ..K, k h k k 6 = 0 } . Deﬁne an equivalencerelation in R S × K ∗ : for any h , g ∈ R S × K , h ∼ g if and only if there exists ( λ k ) k =1 ..K ∈ R K suchthat K Y k =1 λ k = 1 and ∀ k = 1 ..K, h k = λ k g k . Denote the equivalence class of h ∈ R S × K ∗ by [ h ] . For any p ∈ [1 , ∞ ] , we denote the usual ℓ p normby k . k p and deﬁne the mapping d p : (cid:0) ( R S × K ∗ / ∼ ) × ( R S × K ∗ / ∼ ) (cid:1) → R by d p ([ h ] , [ g ]) = inf h ′ ∈ [ h ] ∩ R S × K diag g ′ ∈ [ g ] ∩ R S × K diag k h ′ − g ′ k p , ∀ h , g ∈ R S × K ∗ , (4)where R S × K diag = { h ∈ R S × K ∗ , ∀ k = 1 ..K, k h k k ∞ = k h k ∞ } . It is proved in Malgouyres and Landsberg (2017) that d p is a metric on R S × K ∗ / ∼ .The real valued tensors of order K whose axes are of size S are denoted by T ∈ R S × ... × S . Thespace of tensors is abbreviated R S K . We say that a tensor T ∈ R S K is of rank if and only if thereexists a collection of vectors h ∈ R S × K such that, for any i = ( i , . . . , i K ) ∈ N KS , T i = h ,i . . . h K,i K . he set of all the tensors of rank is denoted by Σ . Moreover, we parametrize Σ ⊂ R S K usingthe Segre embedding P : R S × K −→ Σ ⊂ R S K h ( h ,i h ,i . . . h K,i K ) i ∈ N KS (5)As stated in the two next theorems, we can control the distortion of the distance induced by P and its inverse. Theorem 1 Stability of [ h ] from P ( h ) , see Malgouyres and Landsberg (2017) Let h and g ∈ R S × K ∗ be such that k P ( g ) − P ( h ) k ∞ ≤ max ( k P ( h ) k ∞ , k P ( g ) k ∞ ) . For all p, q ∈ [1 , ∞ ] , d p ([ h ] , [ g ]) ≤ KS ) p min (cid:18) k P ( h ) k K − ∞ , k P ( g ) k K − ∞ (cid:19) k P ( h ) − P ( g ) k q . (6) Theorem 2 Lipschitz continuity of P , see Malgouyres and Landsberg (2017) We have for any q ∈ [1 , ∞ ] and any h and g ∈ R S × K ∗ , k P ( h ) − P ( g ) k q ≤ S K − q K − q max (cid:18) k P ( h ) k − K ∞ , k P ( g ) k − K ∞ (cid:19) d q ([ h ] , [ g ]) . (7)The Tensorial Lifting (see Malgouyres and Landsberg (2017)) states that there exists a uniquelinear map A : R S K −→ R m × n , such that for all h ∈ R S × K M ( h ) M ( h ) . . . M K ( h K ) = A P ( h ) . (8)The intuition leading to this equality is that every entry in M ( h ) M ( h ) . . . M K ( h K ) is a mul-tivariate polynomial whose variables are in h . Moreover, every monomial of the polynomials isof the form a i P ( h ) i for i ∈ N KS , where a i is a coefﬁcient depending on M , . . . , M K . The greatproperty of the Tensorial Lifting is to express any deep linear network using the Segre Embeddingand a linear operator A . The Segre embedding is non-linear and might seem difﬁcult to deal with atthe ﬁrst sight, but it is always the same whatever the network topology, the sparsity pattern, the ac-tion of the ReLU activation function. . . These constituents of the problem only inﬂuence the liftinglinear operator A .In the next section, we study what properties of A are required to obtain the stable recovery. InSection 4, we study these properties when A corresponds to a sparse convolutional linear network.

3. General conditions for the stable recovery under sparsity constraint

From now on, the analysis differs from the one presented in Malgouyres and Landsberg (2017). It isdedicated to models that enforce sparsity. In this particular situation, we can indeed have a differentview of the geometry of the problem. In order to describe it, we ﬁrst establish some notation. TABLE RECOVERY OF DEEP LINEAR NETWORKS UNDER SPARSITY CONSTRAINTS

We deﬁne a support by S = ( S k ) k =1 ..K , with S k ⊂ N S , and denote the set of all supports by P ( N KS ) (the parts of N KS ). For a given support S ∈ P ( N KS ) , we denote R S × K S = { h ∈ R S × K | h k,i = 0 , for all k = 1 ..K and i

6∈ S k } (i.e., for all k , supp ( h k ) ⊂ S k ) and R S K S = { T ∈ R S K | T i = 0 , if ∃ k = 1 ..K , such that i k

6∈ S k } . We also denote by P S the orthogonal projection from R S K onto R S K S . We trivially have for all T ∈ R S K and all i ∈ N KS ( P S T ) i = (cid:26) T i , if i ∈ S , , otherwise.As explained in the introduction, we assume that there exists a known family of admissiblesupports M ⊂ P ( N KS ) , an unknown support S ∈ M and unknown parameters h ∈ R S × K S that wewould like to estimate from the noisy matrix product X = M ( h ) . . . M K ( h K ) + e. (9)We assume that there exists δ ≥ such that the error satisﬁes k e k ≤ δ. (10)Also, we consider an inexact minimization and assume that we have a way to ﬁnd S ∗ ∈ M and h ∗ ∈ R S × K S ∗ η = k M ( h ∗ ) . . . M K ( h ∗ K ) − X k is small.We remind that, in machine learning problems, η represents the risk.In the geometrical view described in the sequel, we consider different linear operators A S , with S ∈ P ( N KS ) , such that for all h ∈ R S × K S A S P ( h ) = M ( h ) . . . M K ( h K ) . In order to achieve that, considering (8), we simply deﬁne for any

S ∈ P ( N KS ) A S = A P S . (11)The following property will turn out to be necessary and sufﬁcient to guarantee the stable re-covery property. Deﬁnition 1 Deep- M -Null Space Property Let γ ≥ and ρ > , we say that A satisﬁes the deep- M -Null Space Property (deep- M -NSP ) with constants ( γ, ρ ) if and only if for all S and S ′ ∈ M , any T ∈ P ( R S × K S )+ P ( R S × K S ′ ) satisfying kA S∪S ′ T k ≤ ρ and any T ′ ∈ Ker ( A S∪S ′ ) , we have k T k ≤ γ k T − P S∪S ′ T ′ k . (12) eometrically, the deep- M -NSP does not hold when P S∪S ′ Ker ( A S∪S ′ ) intersects P ( R S × K S ) + P ( R S × K S ′ ) away from the origin or tangentially at . It holds when the two sets intersect ”transver-sally” at . Despite an apparent abstract nature, we will be able to characterize precisely whenthe lifting operator corresponding to a convolutional linear network satisﬁes the deep- M -NSP (seeSection 4). We will also be able to calculate the constants ( γ, ρ ) . Proposition 1 Sufﬁcient condition for deep- M -NSP If Ker ( A ) ∩ R S K S∪S ′ = { } , for all S and S ′ ∈ M , then A satisﬁes the deep- M -NSP withconstants ( γ, ρ ) = (1 , + ∞ ) . Proof

In order to prove the proposition, let us consider S and S ′ ∈ M , T ′ ∈ Ker ( A S∪S ′ ) . We have A P S∪S ′ T ′ = 0 and therefore P S∪S ′ T ′ ∈ Ker ( A ) . Moreover, by deﬁnition, P S∪S ′ T ′ ∈ R S K S∪S ′ .Therefore, applying the hypothesis of the proposition, we obtain P S∪S ′ T ′ = 0 and (12) holds forany T , when γ = 1 . Therefore, A satisﬁes the deep- M -NSP with constants ( γ, ρ ) = (1 , + ∞ ) .If N KS ∈ M , the condition becomes Ker ( A ) = { } , which is sufﬁcient but obviously not necessaryfor the deep- M -NSP to hold. However, when M truly imposes sparsity, the condition Ker ( A ) ∩ R S K S∪S ′ = { } says that the elements of Ker ( A ) shall not be sparse in some (tensorial) way. Thisnicely generalizes the case K = 1 . Deﬁnition 2 Deep-lower-RIP constant

There exists a constant σ M > such that for any S and S ′ ∈ M and any T in the orthogonalcomplement of Ker ( A S∪S ′ ) σ M k P S∪S ′ T k ≤ kA S∪S ′ T k . (13) We call σ M Deep-lower-RIP constant of A with regard to M . Proof

The existence of σ M is a straightforward consequence of the fact that the restriction of A S∪S ′ on the orthogonal complement of Ker ( A S∪S ′ ) is injective. We therefore have for all T in theorthogonal complement of Ker ( A S∪S ′ ) kA S∪S ′ T k ≥ σ S∪S ′ k T k ≥ σ S∪S ′ k P S∪S ′ T k , where σ S∪S ′ > is the smallest non-zero singular value of A S∪S ′ . We obtain the existence of σ M by taking the minimum of the constants σ S∪S ′ over the ﬁnite family of S and S ′ ∈ M . Theorem 3 Sufﬁcient condition for stable recovery

Assume A satisﬁes the deep- M -NSP with the constants γ ≥ , ρ > . For any S ∗ ∈ M and h ∗ ∈ R S × K S ∗ as in (2) with η + δ ≤ ρ , we have k P ( h ∗ ) − P ( h ) k ≤ γσ M ( δ + η ) , where σ M is the Deep-lower-RIP constant of A with regard to M .Moreover, if γσ M ( δ + η ) ≤ max (cid:0) k P ( h ∗ ) k ∞ , k P ( h ) k ∞ (cid:1) , then d p ([ h ∗ ] , [ h ]) ≤ KS ) p min (cid:18) k P ( h ) k K − ∞ , k P ( h ∗ ) k K − ∞ (cid:19) γσ M ( δ + η ) . TABLE RECOVERY OF DEEP LINEAR NETWORKS UNDER SPARSITY CONSTRAINTS

Proof

We have kA S ∗ ∪S ( P ( h ∗ ) − P ( h )) k = kA S ∗ ∪S P ( h ∗ ) − A S ∗ ∪S P ( h ) k = kA P ( h ∗ ) − A P ( h ) k≤ kA P ( h ∗ ) − X k + kA P ( h ) − X k≤ δ + η If we further decompose (the decomposition is unique) P ( h ∗ ) − P ( h ) = T + T ′ , where T ′ ∈ Ker (cid:0) A S ∗ ∪S (cid:1) and T is orthogonal to Ker (cid:0) A S ∗ ∪S (cid:1) , we have kA S ∗ ∪S ( P ( h ∗ ) − P ( h )) k = kA S ∗ ∪S T k ≥ σ M k P S ∗ ∪S T k , where σ M is the Deep-lower-RIP constant of A with regard to M . We ﬁnally obtain, since P S ∗ ∪S P ( h ∗ ) = P ( h ∗ ) and P S ∗ ∪S P ( h ) = P ( h ) , k P ( h ∗ ) − P ( h ) − P S ∗ ∪S T ′ k = k P S ∗ ∪S T k ≤ δ + ησ M . Since A satisﬁes the deep- M -NSP with constants ( γ, ρ ) and δ + η ≤ ρ , we have k P ( h ∗ ) − P ( h ) k ≤ γ k P ( h ∗ ) − P ( h ) − P S ∗ ∪S T ′ k≤ γ δ + ησ M When δ + η satisfy the condition in the theorem, we can apply Theorem 1 and obtain the last in-equality.Theorem 3 differs from the analogous theorem in Malgouyres and Landsberg (2017). In partic-ular, it is dedicated to sparsity constraint with much weaker hypotheses on A . The constant of theupper bound is also different.One might again ask whether the condition “ A satisﬁes the deep- M -NSP ” is sharp or not. Asstated in the following proposition, the answer is afﬁrmative. Theorem 4 Necessary condition for stable recovery

Assume the stable recovery property holds: There exists M , C and δ > such that for any S ∈ M and any h ∈ R S × K S , any X = A P ( h ) + e , with k e k ≤ δ , and any S ∗ ∈ M and h ∗ ∈ R S × K S ∗ such that kA P ( h ∗ ) − X k ≤ k e k we have d ([ h ∗ ] , [ h ]) ≤ C min (cid:18) k P ( h ) k K − ∞ , k P ( h ∗ ) k K − ∞ (cid:19) k e k . Then, A satisﬁes the deep- M -NSP with constants γ = CS K − √ K σ max and ρ = δ, where σ max is the spectral radius of A . The proof is very similar to the proof of the Theorem 6, in Malgouyres and Landsberg (2017) andthe proof of the analogous converse statement in Cohen et al. (2009). It is provided in Appendix A. dges of depth leaves R N R N R N R N R N root r Figure 1: Example of the considered convolutional linear network. To every edge/neuron is at-tached a convolution kernel. The network does not involve non-linearities or sampling.

4. Application to convolutional linear network under sparsity prior

We consider a sparse convolutional linear network as depicted in Figure 1. The network typicallyaims at performing a linear analysis or synthesis of a signal living in R N . The considered convolu-tional linear network is deﬁned from a rooted directed acyclic graph G ( E , N ) composed of nodes N and edges E . Each edge connects two nodes. The root of the graph is denoted by r and the setcontaining all its leaves is denoted by F . We denote by P a the set of all paths connecting the leavesand the root. We assume, without loss of generality, that the length of any path between any leafand the root is independent of the considered leaf and equal to some constant K ≥ . We alsoassume that, for any edge e ∈ E , the number of edges separating e and the root is the same for allpaths between e and r . This length is called the depth of e . For any k = 1 ..K , we denote the setcontaining all the edges of depth k , by E ( k ) . For e ∈ E ( k ) , we also say that e belongs to the layer k . Moreover, to any edge e is attached a convolution kernel of maximal support S e ⊂ N N . We as-sume (without loss of generality) that P e ∈E ( k ) |S e | is independent of k ( |S e | denotes the cardinalityof S e ). We take S = X e ∈E (1) |S e | . For any edge e , we consider the mapping T e : R S −→ R N that maps any h ∈ R S into the convo-lution kernel h e , attached to the edge e , whose support is S e . It simply writes at the right location(i.e. those in S e ) the entries of h deﬁning the kernel on the edge e . As in the previous section, weassume a sparsity constraint and will only consider a family M of possible supports S ⊂ N KS .At each layer k , the convolutional linear network computes, for all e ∈ E ( k ) , the convolutionbetween the signal at the origin of e ; then, it attaches to any ending node the sum of all the con-volutions arriving at that node. Examples of such convolutional linear networks includes wavelets,wavelet packets Mallat (1998) or the fast transforms optimized in Chabiron et al. (2014, 2016). Itis similar to the usual convolutional neural network except that the linear network does not involveany non-linearity and the supports are not ﬁxed. It is clear that the operation performed at any layerdepends linearly on the parameters h ∈ R S and that its results serves as inputs for the next layer. TABLE RECOVERY OF DEEP LINEAR NETWORKS UNDER SPARSITY CONSTRAINTS

The convolutional linear network therefore depends on parameters h ∈ R S × K and takes the form X = M ( h ) . . . M K ( h K ) , where the operators M k satisfy the hypothesis of the present paper.This section applies the results of the preceding section in order to identify conditions such thatany unknown parameters h ∈ R S × K satisfying supp (cid:0) h (cid:1) ⊂ S , for a given S ∈ M , can be stablyrecovered from X = M ( h ) . . . M K ( h K ) (possibly corrupted by an error).In order to do so, let us deﬁne a few notations. Notice ﬁrst that, we apply the convolutionallinear network to an input x ∈ R N |F| , where x is the concatenation of the signals x f ∈ R N for f ∈ F . Therefore, X is the (horizontal) concatenation of |F | matrices X f ∈ R N × N such that Xx = X f ∈F X f x f , for all x ∈ R N |F| . Let us consider the convolutional linear network deﬁned by h ∈ R S × K as well as f ∈ F and n = 1 ..N . The column of X corresponding to the entry n in the leaf f is the translation by n of X p ∈P a ( f ) T p ( h ) (14)where P a ( f ) contains all the paths of P a starting from the leaf f and T p ( h ) = T e ( h ) ∗ . . . ∗ T e K ( h K ) , where p = ( e , . . . , e K ) . Moreover, we deﬁne for any k = 1 ..K the mapping e k : N S −→ E ( k ) which provides for any i = 1 ..S the unique edge of E ( k ) such that the i th entry of h ∈ R S contributes to T e k ( i ) ( h ) . Also,for any i ∈ N KS , we denote p i = ( e ( i ) , . . . , e K ( i K )) and, for any S ∈ M , I S = (cid:8) i ∈ N KS | i ∈ S and p i ∈ P a (cid:9) . The latter contains all the indices of S corresponding to a valid path in the network. For any set ofparameters h ∈ R S × K and any path p ∈ P a , we also denote by h p the restriction of h to its indicescontributing to the kernels on the path p . We also deﬁne, for any i ∈ N KS , h i ∈ R S × K by h i k,j = (cid:26) , if j = i k otherwise , for all k = 1 ..K and j = 1 ..S. (15)We can deduce from (14) that, when i ∈ I S , A P ( h i ) simply convolves the entries at one leaf with adirac delta function. Thefore, all the entries of A P ( h i ) are in { , } and we denote D i = { ( i, j ) ∈ N N × N N |F| |A P ( h i ) i,j = 1 } .We also denote ∈ R S a vector of size S with all its entries equal to . For any edge e ∈ E , e ∈ R S consists of zeroes except for the entries corresponding to the edge e which are equal to .For any S ⊂ N S , we deﬁne S ∈ R S which consists of zeroes except for the entries correspondingto the indexes in S . For any p = ( e , . . . , e K ) ∈ P a , the support of M ( e ) . . . M K ( e K ) isdenoted by D p .Finally, we remind that because of (8), there exists a unique mapping A : R S K −→ R N × N |F| uch that A P ( h ) = M ( h ) . . . M K ( h K ) , for all h ∈ R S × K , where P is the Segre embedding deﬁned in (5). Proposition 2 Necessary condition of identiﬁability of a sparse network

Either R S × K is not identiﬁable or, for any S and S ′ ∈ M , all the entries of M ( S∪S ′ ) . . . M K ( S∪S ′ ) belong to { , } . When the latter holds :1. For any distinct p and p ′ ∈ P a , we have D p ∩ D p ′ = ∅ .2. Ker ( A S∪S ′ ) = { T ∈ R S K |∀ i ∈ I S∪S ′ , T i = 0 } . ProofLet us assume that:

There exist S and S ′ ∈ M and an entry of M ( S∪S ′ ) . . . M K ( S∪S ′ ) that does not belong to { , } .Using (14), we know that there is f ∈ F and n = 1 ..N such that X p ∈P a ( f ) T p ( ) n ≥ . As a consequence, there is i and j ∈ S ∪ S ′ with i = j and T p i ( h i ) n = T p j ( h j ) n = 1 . Therefore, A P ( h i ) = A P ( h j ) and the network is not identiﬁable. This proves the ﬁrst statement. Let us assume that:

For any S and S ′ ∈ M , all the entries of M ( S∪S ′ ) . . . M K ( S∪S ′ ) belong to { , } .We immediately observe that (14) leads to the item 1 of the Proposition.To prove the second item, we can easily check that ( P ( h i )) i I S∪S′ forms a basis of { T ∈ R S K |∀ i ∈ I S∪S ′ , T i = 0 } . We can also easily check using (14) and (11) that, for any i I S∪S ′ , A S∪S ′ P ( h i ) = (cid:26) , if i

6∈ S ∪ S ′ M ( h i ) . . . M K ( h i K ) = 0 , if i ∈ S ∪ S ′ and p i

6∈ P a As a consequence, { T ∈ R S K |∀ i ∈ I S∪S ′ , T i = 0 } ⊂ Ker ( A S∪S ′ ) .To prove the converse inclusion, we observe that for any distinct i and j ∈ I S∪S ′ , we have D i ∩ D j = ∅ . This implies that rk ( A S∪S ′ ) ≥ | I S∪S ′ | = S K − dim( { T ∈ R S K |∀ i ∈ I S∪S ′ , T i = 0 } ) . Finally, we deduce that dim(Ker ( A S∪S ′ )) ≤ dim( { T ∈ R S K |∀ i ∈ I S∪S ′ , T i = 0 } ) and therefore Ker ( A S∪S ′ ) = { T ∈ R S K |∀ i ∈ I S∪S ′ , T i = 0 } . TABLE RECOVERY OF DEEP LINEAR NETWORKS UNDER SPARSITY CONSTRAINTS

Proposition 2 extends Proposition 8 of Malgouyres and Landsberg (2017) by considering severalpossible supports. Said differently, Proposition 8 of Malgouyres and Landsberg (2017) correspondsto Proposition 2 when M = { N KS } .The interest of the condition in Proposition 2 is that it can easily be computed by apply-ing the network to dirac delta functions, when |M| is not too large. Notice that, beside theknown examples in blind-deconvolution (i.e. when K = 2 and |P a | = 1 ) Ahmed et al. (2014);Bahmani and Romberg (2015), there are known convolutional linear networks (with K ≥ ) thatsatisfy the condition of the ﬁrst statement of Proposition 2. For instance, the convolutional linearnetwork corresponding to the un-decimated Haar wavelet transform is a tree and for any of itsleaves f ∈ F , |P a ( f ) | = 1 . Moreover, the support of the kernel living on the edge e , of depth k , on this path is { , k } . It is not difﬁcult to check that the ﬁrst condition of Proposition 2 holds.Otherwise, it is clear that the necessary condition will be rarely satisﬁed. Proposition 3 If |P a | = 1 and if, for any S and S ′ ∈ M , all the entries of M ( S∪S ′ ) . . . M K ( S∪S ′ ) belong to { , } , then Ker ( A S∪S ′ ) is the orthogonal complement of R S K S∪S ′ and A satisﬁes the deep- M -NSP with constants ( γ, ρ ) = (1 , + ∞ ) . Moreover, the deep-lower-RIP of A with regard to M is σ M = √ N . Proof

The fact that,

Ker ( A S∪S ′ ) is the orthogonal complement of R S K S∪S ′ is a direct consequenceof Proposition 2 and the fact that, when |P a | = 1 , I S∪S ′ = S ∪ S ′ . We then trivialy deduce that,for any T ′ ∈ Ker ( A S∪S ′ ) , P S∪S ′ T ′ = 0 . A straightforward consequence is that A satisﬁes thedeep- M -NSP with constants ( γ, ρ ) = (1 , + ∞ ) .To calculate σ M , let us consider S , S ′ ∈ M and T in the orthogonal complement of Ker ( A S∪S ′ ) .We express T under the form T = P i ∈ I S∪S′ T i P ( h i ) , where h i is deﬁned (15). Let us also remindthat, applying Proposition 2, the supports of A P ( h i ) and A P ( h j ) are disjoint, when i = j . Let usﬁnally add that, since A P ( h j ) is the matrix of a convolution with a Dirac mass, we have |D j | = N .We ﬁnally have kA T k = k X i ∈ I T i A P ( h i ) k , = N X i ∈ I T i = N k T k , from which we deduce the value of σ M .In the sequel, we establish stability results for a convolutional linear network estimator. In orderto do so, we consider a convolutional linear network of known structure G ( E , N ) , ( S e ) e ∈E and M . The convolutional linear network is deﬁned by unknown parameters h ∈ R S × K satisfying aconstraint supp (cid:0) h (cid:1) ⊂ S for an unknown support S ∈ M . We consider the noisy situation where X = M ( h ) . . . M K ( h K ) + e,

2. Un-decimated means computed with the ”Algorithme `a trous”, Mallat (1998), Section 5.5.2 and 6.3.2. The Haarwavelet is described in Mallat (1998), Section 7.2.2, p. 247 and Example 7.7, p. 235 ith k e k ≤ δ and an estimate h ∗ ∈ R S × K such that k M ( h ∗ ) . . . M K ( h ∗ K ) − X k ≤ η. The equivalence relationship ∼ does not sufﬁse to group parameters leading to the same networkaction. Indeed, with networks, we can rescale the kernels on different path differently. Therefore,we say that two networks sharing the same topology and deﬁned by the parameters h and g ∈ R S × K are equivalent if and only if ∀ p ∈ P a , ∃ ( λ e ) e ∈ p ∈ R p , such that Y e ∈ p λ e = 1 and ∀ e ∈ p , T e ( g ) = λ e T e ( h ) . We trivially observe that applying the networks deﬁned by equivalent parameters lead to the sameresult. The equivalence class of h ∈ R S × K is denoted by { h } . For any p ∈ [1 , + ∞ ] , we deﬁne δ p ( { h } , { g } ) =  X p ∈P a d p ([ h p ] , [ g p ]) p  p , where we remind that h p (resp g p ) denotes the restriction of h (resp g ) to the path p and d p isdeﬁned in (4). Since d p is a metric, we easily prove that δ p is a metric between network classes. Theorem 5

If for any S and S ′ ∈ M , all the entries of M ( S∪S ′ ) . . . M K ( S∪S ′ ) belong to { , } and if there exists ε > such that for all e ∈ E , kT e ( h ) k ∞ ≥ ε then δ p ( { h ∗ } , { h } ) ≤ KS ) p √ N ε K − ( δ + η ) . Proof

Let us consider a path p ∈ P a , using (14), since all the entries of M ( S∪S ′ ) . . . M K ( S∪S ′ ) belong to { , } , the restriction of the network to p satisfy the same property. Therefore, we canapply Proposition 3 and Theorem 3 to the restriction of the convolutional linear network to p andobtain for any p ∈ [1 , ∞ ] d p ([( h ∗ ) p ] , [ h p ]) ≤ KS ) p √ N ε − K ( δ p + η p ) , where δ p and η p are the restrictions of the errors on D p . Finally, using item 1 of Proposition 2 δ p ( { h ∗ } , { h } ) ≤ KS ) p √ N ε − K  X p ∈P a ( δ p + η p ) p  p , ≤ KS ) p √ N ε K − ( δ + η ) . TABLE RECOVERY OF DEEP LINEAR NETWORKS UNDER SPARSITY CONSTRAINTS

Acknowledgments

Joseph Landsberg is supported by NSF DMS-1405348.

References

Arif Ahmed, Benjamin Recht, and Justin Romberg. Blind deconvolution using convex program-ming.

IEEE Transactions on Information Theory , 60(3):1711–1732, 2014.Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Moitra. Computing a nonnegative matrixfactorization–provably. In

Proceedings of the forty-fourth annual ACM symposium on Theory ofcomputing , pages 145–162. ACM, 2012.Sohail Bahmani and Justin Romberg. Lifting for blind deconvolution in random mask imaging:Identiﬁability and convex relaxation.

SIAM Journal on Imaging Sciences , 8(4):2203–2238, 2015.Emmanuel J Candes, Thomas Strohmer, and Vladislav Voroninski. Phaselift: Exact and stablesignal recovery from magnitude measurements via convex programming.

Communications onPure and Applied Mathematics , 66(8):1241–1274, 2013.Olivier Chabiron, Franc¸ois Malgouyres, Jean-Yves Tourneret, and Nicolas Dobigeon. Toward fasttransform learning.

International Journal of Computer Vision , pages 1–22, 2014.Olivier Chabiron, Franc¸ois Malgouyres, Herwig Wendt, and Jean-Yves Tourneret. Optimization ofa fast transform structured as a convolutional tree. preprint HAL , (hal-01258514), 2016.Anna Choromanska, Mikael Henaff, Michael Mathieu, G´erard Ben Arous, and Yann LeCun. Theloss surfaces of multilayer networks. In

Artiﬁcial Intelligence and Statistics , pages 192–204,2015a.Anna Choromanska, Yann LeCun, and G´erard Ben Arous. Open problem: The landscape of the losssurfaces of multilayer networks. In

Conference on Learning Theory , pages 1756–1760, 2015b.Sunav Choudhary and Urbashi Mitra. Identiﬁability scaling laws in bilinear inverse problems. arXivpreprint arXiv:1402.2637 , 2014.Albert Cohen, Wolfgang Dahmen, and Ronald DeVore. Compressed sensing and best -term approx-imation.

Journal of the American mathematical society , 22(1):211–231, 2009.David L Donoho and Victoria Stodden. When does non-negative matrix factorization give a correctdecomposition into parts? 2003.Michael Elad.

Sparse and Redundant Representations: From Theory to Applications in Signal andImage Processing . Springer, 2010.Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In

Conferenceon Learning Theory , pages 907–940, 2016.Benjamin D Haeffele and Ren´e Vidal. Global optimality in tensor factorization, deep learning, andbeyond. arXiv preprint arXiv:1506.07540 , 2015. odolphe Jenatton, R´emi Gribonval, and Francis Bach. Local stability and robustness of sparsedictionary learning in the presence of noise. arxiv, 2012.Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural InformationProcessing Systems , pages 586–594, 2016.Hans Laurberg, Mads Græsbøll Christensen, Mark D Plumbley, Lars Kai Hansen, and Søren HoldtJensen. Theorems on positive data: On the uniqueness of nmf.

Computational intelligence andneuroscience , 2008, 2008.Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factor-ization.

Nature , 401(6755):788–791, 1999.Xiaodong Li, Shuyang Ling, Thomas Strohmer, and Ke Wei. Rapid, robust, and reliableblind deconvolution via nonconvex optimization.

CoRR , abs/1606.04933, 2016. URL http://arxiv.org/abs/1606.04933 .Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efﬁciency of trainingneural networks. In

Advances in Neural Information Processing Systems , pages 855–863, 2014.Franc¸ois Malgouyres and Joseph Landsberg. Stable recovery of the factors from a deep matrixproduct and application to convolutional network. arXiv preprint arXiv:1703.08044 , 2017.St´ephane Mallat.

A Wavelet Tour of Signal Processing . Academic Press, Boston, 1998.Itay Safran and Ohad Shamir. On the quality of the initial basin in overspeciﬁed neural networks.In

International Conference on Machine Learning , pages 774–782, 2016.Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: a simple way to prevent neural networks from overﬁtting.

Journal of machine learningresearch , 15(1):1929–1958, 2014.

Appendix A. Proof of Theorem 4

Proof

Let S and S ′ ∈ M . Let h ∈ R S × K S and h ′ ∈ R S × K S ′ be such that kA (cid:16) P ( h ) − P ( h ′ ) (cid:17) k ≤ δ . Throughout the proof, we also consider T ′ ∈ Ker ( A S∪S ′ ) . We assume that k P ( h ) k ∞ ≤k P ( h ′ ) k ∞ . If it is not the case, we simply switch the role of h and h ′ in the deﬁnition of X and e ,below. We denote X = A P ( h ) and e = A P ( h ) − A P ( h ′ ) . We have X = A P ( h ′ ) + e with k e k ≤ δ . Moreover, when S and h play the role of S ∗ and h ∗ inthe hypothesis, since h ∈ R S × K S and k e k ≤ δ , we have d ([ h ′ ] , [ h ]) ≤ C k P ( h ′ ) k K − ∞ k e k . TABLE RECOVERY OF DEEP LINEAR NETWORKS UNDER SPARSITY CONSTRAINTS

Using the fact that e = A S∪S ′ ( P ( h ) − P ( h ′ )) , for any T ′ ∈ Ker ( A S∪S ′ ) k e k = kA S∪S ′ ( P ( h ) − P ( h ′ ) − T ′ ) k , ≤ σ max k P S∪S ′ ( P ( h ) − P ( h ′ ) − T ′ ) k , = σ max k P ( h ) − P ( h ′ ) − P S∪S ′ T ′ k , where σ max is the spectral radius of A . Therefore, d ([ h ′ ] , [ h ]) ≤ C k P ( h ′ ) k K − ∞ σ max k P ( h ) − P ( h ′ ) − P S∪S ′ T ′ k , Finally, using Theorem 2 and the fact that k P ( h ) k ∞ ≤ k P ( h ′ ) k ∞ , we obtain k P ( h ′ ) − P ( h ) k ≤ S K − K − k P ( h ′ ) k − K ∞ d ([ h ′ ] , [ h ]) ≤ CS K − √ K σ max k P ( h ) − P ( h ′ ) − P S∪S ′ T ′ k = γ k P ( h ) − P ( h ′ ) − P S∪S ′ T ′ k for γ = CS K − √ K σ max .Summarizing, we conclude that under the hypothesis of the theorem: For any S and S ′ ∈ M and any T ∈ P ( R S × K S ) + P ( R S × K S ′ ) such that kA T k = kA S∪S ′ T k ≤ δ , we have for any T ′ ∈ Ker ( A S∪S ′ ) k T k ≤ γ k T − P S∪S ′ T ′ k . In words, A satisﬁes the deep- M -NSP with the constants of Theorem 4.-NSP with the constants of Theorem 4.