[PDF] Boosting Dilated Convolutional Networks with Mixed Tensor Decompositions

Abstract

The driving force behind deep networks is their ability to compactly represent rich classes of functions. The primary notion for formally reasoning about this phenomenon is expressive efficiency, which refers to a situation where one network must grow unfeasibly large in order to realize (or approximate) functions of another. To date, expressive efficiency analyses focused on the architectural feature of depth, showing that deep networks are representationally superior to shallow ones. In this paper we study the expressive efficiency brought forth by connectivity, motivated by the observation that modern networks interconnect their layers in elaborate ways. We focus on dilated convolutional networks, a family of deep models delivering state of the art performance in sequence processing tasks. By introducing and analyzing the concept of mixed tensor decompositions, we prove that interconnecting dilated convolutional networks can lead to expressive efficiency. In particular, we show that even a single connection between intermediate layers can already lead to an almost quadratic gap, which in large-scale settings typically makes the difference between a model that is practical and one that is not. Empirical evaluation demonstrates how the expressive efficiency of connectivity, similarly to that of depth, translates into gains in accuracy. This leads us to believe that expressive efficiency may serve a key role in the development of new tools for deep network design.

Full PDF

BBoosting Dilated Convolutional Networks withMixed Tensor Decompositions

Nadav Cohen

COHENNADAV @ IAS . EDU

Institute for Advanced Study

Ronen Tamari

RONENT @ CS . HUJI . AC . IL The Hebrew University of Jerusalem

Amnon Shashua

SHASHUA @ CS . HUJI . AC . IL The Hebrew University of Jerusalem

Abstract

The driving force behind deep networks is their ability to compactly represent rich classes offunctions. The primary notion for formally reasoning about this phenomenon is expressive ef-ﬁciency, which refers to a situation where one network must grow unfeasibly large in order torealize (or approximate) functions of another. To date, expressive efﬁciency analyses focused onthe architectural feature of depth, showing that deep networks are representationally superior toshallow ones. In this paper we study the expressive efﬁciency brought forth by connectivity, mo-tivated by the observation that modern networks interconnect their layers in elaborate ways. Wefocus on dilated convolutional networks, a family of deep models delivering state of the art per-formance in sequence processing tasks. By introducing and analyzing the concept of mixed tensordecompositions, we prove that interconnecting dilated convolutional networks can lead to expres-sive efﬁciency. In particular, we show that even a single connection between intermediate layerscan already lead to an almost quadratic gap, which in large-scale settings typically makes the differ-ence between a model that is practical and one that is not. Empirical evaluation demonstrates howthe expressive efﬁciency of connectivity, similarly to that of depth, translates into gains in accuracy.This leads us to believe that expressive efﬁciency may serve a key role in the development of newtools for deep network design.

Keywords:

Deep Learning , Expressive Efﬁciency , Dilated Convolutions , Tensor Decompositions

1. Introduction

One of the key attributes fueling the success of deep learning is the ability of deep networks tocompactly represent rich classes of functions. This phenomenon has drawn considerable attentionfrom the theoretical machine learning community in recent years. The primary notion for formallyreasoning about the representational abilities of different models is expressive efﬁciency . Given twonetwork architectures A and B , with size parameters (typically the width of layers across a network) r A and r B , we say that architecture A is expressively efﬁcient w.r.t. architecture B if the followingtwo conditions hold: (i) any function realized by B with size r B can be realized (or approximated)by A with size r A ∈ O ( r B ) ; and (ii) there exist functions realized by A with size r A that cannotbe realized (or approximated) by B unless its size meets r B ∈ Ω( f ( r A )) for some super-linearfunction f . The nature of the function f in condition (ii) determines the type of efﬁciency takingplace – if f is exponential then architecture A is said to be exponentially expressively efﬁcientw.r.t. architecture B , and if f is polynomial so is the expressive efﬁciency of A over B . c (cid:13) a r X i v : . [ c s . L G ] F e b OHEN T AMARI S HASHUA

To date, works studying expressive efﬁciency in the context of deep learning (e.g. Delalleauand Bengio (2011); Pascanu et al. (2013); Montufar et al. (2014); Telgarsky (2015); Eldan andShamir (2015); Poole et al. (2016); Raghu et al. (2016); Cohen et al. (2016b); Cohen and Shashua(2016); Poggio et al. (2015); Mhaskar et al. (2016)) have focused on the architectural feature ofdepth, showing instances where deep networks are expressively efﬁcient w.r.t. shallow ones. Thistheoretical focus is motivated by the vast empirical evidence supporting the importance of depth(see LeCun et al. (2015) for a survey of such results). However, it largely overlooks an additionalarchitectural feature that in recent years is proving to have great impact on the performance of deepnetworks – connectivity . Nearly all state of the art networks these days (e.g. Szegedy et al. (2015);He et al. (2015); Huang et al. (2016b,a)) deviate from the simple feed-forward (chain) approach,running layers connected under various schemes. Whether or not this relates to expressive efﬁciencyremains to be an open question.A speciﬁc family of deep networks gaining increased attention in the deep learning communityis that of dilated convolutional networks . These models form the basis of the recent WaveNet (van denOord et al. (2016)) and ByteNet (Kalchbrenner et al. (2016)) architectures, which provide state ofthe art performance in audio and text processing tasks. Dilated convolutional networks are typicallyapplied to sequence data, and consist of multiple succeeding convolutional layers, each comprisingnon-contiguous ﬁlters with a different dilation (distance between neighboring elements). The choiceof dilations directly affects the space of functions that may be realized by a network, and while nochoice is expressively efﬁcient w.r.t. another, we show in this work that interconnecting networkswith different dilations leads to expressive efﬁciency, and by this demonstrate that connectivityindeed bears the potential to enhance the expressiveness of deep networks.Our analysis follows several recent works utilizing tensor decompositions for theoretical studiesof deep learning (see for example Janzamin et al. (2015); Sedghi and Anandkumar (2016)), and inparticular, builds on the equivalence between hierarchical tensor decompositions and convolutionalnetworks established in Cohen et al. (2016b) and Cohen and Shashua (2016). We show that withdilated convolutional networks, the choice of dilations throughout a network corresponds to deter-mination of the mode (dimension) tree underlying the respective decomposition. We then deﬁnethe notion of a mixed tensor decomposition , which blends together multiple mode trees, effectivelycreating a large ensemble of hybrid trees formed from all possible combinations. Mixed tensordecompositions correspond to mixed dilated convolutional networks, i.e. mixtures formed by con-necting intermediate layers of different dilated convolutional networks. This allows studying theexpressive properties of such mixtures using mathematical machinery from the ﬁeld of tensor anal-ysis. We fully analyze a particular case of dilated convolutional arithmetic circuits, showing that asingle connection between intermediate layers already leads to an almost quadratic expressive efﬁ-ciency, which in large-scale settings typically makes the difference between a model that is practicaland one that is not.An experiment on TIMIT speech corpus (Garofolo et al. (1993)) evaluates the dilated convo-lutional network architectures covered by our analysis. We ﬁnd that interconnecting intermediatelayers of different networks improves accuracy, with no additional cost in terms of computationor model capacity. This serves as an indication that with the architectural feature of connectivity,similarly to the case of depth, expressive efﬁciency and improved accuracies go hand in hand. Ac-cordingly, we believe expressive efﬁciency may serve a key role in the development of new toolsfor deep network design. OOSTING D ILATED C ONVOLUTIONAL N ETWORKS WITH M IXED T ENSOR D ECOMPOSITIONS

2. Summary of Our Analysis and Contributions

Our analysis begins in sec. 4, where we present the dilated convolutional network underlyingWaveNet (ﬁg. 1). We consider this to be the baseline architecture and, following Cohen and Shashua(2016), facilitate its study through tensor analysis. The key to introducing tensors into the frame-work is a discretization of the network’s input-output mapping. Namely, f ( x [ t − N +1] , . . . , x [ t ]) – afunction realized by the network ( t here stands for a natural time index), is conceptually evaluatedon a ﬁnite (exponentially large) number of input points, generated from all possible assignmentsof the variables x [ t − N +1] , . . . , x [ t ] to each hold one of M predetermined values. This gives riseto an N -dimensional lookup table, with length M in each axis. We refer to this lookup table asa grid tensor (eq. 1). It is shown (app. A) that grid tensors brought forth by the baseline dilatedconvolutional network can be expressed as a hierarchical tensor decomposition, referred to as the baseline decomposition (eq. 2).The baseline decomposition implicitly adheres to a particular tree over tensor modes (axes). Thiscalls for a generalization, and we indeed deﬁne a general mode tree (def. 1), followed by a corre-sponding hierarchical tensor decomposition, referred to as the tree decomposition (eq. 3). Differentchoices of mode trees lead to tree decompositions characterizing networks with different dilations.We focus on the tree that corresponds to the baseline network (ﬁg. 2(a)), and on those correspondingto networks obtained by swapping dilations of different layers (ﬁg. 2(b), for example).Armed with a framework for representing different dilated convolutional networks through hi-erarchical tensor decompositions of different mode trees, we head on in sec. 5 and introduce thenotion of a mixed tensor decomposition (eq. 4). The mixed decomposition of two mode trees T and ¯ T is based on a preselected set of nodes present in both trees, referred to as mixture nodes . Indi-vidual tree decompositions of T and ¯ T are run in parallel, where at each mixture node, tensors fromthe two decompositions are swapped. If N and ¯ N are the dilated convolutional networks charac-terized by T and ¯ T (respectively), the mixed decomposition characterizes a mixed (interconnected)network M , formed by rewiring intermediate layers of N into ¯ N , and vice versa (see ﬁg. 3).The heart of our analysis is sec. 6, where we study the expressive efﬁciency of the mixed net-work M over the individual networks N and ¯ N . Establishing expressive efﬁciency requires show-ing that any function realized by N or ¯ N can be realized by M with no more than linear growth insize, whereas the converse does not hold, i.e. there exist functions realizable by M that cannot berealized by N or ¯ N unless their size is allowed to grow super-linearly. From a tensor decompositionperspective, this translates to the following two propositions: (i) any tensor generated by a tree decomposition of T or ¯ T can be realized by their mixed de-composition with no more than linear growth in size; and (ii) there exist tensors realizable by the mixed decomposition of T and ¯ T that cannot be realizedby their individual tree decompositions without a super-linear growth in size.We address both propositions through the notion of hybrid mode trees (def. 4; ﬁg. 4), which aresimply mode trees born from combinations of T and ¯ T . We prove (claim 5) that the mixed decom-position of T and ¯ T can replicate, with no more than linear growth in size, the tree decompositionof any hybrid tree H . Since T and ¯ T are in particular hybrid mode trees of themselves, we obtainan afﬁrmative answer to proposition (i) . For addressing proposition (ii) , we demonstrate a case(with convolutional arithmetic circuits) where there exists a hybrid tree H whose tree decomposi-tion generates tensors that require the tree decompositions of T and ¯ T to grow super-linearly. Sincethe mixed decomposition of T and ¯ T can (by claim 5) replicate the tree decomposition of H with OHEN T AMARI S HASHUA no more than linear growth, proposition (ii) is established, and M is indeed expressively efﬁcientw.r.t. N and ¯ N (corollary 8).The central tool for establishing proposition (ii) , or more speciﬁcally, for demonstrating theexistence of a hybrid tree H whose tree decomposition requires those of T and ¯ T to grow super-linearly, is a tight analysis of tensors generated by a tree decomposition in terms of their ranks whenarranged as matrices (theorem 7). Matricization ranks under hierarchical tensor decompositions areof interest from a pure tensor analysis perspective ( cf. Hackbusch (2012)), as well as in the contextof deep learning ( cf.

Cohen and Shashua (2017)). The bounds we provide are much tighter (exact inmany cases) and far more general than those existing in the literature, and we expect them to proveuseful in different applications. The key idea in deriving these bounds is to consider a matricizedform of the tree decomposition, and recursively propagate outwards various matrices (for details seeproof sketch following theorem 7, as well as the complete proof in app. B.2).To conclude this section, we list below the main contributions of the paper: • We introduce the notion of a mixed tensor decomposition, and prove that it brings forth arepresentational advantage compared to the individual hierarchical decompositions it com-prises. This development is of interest from a pure tensor analysis perspective, independentlyof convolutional networks, or machine learning in general. • We provide the ﬁrst formal evidence for the fact that interconnectivity – an architectural fea-ture prevalent in state of the art deep learning, brings forth expressive efﬁciency. • Our central theorem (theorem 7) provides the most comprehensive characterization to date ofmatricization ranks brought forth by hierarchical tensor decompositions.

3. Preliminaries

The constructions and analyses delivered in this paper rely on concepts from the ﬁeld of tensoranalysis. Below we provide the minimal background required in order to follow our arguments. The core concept in tensor analysis is a tensor , which for our purposes may simply be thought ofas a multi-dimensional array. The order of a tensor is deﬁned to be the number of indexing entriesin the array, which are referred to as modes . The dimension of a tensor in a particular mode isdeﬁned as the number of values that may be taken by the index in that mode. For example, a -by- matrix is a tensor of order , i.e. it has two modes, with dimension in mode and dimension inmode . If A is a tensor of order N and dimension M i in each mode i ∈ { , . . . , N } , the space ofall conﬁgurations it can take is denoted, quite naturally, by R M ×···× M N .A fundamental operator in tensor analysis is the tensor product (also known as outer prod-uct ), which we denote by ⊗ . It is an operator that intakes two tensors A ∈ R M ×···× M P and B ∈ R M P +1 ×···× M P + Q (orders P and Q respectively), and returns a tensor A ⊗ B ∈ R M ×···× M P + Q (order P + Q ) deﬁned by: ( A⊗B ) d ...d P + Q = A d ...d P ·B d P +1 ...d P + Q . In Cohen and Shashua (2016)a generalization of the tensor product is deﬁned, by replacing multiplication with a general opera-tor g ( · ) . Speciﬁcally, for a function g : R × R → R that is commutative ( g ( a, b ) = g ( b, a ) for all a, b ∈ R ), the generalized tensor product , denoted ⊗ g , is deﬁned to be the operator that for input ten-sors A ∈ R M ×···× M P and B ∈ R M P +1 ×···× M P + Q (orders P and Q respectively), returns the tensor A ⊗ g B ∈ R M ×···× M P + Q (order P + Q ) given by: ( A ⊗ g B ) d ...d P + Q = g ( A d ...d P , B d P +1 ...d P + Q ) . OOSTING D ILATED C ONVOLUTIONAL N ETWORKS WITH M IXED T ENSOR D ECOMPOSITIONS

An additional operator we will make use of is mode permutation . Let A be a tensor of order N ,and let σ ( · ) be a permutation over N (bijective mapping from { , . . . , N } to itself). The mode per-mutation of A w.r.t. σ ( · ) , which by a slight abuse of notation is denoted σ ( A ) , is the order- N tensordeﬁned by: σ ( A ) d ...d N = A d σ (1) ...d σ ( N ) . In words, σ ( A ) is the tensor obtained by rearranging themodes of A in accordance with σ ( · ) .When studying tensors, it is oftentimes useful to arrange them as matrices, a procedure referredto as matricization . Let A be a tensor of order N and dimension M i in each mode i ∈ { , . . . , N } ,and let I ⊂ { , . . . , N } be a set of mode indexes, whose complement { , . . . , N } \ I we denoteby I c . We may write I = { i , . . . , i |I| } where i < · · · < i |I| , and similarly I c = { j , . . . , j |I c | } where j < · · · < j |I c | . The matricization of A w.r.t. I , denoted (cid:74) A (cid:75) I , is the (cid:81) |I| t =1 M i t -by- (cid:81) |I c | t =1 M j t matrix holding the entries of A such that A d ...d N is placed in row index (cid:80) |I| t =1 ( d i t − (cid:81) |I| t (cid:48) = t +1 M i t (cid:48) and column index (cid:80) |I c | t =1 ( d j t − (cid:81) |I c | t (cid:48) = t +1 M j t (cid:48) . If I = ∅ or I = { , . . . , N } ,then by deﬁnition (cid:74) A (cid:75) I is a row or column (respectively) vector of dimension (cid:81) Nt =1 M t holding A d ...d N in entry (cid:80) Nt =1 ( d t − (cid:81) Nt (cid:48) = t +1 M t (cid:48) .To conclude this section, we hereinafter establish notational conventions that will accompanyus throughout the paper. We denote tensors with uppercase calligraphic letters, e.g. A , and in somecases, with the Greek letters φ , ϕ or ψ . Subscripts are used to refer to individual tensor entries, e.g. A d ...d N ∈ R , whereas superscripts indicate the location of a tensor in some annotated collec-tion, for example A y stands for the y ’th tensor in the collection A . . . A r . Vectors are typicallydenoted with boldface lowercase letters, e.g. a , where again subscripts refer to an individual entry( e.g. a α ∈ R ), and superscripts to the identity of a vector within some annotated collection ( e.g. a l,j is the ( l, j ) ’th vector in the set { a l,j } l =1 ...L,j =1 ...N ). We use non-boldface lowercase or uppercaseletters ( e.g. l or L ) to denote scalars, and in this case, both subscripts and superscripts distinguishbetween objects in an annotated set ( e.g. l i , l i , L i , L i ∈ R ). Finally, for a positive integer N ∈ N ,we use [ N ] as shorthand for the set { , . . . , N } .

4. Dilated Convolutional Networks

Dilated convolutional networks are a family of convolutional networks (LeCun and Bengio (1995))gaining increased attention in the deep learning community. As opposed to more conventional con-volutional architectures (see for example Krizhevsky et al. (2012)), which are applied primarily toimages and videos, dilated convolutional networks thrive in sequence processing tasks. For exam-ple, they underlie Google’s WaveNet (van den Oord et al. (2016)) and ByteNet (Kalchbrenner et al.(2016)) models, which provide state of the art performance in audio and text processing tasks.

The dilated convolutional network architecture considered as baseline in this paper is the one un-derlying WaveNet model, depicted in ﬁg. 1. The input to the network is a sequence of vectors ( x [ t ]) t ⊂ R r , where t is a natural time index. A size- convolutional layer with dilation- , i.e. with contiguous ﬁlters, maps this input into the hidden sequence ( h (1) [ t ]) t ⊂ R r . Speciﬁ-cally, entry γ ∈ [ r ] of h (1) [ t ] is obtained by applying the ﬁlter formed by a ,γ, I , a ,γ, II ∈ R r to time points t − , t of the input: h (1) [ t ] γ = g ( (cid:10) a ,γ, I , x [ t − (cid:11) , (cid:10) a ,γ, II , x [ t ] (cid:11) ) . For reasons thatwill shortly become apparent, we use g ( · ) here to denote the binary function combining two size- convolutions into a single size- convolution with non-linearity. Different choices of g ( · ) lead to OHEN T AMARI S HASHUA size-2 conv: dilation-1 size-2 conv: dilation-2 size-2 conv: dilation-2

L-1

Time t-2 L +1 t+1 L-1 hidden layers N:=2 L time pointsinputoutput r  r  r  L r   L r  tt-1t-2t-3t-2 L +2t-2 L     [ ] , [ 1] , , [ ] h t g t t     a x a x       [ ] , [ 2 ] , , [ ] L LL L L yy o t g t t     a h a h         [ ] , [ 2] , , [ ] h t g t t     a h a h

Figure 1: Baseline dilated convolutional network architecture (see description in sec. 4.1).different convolutional operators, for example g ( a, b ) := max { a + b, } leads to standard convolu-tion followed by rectiﬁed linear activation ( ReLU , Nair and Hinton (2010)), whereas g ( a, b ) = a · b gives rise to what is known as a convolutional arithmetic circuit (Cohen et al. (2016b)). Fol-lowing the ﬁrst hidden layer, L − size- convolutional layers with increasing dilations are ap-plied. Speciﬁcally, for l = 2 , . . ., L − , hidden layer l maps the sequence ( h ( l − [ t ]) t ⊂ R r l − into ( h ( l ) [ t ]) t ⊂ R r l using ﬁlters with dilation- l − , i.e. with an internal temporal gap of l − − points: h ( l ) [ t ] γ = g ( (cid:10) a l,γ, I , h ( l − [ t − l − ] (cid:11) , (cid:10) a l,γ, II , h ( l − [ t ] (cid:11) ) . The last convolutional layer maps ( h ( L − [ t ]) t into network output sequence ( o [ t ]) t ⊂ R r L using ﬁlters with dilation- L − : o [ t ] y = g ( (cid:10) a L,y, I , h ( L − [ t − L − ] (cid:11) , (cid:10) a L,y, II , h ( L − [ t ] (cid:11) ) .Altogether, the architectural parameters of the network are the number of convolutional layers L ,the convolutional operator g ( · ) , the input dimension r , the number of channels r l for each hiddenlayer l ∈ [ L − , and the output dimension r L . The learnable parameters are the convolution weights a l,γ, I , a l,γ, II ∈ R r l − for channel γ ∈ [ r l ] of layer l ∈ [ L ] .Our interest lies on the representational abilities of the network, i.e. on the properties of theinput-output mappings it can realize. As illustrated in ﬁg. 1, for some ﬁxed time point t , o [ t ] – net-work output at time t , is a function of x [ t − L + . . . x [ t ] – network input over the last L timepoints. Taking into account the temporal stationarity of the network, and denoting for brevity N :=2 L , we may write o [ t ] y = f y ( x [ t − N +1] , . . . , x [ t ]) for every y ∈ [ r L ] , where the func-tions { f y ( · ) } y are independent of the time index t . The latter functions, which obviously dependon the convolution weights { a l,γ, I , a l,γ, II } l,γ , completely characterize the input-output mapping re-alized by the network. We will study these functions through the process of discretization . Namely, f y ( · ) – a function of N vector-variables, will be represented by a lookup table (tensor) formed byvarying each vector-variable over a ﬁnite number of possible values. The size of such a lookuptable is exponential in N , thus treating it directly is intractable. However, as we shall see, thenetwork admits a compact parameterization of lookup tables in terms of the convolution weights { a l,γ, I , a l,γ, II } l,γ . This parameterization (eq. 2 below) entails an algebraic structure, and will beused to study the representational properties of the baseline dilated convolutional network.For the discretization of f y ( · ) , we choose a collection of vectors v (1) . . . v ( M ) ∈ R r , and deﬁnethe following tensor A y of order N and dimension M in each mode: A yd ...d N := f y ( v ( d ) , . . . , v ( d N ) ) ∀ d . . .d N ∈ [ M ] (1) OOSTING D ILATED C ONVOLUTIONAL N ETWORKS WITH M IXED T ENSOR D ECOMPOSITIONS dilation-1dilation-2dilation-4dilation-8dilation-1dilation-2dilation-4dilation-8           

III CC  (a)         (b)                    III CC  {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16}{1,2} {3,4} {5,6} {15,16}{7,8} {9,10} {11,12} {13,14}{1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,2,3,4,5,6,7,8} {9,10,11,12,13,14,15,16} {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}{1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16} {1,3} {2,4} {5,7} {14,16}{6,8} {9,11} {10,12} {13,15} {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,2,3,4,9,10,11,12} {5,6,7,8,13,14,15,16} {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} Figure 2: Best viewed in color. Dilated convolutional networks (left) and the mode trees underlyingtheir respective tensor decompositions (right). (a)

Baseline architecture – dilation l − inlayer l . (b) Architecture obtained by swapping dilations of even and odd layers.The vectors v (1) . . . v ( M ) are referred to as discretizers . They generate the tensor A y by assigning,in all possible combinations, the N vector-variables of the function f y ( · ) . We refer to A y as the grid tensor of f y ( · ) , reﬂecting the fact that it holds function values over a discrete grid.The parameterization of { f y ( · ) } y discretizations mentioned above is in fact a hierarchical de-composition of the grid tensors {A y } y . Accordingly, and for the sake of highlighting correspon-dence to the baseline dilated convolutional network (ﬁg. 1), we refer to this parameterization as the baseline decomposition . For conciseness, we defer the derivation of the baseline decomposition toapp. A, and hereby lay out its ﬁnal form:For j = 1 . . .N : φ ,j,γ (cid:124) (cid:123)(cid:122) (cid:125) order = [ v (1) γ , . . . , v ( M ) γ ] (cid:62) ∀ γ ∈ [ r ] For l = 1 . . .L , j = 1 . . .N/ l : φ l,j,γ (cid:124) (cid:123)(cid:122) (cid:125) order l = (cid:16)(cid:88) r l − α =1 a l,γ, I α · φ l − , j − ,α (cid:17) ⊗ g (cid:16)(cid:88) r l − α =1 a l,γ, II α · φ l − , j,α (cid:17) ∀ γ ∈ [ r l ] A y = φ L, ,y ∀ y ∈ [ r L ] (2) a l,γ, I α and a l,γ, II α here stand for coordinate α of the convolution weights a l,γ, I and a l,γ, II respectively,while v ( i ) γ stands for coordinate γ of the discretizer v ( i ) . Notice that the tensor products here aregeneralized (see sec. 3) – based on the network’s convolutional operator g ( · ) . Therefore, strictlyspeaking, the baseline decomposition is a generalized tensor decomposition , as deﬁned in Cohenand Shashua (2016). The baseline decomposition (eq. 2), corresponding to the baseline dilated convolutional network(ﬁg. 1), implicitly adheres to a tree structure – for every ( l, j ) , there exists a group of tensors { φ l,j,γ } γ , OHEN T AMARI S HASHUA formed through combinations of tensors from its “child” groups { φ l − , j − ,γ } γ and { φ l − , j,γ } γ .In this subsection we generalize the underlying tree structure, and show that the resulting decompo-sitions capture networks with various dilations throughout their convolutional layers. We begin bydeﬁning a general (binary) tree over tensor modes: Deﬁnition 1

Let N ∈ N . A binary mode tree over [ N ] is a full binary tree in which: • Every node is labeled by a subset of [ N ] • There are exactly N leaves, labeled { } . . . { N }• The label of an interior (non-leaf) node is the union of the labels of its childrenIf T is a binary mode tree, we identify its nodes with their labels, i.e. with the corresponding subsetsof [ N ] . The set of all interior nodes is denoted by int ( T ) ⊂ [ N ] , the children of an interior node ν ⊂ [ N ] are denoted by C I ( ν ; T ) , C II ( ν ; T ) ⊂ [ N ] , and the parent of a non-root node ν ⊂ [ N ] isdenoted by P ( ν ; T ) . Notice that by deﬁnition, the root node is labeled [ N ] . Binary mode trees induce hierarchical decompositions of grid tensors. Recall the deﬁnition ofgrid tensors in sec. 4.1 (eq. 1), and let T be a binary mode tree over [ N ] . For every node ν ⊂ [ N ] in T , we deﬁne a collection of | ν | -order tensors { φ ν,γ } γ ∈ [ r ] , where r ∈ N is a predeterminedconstant, referred to as the size constant of the decomposition. We also deﬁne, for each interiornode ν ∈ int ( T ) , two collections of weight vectors – { a ν,γ, I } γ ∈ [ r ] ⊂ R r and { a ν,γ, II } γ ∈ [ r ] ⊂ R r .The hierarchical grid tensor decomposition induced by T traverses through the tree in a depth-ﬁrstfashion, assigning the tensors of node ν ( { φ ν,γ } γ ) through combinations of the tensors of its children( { φ C I ( ν ; T ) ,γ } γ and { φ C II ( ν ; T ) ,γ } γ ). This is laid out formally in eq. 3 below, which we refer to as the tree decomposition .For j = 1 . . .N : φ { j } ,γ (cid:124) (cid:123)(cid:122) (cid:125) order = [ v (1) γ , . . . , v ( M ) γ ] (cid:62) ∀ γ ∈ [ r ] For ν in int ( T ) (depth-ﬁrst order): φ ν,γ (cid:124)(cid:123)(cid:122)(cid:125) order | ν | = σ ( ν ; T ) (cid:16)(cid:16)(cid:88) rα =1 a ν,γ, I α · φ C I ( ν ; T ) ,α (cid:17) ⊗ g (cid:16)(cid:88) rα =1 a ν,γ, II α · φ C II ( ν ; T ) ,α (cid:17)(cid:17) ∀ γ ∈ [ r ] A y = φ [ N ] ,y ∀ y ∈ [ r ] (3)As in the baseline decomposition (eq. 2), v ( i ) γ here stands for coordinate γ of the discretizer v ( i ) . Thepermutation σ ( ν ; T ) ( · ) , for an interior node ν ∈ int ( T ) , arranges the modes of the tensor φ ν,γ suchthat these comply with a sorted ordering of ν . Speciﬁcally, if we denote by i < · · · < i | C I ( ν ; T ) | theelements of C I ( ν ; T ) ⊂ [ N ] , and by j < · · · < j | C II ( ν ; T ) | the elements of C II ( ν ; T ) ⊂ [ N ] , the per-mutation σ ( ν ; T ) : [2 | ν | ] → [2 | ν | ] is the one that sorts the tuple ( i , . . . , i | C I ( ν ; T ) | , j , . . . , j | C II ( ν ; T ) | ) in ascending order. The ﬁnal outcome of the decomposition, i.e. the generated grid tensors {A y } y ,are the tensors { φ [ N ] ,γ } γ corresponding to the root of T .Compare the general tree decomposition in eq. 3 to the baseline decomposition in eq. 2. Itis not difﬁcult to see that the latter is a special case of the former. Namely, it corresponds to abinary mode tree T that is perfect (all leaves have the same depth L = log N ), and whose depth- l nodes ( l ∈ { , , . . . , L } ) are ( k − N/ l + [ N/ l ] for k ∈ [2 l ] . This implies that such a mode OOSTING D ILATED C ONVOLUTIONAL N ETWORKS WITH M IXED T ENSOR D ECOMPOSITIONS tree, when plugged into the tree decomposition (eq. 3), provides a characterization of the baselinedilated convolutional network (ﬁg. 1), i.e. a network whose dilation in layer l is l − (see illustrationin ﬁg. 2(a)). If we were to choose a different mode tree, the corresponding dilated convolutionalnetwork would change. For example, assume that L = log N is even, and consider a perfectbinary mode tree T whose depth- l nodes ( l ∈ { , , . . . , L } ) are as follows: • Even l : depth- l nodes are ( k − N/ l + [ N/ l ] for k ∈ [2 l ] • Odd l : depth- l nodes are generated by splitting nodes of depth l - , such that the ﬁrst and thirdquadrants of a split node belong to one child, while the second and fourth belong to the otherIn this case, the network characterized by the tree decomposition (eq. 3) is obtained by swappingdilations of even and odd layers in the baseline architecture, i.e. it has dilation in layer l of l − if l is even, and l if l is odd (see illustration in ﬁg. 2(b)).

5. Mixed Tensor Decompositions

Let T and ¯ T be two binary mode trees over [ N ] (def. 1). Consider the tree decomposition of grid ten-sors induced by T (eq. 3). This decomposition iteratively assigns a group of tensors { φ ν,γ } γ for eachnode ν in T , based on weight vectors { a ν,γ, I , a ν,γ, II } γ deﬁned for each interior node ν ∈ int ( T ) . Thetree decomposition induced by ¯ T operates similarly, but for distinction we use { ¯ φ ¯ ν,γ } γ to denote thetensor group of node ¯ ν ∈ ¯ T , and { ¯ a ¯ ν,γ, I , ¯ a ¯ ν,γ, II } γ to denote the weights of interior node ¯ ν ∈ int ( ¯ T ) .We will deﬁne a mixed tensor decomposition , blending together the tree decompositions of T and ¯ T .The latter is obtained by choosing a collection of mixture nodes – mix ( T, ¯ T ) ⊂ int ( T ) ∩ int ( ¯ T ) .These are nodes (subsets of [ N ] ) that reside in the interior of both T and ¯ T , deﬁning locations inthe tree decompositions at which tensors will be exchanged. If mix ( T, ¯ T ) is chosen as the emptyset, the mixed decomposition simply sums the output tensors generated by the tree decompositionsof T and ¯ T ( { φ [ N ] ,y } y and { ¯ φ [ N ] ,y } y respectively). Otherwise, the tree decompositions of T and ¯ T progress in parallel, until reaching a mixture node µ ∈ mix ( T, ¯ T ) , where they exchange half thetensors corresponding to that node (half of { φ µ,γ } γ is exchanged for half of { ¯ φ µ,γ } γ ). The processcontinues until all mixture nodes are visited and the root node (of both trees) [ N ] is reached. At thispoint tensors ( { φ [ N ] ,y } y and { ¯ φ [ N ] ,y } y ) are summed and returned as output.The formal deﬁnition of the mixed decomposition is as follows: For j = 1 . . .N : φ { j } ,γ = ¯ φ { j } ,γ = [ v (1) γ , . . . , v ( M ) γ ] (cid:62) ∀ γ ∈ [ r ] r – decomposition size constant For µ in mix ( T, ¯ T ) ∪ { [ N ] } (inclusion order): For ν in int ( T ) ∩ µ \ { nodes in T already visited } (inclusion order): φ ν,γ = σ ( ν ; T ) (cid:16)(cid:16)(cid:88) rα =1 a ν,γ, I α · φ C I ( ν ; T ) ,α (cid:17) ⊗ g (cid:16)(cid:88) rα =1 a ν,γ, II α · φ C II ( ν ; T ) ,α (cid:17)(cid:17) ∀ γ ∈ [ r ]6 : For ¯ ν in int ( ¯ T ) ∩ µ \ { nodes in ¯ T already visited } (inclusion order): φ ¯ ν,γ = σ (¯ ν ; ¯ T ) (cid:16)(cid:16)(cid:88) rα =1 ¯ a ¯ ν,γ, I α · ¯ φ C I (¯ ν ; ¯ T ) ,α (cid:17) ⊗ g (cid:16)(cid:88) rα =1 ¯ a ¯ ν,γ, II α · ¯ φ C II (¯ ν ; ¯ T ) ,α (cid:17)(cid:17) ∀ γ ∈ [ r ]8 : Swap φ µ,γ ←→ ¯ φ µ,γ ∀ γ ∈ [ r/ A y = φ [ N ] ,y + ¯ φ [ N ] ,y ∀ y ∈ [ r ] (4) OHEN T AMARI S HASHUA

As in the basic tree decomposition (eq. 3), the ﬁrst step here (lines - ) is to assign tensors cor-responding to the leaf nodes ( { } . . . { N } ) via discretizers v (1) . . . v ( M ) . The outer loop in line traverses µ through mixture nodes and the root node in inclusion order, i.e. such that a node (subsetof [ N ] ) is always reached after all nodes strictly contained in it. Lines - (respectively - ) are thesame as in the tree decomposition (eq. 3), except that instead of running through the entire interiorof T (respectively ¯ T ), they cover a segment of it. This segment continues where the previous onesleft off, and comprises only nodes (subsets of [ N ] ) contained in µ (including µ itself). Line iswhere the mixing takes place – here half the tensors corresponding to node µ in the decomposi-tion of T ( { φ µ,γ } γ ), are exchanged for half the tensors corresponding to µ in the decompositionof ¯ T ( { ¯ φ µ,γ } γ ). Finally, after µ has reached the root node [ N ] and the decompositions of T and ¯ T have concluded, line sums the output tensors of these decompositions ( { φ [ N ] ,y } y and { ¯ φ [ N ] ,y } y respectively), producing the grid tensors {A y } y .In terms of computation and memory, the requirements posed by the mixed decomposition(eq. 4) are virtually identical to those of running two separate tree decompositions (eq. 3) with T and ¯ T . Speciﬁcally, if the tree decompositions of T and ¯ T correspond to input-output mappingscomputed by the dilated convolutional networks N and ¯ N (respectively), the mixed decompositionwould correspond to the computation of a mixed dilated convolutional network , formed by sum-ming the outputs of N and ¯ N , and interconnecting their intermediate layers. The choice of mixturenodes mix ( T, ¯ T ) in the mixed decomposition determines the locations at which networks N and ¯ N are interconnected, where an interconnection simply wires into N half the outputs of a convolu-tional layer in ¯ N , and vice versa. For example, suppose that N is the baseline dilated convolutionalnetwork (dilation l − in layer l – see sec. 4.1), whereas ¯ N is the network obtained by swappingdilations of even and odd layers (such that layer l has dilation l − if l is even, and l if l is odd).The mode trees corresponding to these networks, illustrated in ﬁg. 2 (for the case L := log N =4 ),share interior nodes ( k − N/ l + [ N/ l ] for l ∈ { , , . . . , L } , k ∈ [2 l ] . We may thereforechoose mix ( T, ¯ T ) to be all such nodes (excluding root), and get a mixed decomposition that cor-responds to a mixed network interconnecting all even layers of N and ¯ N . Illustrations of suchdecomposition and network (again, for the case L =4 ) are given in ﬁg. 3.The main advantage of the mixed decomposition (eq. 4), and the reason for its deﬁnition, is thatit leads to expressive efﬁciency. That is to say, the mixed dilated convolutional network, formed byinterconnecting intermediate layers of networks with different dilations, can realize functions thatwithout the interconnections would be expensive, or even impractical to implement. We theoreti-cally support this in the next section, providing a complete proof for a special case of convolutionalarithmetic circuits ( g ( a, b ) = a · b ).

6. Expressive Efﬁciency Analysis

As in sec. 5, let N and ¯ N be two dilated convolutional networks whose input-output mappings arecharacterized by the tree decomposition (eq. 3) with mode trees T and ¯ T respectively. Consider themixed decomposition (eq. 4) resulting from a particular choice of mixture nodes mix ( T, ¯ T ) (subsetof the nodes interior to both T and ¯ T ), and denote its corresponding mixed dilated convolutionalnetwork by M . We would like to show that M is expressively efﬁcient w.r.t. N and ¯ N , meaning: (i) any function realized by N or ¯ N can also be realized by M with no more than linear growthin network size (number of channels in the convolutional layers); (ii) there exist functions realiz-able by M that cannot be realized by N or ¯ N (or a summation thereof) unless their size (number OOSTING D ILATED C ONVOLUTIONAL N ETWORKS WITH M IXED T ENSOR D ECOMPOSITIONS network N dilation-1dilation-2 dilation-8 input dilation-4dilation-1dilation-2dilation-8 dilation-4 output (b) (a) mix(T,T) mode tree T mode tree T network N {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16}{1,2} {3,4} {5,6} {15,16}{7,8} {9,10} {11,12} {13,14} {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16}{1,2,3,4,5,6,7,8} {9,10,11,12,13,14,15,16}{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16}{1,3} {2,4} {5,7} {14,16}{6,8} {9,11} {10,12} {13,15} {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16}{1,2,3,4,9,10,11,12} {5,6,7,8,13,14,15,16}{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} Figure 3: To be viewed in color. (a)

Two mode trees T and ¯ T (given on the right of ﬁg. 2), alongwith a possible choice of mixture nodes mix ( T, ¯ T ) for the mixed decomposition (eq. 4). (b) Mixed dilated convolutional network resulting from the mixed decomposition. Thenetworks N and ¯ N corresponding to T and ¯ T respectively (ﬁg. 2, left), are combinedthrough output summation and rewiring of an intermediate convolutional layer (green).of convolutional channels) is allowed to grow super-linearly. We study the representational abili-ties of networks through their corresponding tensor decompositions, which as discussed in sec. 4,parameterize discretizations of input-output mappings (grid tensors). Through the lens of tensordecompositions, our objective is to address the following two propositions (stated informally): Proposition 2

Consider a tree decomposition (eq. 3) with underlying mode tree T or ¯ T and sizeconstant r = r tree . This decomposition can be realized by a mixed decomposition of T and ¯ T (eq. 4)whose size constant r is linear in r tree . Proposition 3

Consider a mixed decomposition of T and ¯ T (eq. 4) with size constant r = r mix .This decomposition can generate grid tensors {A y } y that cannot be generated by tree decomposi-tions of T or ¯ T (eq. 3), or a summation of such, unless their size constant r is super-linear in r mix . Before heading to a formal treatment of prop. 2 and 3, we brieﬂy convey the intuition behindour analysis. Recall from sec. 5 that the mixed decomposition (eq. 4) blends together tree decom-positions (eq. 3) of different mode trees T and ¯ T , by traversing upwards through the trees, while OHEN T AMARI S HASHUA exchanging tensors at each of a preselected set of mixture nodes. We may think of each mixturenode as a decision point that can propagate upwards one of two computations – that carried outby T or that carried out by ¯ T , where in both cases, the chosen computation is propagated upwardsthrough both T and ¯ T . Each combination of decisions across all mixture nodes gives rise to a com-putational path traversing between T and ¯ T , equivalent to a tree decomposition based on a hybridmode tree (see illustration in ﬁg. 4). The number of possible hybrid trees is exponential in the num-ber of mixture nodes, and thus a mixed decomposition is comparable to an exponential ensembleof tree decompositions. The original tree decompositions, based on T and ¯ T , are included in theensemble, thus may easily be replicated by the mixed decomposition. On the other hand, many ofthe hybrid trees in the mixed decomposition are signiﬁcantly different from T and ¯ T , requiring largesize constants from their tree decompositions.As a ﬁrst step in formalizing the above intuition, we deﬁne the notion of a hybrid mode tree: Deﬁnition 4

Let T and ¯ T be binary mode trees over [ N ] (def. 1), and let mix ( T, ¯ T ) be a corre-sponding collection of mixture nodes, i.e. a set of nodes (subsets of [ N ] ) contained in the interior ofboth T and ¯ T . We say that H is a hybrid mode tree of T and ¯ T w.r.t. mix ( T, ¯ T ) , if it is a binarymode tree over [ N ] , whose interior may be generated by the following process: int ( H ) = ∅ For µ in mix ( T, ¯ T ) ∪ { [ N ] } (inclusion order): S = int ( T ) ∩ µ \ { nodes in T already assigned to S } ¯ S = int ( ¯ T ) ∩ µ \ { nodes in ¯ T already assigned to ¯ S } int ( H ) = int ( H ) ∪ S or int ( H ) = int ( H ) ∪ ¯ S In words, for every µ that is either a mixture node or the root node, int ( H ) includes a segment fromeither int ( T ) or int ( ¯ T ) , where the segment comprises all descendants of µ (including µ itself) fromwhich the path to µ does not cross any other mixture node (see illustration in ﬁg. 4). Claim 5 below states that with proper weight setting, a mixed decomposition of T and ¯ T (eq. 4)with size constant r = r mix can realize any tree decomposition (eq. 3) with size constant r = r mix / ,if the underlying mode tree is a hybrid of T and ¯ T . Since T and ¯ T are in particular hybrid modetrees of themselves, we obtain an afﬁrmative answer to prop. 2. Claim 5

Let T and ¯ T be binary mode trees over [ N ] (def. 1), and let mix ( T, ¯ T ) be a corre-sponding collection of mixture nodes (a set of nodes contained in the interior of both T and ¯ T ).Consider a mixed decomposition of T and ¯ T w.r.t. mix ( T, ¯ T ) (eq. 4), and denote its size con-stant r by r mix . Let H be a hybrid mode tree of T and ¯ T w.r.t. mix ( T, ¯ T ) (def. 4), and con-sider the respective tree decomposition (eq. 3), with size constant r = r mix / . For any setting ofweights { a ν,γ, I , a ν,γ, II } ν,γ leading to grid tensors {A y } y in this tree decomposition, there exists asetting of weights { a ν,γ, I , a ν,γ, II } ν,γ and { ¯ a ¯ ν,γ, I , ¯ a ¯ ν,γ, II } ¯ ν,γ in the mixed decomposition, independentof the discretizers v (1) . . . v ( M ) (see sec. 4), that leads to the same grid tensors. Proof [ sketch – complete proof is given in app. B.1 ] We prove the claim constructively, by assigningweights to the mixed decomposition of T and ¯ T such that it mimics the tree decomposition of H .For distinction, we add the relevant mode tree to our notation of weights, letting { a T,ν,γ, I , a T,ν,γ, II ∈ OOSTING D ILATED C ONVOLUTIONAL N ETWORKS WITH M IXED T ENSOR D ECOMPOSITIONS mode tree T mode tree Thybrid mode trees mixture nodes root {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16} {1,2} {3,4} {5,6} {15,16}{7,8} {9,10} {11,12} {13,14} {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,2,3,4,5,6,7,8} {9,10,11,12,13,14,15,16} {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16} {1,3} {2,4} {5,7} {14,16}{6,8} {9,11} {10,12} {13,15} {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,2,3,4,9,10,11,12} {5,6,7,8,13,14,15,16} {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}{1} {2} {3} {4} {9} {10} {11} {12} {1,2} {3,4} {9,10} {11,12} {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,2,3,4,5,6,7,8} {9,10,11,12,13,14,15,16} {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}{1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16} {1,2} {3,4} {5,6} {15,16}{7,8} {9,10} {11,12} {13,14} {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,2,3,4,5,6,7,8} {9,10,11,12,13,14,15,16} {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16} {1,3} {2,4} {5,7} {14,16}{6,8} {9,11} {10,12} {13,15} {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,2,3,4,9,10,11,12} {5,6,7,8,13,14,15,16} {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}{5} {6} {7} {8} {5,7} {6,8} {13} {14} {15} {16} {14,16}{13,15} leaves (a)(b)

Figure 4: Best viewed in color. (a)

Two mode trees T and ¯ T along with a possible choice of mixturenodes (same as in ﬁg. 3(a)). (b) Sample of the resulting hybrid mode trees (def. 4). Eachhybrid tree is a combination of segments from T and ¯ T , where a segment comprises allnon-leaf descendants of a certain root node or mixture node, from which the path to thatnode does not cross any other mixture node. R r mix } ν ∈ int ( T ) ,γ ∈ [ r mix ] and { a ¯ T ,ν,γ, I , a ¯ T ,ν,γ, II ∈ R r mix } ν ∈ int ( ¯ T ) ,γ ∈ [ r mix ] stand for the weights in themixed decomposition of T and ¯ T , while { a H,ν,γ, I , a H,ν,γ, II ∈ R r mix / } ν ∈ int ( H ) ,γ ∈ [ r mix / representthe weights in the tree decomposition of H . We also denote by t ( ν ) , for each node ν ∈ int ( H ) , thesource tree ( T or ¯ T ) from which it originated (see def. 4).The assignment of { a T,ν,γ, I , a T,ν,γ, II } ν,γ and { a ¯ T ,ν,γ, I , a ¯ T ,ν,γ, II } ν,γ proceeds as follows. Allweights are initialized with zeros. Afterwards, for each ν ∈ int ( H ) , { a H,ν,γ, I , a H,ν,γ, II } γ (weightsof ν in the tree decomposition of H ) are copied into { a t ( ν ) ,ν,γ, I , a t ( ν ) ,ν,γ, II } γ (weights of the respec-tive node at the respective source tree in the mixed decomposition of T and ¯ T ). There are twiceas many vectors in { a t ( ν ) ,ν,γ, I , a t ( ν ) ,ν,γ, II } γ than there are in { a H,ν,γ, I , a H,ν,γ, II } γ , and each vectoris of twice the dimension. The choices of which vectors to use, and which coordinates to use inthe selected vectors, are made such that computations traverse properly between trees. Speciﬁcally,if P ( ν ; H ) (parent of ν in H ) originated from the same tree ( T or ¯ T ) as ν , higher index vectors( r mix / < γ ≤ r mix ) in { a t ( ν ) ,ν,γ, I , a t ( ν ) ,ν,γ, II } γ are used, as they correspond to computations (ten-sors) that are not exchanged between trees (see eq. 4). On the other hand, if P ( ν ; H ) originatedfrom the tree opposite to t ( ν ) , lower index vectors ( ≤ γ ≤ r mix / ) in { a t ( ν ) ,ν,γ, I , a t ( ν ) ,ν,γ, II } γ areused, in accordance with the fact that they realize computations (tensors) that will be transferred be-tween trees. The second aforementioned choice – which coordinates to use in the selected vectors,is made analogously, based on the trees from which C I ( ν ; H ) and C II ( ν ; H ) (children of ν in H )came from. OHEN T AMARI S HASHUA tiling {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16}{1,2} {3,4} {5,6} {15,16}{7,8} {9,10} {11,12} {13,14}{1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,2,3,4,5,6,7,8} {9,10,11,12,13,14,15,16} {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} index set mode tree T             ; 3, 4 , 5, 6, 7,8 , 9 , 11,12 T   I  

3, 4, 5, 6, 7,8, 9,11,12  I Figure 5: Mode tree T along with a speciﬁc index set I and the resulting tiling Θ( I ; T ) (def. 6).Claim 5 not only addresses prop. 2, but also brings forth a strategy for treating prop. 3. The strat-egy is to ﬁnd a hybrid mode tree H distinct enough from T and ¯ T , such that its tree decomposition,easily realized by the mixed decomposition according to claim 5, poses a signiﬁcant challenge forthe individual tree decompositions of T and ¯ T . Hereinafter we pursue this line of reasoning, focus-ing on the particular case where the convolutional operator g ( · ) is a simple product – g ( a, b )= a · b . Inthis case the tree and mixed decompositions (eq. 3 and 4 respectively) are standard (non-generalized)tensor decompositions ( ⊗ g ≡⊗ – see sec. 3), and the corresponding dilated convolutional networksare convolutional arithmetic circuits. We focus on this special case since it allows the use of a plu-rality of algebraic tools for theoretical analysis, while at the same time corresponding to modelsshowing promising results in practice (see for example Cohen et al. (2016a); Sharir et al. (2016)).Full treatment of additional cases, such as g ( a, b )= max { a + b, } , corresponding to networks withReLU activation, is left for future work.For establishing the difﬁculty experienced by the tree decompositions of T and ¯ T in replicatingthat of a hybrid tree H , we analyze ranks of matricized grid tensors. Speciﬁcally, we considerthe tree decomposition (eq. 3) of a general mode tree, and derive tight upper and lower bounds onthe ranks of generated grid tensors when these are subject to matricization w.r.t. a general indexset I ⊂ [ N ] (see sec. 3). The bounds we derive (theorem 7 below) highly depend on both theunderlying mode tree and the index set, and this allows ﬁnding index sets for which ranks tend tobe higher with the hybrid mode tree H than they are with the original mode trees T and ¯ T . Undersuch index sets, the only way for T and ¯ T to match ranks generated with H is through a signiﬁcantincrease in the size constant of their tree decompositions – precisely the sought-after result.To crisply phrase our main theorem, we deﬁne the notion of an index set tiled by a mode tree: Deﬁnition 6

Let T be a binary mode tree over [ N ] (def. 1), and let I ⊂ [ N ] be a non-empty setof indexes. A tiling of I by T is a collection of nodes in the tree, denoted Θ( I ; T ) , which meets thefollowing requirements: • (cid:83) ν ∈ Θ( I ; T ) ν = I• ν ∈ Θ( I ; T ) = ⇒ P ( ν ; T ) (cid:54)⊂ I In words, Θ( I ; T ) is a set of nodes in T whose disjoint union gives I , where each node is maximal, i.e. its parent in the tree is not a subset of I (see illustration in ﬁg. 5). It is not difﬁcult to see that for any mode tree T and non-empty index set I , the tiling Θ( I ; T ) existsand is determined uniquely. As theorem 7 below states, this tiling, along with that of I ’s comple- OOSTING D ILATED C ONVOLUTIONAL N ETWORKS WITH M IXED T ENSOR D ECOMPOSITIONS ment ( I c :=[ N ] \I ), characterizes the ranks of grid tensors generated by the tree decomposition of T when these are matricized w.r.t. I . Theorem 7

Let T be a binary mode tree over [ N ] (def. 1), and consider the corresponding treedecomposition (eq. 3) with discretizers v (1) . . . v ( M ) spanning R r . Assume that g ( · ) is the productoperator ( g ( a, b ) = a · b ), and suppose the generated grid tensors {A y } y are matricized (see sec. 3)w.r.t. an index set I ⊂ [ N ] , ∅ (cid:54) = I (cid:54) = [ N ] , whose complement we denote by I c := [ N ] \ I . Then,the ranks of the grid tensor matricizations { (cid:74) A y (cid:75) I } y are: • no greater than r min {| Θ( I ; T ) | , | Θ( I c ; T ) |} • at least r |{ ( ν ,ν ) ∈ Θ( I ; T ) × Θ( I c ; T ): ν and ν are siblings in T with depth > }| almost always, i.e. for allconﬁgurations of weights { a ν,γ, I , a ν,γ, II } ν,γ but a set of Lebesgue measure zero Proof [ sketch – complete proof is given in app. B.2 ] The proof proceeds in three stages. In theﬁrst stage we matricize the tree decomposition of T , i.e. transform it from a tensor decompositiongenerating {A y } y to a matrix decomposition generating { (cid:74) A y (cid:75) I } y . In this transformation, instancesof the tensor product ⊗ convert to a Kronecker product (cid:12) . The second stage of the proof establishesthe upper bound stated in the theorem, by showing that for each y , (cid:74) A y (cid:75) I is equal to a product ofmatrices, one of which has size r | Θ( I ; T ) | -by- r | Θ( I c ; T ) | . The key idea in this stage is the propagationof elements out of the matrix decomposition, using the relation ( AA (cid:48) ) (cid:12) ( BB (cid:48) ) = ( A (cid:12) A (cid:48) )( B (cid:12) B (cid:48) ) .The third and ﬁnal stage of the proof establishes the lower bound stated in the theorem. Here again,elements are propagated out of the matrix decomposition, allowing the construction of a concreteconﬁguration of weights ( { a ν,γ, I , a ν,γ, II } ν,γ ) for which the lower bound holds. The fact that thelower bound holds almost always is then a direct corollary of app. C, where it is shown that the treedecomposition admits maximal matricization ranks almost always when g ( · ) is the product operator.The study of matricization ranks under hierarchical tensor decompositions is of signiﬁcant in-terest, particularly in the context of deep learning. Cohen et al. (2016b) proved the lower boundin the theorem for the speciﬁc case where T is the mode tree corresponding to the baseline dilatedconvolutional network (see ﬁg. 2(a)), and I = { , , . . . , N − } . The result was used to estab-lish exponential expressive efﬁciency of deep convolutional arithmetic circuits w.r.t. shallow ones.Cohen and Shashua (2017) later extended the analysis by deriving upper bounds for arbitrary indexsets I , using them to study the ability of deep convolutional arithmetic circuits to model correlationsamong regions of their input. The bounds used in Cohen and Shashua (2017) were loose, and infact trivial for many choices of index sets I . We here treat arbitrary mode trees T and index sets I ,proving upper and lower bounds that are tight, oftentimes exact. Such tight bounds are necessaryfor identifying expressive efﬁciency that is not exponential, as we do in this paper. The key to deriv-ing the bounds is the aforementioned idea of propagating elements out of a matrix decomposition.As stated previously, given two binary mode trees over [ N ] (def. 1) – T and ¯ T , with a corre-sponding collection of mixture nodes mix ( T, ¯ T ) (set of nodes interior to both T and ¯ T ), the boundsin theorem 7 can be used to ﬁnd an index set I ⊂ [ N ] and a hybrid mode tree H (def. 4), suchthat the tree decomposition (eq. 3) of H generates grid tensors whose ranks under matricizationw.r.t. I are much higher than those brought forth by the tree decompositions of T and ¯ T . Considerour exemplar mode trees illustrated in ﬁg. 2. Speciﬁcally, let T be the mode tree corresponding tothe baseline dilated convolutional network (dilation l − in layer l ∈ [ L ]=[log N ] – see sec. 4.1), OHEN T AMARI S HASHUA index setcomplement tilings mode tree T mode tree Thybrid mode tree H mixture nodes {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16}{1,2} {3,4} {5,6} {15,16}{7,8} {9,10} {11,12} {13,14} {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,2,3,4,5,6,7,8} {9,10,11,12,13,14,15,16}{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16}{1,3} {2,4} {5,7} {14,16}{6,8} {9,11} {10,12} {13,15} {1,2,3,4} {5,6,7,8} {9,10,11,12} {13,14,15,16} {1,2,3,4,9,10,11,12} {5,6,7,8,13,14,15,16}{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16} {1} {2} {3} {4} {1,2} {3,4} {13} {14} {15} {16} {14,16}{13,15} {5} {6} {7} {8} {5,6} {7,8} {9} {10} {11} {12} {9,11} {10,12}                 ; 1 , 3 , 5 , 7 , 9,10 , 13,14 T   I                 ; 2 , 4 , 6 , 8 , 11,12 , 15,16 c T   I tilings                 ; 1,3 , 5,7 , 9 , 10 , 13 , 14 T   I                 ; 2,4 , 6,8 , 11 , 12 , 15 , 16 c T   I tilings                     ; 1 , 3 , 5 , 7 , 9 , 10 , 13 , 14 H   I                     ; 2 , 4 , 6 , 8 , 11 , 12 , 15 , 16 c H   I    I   c  I {1,2,3,4,9,10,11,12} {5,6,7,8,13,14,15,16}{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}{1,2,3,4} {13,14,15,16}{5,6,7,8} {9,10,11,12} Figure 6: Best viewed in color. Two mode trees T and ¯ T with a possible choice of mixture nodes(same as in ﬁg. 3(a) and 4(a)), along with a particular formed hybrid tree H . An indexset I and its complement I c are tiled into more pieces by H than they are by T and ¯ T ,leading the former to generate grid tensors with higher matricization ranks (theorem 7).and let ¯ T be the mode tree corresponding to the network obtained by swapping dilations of evenand odd layers (such that layer l has dilation l − if l is even, and l if l is odd). As described insec. 4.2, T is a perfect binary tree whose depth- l nodes, l ∈ { , , . . . , L } , are ( k − N/ l + [ N/ l ] for k ∈ [2 l ] . ¯ T is also perfect and has the same even-depth nodes, but its odd-depth nodes differ –they are generated by splitting parents into children holding non-contiguous quadrants. Suppose wechoose mix ( T, ¯ T ) to include the set of nodes in T and ¯ T whose depth is L − , and consider the hy-brid mode tree H formed by taking the segments (see def. 4) of the ﬁrst half of these nodes from T ,and the rest of the tree from ¯ T . An illustration of T , ¯ T and H in this setting, for the case L = 4 , isgiven in ﬁg. 6. Now, let the index set I consist of every second index in [ N/ , and every second pairof indexes in N/ N/ , i.e. I := { k − k ∈ [ N/ }∪{ N/ k − k (cid:48) : k ∈ [ N/ , k (cid:48) = 2 , } .As illustrated in ﬁg. 6, the mode tree T tiles (see def. 6) the lower half of I into singletons, and itsupper half into pairs. The same applies to T ’s tiling of I ’s complement I c := [ N ] \ I . Moreover,for every node in the former tiling Θ( I ; T ) , there exists a sibling in the latter Θ( I c ; T ) (and viceversa). By theorem 7, this implies that the tree decomposition of T generates grid tensors whosematricizations w.r.t. I have rank r N/ N/ . A similar situation occurs with the mode tree ¯ T , underwhich I and I c are tiled into pairs in their lower halves and into singletons in their top halves (seeillustration in ﬁg. 6). This also leads to matricized grid tensors of rank r N/ N/ . On the other hand,the hybrid mode tree H tiles I and I c entirely into singletons (see illustration in ﬁg. 6), leading (bytheorem 7) to grid tensor matricization ranks of r N/ . This means that if we were to replicate gridtensors generated by the tree decomposition of H using those of T or ¯ T (or a summation thereof),we would need to increase the size constant r super-linearly – by a power of / (at least).The above example can be generalized, by considering swapping the dilations of more than twolayers at once. In particular, if T is the mode tree corresponding to the baseline dilated convolutionalnetwork (dilation l − in layer l ), ¯ T is the mode tree corresponding to the network obtained byswapping dilations of groups of k layers (dilation (cid:100) l/k (cid:101)· k − − (( l − mod k ) in layer l ), and the set ofmixture nodes includes all nodes of depth L − k , a hybrid mode tree H and an index set I can be OOSTING D ILATED C ONVOLUTIONAL N ETWORKS WITH M IXED T ENSOR D ECOMPOSITIONS found, such that the tree decomposition of H generates grid tensors whose ranks when matricizedw.r.t. I can only be matched by the tree decompositions of T and ¯ T if their size constant r isincreased by a power of / (1 + 2 − k ) . Since the mixed decomposition of T and ¯ T (eq. 4) canrealize the tree decomposition of H with double the size constant (claim 5), we conclude that itcan, with size constant r , generate grid tensors whose matricization ranks (w.r.t. I ) require the treedecompositions of T and ¯ T to have size constant r / (1+2 − k ) – super-linearly larger. Therefore, inthis particular setting, prop. 3 holds and the mixed decomposition of T and ¯ T is indeed expressivelyefﬁcient w.r.t. their tree decompositions. Taking into account the fact that the mixed decompositionadmits maximal matricization ranks almost always when g ( · ) is the product operator (see app. C),we formalize the result in network terms: Corollary 8

Let N be the baseline dilated convolutional network (dilation l − in layer l – seesec. 4.1), and let ¯ N be the network obtained by swapping dilations of groups of k layers (dila-tion (cid:100) l/k (cid:101)· k − − (( l − mod k ) in layer l ). Denote by M the mixed dilated convolutional network ob-tained by summing the outputs of N and ¯ N , while interconnecting their k ’th intermediate layer(and possibly additional layers). Assume the networks’ convolutional operator g ( · ) is a product.Then, besides a negligible set, all functions realized by M with r channels in the layers of eachinterconnected network, cannot be realized by N or ¯ N (or a summation thereof) if the number ofchannels in each layer is less than ( r/ / (1+2 − k ) . Corollary 8 (along with claim 5) demonstrates that interconnecting intermediate layers of dif-ferent dilated convolutional networks can bring forth expressive efﬁciency. That is to say, throughcross-connections between networks, we are able to represent functions that would otherwise beexpensive, or even impractical to implement. The lower bound in corollary 8 – ( r/ / (1+2 − k ) , isessentially quadratic when k ≥ . For example, if k = 4 and the number of channels r in eachinterconnected network is , the lower bound would imply that in order to maintain representa-tional abilities with an individual network (or a summation of the networks), over channels ineach layer are required – far beyond acceptable practice in deep learning.

7. Experiment

To assess the practical implications of the expressive efﬁciency brought forth by mixing dilatedconvolutional networks, a simple experiment was conducted. We trained a baseline dilated convolu-tional network N (dilation l − in layer l ∈ [ L ] – see sec. 4.1) with architectural parameters similarto those used in WaveNet (van den Oord et al. (2016)), to classify individual phonemes in the TIMITacoustic speech corpus (Garofolo et al. (1993)). In addition to the baseline model, we also trained acompanion network ¯ N obtained by swapping dilations of even and odd layers (such that layer l hasdilation l − if l is even, and l if l is odd). As discussed in sec. 5, the mode trees corresponding tothese networks (illustrated in ﬁg. 2) – T and ¯ T , share interior nodes of even depth, thus any subsetof those nodes may serve as mixture nodes for a mixed decomposition (eq. 4). We evaluate mixeddilated convolutional networks M corresponding to different choices of mixture nodes (see ﬁg. 3for illustration of a particular case). Speciﬁcally, we consider choices of the following form: mix ( T, ¯ T ) := { ν ∈ int ( T ) ∩ int ( ¯ T ) : depth of ν (in T and ¯ T ) ≥ threshold } Varying the threshold yields mixed networks with a varying number of interconnections. In theextreme case mix ( T, ¯ T ) = ∅ (high threshold), M simply sums the outputs of N and ¯ N . As OHEN T AMARI S HASHUA

Connections up to layer A cc u r a c y TIMIT Individual Phoneme Classification

Validation SetTrain Set

Figure 7: Experimental results – increasing the number of interconnections between hidden layersof different dilated convolutional networks improves accuracy, with no additional cost incomputation or model capacity.the threshold decreases interconnections between hidden layers are added – starting from hiddenlayer , then including hidden layer , and so on. The intuition from our analysis (sec. 6) is thatadditional interconnections result in a larger ensemble of hybrid mode trees, which in turn boosts theexpressive power of the mixed network M . As ﬁg. 7 shows, this intuition indeed complies with theresults in practice – classiﬁcation accuracy improves as we increase the number of interconnections,without any additional cost in terms of computation or model capacity. It is important to stress that our objective in the experiment was to evaluate, in the most con-trolled setting possible, the exact models covered by our analysis. We did not compare to stateof the art results, as all phoneme recognition rates reported in the literature deviate from our ba-sic setting – they heavily rely on data pre-processing ( e.g.

Mel-Frequency Cepstral Coefﬁcients),prediction post-processing ( e.g.

Conditional Random Fields), or both. The recent DeepLab model(Chen et al. (2016)) has demonstrated that when combined with other techniques, mixing dilatedconvolutions can lead to state of the art image segmentation performance. We are currently pursuingsimilar results in the context of sequence processing tasks.To conclude this section, we brieﬂy convey implementation details behind the experiment.TIMIT dataset is an acoustic-phonetic corpus comprising sentences manually labeled at thephoneme level. We split the data into train and validation sets in accordance with Halberstadt(1998), and as advised by Lee and Hon (1989), mapped the possible phoneme labels into plus an additional “garbage” label. The task was then to classify individual phonemes into one ofthe latter categories. In accordance with WaveNet, the baseline dilated convolutional network hadReLU activation ( g ( a, b )= max { a + b, } – see sec. 4.1), channels per layer, and input vec-tors of dimension holding one-hot quantizations of the audio signal. The number of layers L was set to , corresponding to an input window of N =2 L =4096 samples, spanning ms ofaudio signal – standard practice with TIMIT dataset. The framework chosen for running the exper-iment was Caffe toolbox (Jia et al. (2014)), and we used Adam optimizer (Kingma and Ba (2014))for training (with default hyper-parameters: moment decay rates β = 0 . , β = 0 . ; learningrate α = 0 . ). Weight decay and batch size were set to − and respectively. Models weretrained for iterations, with learning rate decreased by a factor of after of iterationstook place. OOSTING D ILATED C ONVOLUTIONAL N ETWORKS WITH M IXED T ENSOR D ECOMPOSITIONS

8. Conclusion

Nearly all state of the art deep networks these days ( e.g.

Szegedy et al. (2015); He et al. (2015);Huang et al. (2016b,a)) deviate from the simple feed-forward (chain) approach, employing variousconnectivity schemes between their layers. In this paper we studied the representational implicationsof connectivity in the context of dilated convolutional networks, a family of deep models deliver-ing state of the art performance in audio and text processing tasks, underlying Google’s WaveNet(van den Oord et al. (2016)) and ByteNet (Kalchbrenner et al. (2016)). We formulated our studythrough the notion of expressive efﬁciency, which refers to a situation where one network mustgrow unfeasibly large to realize (or approximate) functions of another. Our analysis shows thatinterconnecting hidden layers of different dilated convolutional networks can bring forth a modelthat is expressively efﬁcient w.r.t. the individual networks it comprises. In particular, we show thata single connection between hidden layers can already lead to an almost quadratic gap, which inlarge-scale settings typically makes the difference between a model that is practical and one that isnot. We empirically evaluate the analyzed networks, and ﬁnd that the expressive efﬁciency broughtforth by interconnectivity coincides with improved accuracies.To date, formal analyses studying expressive efﬁciency have focused on the architectural featureof depth, showing instances where deep networks are expressively efﬁcient w.r.t. shallow ones.These studies were motivated by the vast empirical evidence supporting the importance of depth.Our work thus provides a second exemplar of an architectural feature for which expressive efﬁciencyand superior accuracies go hand in hand. This leads us to believe that expressive efﬁciency mayserve a key role in the development of new tools for deep network design.

Notes convolutions. We limitourselves to this case for simplicity of presentation. Our formulation can easily be extended to account for convolu-tions of arbitrary size by considering mode trees that are not necessarily binary, and by modifying the decompositionin eq. 3 to take (generalized) tensor products between an arbitrary number of tensors (not necessarily two).3 A full binary tree is a tree in which all interior (non-leaf) nodes have exactly two children.4 In general the number of tensors in the collection may vary across nodes, but for simplicity of presentation weassume here that all collections comprise exactly r tensors.5 If c is a scalar and S is a set, c + S stands for the set obtained by adding c to each element in S .6 It is important to stress that not all choices of mode trees lead to networks resembling ones used in practice. Forexample, if different leaves in a tree have different depths, different inputs in the corresponding network pass througha different number of layers. Conversely, not every type of dilated convolutional network used in practice correspondsto a mode tree – only ones in which an input is connected to the output through a single path.7 A few remarks are in order at this point: • The number of channels in each layer of N or ¯ N corresponds to the constant r in the respective tree decompo-sition (eq. 3 with underlying mode tree T or ¯ T respectively). Similarly, the number of channels in each layerof each interconnected network in M corresponds to r in the respective mixed decomposition (eq. 4). In boththe tree and mixed decompositions, r , referred to as the size constant, stands for the number of tensors { φ ν,γ } γ (respectively { ¯ φ ¯ ν,γ } γ ) held in each node ν (respectively ¯ ν ). We set this number uniformly across nodes, corre-sponding to uniformly sized layers across networks, merely for simplicity of presentation. Our formulations andanalysis can easily be adapted to account for varying layer sizes, by allowing different nodes in a decomposition OHEN T AMARI S HASHUA to hold a different number of tensors. Note that an implication of our uniform setting is that a network’s input andoutput dimensions vary along with the size of its hidden layers. When replicating a function realized by a networkusing a larger network, we simply pad input vectors with zeros, and ignore the excess output coordinates. • An additional simpliﬁcation we made relates to weight sharing. In both the tree and mixed decompositions, eachinterior node ν (respectively ¯ ν ) has a separate set of weights { a ν,γ, I , a ν,γ, II } γ (respectively { ¯ a ¯ ν,γ, I , ¯ a ¯ ν,γ, II } γ ).This implies that in the corresponding networks, convolution ﬁlters may vary through time, i.e. different weightsmay be used against different portions of a convolved sequence. The more commonplace setting of stationaryﬁlters (standard convolutions) is obtained by restricting different nodes in a decomposition to possess the sameweights. We do not introduce such restrictions into our formulations, as they make little difference in terms of theanalysis, but on the other hand signiﬁcantly burden presentation.8 In accordance with the remark given at the beginning of this section, when using the (larger) mixed decomposition,we pad discretizers with zeros, and ignore the excess output tensors.9 We note that in addition to the mixed dilated convolutional network M , we also evaluated the individual networks N and ¯ N – both reached accuracies comparable to M in the case of zero interconnections (output summation only).10 The case of convolutional arithmetic circuits ( g ( a, b )= a · b ) was also evaluated, leading to the exact same trends asthose observed with ReLU (ﬁg. 7). Acknowledgments

This work was supported by Intel grant ICRI-CI

References

Richard Bellman.

Introduction to matrix analysis , volume 960. SIAM, 1970.Richard Caron and Tim Traynor. The zero set of a polynomial.

WSMR Report 05-02 , 2005.Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab:Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915 , 2016.Nadav Cohen and Amnon Shashua. Convolutional rectiﬁer networks as generalized tensor decompositions.

International Conference on Machine Learning (ICML) , 2016.Nadav Cohen and Amnon Shashua. Inductive bias of deep convolutional networks through pooling geometry.

International Conference on Learning Representations (ICLR) , 2017.Nadav Cohen, Or Sharir, and Amnon Shashua. Deep simnets.

IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2016a.Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis.

Conference On Learning Theory (COLT) , 2016b.Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum-product networks. In

Advances in NeuralInformation Processing Systems , pages 666–674, 2011.Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. arXiv preprintarXiv:1512.03965 , 2015. 20

OOSTING D ILATED C ONVOLUTIONAL N ETWORKS WITH M IXED T ENSOR D ECOMPOSITIONS

John S Garofolo, Lori F Lamel, William M Fisher, Jonathon G Fiscus, and David S Pallett. Darpa timitacoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1.

NASA STI/Recon technicalreport n , 93, 1993.Wolfgang Hackbusch.

Tensor Spaces and Numerical Tensor Calculus , volume 42 of

Springer Series inComputational Mathematics . Springer Science & Business Media, Berlin, Heidelberg, February 2012.Andrew K Halberstadt.

Heterogeneous acoustic measurements and multiple classiﬁers for speech recogni-tion . PhD thesis, Massachusetts Institute of Technology, 1998.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 , 2015.Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolu-tional networks. arXiv preprint arXiv:1608.06993 , 2016a.Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochasticdepth. In

European Conference on Computer Vision , pages 646–661. Springer, 2016b.Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the Perils of Non-Convexity: GuaranteedTraining of Neural Networks using Tensor Methods.

CoRR abs/1506.08473 , 2015.Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadar-rama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In

Proceedings ofthe 22nd ACM international conference on Multimedia , pages 675–678. ACM, 2014.Frank Jones.

Lebesgue integration on Euclidean space . Jones & Bartlett Learning, 2001.Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and KorayKavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099 , 2016.Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classiﬁcation with Deep ConvolutionalNeural Networks.

Advances in Neural Information Processing Systems , pages 1106–1114, 2012.Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech, and time series.

The handbookof brain theory and neural networks , 3361(10), 1995.Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.

Nature , 521(7553):436–444, May 2015.K-F Lee and H-W Hon. Speaker-independent phone recognition using hidden markov models.

IEEE Trans-actions on Acoustics, Speech, and Signal Processing , 37(11):1641–1648, 1989.Hrushikesh Mhaskar, Qianli Liao, and Tomaso Poggio. Learning real and boolean functions: When is deepbetter than shallow. arXiv preprint arXiv:1603.00988 , 2016.Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regionsof deep neural networks. In

Advances in Neural Information Processing Systems , pages 2924–2932, 2014.Vinod Nair and Geoffrey E Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In

Pro-ceedings of the 27th International Conference on Machine Learning (ICML-10) , pages 807–814, 2010.Razvan Pascanu, Guido Montufar, and Yoshua Bengio. On the number of inference regions of deep feedforward networks with piece-wise linear activations. arXiv preprint arXiv , 1312, 2013.21

OHEN T AMARI S HASHUA

Tomaso Poggio, Fabio Anselmi, and Lorenzo Rosasco. I-theory on depth vs width: hierarchical functioncomposition. Technical report, Center for Brains, Minds and Machines (CBMM), 2015.Ben Poole, Subhaneil Lahiri, Maithreyi Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential ex-pressivity in deep neural networks through transient chaos. In

Advances In Neural Information ProcessingSystems , pages 3360–3368, 2016.Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the expressivepower of deep neural networks. arXiv preprint arXiv:1606.05336 , 2016.Hanie Sedghi and Anima Anandkumar. Training input-output recurrent neural networks through spectralmethods. arXiv preprint arXiv:1603.00954 , 2016.Or Sharir, Ronen Tamari, Nadav Cohen, and Amnon Shashua. Tensorial mixture models. arXiv preprintarXiv:1610.04167 , 2016.Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,Vincent Vanhoucke, and Andrew Rabinovich. Going Deeper with Convolutions.

CVPR , 2015.Matus Telgarsky. Representation beneﬁts of deep feedforward networks. arXiv preprint arXiv:1509.08101 ,2015.A¨aron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalch-brenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio.

CoRRabs/1609.03499 , 2016. 22

OOSTING D ILATED C ONVOLUTIONAL N ETWORKS WITH M IXED T ENSOR D ECOMPOSITIONS

Appendix A. Derivation of the Baseline Decomposition

In this appendix we derive the baseline decomposition (eq. 2) – a parameterization of grid tensors(eq. 1) discretizing input-output mappings of the baseline dilated convolutional network (ﬁg. 1).As discussed in sec. 4.1, o [ t ] – the network output at time t , is a function of x [ t - N + . . . x [ t ] –its input over the last N := 2 L time points. We would like to show that for any d . . .d N ∈ [ M ] , entry ( d , . . . , d N ) of a tensor A y generated by eq. 2, is equal to coordinate y of networkoutput o [ t ] under the following input assignment: x [ t - N +

1] = v ( d ) , . . . , x [ t ] = v ( d N ) . To achievethis, we prove by induction that under the latter assignment, for every l ∈ [ L ] ∪ { } , j ∈ [ N/ l ] and γ ∈ [ r l ] , coordinate γ of the network’s depth- l sequence (input ( x [ t ]) t for l = 0 ; hiddensequence ( h ( l ) [ t ]) t for l ∈ [ L − ; output ( o [ t ]) t for l = L ) at time t − N + j · l , is equal toentry ( d ( j − l +1 , . . . , d ( j − l +2 l ) of the tensor φ l,j,γ in the baseline decomposition (eq. 2). Thedesired result then follows from the case l = L, j = 1 , γ = y .When l = 0 , the inductive hypothesis is trivial – coordinate γ of the input sequence at time t − N + j , i.e. x [ t − N + j ] γ , is by deﬁnition of our assignment equal to v ( d j ) γ – entry d j of the tensor φ ,j,γ (see eq. 2). Assume now that the inductive hypothesis holds whenever l = k , and consider thetensor φ k +1 ,j,γ for some j ∈ [ N/ k +1 ] and γ ∈ [ r k +1 ] . From the baseline decomposition (eq. 2): φ k +1 ,j,γ = (cid:16)(cid:88) r k α =1 a k +1 ,γ, I α · φ k, j − ,α (cid:17) ⊗ g (cid:16)(cid:88) r k α =1 a k +1 ,γ, II α · φ k, j,α (cid:17) Focusing on entry ( d ( j − k +1 +1 , . . . , d ( j − k +1 +2 k +1 ) of the left-hand side, while recalling thedeﬁnition of the generalized tensor product ⊗ g (sec. 3), we may write: φ k +1 ,j,γd ( j − k +1+1 ,...,d ( j − k +1+2 k +1 = g (cid:16)(cid:80) r k α =1 a k +1 ,γ, I α · φ k, j − ,αd (2 j − k +1 ,...,d (2 j − k +2 k , (cid:80) r k α =1 a k +1 ,γ, II α · φ k, j,αd (2 j − k +1 ,...,d (2 j − k +2 k (cid:17) (5)By our inductive assumption: φ k, j − ,αd (2 j − k +1 ,...,d (2 j − k +2 k = h ( k ) [ t − N + (2 j − · k ] α ∀ α ∈ [ r k ] φ k, j,αd (2 j − k +1 ,...,d (2 j − k +2 k = h ( k ) [ t − N + 2 j · k ] α ∀ α ∈ [ r k ] where we overload notation in the case k = 0 , letting ( h (0) [ t ]) t stand for the input sequence ( x [ t ]) t .Plugging the latter into eq. 5, we obtain: φ k +1 ,j,γd ( j − k +1+1 ,...,d ( j − k +1+2 k +1 = g (cid:0)(cid:10) a k +1 ,γ, I , h ( k ) [ t − N + (2 j − · k ] (cid:11) , (cid:10) a k +1 ,γ, II , h ( k ) [ t − N + 2 j · k ] (cid:11)(cid:1) By the deﬁnition of the baseline dilated convolutional network (sec. 4.1), the latter expression isprecisely equal to coordinate γ of the sequence ( h ( k +1) [ t ]) t (or ( o [ t ]) t if k = L − ) at time t − N + j · k +1 . This proves that our inductive hypothesis holds when l = k + 1 , and in general. Appendix B. Deferred Proofs

B.1. Proof of Claim 5

We initiate the proof by introducing notations that will allow a more compact presentation. Here-inafter, we let { a H,ν,γ, I , a H,ν,γ, II ∈ R r mix / } ν ∈ int ( H ) ,γ ∈ [ r mix / stand for the weights in the tree OHEN T AMARI S HASHUA decomposition of the hybrid mode tree H (eq. 3 with size constant r = r mix / and underly-ing mode tree given by def. 4). Similarly, we use { a T,ν,γ, I , a T,ν,γ, II ∈ R r mix } ν ∈ int ( T ) ,γ ∈ [ r mix ] and { a ¯ T ,ν,γ, I , a ¯ T ,ν,γ, II ∈ R r mix } ν ∈ int ( ¯ T ) ,γ ∈ [ r mix ] to denote the weights, corresponding to T and ¯ T (respectively), in the mixed decomposition (eq. 4 with size constant r = r mix ). Recall that by con-struction (def. 4), int ( H ) – the interior of H , consists of different segments (collections of nodes),each taken from either int ( T ) or int ( ¯ T ) . We deﬁne t : int ( H ) → { T, ¯ T } to be the function indi-cating which tree an interior node in H came from. Speciﬁcally, if the node ν ∈ int ( H ) originatedfrom T we have t ( ν ) = T , and on the other hand, if its source is ¯ T then t ( ν ) = ¯ T . By convention,feeding t ( · ) with an argument outside int ( H ) yields something that is different from both T and ¯ T .For example, if ν ∈ int ( H ) is the root node, i.e. ν = [ N ] , then P ( ν ; H ) – its parent in H , is unde-ﬁned and we have t ( P ( ν ; H )) (cid:54) = t ( ν ) . Similarly, if the child C I ( ν ; H ) of ν ∈ int ( H ) is a leaf, it isoutside the domain of t ( · ) and thus t ( ν ) (cid:54) = t ( C I ( ν ; H )) .Given a setting of weights { a H,ν,γ, I , a H,ν,γ, II } ν,γ for the tree decomposition of H , we would liketo show that there exists a setting of weights { a T,ν,γ, I , a T,ν,γ, II } ν,γ and { a ¯ T ,ν,γ, I , a ¯ T ,ν,γ, II } ν,γ for themixed decomposition of T and ¯ T , such that the latter generates grid tensors identical to those ofthe former. More precisely, for any collection of discretizers { v ( i ) ∈ R r mix / } i ∈ [ M ] fed into thetree decomposition of H , leading the latter to produce grid tensors {A y } y ∈ [ r mix / , we would likethe mixed decomposition to be such that when fed with the padded discretizers { [( v ( i ) ) (cid:62) ] (cid:62) ∈ R r mix } i ∈ [ M ] , the ﬁrst r mix / grid tensors it generates are equal to {A y } y ∈ [ r mix / . We prove ex-istence of the sought after weight setting constructively, by presenting an explicit procedure forassigning { a T,ν,γ, I , a T,ν,γ, II } ν,γ and { a ¯ T ,ν,γ, I , a ¯ T ,ν,γ, II } ν,γ based on { a H,ν,γ, I , a H,ν,γ, II } ν,γ :Initialize: a T,ν,γ, I = a T,ν,γ, II = ∀ ν ∈ int ( T ) , γ ∈ [ r mix ] a ¯ T ,ν,γ, I = a ¯ T ,ν,γ, II = ∀ ν ∈ int ( ¯ T ) , γ ∈ [ r mix ] For ν in int ( H ) (depth-ﬁrst order): a t ( ν ) ,ν,γ + r mix , I = (cid:40) (cid:2) (cid:62) ( a H,ν,γ, I ) (cid:62) (cid:3) (cid:62) , t ( ν ) = t ( C I ( ν ; H )) (cid:2) ( a H,ν,γ, I ) (cid:62) (cid:62) (cid:3) (cid:62) , t ( ν ) (cid:54) = t ( C I ( ν ; H )) ∀ γ ∈ [ r mix / a t ( ν ) ,ν,γ + r mix , II = (cid:40) (cid:2) (cid:62) ( a H,ν,γ, II ) (cid:62) (cid:3) (cid:62) , t ( ν ) = t ( C II ( ν ; H )) (cid:2) ( a H,ν,γ, II ) (cid:62) (cid:62) (cid:3) (cid:62) , t ( ν ) (cid:54) = t ( C II ( ν ; H )) ∀ γ ∈ [ r mix / If t ( P ( ν ; H )) (cid:54) = t ( ν ) : Swap a t ( ν ) ,ν,γ, I ←→ a t ( ν ) ,ν,γ + r mix , I ∀ γ ∈ [ r mix / Swap a t ( ν ) ,ν,γ, II ←→ a t ( ν ) ,ν,γ + r mix , II ∀ γ ∈ [ r mix / (6)The idea behind this assignment is as follows. The computation corresponding to a node in thetree decomposition of H , is carried out, in the mixed decomposition of T and ¯ T , by the re-spective node in the respective source tree. That is to say, the computation of ν ∈ int ( H ) in thetree decomposition is carried out by ν ∈ int ( t ( ν )) in the mixed decomposition. ν ∈ int ( t ( ν )) useshalf ( r mix / ) of its weight vectors, and in each used weight vector, half ( r mix / ) of the coordi-nates hold actual (non-zero) values – a copy of the respective weight from ν ∈ int ( H ) . The choiceof which weight vectors to use, and which coordinates to use in the active weight vectors, de- OOSTING D ILATED C ONVOLUTIONAL N ETWORKS WITH M IXED T ENSOR D ECOMPOSITIONS pends on the tree-transitioning scheme. If the parent of ν in H came from the same tree as ν , i.e. t ( P ( ν ; H )) = t ( ν ) , ν ∈ int ( t ( ν )) in the mixed decomposition uses weight vectors with higherindexes ( γ ∈ r mix / r mix / ), as these relate to tensors that are not exchanged (see eq. 4). Onthe other hand, if t ( P ( ν ; H )) (cid:54) = t ( ν ) , weight vectors with lower indexes ( γ ∈ [ r mix / ) are used, sothat the computations (tensors) will be sent to the opposite tree. The analogous rationale holds forthe children of ν in H ( C I ( ν ; H ) and C II ( ν ; H ) ). If a child came from the same tree as ν , uppercoordinates of the appropriate weight vectors are used, so that computations (tensors) coming fromthe present tree are collected. On the other hand, if the child came from the opposite tree, lower co-ordinates are used and computations (tensors) from that tree are fetched. Altogether, the assignmentin eq. 6 meets our requirements, and thus concludes the proof. (cid:4) B.2. Proof of Theorem 7

Since we are dealing with a single particular mode tree T , we omit it from our notations throughoutthe proof. Speciﬁcally, we denote by C I ( ν ) and C II ( ν ) (instead of C I ( ν ; T ) and C II ( ν ; T ) ) thechildren of an interior node ν ∈ int ( T ) ; by Θ( I ) and Θ( I c ) (instead of Θ( I ; T ) and Θ( I c ; T ) )the tilings of I and I c (respectively) w.r.t. T (see def. 6); and by σ ( ν ) ( · ) (instead of σ ( ν ; T ) ( · ) ) thepermutation corresponding to ν ∈ int ( T ) in the tree decomposition (eq. 3).The ﬁrst stage of the proof is to derive a matricized form of the tree decomposition, sheddinglight into the manner in which grid tensor matricizations { (cid:74) A y (cid:75) I } y are generated. As a preparatorystep in this direction, we deﬁne the notion of an index set reduction . Let ν ⊂ [ N ] be a node in T ,whose elements we denote by i < · · · < i | ν | . The reduction of I onto ν is deﬁned as follows: I| ν := { j ∈ [ | ν | ] : i j ∈ I ∩ ν } (7)In words, it is the set of indexes corresponding to the intersection I ∩ ν inside ν . Besides index setreduction, an additional tool we will be using is the Kronecker product – a matrix operator we denoteby (cid:12) . For two matrices A ∈ R M × M and B ∈ R N × N , A (cid:12) B is the matrix in R M N × M N holding A ij B kl in row index ( i − N + k and column index ( j − N + l .Consider the central relation in the tree decomposition (eq. 3), while noticing that ⊗ g ≡ ⊗ inour setting ( g ( · ) is the product operator – see sec. 3): φ ν,γ (cid:124)(cid:123)(cid:122)(cid:125) order | ν | = σ ( ν ) (cid:16)(cid:16)(cid:88) rα =1 a ν,γ, I α · φ C I ( ν ) ,α (cid:17) ⊗ (cid:16)(cid:88) rα =1 a ν,γ, II α · φ C II ( ν ) ,α (cid:17)(cid:17) (8)Suppose we would like to matricize the tensor φ ν,γ w.r.t. the reduction I| ν . If all elements of C I ( ν ) were smaller than those of C II ( ν ) , the permutation σ ( ν ) ( · ) would be the identity (see sec. 4.2), andthe following matrix relation would hold: (cid:74) φ ν,γ (cid:75) I| ν = (cid:114) (cid:16)(cid:88) rα =1 a ν,γ, I α · φ C I ( ν ) ,α (cid:17) ⊗ (cid:16)(cid:88) rα =1 a ν,γ, II α · φ C II ( ν ) ,α (cid:17) (cid:122) I| ν = (cid:114) (cid:88) rα =1 a ν,γ, I α · φ C I ( ν ) ,α (cid:122) I| C I ( ν ) (cid:12) (cid:114) (cid:88) rα =1 a ν,γ, II α · φ C II ( ν ) ,α (cid:122) I| C II ( ν ) = (cid:16)(cid:88) rα =1 a ν,γ, I α · (cid:74) φ C I ( ν ) ,α (cid:75) I| C I ( ν ) (cid:17) (cid:12) (cid:16)(cid:88) rα =1 a ν,γ, II α · (cid:74) φ C II ( ν ) ,α (cid:75) I| C II ( ν ) (cid:17) OHEN T AMARI S HASHUA

In general however, elements in C I ( ν ) could be greater than ones in C II ( ν ) , and so eq. 8 includes atensor mode sorting via σ ( ν ) ( · ) . In matricized form, this amounts to rearranging rows and columnsthrough appropriate permutation matrices Q ( ν ) and ¯ Q ( ν ) respectively: (cid:74) φ ν,γ (cid:75) I| ν = Q ( ν ) (cid:16)(cid:16)(cid:88) rα =1 a ν,γ, I α · (cid:74) φ C I ( ν ) ,α (cid:75) I| C I ( ν ) (cid:17) (cid:12) (cid:16)(cid:88) rα =1 a ν,γ, II α · (cid:74) φ C II ( ν ) ,α (cid:75) I| C II ( ν ) (cid:17)(cid:17) ¯ Q ( ν ) We thus arrive at the following matrix form of eq. 3, referred to as the matricized tree decomposition : For j = 1 . . .N : (cid:74) φ { j } ,γ (cid:75) I| { j } = (cid:114) [ v (1) γ , . . . , v ( M ) γ ] (cid:62) (cid:122) I| { j } ∀ γ ∈ [ r ] For ν in int ( T ) (depth-ﬁrst order): (cid:74) φ ν,γ (cid:75) I| ν = Q ( ν ) (cid:32)(cid:32) r (cid:88) α =1 a ν,γ, I α (cid:74) φ C I ( ν ) ,α (cid:75) I| C I ( ν ) (cid:33) (cid:12) (cid:32) r (cid:88) α =1 a ν,γ, II α (cid:74) φ C II ( ν ) ,α (cid:75) I| C II ( ν ) (cid:33)(cid:33) ¯ Q ( ν ) ∀ γ ∈ [ r ] (cid:74) A y (cid:75) I = (cid:74) φ [ N ] ,y (cid:75) I| [ N ] ∀ y ∈ [ r ] (9) Next, we move on to the second stage of the proof, where we establish the upper bound statedin the theorem: rank (cid:74) A y (cid:75) I ≤ r min {| Θ( I ) | , | Θ( I c ) |} ∀ y (10)We begin by “propagating outwards” the permutation matrices Q ([ N ]) and ¯ Q ([ N ]) corresponding tothe root node [ N ] in the matricized tree decomposition (eq. 9). Namely, for every γ ∈ [ r ] , wereplace the matrix (cid:74) φ [ N ] ,γ (cid:75) I| [ N ] by: B [ N ] ,γ := (cid:32) r (cid:88) α =1 a [ N ] ,γ, I α (cid:74) φ C I ([ N ]) ,α (cid:75) I| C I ([ N ]) (cid:33) (cid:12) (cid:32) r (cid:88) α =1 a [ N ] ,γ, II α (cid:74) φ C II ([ N ]) ,α (cid:75) I| C II ([ N ]) (cid:33) and accordingly move Q ([ N ]) and ¯ Q ([ N ]) to the assignments of { (cid:74) A y (cid:75) I } y . This gives rise to thefollowing decomposition: For j = 1 . . .N : (cid:74) φ { j } ,γ (cid:75) I| { j } = (cid:114) [ v (1) γ , . . . , v ( M ) γ ] (cid:62) (cid:122) I| { j } ∀ γ ∈ [ r ] For ν in int ( T ) \ { [ N ] } (depth-ﬁrst order): (cid:74) φ ν,γ (cid:75) I| ν = Q ( ν ) (cid:32)(cid:32) r (cid:88) α =1 a ν,γ, I α (cid:74) φ C I ( ν ) ,α (cid:75) I| C I ( ν ) (cid:33) (cid:12) (cid:32) r (cid:88) α =1 a ν,γ, II α (cid:74) φ C II ( ν ) ,α (cid:75) I| C II ( ν ) (cid:33)(cid:33) ¯ Q ( ν ) ∀ γ ∈ [ r ] B [ N ] ,γ = (cid:32) r (cid:88) α =1 a [ N ] ,γ, I α (cid:74) φ C I ([ N ]) ,α (cid:75) I| C I ([ N ]) (cid:33) (cid:12) (cid:32) r (cid:88) α =1 a [ N ] ,γ, II α (cid:74) φ C II ([ N ]) ,α (cid:75) I| C II ([ N ]) (cid:33) ∀ γ ∈ [ r ] (cid:74) A y (cid:75) I = Q ([ N ]) B [ N ] ,y ¯ Q ([ N ]) ∀ y ∈ [ r ] Consider now C I ([ N ]) – a child of the root node [ N ] , and suppose we would like to similarly prop-agate outwards its permutation matrices Q ( C I ([ N ])) and ¯ Q ( C I ([ N ])) . We may deﬁne, for every γ ∈ [ r ] : B C I ([ N ]) ,γ := (cid:32) r (cid:88) α =1 a C I ([ N ]) ,γ, I α (cid:74) φ C I ( C I ([ N ])) ,α (cid:75) I| C I ( C I ([ N ])) (cid:33) (cid:12) (cid:32) r (cid:88) α =1 a C I ([ N ]) ,γ, II α (cid:74) φ C II ( C I ([ N ])) ,α (cid:75) I| C II ( C I ([ N ])) (cid:33) OOSTING D ILATED C ONVOLUTIONAL N ETWORKS WITH M IXED T ENSOR D ECOMPOSITIONS which in turn implies: B [ N ] ,γ = (cid:32) r (cid:88) α =1 a [ N ] ,γ, I α Q ( C I ([ N ])) B C I ([ N ]) ,α ¯ Q ( C I ([ N ])) (cid:33) (cid:12) (cid:32) r (cid:88) α =1 a [ N ] ,γ, II α (cid:74) φ C II ([ N ]) ,α (cid:75) I| C II ([ N ]) (cid:33) = (cid:32) Q ( C I ([ N ])) (cid:32) r (cid:88) α =1 a [ N ] ,γ, I α B C I ([ N ]) ,α (cid:33) ¯ Q ( C I ([ N ])) (cid:33) (cid:12) (cid:32) r (cid:88) α =1 a [ N ] ,γ, II α (cid:74) φ C II ([ N ]) ,α (cid:75) I| C II ([ N ]) (cid:33) Now, for any matrices

A, A (cid:48) , B, B (cid:48) such that AA (cid:48) and BB (cid:48) are deﬁned, the following equalityholds: ( AA (cid:48) ) (cid:12) ( BB (cid:48) ) = ( A (cid:12) A (cid:48) )( B (cid:12) B (cid:48) ) (see Bellman (1970) for proof). We may therefore write: B [ N ] ,γ = (cid:16) Q ( C I ([ N ])) (cid:12) I (cid:17) (cid:32)(cid:32) r (cid:88) α =1 a [ N ] ,γ, I α B C I ([ N ]) ,α (cid:33) (cid:12) (cid:32) r (cid:88) α =1 a [ N ] ,γ, II α (cid:74) φ C II ([ N ]) ,α (cid:75) I| C II ([ N ]) (cid:33)(cid:33) (cid:16) ¯ Q ( C I ([ N ])) (cid:12) ¯ I (cid:17) where I and ¯ I are identity matrices of appropriate sizes. Propagating outwards the matrices Q ( C I ([ N ])) (cid:12) I and ¯ Q ( C I ([ N ])) (cid:12) ¯ I (while redeﬁning B [ N ] ,γ appropriately), we arrive at the followingdecomposition: For j = 1 . . .N : (cid:74) φ { j } ,γ (cid:75) I| { j } = (cid:114) [ v (1) γ , . . . , v ( M ) γ ] (cid:62) (cid:122) I| { j } ∀ γ ∈ [ r ] For ν in int ( T ) \ { [ N ] , C I ([ N ]) } (depth-ﬁrst order): (cid:74) φ ν,γ (cid:75) I| ν = Q ( ν ) (cid:32)(cid:32) r (cid:88) α =1 a ν,γ, I α (cid:74) φ C I ( ν ) ,α (cid:75) I| C I ( ν ) (cid:33) (cid:12) (cid:32) r (cid:88) α =1 a ν,γ, II α (cid:74) φ C II ( ν ) ,α (cid:75) I| C II ( ν ) (cid:33)(cid:33) ¯ Q ( ν ) ∀ γ ∈ [ r ] B C I ([ N ]) ,γ = (cid:32) r (cid:88) α =1 a C I ([ N ]) ,γ, I α (cid:74) φ C I ( C I ([ N ])) ,α (cid:75) I| C I ( C I ([ N ])) (cid:33) (cid:12) (cid:32) r (cid:88) α =1 a C I ([ N ]) ,γ, II α (cid:74) φ C II ( C I ([ N ])) ,α (cid:75) I| C II ( C I ([ N ])) (cid:33) ∀ γ ∈ [ r ] B [ N ] ,γ = (cid:32) r (cid:88) α =1 a [ N ] ,γ, I α B C I ([ N ]) ,α (cid:33) (cid:12) (cid:32) r (cid:88) α =1 a [ N ] ,γ, II α (cid:74) φ C II ([ N ]) ,α (cid:75) I| C II ([ N ]) (cid:33) ∀ γ ∈ [ r ] (cid:74) A y (cid:75) I = (cid:16) Q ([ N ]) ( Q ( C I ([ N ])) (cid:12) I ) (cid:17) B [ N ] ,y (cid:16) ( ¯ Q ( C I ([ N ])) (cid:12) ¯ I ) ¯ Q ([ N ]) (cid:17) ∀ y ∈ [ r ] Continuing this process, we propagate outwards the permutation matrices Q ( ν ) and ¯ Q ( ν ) of allnodes ν in the tree that are not members of the tilings Θ( I ) or Θ( I c ) (see def. 6), and are not OHEN T AMARI S HASHUA descendants of such. This brings forth the following decomposition:

For j = 1 . . .N : (cid:74) φ { j } ,γ (cid:75) I| { j } = (cid:114) [ v (1) γ , . . . , v ( M ) γ ] (cid:62) (cid:122) I| { j } ∀ γ ∈ [ r ] For ν in int ( T ) ∩{ nodes in Θ( I ) or Θ( I c ) or descendants of such } (depth-ﬁrst order): (cid:74) φ ν,γ (cid:75) I| ν = Q ( ν ) (cid:32)(cid:32) r (cid:88) α =1 a ν,γ, I α (cid:74) φ C I ( ν ) ,α (cid:75) I| C I ( ν ) (cid:33) (cid:12) (cid:32) r (cid:88) α =1 a ν,γ, II α (cid:74) φ C II ( ν ) ,α (cid:75) I| C II ( ν ) (cid:33)(cid:33) ¯ Q ( ν ) ∀ γ ∈ [ r ] For ν in Θ( I ) ∪ Θ( I c ) : B ν,γ = (cid:74) φ ν,γ (cid:75) I| ν ∀ γ ∈ [ r ] For ν in int ( T ) \{ nodes in Θ( I ) or Θ( I c ) or descendants of such } (depth-ﬁrst order): B ν,γ = (cid:32) r (cid:88) α =1 a ν,γ, I α B C I ( ν ) ,α (cid:33) (cid:12) (cid:32) r (cid:88) α =1 a ν,γ, II α B C II ( ν ) ,α (cid:33) ∀ γ ∈ [ r ] (cid:74) A y (cid:75) I = A · B [ N ] ,y · ¯ A ∀ y ∈ [ r ] , for appropriate matrices A and ¯ A Consider now a node ν ∈ int ( T ) whose child belongs to a tiling – without loss of generality C I ( ν ) belongs to Θ( I ) . Notice that in this case B C I ( ν ) ,α is a column vector for every α ∈ [ r ] . We maythus deﬁne B C I ( ν ) to be the matrix whose α ’th column is B C I ( ν ) ,α , and get the following equalities: B ν,γ = (cid:16) B C I ( ν ) a ν,γ, I (cid:17) (cid:12) (cid:16)(cid:88) rα =1 a ν,γ, II α B C II ( ν ) ,α (cid:17) = (cid:16) B C I ( ν ) (cid:12) I (cid:17) (cid:16) a ν,γ, I (cid:12) (cid:88) rα =1 a ν,γ, II α B C II ( ν ) ,α (cid:17) where again, I is an appropriately sized identity matrix. This implies that we can propagate out-wards B C I ( ν ) (cid:12) I , just as we have done with permutation matrices. Applying this procedure to allnodes in the tilings Θ( I ) and Θ( I c ) , we arrive at the decomposition below: For ν in Θ( I ) : B ν,γ = e ( γ ) ∀ γ ∈ [ r ] For ν in Θ( I c ) : B ν,γ = ( e ( γ ) ) (cid:62) ∀ γ ∈ [ r ] For ν in int ( T ) \{ nodes in Θ( I ) or Θ( I c ) or descendants of such } (depth-ﬁrst order): B ν,γ = (cid:32) r (cid:88) α =1 a ν,γ, I α B C I ( ν ) ,α (cid:33) (cid:12) (cid:32) r (cid:88) α =1 a ν,γ, II α B C II ( ν ) ,α (cid:33) ∀ γ ∈ [ r ] (cid:74) A y (cid:75) I = A · B [ N ] ,y · ¯ A ∀ y ∈ [ r ] , for appropriate matrices A and ¯ A Notice that for compactness in writing we made use of the fact that a ν,γ, I = (cid:80) rα =1 a ν,γ, II α e ( α ) ,where e ( α ) , α ∈ [ r ] , is the vector in R r holding in entry α and in the rest. Note also thatin this decomposition, as opposed to the previous ones, the matrices A and ¯ A are not globalconstants that depend only on T . Rather, they also depend on (cid:74) φ ν,γ (cid:75) I| ν for tiling nodes ν ∈ Θ( I ) ∪ Θ( I c ) , and thus are ultimately determined through a hidden computation that is not speciﬁedabove. This hidden computation is outside our scope, as we are only interested in the size of thematrices { B [ N ] ,y } y . It is not difﬁcult to see that this size is precisely r | Θ( I ) | -by- r | Θ( I c ) | , meaningthat the ranks of { B [ N ] ,y } y are no more than r min {| Θ( I ) | , | Θ( I c ) |} . Since these ranks are greater thanor equal to those of { (cid:74) A y (cid:75) I } y , the sought after upper bound (eq. 10) indeed holds. OOSTING D ILATED C ONVOLUTIONAL N ETWORKS WITH M IXED T ENSOR D ECOMPOSITIONS

In the third and ﬁnal stage of the proof, we establish the lower bound stated in the theorem,namely, that for all conﬁgurations of weights { a ν,γ, I , a ν,γ, II } ν,γ but a set of Lebesgue measure zero: rank (cid:74) A y (cid:75) I ≥ r |{ ( ν ,ν ) ∈ Θ( I ) × Θ( I c ): ν and ν are siblings in T with depth > }| ∀ y (11)We reduce the problem in three successive steps: • A tree decomposition (eq. 3) with a product operator g ( · ) admits maximal matricization ranksalmost always (see app. C). Therefore, to prove that eq. 11 holds for all weight settings buta set of Lebesgue measure zero, it sufﬁces to ﬁnd a particular weight setting for which theinequality holds. • By assumption, the discretizers { v ( i ) } i ∈ [ M ] span R r . Without loss of generality, assumethat { v ( i ) } i ∈ [ r ] are linearly independent, and consider the sub-tensors of {A y } y formed byrestricting their indexes to the range . . .r (instead of . . .M ). The matricizations of thesesub-tensors w.r.t. I are sub-matrices of { (cid:74) A y (cid:75) I } y , thus any lower bound on ranks of theformer matricizations immediately translates to a lower bound on ranks of the latter. Sincethe sub-tensors are precisely the grid tensors that would have been generated by the treedecomposition (eq. 3) had we omitted the trailing discretizers { v ( i ) } i ∈ [ M ] \ [ r ] , establishingeq. 11 in the case M = r proves that it holds in general ( M ≥ r ). • Bearing in mind that we assume M = r (and linear independence of { v ( i ) } i ∈ [ r ] ), denoteby V the r -by- r matrix holding v ( i ) in its i ’th row, i.e. V := [ v (1) · · · v ( r ) ] (cid:62) . From thetree decomposition (eq. 3) it is evident that the discretizers affect generated grid tensors onlythrough products of the form V a ν,γ I or V a ν,γ II , where ν is a parent of a leaf node in T .Since V is invertible ( { v ( i ) } i ∈ [ r ] are linearly independent), its exact value has no effect onthe class of representable grid tensors – any change it undergoes may be accounted for by theweights a ν,γ I and a ν,γ II that multiply it (these weights do not appear elsewhere in the decom-position). Accordingly, for establishing a lower bound on achievable grid tensor matricizationranks, the value of V is irrelevant (so long as it is invertible), and we may assume, withoutloss of generality, that V is the identity matrix, i.e. that v ( i ) = e ( i ) for all i ∈ [ r ] .Taking into account the above reductions, our objective is to show that there exists a setting ofweights { a ν,γ, I , a ν,γ, II } ν,γ , such that the following special case of the matricized tree decomposition(eq. 9) generates matricizations meeting the lower bound in eq. 11: For j in I : (cid:74) φ { j } ,γ (cid:75) I| { j } = e ( γ ) ∀ γ ∈ [ r ] For j in I c : (cid:74) φ { j } ,γ (cid:75) I| { j } = ( e ( γ ) ) (cid:62) ∀ γ ∈ [ r ] For ν in int ( T ) (depth-ﬁrst order): (cid:74) φ ν,γ (cid:75) I| ν = Q ( ν ) (cid:32)(cid:32) r (cid:88) α =1 a ν,γ, I α (cid:74) φ C I ( ν ) ,α (cid:75) I| C I ( ν ) (cid:33) (cid:12) (cid:32) r (cid:88) α =1 a ν,γ, II α (cid:74) φ C II ( ν ) ,α (cid:75) I| C II ( ν ) (cid:33)(cid:33) ¯ Q ( ν ) ∀ γ ∈ [ r ] (cid:74) A y (cid:75) I = (cid:74) φ [ N ] ,y (cid:75) I| [ N ] ∀ y ∈ [ r ] OHEN T AMARI S HASHUA

Similarly to the procedure carried out in the second stage of the proof (establishing the upper boundin eq. 10), we now propagate outwards the permutation matrices Q ( ν ) and ¯ Q ( ν ) corresponding to allinterior nodes ν ∈ int ( T ) . This brings forth the following decomposition: For j in I : B { j } ,γ = e ( γ ) ∀ γ ∈ [ r ] For j in I c : B { j } ,γ = ( e ( γ ) ) (cid:62) ∀ γ ∈ [ r ] For ν in int ( T ) (depth-ﬁrst order): B ν,γ = (cid:32) r (cid:88) α =1 a ν,γ, I α B C I ( ν ) ,α (cid:33) (cid:12) (cid:32) r (cid:88) α =1 a ν,γ, II α B C II ( ν ) ,α (cid:33) ∀ γ ∈ [ r ] (cid:74) A y (cid:75) I = A · B [ N ] ,y · ¯ A ∀ y ∈ [ r ] , for appropriate matrices A and ¯ A (12) The matrices A and ¯ A in the assignments of { (cid:74) A y (cid:75) I } y essentially collect all permutation ma-trices { Q ( ν ) } ν and { ¯ Q ( ν ) } ν (respectively) that have been propagated outwards. Speciﬁcally, A (respectively ¯ A ) is a product of factors, each of the form I (cid:12) Q ( ν ) (cid:12) I (cid:48) (respectively I (cid:12) ¯ Q ( ν ) I (cid:48) ) fora different interior node ν and appropriately sized identity matrices I and I (cid:48) . Since permutationmatrices are invertible, and since the Kronecker product between two invertible matrices is invert-ible as well (see Bellman (1970) for proof), we conclude that the matrices A and ¯ A are invertible.Therefore, for every y ∈ [ r ] , the rank of (cid:74) A y (cid:75) I is equal to that of B [ N ] ,y . It thus sufﬁces to ﬁnd asetting of weights { a ν,γ, I , a ν,γ, II } ν,γ for which: rank ( B [ N ] ,γ ) ≥ r |{ ( ν ,ν ) ∈ Θ( I ) × Θ( I c ): ν and ν are siblings in T with depth > }| ∀ γ ∈ [ r ] (13)Disregard the trivial case where there exist siblings ν ∈ Θ( I ) and ν ∈ Θ( I c ) of depth , andconsider the following weight setting: • ν is a node in Θ( I ) or Θ( I c ) , or a descendant of such: a ν,γ, I = a ν,γ, II = e ( γ ) ∀ γ ∈ [ r ] • ν has one child in Θ( I ) and the other in Θ( I c ) : a ν,γ, I = a ν,γ, II = e ( γ ) ∀ γ ∈ [ r ] • ν is the root node [ N ] : a ν,γ, I = a ν,γ, II = e (1) ∀ γ ∈ [ r ] • ν meets neither of the above ( and here denote the all-zero and all-one vectors in R r ,respectively): a ν, , I = (cid:26) , C I ( ν ) has one child in Θ( I ) and the other in Θ( I c ) e (1) , otherwise a ν, , II = (cid:26) , C II ( ν ) has one child in Θ( I ) and the other in Θ( I c ) e (1) , otherwise a ν,γ, I = a ν,γ, II = ∀ γ ∈ [ r ] \ { }

1. In this case I and I c are the children of the root node [ N ] , and the maximal rank of B [ N ] ,γ is for every γ ∈ [ r ] . OOSTING D ILATED C ONVOLUTIONAL N ETWORKS WITH M IXED T ENSOR D ECOMPOSITIONS

Plugging this into the decomposition in eq. 12, one readily sees that: • For every ν ∈ Θ( I ) , { B ν,γ } γ ∈ [ r ] are indicator column vectors (one entry holds , the resthold ) such that B ν,γ (cid:54) = B ν,γ (cid:48) if γ (cid:54) = γ (cid:48) . The same holds for ν ∈ Θ( I c ) , but with the vectorsbeing rows. • If ν has one child in Θ( I ) and the other in Θ( I c ) , { B ν,γ } γ ∈ [ r ] are indicator matrices, whereboth the row and column indexes of the active entry do not repeat as γ varies. • The matrices { B [ N ] ,γ } γ ∈ [ r ] corresponding to the root node [ N ] are equal to one another, givenby a joint Kronecker product between all of the following: – B ν, for every node ν in either Θ( I ) or Θ( I c ) which does not have a sibling in the other – (cid:80) rα =1 B ν,α for every node ν that has one child in Θ( I ) and the other in Θ( I c ) According to the ﬁrst observation above, B ν, has rank for every ν in Θ( I ) or Θ( I c ) . The secondobservation implies that (cid:80) rα =1 B ν,α has rank r for every node ν that has one child in Θ( I ) andthe other in Θ( I c ) . In turn, and while taking into account the rank-multiplicative property of theKronecker product ( rank ( A (cid:12) A (cid:48) ) = rank ( A ) · rank ( A (cid:48) ) – see Bellman (1970) for proof), the thirdobservation implies: rank ( B [ N ] ,γ ) = r |{ ( ν ,ν ) ∈ Θ( I ) × Θ( I c ): ν and ν are siblings in T }| ∀ γ ∈ [ r ] We thus have found weights { a ν,γ, I , a ν,γ, II } ν,γ for which eq. 13 holds. This establishes the soughtafter lower bound on matricization ranks (eq. 11), completing the proof of the theorem. (cid:4)

Appendix C. Maximality of Matricization Ranks

In the proof of theorem 7 (app. B.2), and in the derivation of corollary 8 (sec. 6), we made use of thefact that a tree or mixed decomposition (eq. 3 or 4 respectively), with a product operator g ( · ) , admitsmaximal matricization ranks almost always. That is to say, for any index set I ⊂ [ N ] , the ranks ofgenerated grid tensors {A y } y when matricized w.r.t. I , attain their maximum possible values (whichdepend on both the decomposition and I ) for all conﬁgurations of weights ( { a ν,γ, I , a ν,γ, II } ν,γ forthe tree decomposition, { a ν,γ, I , a ν,γ, II } ν,γ and { ¯ a ¯ ν,γ, I , ¯ a ¯ ν,γ, II } ¯ ν,γ for the mixed decomposition) buta set of Lebesgue measure zero. Hereinafter we justify this assertion.When equipped with the product operator ( g ( a, b ) = a · b ), a tree or mixed decomposition gen-erates grid tensors {A y } y whose entries are polynomials in the decomposition weights. Therefore,for any index set I ⊂ [ N ] , the entries of the matricizations { (cid:74) A y (cid:75) I } y are, too, polynomials in thedecomposition weights. Claim 9 below implies that for a particular index y , the rank of (cid:74) A y (cid:75) I ismaximal almost always, i.e. for all weight settings but a set of measure zero. Since the union ofﬁnitely many zero measure sets is itself a zero measure set (see Jones (2001) for example), we con-clude that the ranks of { (cid:74) A y (cid:75) I } y are jointly maximal almost always, which is what we set out toprove.

2. This applies to all but the trivial case where I is such that there exist siblings ν ∈ Θ( I ) and ν ∈ Θ( I c ) of depth ( I and I c are the children of the root node [ N ] ). In the latter case the lower bound in eq. 13 can be met trivially. OHEN T AMARI S HASHUA

Claim 9

Let

D, M , M ∈ N , and consider a polynomial function mapping weights α ∈ R D tomatrices A ( α ) ∈ R M × M (“polynomial” here means that all entries of A ( α ) are polynomialsin α ). Denote R = max α ∈ R D rank ( A ( α )) , and consider the set S := { α ∈ R D : rank ( A ( α ))

We disregard the trivial case where R = 0 . Let α be a point at which R is attained( rank ( A ( α )) = R ), and assume without loss of generality that the top-left R × R minor of A ( α ) , i.e. the determinant of A ( α ) R, R , is non-zero. The function p : R D → R deﬁned by p ( α ) =det( A ( α ) R, R ) is a polynomial, which by construction does not vanish everywhere ( p ( α ) (cid:54) = 0 ).The zero set of a polynomial is either the entire space, or a set of Lebesgue measure zero (see Caronand Traynor (2005) for proof). Therefore, the zero set of p ( · ) has Lebesgue measure zero. Now, forevery α ∈ S : rank ( A ( α )) < R = ⇒ rank ( A ( α ) R, R ) < R = ⇒ p ( α ) := det( A ( α ) R, R ) = 0 S is thus contained in the zero set of p ( · ) , and therefore too, has Lebesgue measure zero., and therefore too, has Lebesgue measure zero.