[PDF] Function space analysis of deep learning representation layers

Abstract

In this paper we propose a function space approach to Representation Learning and the analysis of the representation layers in deep learning architectures. We show how to compute a weak-type Besov smoothness index that quantifies the geometry of the clustering in the feature space. This approach was already applied successfully to improve the performance of machine learning algorithms such as the Random Forest and tree-based Gradient Boosting. Our experiments demonstrate that in well-known and well-performing trained networks, the Besov smoothness of the training set, measured in the corresponding hidden layer feature map representation, increases from layer to layer. We also contribute to the understanding of generalization by showing how the Besov smoothness of the representations, decreases as we add more mis-labeling to the training data. We hope this approach will contribute to the de-mystification of some aspects of deep learning.

Full PDF

aa r X i v : . [ c s . A I] O c t Function space analysis of deep learningrepresentation layers

Oren Elisha and Shai Dekel

Abstract —In this paper we propose a function space approach to Representation Learning [3] and the analysis of the representationlayers in deep learning architectures. We show how to compute a ‘weak-type’ Besov smoothness index that quantiﬁes the geometry ofthe clustering in the feature space. This approach was already applied successfully to improve the performance of machine learningalgorithms such as the Random Forest [22] and tree-based Gradient Boosting [14]. Our experiments demonstrate that in well-knownand well-performing trained networks, the Besov smoothness of the training set, measured in the corresponding hidden layer featuremap representation, increases from layer to layer. We also contribute to the understanding of generalization [49] by showing how theBesov smoothness of the representations, decreases as we add more mis-labeling to the training data. We hope this approach willcontribute to the de-mystiﬁcation of some aspects of deep learning.

Index Terms —Deep Learning, Representation learning, Wavelets, Random Forest, Besov Spaces, Sparsity. ✦ NTRODUCTION A N excellent starting point for this paper is the surveyon Representation Learning [3]. One of the main issuesraised in this survey is that simple smoothness assumptionson the data do not hold. That is, there exists a curse ofdimensionality and ‘close’ feature representations do notmap to ‘similar’ values. The authors write: “We advocatelearning algorithms that are ﬂexible and non-parametric butdo not rely exclusively on the smoothness assumption”.In this work we do in fact advocate smoothness analysisof representation layers, yet in line with [3], our notion ofsmoothness is indeed ﬂexible, adaptive and non-parametric.We rely on geometric multivariate function space theory anduse a machinery of Besov ‘weak-type’ smoothness which isrobust enough to support quantifying smoothness of highdimensional discontinuous functions.Although machine learning is mostly associated with theﬁeld of statistics we argue that the popular machine learningalgorithms such as Support Vector Machines, Tree-BasedGradient Boosting and Random Forest (see e.g [28]), arein fact closely related to the ﬁeld of multivariate adaptiveapproximation theory. In essence, these algorithms workbest if there exists geometric structure of clusters in thefeature space. If such geometry exists, these algorithms willcapture it, by segmenting out the different clusters. We claimthat in the absence of such geometry, these machine learningalgorithms will fail.However, this is exactly where Deep Learning (DL)comes into play. In the absence of a geometrical structurein the given initial representation space, the goal of theDL layers is to create a series of transformations from onerepresentation space to the next, where structure of thegeometry of the clusters improves sequentially. We quote • O. Elisha is with GE Research and the School of Mathematics, Tel-AvivUniversity, Tel-Aviv • S. Dekel is with WIX AI and the School of Mathematics, Tel-AvivUniversity, Tel-Aviv [10]: “The whole process of applying this complex geometrictransformation to the input data can be visualized in 3Dby imagining a person trying to uncrumple a paper ball:the crumpled paper ball is the manifold of the input datathat the model starts with. Each movement operated by theperson on the paper ball is similar to a simple geometrictransformation operated by one layer. The full uncrum-pling gesture sequence is the complex transformation ofthe entire model. Deep learning models are mathematicalmachines for uncrumpling complicated manifolds of high-dimensional data”.Let us provide an instructive example: Assume we arepresented with a set of gray-scale images of dimension √ n × √ n with L class labels. Assume further that aDL network has been successfully trained to classify theseimages with relatively high precision. This allows us to ex-tract the representation of each image in each of the hiddenlayers. To create a representation at layer , we concatenatethe √ n rows of pixel values of each image, to create avector of dimension n . We also normalize the pixel valuesto the range [0 , . Since we advocate a function-theoreticalapproach, we transform the class labels into vector-valuesin the space R L − by assigning each label to a vertex of astandard simplex (see Section 2 below). Thus, the images areconsidered as samples of a function f : [0 , n → R L − . Inthe general case, there is no hope that there exists geometricclustering of the classes in this initial feature space and that f has sufﬁcient ‘weak-type’ smoothness (as is veriﬁed byour experiments below). Thus, a transform into a differentfeature space is needed. We thus associate with each k -thlayer of a DL network, a function f k : [0 , n k → R L − where the samples are vectors created by normalizing andconcatenating the feature maps computed from each ofthe images. Interestingly enough, although the series offunctions f k are embedded in different dimensions n k ,through the simple normalizing of the features, our methodis able to assign smoothness indices to each layer that arecomparable. We claim that for well performing networks, the representations in general ‘improve’ from layer to layerand that our method captures this phenomena and showsthe increase of smooothness.Related work is [38] where an architecture of Convolu-tional Sparse Coding was analyzed. The connection to thiswork is the emphasis on ‘sparsity’ analysis of hidden layers.However, there are signiﬁcant differences since we advocatea function space theoretical analysis of any neural networkarchitecture in current use. Also, there is the recent work[42], where the authors take an ‘information-theoretical’approach to the analysis of the stochastic gradient descentoptimization of DL networks and the representation layers.One can safely say that all of the approaches, including theone presented here, need to be further evaluated on largerdatasets and deeper architectures.The paper is organized as follows: In Section 2 we reviewour smoothness analysis machinery which is the WaveletDecomposition of Random Forest (RF) [22]. In Section 3 wepresent the required geometric function space theoreticalbackground. Since we are comparing different representa-tions over different spaces of different dimensions, we addto the theory presented in [22] relevant ‘dimension-free’results. In Section 4 we show how to apply the theory inpractice. Speciﬁcally, how the wavelet decomposition of aRF can be used to numerically compute a Besov ‘weak-type’smoothness index of a given function in any representationspace (e.g hidden layer). Section 5 provides experimentalresults that demonstrate how our theory is able to explainempirical ﬁndings in various scenarios. Finally, Section 6presents our conclusions as well as future work. AVELET DECOMPOSITION OF R ANDOM F ORESTS

To measure smoothness of a dataset at the various DLrepresentation layers, we apply the construction of waveletdecompositions of Random Forests [22]. Wavelets [13], [34]and geometric wavelets [15], [1], are a powerful yet simpletool for constructing sparse representations of ‘complex’functions. The Random Forest (RF) [4], [11], [28] introducedby Breiman [5], [6], is a very effective machine learningmethod that can be considered as a way to overcomethe ‘greedy’ nature and high variance of a single decisiontree. When combined, the wavelet decomposition of the RFunravels the sparsity of the underlying function and estab-lishes an order of the RF nodes from ‘important’ compo-nents to ‘negligible’ noise. Therefore, the method providesa better understanding of any constructed RF. Furthermore,the method is a basis for a robust feature importance al-gorithm. We note that one can apply a similar approach toimprove the performance of tree-based Gradient Boostingalgorithms [14].We begin with an overview of single trees. In statisticsand machine learning [7], [2], [4], [16], [28] the constructionis called a Decision Tree or the Classiﬁcation and Regres-sion Tree (CART). We are given a real-valued function f ∈ L (Ω ) or a discrete dataset { x i ∈ Ω , f ( x i ) } i ∈ I , insome convex bounded domain Ω ⊂ R n . The goal is to ﬁndan efﬁcient representation of the underlying function, over-coming the complexity, geometry and possibly non-smoothnature of the function values. To this end, we subdivide the initial domain Ω into two subdomains, e.g. by intersectingit with a hyper-plane. The subdivision is performed tominimize a given cost function. This subdivision processthen continues recursively on the subdomains until somestopping criterion is met, which in turn, determines theleaves of the tree. We now describe one instance of the costfunction which is related to minimizing variance. At eachstage of the subdivision process, at a certain node of thetree, the algorithm ﬁnds, for the convex domain Ω ⊂ R n associated with the node:(i) A partition by an hyper-plane into two convex subdo-mains Ω ′ , Ω ′′ , Ω ′ ∪ Ω ′′ = Ω ,(ii) Two multivariate polynomials Q Ω ′ , Q Ω ′′ ∈ Π r − ( R n ) , ofﬁxed (typically low) total degree r − k f − Q Ω ′ k pL p (Ω ′ ) + k f − Q Ω ′′ k pL p (Ω ′′ ) . (2.1)Here, for ≤ p < ∞ , we used the deﬁnition k g k L p ( ˜Ω ) := (cid:18)Z ˜Ω | g ( x ) | p dx (cid:19) /p . If the dataset is discrete, consisting of feature vectors x i ∈ R n , i ∈ I , with response values f ( x i ) , then a discretefunctional is minimized over all partitions Ω ′ ∪ Ω ′′ = Ω X x i ∈ Ω ′ | f ( x i ) − Q Ω ′ ( x i ) | p + X x i ∈ Ω ′′ | f ( x i ) − Q Ω ′′ ( x i ) | p . (2.2)Observe that for any given subdividing hyperplane, theapproximating polynomials in (2.2) can be uniquely deter-mined for p = 2 , by least square minimization. For the order r = 1 , the approximating polynomials are nothing but themean of the function values over each of the subdomains Q Ω ′ ( x ) = C Ω ′ = 1 { x i ∈ Ω ′ } X x i ∈ Ω ′ f ( x i ) ,Q Ω ′′ ( x ) = C Ω ′′ = 1 { x i ∈ Ω ′′ } X x i ∈ Ω ′′ f ( x i ) . In many applications of decision trees, the high-dimensionality of the data does not allow to search throughall possible subdivisions. As in our experimental results, onemay restrict the subdivisions to the class of hyperplanesaligned with the main axes. In contrast, there are caseswhere one would like to consider more advanced form ofsubdivisions, where they take certain hyper-surface formor even non-linear forms through kernel Support VectorMachines. Our paradigm of wavelet decompositions cansupport in principle all of these forms.Random Forest (RF) is a popular machine learning toolthat collects decision trees into an ensemble model [5], [4].The trees are constructed independently in a diverse fashionand prediction is done by a voting mechanism among alltrees. A key element [5], is that large diversity between thetrees reduces the ensemble’s variance. There are many RFsvariations that differ in the way randomness is injected intothe model, e.g bagging, random feature subset selection andthe partition criterion [11], [28]. Our wavelet decompositionparadigm is applicable to most of the RF versions knownfrom the literature.

Bagging [6] is a method that produces partial replicatesof the training data for each tree. A typical approach is torandomly select for each tree a certain percentage of thetraining set (e.g. 80%) or to randomly select samples withrepetitions [28].Additional methods to inject randomness can beachieved at the node partitioning level. For each node,we may restrict the partition criteria to a small randomsubset of the parameter values (hyper-parameter). A typicalselection is to search for a partition from a random sub-set of √ n features [5]. This technique is also useful forreducing the amount of computations when searching theappropriate partition for each node. Bagging and randomfeature selections are not mutually exclusive and could beused together.For j = 1 , ..., J , one creates a decision tree T j , based ona subset of the data, X j . One then provides a weight (score) w j to the tree T j , based on the estimated performance of thetree, where P Jj =1 w j = 1 . In the supervised learning, onetypically uses the remaining data points x i / ∈ X j to evaluatethe performance of T j . For simplicity, we will mostly con-sider in this paper the choice of uniform weights w j = 1 /J .For any point x ∈ Ω , the approximation associated withthe j th tree, denoted by ˜ f j ( x ) , is computed by ﬁnding theleaf Ω ∈ T j in which x is contained and then evaluating ˜ f j ( x i ) := Q Ω ( x ) , where Q Ω is the corresponding polyno-mial associated with the decision node Ω . One then assignsan approximate value to any point x ∈ Ω by ˜ f ( x ) = J X j =1 w j ˜ f j ( x ) . Typically, in classiﬁcation problems, the response vari-able does have a numeric value, but is labeled by one of Lclasses. In this scenario, each input training point x i ∈ Ω is assigned with a class Cl ( x i ) . To convert the problem tothe ‘functional’ setting described above one assigns to eachclass the value of a node on the regular simplex consisting of L vertices in R L − (all with equal pairwise distances). Thus,we may assume that the input data is in the form { x i , Cl ( x i ) } i ∈ I ∈ (cid:16) R n , R L − (cid:17) . In this case, if we choose approximation using constants( r = 1) , then the calculated mean over any subdomain Ω isin fact a point ~E Ω ∈ R L − , inside the simplex. Obviously,any value inside the multidimensional simplex, can bemapped back to a class, along with an estimated conﬁdencelevel, by calculating the closest vertex of the simplex to it.Following the classic paradigm of nonlinear approxi-mation using wavelets [13], [17], [34] and the geometricfunction space theory presented in [15], [30], we introducedin [22] a construction of a wavelet decomposition of a forest.Let Ω ′ be a child of Ω in a tree T , i.e. Ω ′ ⊂ Ω and Ω ′ wascreated by a partition of Ω . Denote by Ω ′ , the indicatorfunction over the child domain Ω ′ , i.e. Ω ′ ( x ) = 1 , if x ∈ Ω ′ and Ω ′ ( x ) = 0 , if x / ∈ Ω ′ . We use the polynomialapproximations Q Ω ′ , Q Ω ∈ Π r − ( R n ) , computed by thelocal minimization (2.1) and deﬁne ψ Ω ′ ( x ) := ψ Ω ′ ( f ) ( x ) := Ω ′ ( x ) ( Q Ω ′ ( x ) − Q Ω ( x )) , (2.3) as the geometric wavelet associated with the subdomain Ω ′ and the function f , or the given discrete dataset { x i , f ( x i ) } i ∈ I . Each wavelet ψ Ω ′ , is a ‘local difference’component that belongs to the detail space between twolevels in the tree, a ‘low resolution’ level associated with Ω and a ‘high resolution’ level associated with Ω ′ . Also,the wavelets (2.3) have the ‘zero moments’ property, i.e.,if the response variable is sampled from a polynomial ofdegree r − over Ω , then our local scheme will compute Q Ω ′ ( x ) = Q Ω ( x ) = f ( x ) , ∀ x ∈ Ω , and therefore ψ Ω ′ = 0 .Under certain mild conditions on the tree T and thefunction f , we have by the nature of the wavelets, the‘telescopic’ sum of differences f = X Ω ∈T ψ Ω , ψ Ω := Q Ω . (2.4)For example, (2.4) holds in L p -sense, ≤ p < ∞ , if f ∈ L p (Ω ) and for any x ∈ Ω and series of domains Ω l ∈ T , each on a level l , with x ∈ Ω l , we have that lim l →∞ diam (Ω l ) = 0 .The norm of a wavelet is computed by k ψ Ω ′ k pp = Z Ω ′ ( Q Ω ′ ( x ) − Q Ω ( x )) p dx. For the case r = 1 , where Q Ω ( x ) = C Ω and Q Ω ′ ( x ) = C Ω ′ this simpliﬁes to k ψ Ω ′ k pp = | C Ω ′ − C Ω | p | Ω ′ | , (2.5)where | Ω ′ | denotes the volume of Ω ′ . Observe that for r = 1 , the subdivision process for partitioning a nodeby minimizing (2.1) is equivalent to maximizing the sumof squared norms of the wavelets that are formed in thatpartition (see [22]).Recall that our approach is to convert classiﬁcationproblems into a ‘functional’ setting by assigning the L class labels to vertices of a simplex in R L − . In such casesof multi-valued functions, choosing r = 1 , the wavelet ψ Ω ′ : R n → R L − is ψ Ω ′ ( x ) = Ω ′ ( x ) (cid:16) ~E Ω ′ − ~E Ω (cid:17) , and its norm is given by k ψ Ω ′ k pp = (cid:13)(cid:13)(cid:13) ~E Ω ′ − ~E Ω (cid:13)(cid:13)(cid:13) pl | Ω ′ | , (2.6)where for ~v ∈ R L − , k ~v k l := qP L − i =1 v i .Using any given weights assigned to the trees, we obtaina wavelet representation of the entire RF ˜ f ( x ) = J X j =1 X Ω ∈T j w j ψ Ω ( x ) . (2.7)The theory (see [22]) tells us that sparse approximation isachieved by ordering the wavelet components based ontheir norm w j ( Ω k ) (cid:13)(cid:13)(cid:13) ψ Ω k (cid:13)(cid:13)(cid:13) p ≥ w j ( Ω k ) (cid:13)(cid:13)(cid:13) ψ Ω k (cid:13)(cid:13)(cid:13) p ≥ · · · (2.8)with the notation Ω ∈ T j ⇒ j (Ω) = j . Thus, the adaptiveM-term approximation of a RF is f M ( x ) := M X m =1 w j (Ω km ) ψ Ω km ( x ) . (2.9) Fig. 1. Selection of an M-term approximation from the entire forest.Fig. 2. “Red Wine Quality” dataset - Numeric computation of M foroptimal regression.

Observe that, contrary to most existing tree pruning tech-niques, where each tree is pruned separately, the aboveapproximation process applies a ‘global’ pruning strategywhere the signiﬁcant components can come from any nodeof any of the trees at any level. For simplicity, one couldchoose w j = 1/ J , and obtain f M ( x ) = 1 J M X m =1 ψ Ω km ( x ) . (2.10)Fig. 1 depicts an M-term (2.10) selected from an RF ensem-ble. The red colored nodes illustrate the selection of the Mwavelets with the highest norm values from the entire forest.Observe that they can be selected from any tree at any level,with no connectivity restrictions.Figure 2 depicts how the parameter M is selected forthe challenging “Red Wine Quality” dataset from the UCIrepository [45]. The generation of 10 decision trees onthe training set creates approximately 3500 wavelets. Theparameter M is then selected by minimization of the ap-proximation error on an OOB validation set. In contrastwith other pruning methods [32], using (2.8), the waveletapproximation method may select signiﬁcant componentsfrom any tree and any level in the forest. By this method,one does not need to predetermine the maximal depth ofthe trees and over-ﬁtting is controlled by the selection ofsigniﬁcant wavelet components. EOMETRIC MULTIVARIATE FUNCTION SPACETHEORY

An important research area of approximation theory, pio-neered by Pencho Petrushev, is the characterization of adap-tive geometric approximation algorithms by generalizationsof the classic ‘isotropic’ Besov space to more ‘geometric’Besov-type spaces [12], [15], [30]. We ﬁrst review the deﬁni-tion and results of [22]. In essence, this is a generalization ofa theoretical framework that has been successfully appliedin the context of low dimensional and structured signalprocessing [17], [20]. However, in the context of machinelearning, we need to analyze unstructured and possibly highdimensional datasets.Approximation Theory relates the sparsity of a functionto its Besov smoothness index and supports cases where thefunction is not even continuous. For a function f ∈ L τ (Ω) , < τ ≤ ∞ , h ∈ R n and r ∈ N , we recall the r -th orderdifference operator ∆ rh ( f, x ) := ∆ rh ( f, Ω , x ):= r X k =0 ( − r + k (cid:18) rk (cid:19) f ( x + kh ) , where we assume the segment [ x, x + rh ] is contained in Ω . Otherwise, we set the ∆ rh ( f, Ω , x ) = 0 . The modulus ofsmoothness of order r is deﬁned by ω r ( f, t ) τ := sup | h |≤ t k ∆ rh ( f, Ω , · ) k L τ (Ω) , t > , where for h ∈ R n , | h | denotes the norm of h . We also denote ω r ( f, Ω) τ := ω r (cid:18) f, diam (Ω) r (cid:19) τ . (3.1)Next, we deﬁne the ‘weak-type’ Besov smoothness of a func-tion, subject to the geometry of a single (possibly adaptive)tree Deﬁnition 3.1.

For < p < ∞ and α > , we set τ = τ ( α, p ) , to be /τ := α + 1 /p . For a given function f ∈ L p (Ω ) , Ω ⊂ R n , and tree T , we deﬁne the associatedB-space smoothness in B α,rτ ( T ) , r ∈ N , by | f | B α,rτ ( T ) := X Ω ∈T (cid:16) | Ω | − α ω r ( f, Ω) τ (cid:17) τ ! /τ , (3.2)where, | Ω | denotes the volume of Ω .This notion of smoothness allows to handle functionsthat are not even continuous. The higher the index α forwhich (3.2) is ﬁnite, the smoother the function is. This gen-eralizes the Sobolev smoothness of differentiable functionsthat have their partial derivatives integrable in some L τ space. Also, the above deﬁnition generalizes the classicalfunction space theory of Besov spaces, where the tree par-titions are non-adaptive. In fact, classical Besov spaces area special case, where the tree is constructed by partitioninginto dyadic cubes, each time using n levels of the tree. Werecall that a ‘well clustered’ function is in fact inﬁnitelysmooth in the right adaptively chosen Besov space. Lemma 3.2.

Let f ( x ) = K P k =1 P k ( x ) B k ( x ) , where each B k ⊂ Ω is a box with sides parallel to the main axes and P k ∈ Π r − . We further assume that B k ∩ B j = ∅ , whenever j = k . Then, there exists an adaptive tree partition T ,such that f ∈ B α,rα ( T ) , for any α > . Proof:

See [22].For a given forest F = {T j } Jj =1 and weights w j = 1/ J ,the α - Besov semi-norm associated with the forest is | f | B α,rτ ( F ) := 1 J  J X j =1 | f | τ B α,rτ ( T j )  /τ . (3.3) Deﬁnition 3.3.

Given a (possibly adaptive) forest represen-tation, we deﬁne the Besov smoothness index of f by themaximal index α for which (3.3) is ﬁnite. Remark

It is known that different geometric approximationschemes are characterized by different ﬂavors of Besov-type smoothness. In this work, for example, all of our ex-perimental results compute smoothness of representationsusing partitions along the main n axes. This restriction maylead, in general, to potentially lower Besov smoothness ofthe underlying function and lower sparsity of the waveletrepresentation. Yet, the theoretical deﬁnitions and resultsof this paper can also apply to more generalized schemeswhere, for example, tree partitions are performed usingarbitrary hyper-planes. In such a case, the smoothness indexof a given function may increase.Next, for a given tree T and parameter < τ < p , wedenote the τ -sparsity of the tree by N τ ( f, T ) =  X Ω =Ω , Ω ∈T k ψ Ω k τp  /τ . (3.4)Let us further denote the τ -sparsity of a forest F , by N τ ( f, F ) := 1 J  J X j =1 X Ω =Ω , Ω ∈T j k ψ Ω k τp  /τ = 1 J  J X j =1 N τ ( f, T j ) τ  /τ . In the setting of a single tree constructed to represent a real-valued function and under mild conditions on the partitions(see remark after (2.4) and condition (3.7)) , the theory of [15]proves the equivalence | f | B α,rτ ( T ) ∼ N τ ( f, T ) . (3.5)This implies that there are constants < C < C < ∞ , thatdepend on parameters such as α, p, n, r and ρ in condition(3.7) below, such that C | f | B α,rτ ( T ) ≤ N τ ( f, T ) ≤ C | f | B α,rτ ( T ) . Therefore, we also have for the forest model | f | B α,rτ ( F ) ∼ N τ ( f, F ) . (3.6)In the setting in which we wish to apply our functiontheoretical approach, we are comparing smoothness of rep-resentation over different layers of DL networks. This im-plies that we are analyzing and comparing the smoothnessa set of functions f k , each over a different representationspace of a different dimension n k . This is, in some sense,non-standard in function space theory, where the space, or at least the dimension, over which the functions havetheir domain is typically ﬁxed. Speciﬁcally, observe that theequivalence (3.6) depends on the dimension n of the featurespace. To this end, we add to the theory ‘dimension-free’analysis for the case r = 1 .We begin with a Jackson-type estimate for the degree ofthe adaptive wavelet forest approximation, which we keep‘dimension free’ for the case r = 1 . Theorem 3.4.

Let F = {T j } Jj =1 be a forest. Assume thereexists a constant < ρ < , such that for any domain Ω ∈ F on a level l and any domain Ω ′ ∈ F , on the level l + 1 , with Ω ∩ Ω ′ = ∅ , we have | Ω ′ | ≤ ρ | Ω | , (3.7)where | E | denotes the volume of E ⊂ R n . For any r ≥ , denote formally f = P Ω ∈F w j (Ω) ψ Ω , and assume that N τ ( f, F ) < ∞ , where τ = α + 1 p . Then, for the M -term approximation (2.9) we have for r = 1 σ M ( f ) := k f − f M k p ≤ C ( p, α, ρ ) JM − α N τ ( f, F ) . (3.8)and for r > σ M ( f ) := k f − f M k p ≤ C ( p, α, ρ, n ) JM − α N τ ( f, F ) . (3.9) Proof:

The proof in [22] shows (3.9). To see (3.8) we ob-serve that the dimension n comes into play in the Nikolskii-type estimate for bounded convex domains Ω ⊂ R n , and r ≥ k ψ Ω k ∞ ≤ c ( p, n, r ) | Ω | − /p k ψ Ω k p . However, for the special case of r = 1 this actually simpliﬁesto k ψ Ω k ∞ = | Ω | − /p k ψ Ω k p . Using the equivalence (3.6), we get for any r ≥ σ M ( f ) ≤ C ( p, α, ρ, n ) JM − α | f | B α,rτ ( F ) , which is not a ‘dimension-free’ Jackson estimate, as the onewill show below for r = 1 (see (3.11)). Next, we presenta simple invariance property of the smoothness analysisunder higher dimension embedding. Lemma 3.5.

Let { x i } , x i ∈ [0 , n , with values f ( x i ) ∈ R L − , i ∈ I . Let F be a forest approximation of the data. Forany m ≥ , let { ˜ x i } be deﬁned by ˜ x i = ( x i , , ..., ∈ [0 , n + m , i ∈ I . Let us further deﬁne ˜ f (˜ x i ) := f ( x i ) .Next, denote by ˜ F a forest deﬁned over [0 , n + m whichis the natural extension of F , using the same trees withsame partitions over the ﬁrst n dimensions. Then, for r = 1 and any τ > , N τ , (cid:16) ˜ f , ˜ F (cid:17) = N τ ( f, F ) . Proof:

Let Ω ′ ∈ F be the domains of the trees of F ,with wavelets of the type ψ Ω ′ ( x ) = Ω ′ ( x ) (cid:16) ~E Ω ′ − ~E Ω (cid:17) . Recall that N τ ( f, F ) is the l τ norm of the sequence of thewavelet norms given by (2.6).Now, for each domain Ω ′ ∈ F and the correspondingdomain ˜Ω ′ ∈ ˜ F , the normalization of the feature space into [0 , n and the higher dimensional embedding in [0 , n + m ensures that | Ω ′ | = | Ω ′ | × | [0 , m | = (cid:12)(cid:12)(cid:12) ˜Ω ′ (cid:12)(cid:12)(cid:12) . Since the vector means { ~E Ω ′ } remain unchanged under thehigher dimensional embedding, we have k ψ Ω ′ k L p [0 , n = k ψ ˜Ω ′ k L p [0 , n + m . This gives N τ (cid:16) ˜ f , ˜ F (cid:17) = N τ ( f, F ) .Next, to allow our smoothness analysis to be ‘dimensionfree’ we modify the modulus of smoothness (3.1) for r = 1 and use the following form of ‘averaged modulus’ Deﬁnition 3.6.

For a function f : Ω → R L − we deﬁne w ( f, Ω) τ := (cid:18)Z Ω (cid:13)(cid:13)(cid:13) f ( x ) − ~E Ω (cid:13)(cid:13)(cid:13) τl ( L − dx (cid:19) /τ , (3.10)where ~E Ω is the average of f over Ω .It is well known that averaged forms of the modulus areequivalent to the form (3.1), but with constants that dependon the dimension. However, replacing (3.1) with (3.10) al-lows us to produce ‘dimension-free’ analysis. We use (3.10)to deﬁne | f | ˜ B α, τ ( T ) := X Ω ∈T (cid:16) | Ω | − α w ( f, Ω) τ (cid:17) τ ! /τ . We can now show

Theorem 3.7.

Let f : Ω → R L − . Then the followingequivalence holds for the case r = 1 , | f | ˜ B α, τ ( F ) ∼ N τ ( f, F ) , where /τ = α + 1 /p , and the constants of equivalencedepend on α, τ, ρ , but not n . Proof:

See the AppendixThis equivalence together with (3.8) imply that for r = 1 wedo have a ‘dimension-free’ Jackson estimate σ M ( f ) ≤ C ( p, α, ρ ) JM − α | f | ˜ B α, τ ( F ) . (3.11) MOOTHNESS ANALYSIS OF THE REPRESENTA - TION LAYERS IN DEEP LEARNING NETWORKS

We now explain how the theory presented in Section 3 isused to estimate the ‘weak-type’ smoothness of a givenfunction in a given representation layer. Recall from the in-troduction that we create a representation of images at layer by concatenating the √ n rows of pixel values of eachgrayscale image, to create a vector of dimension n (or × n for a color image). We also normalize the pixel values to therange [0 , . We then transform the class labels into vector-values in the space R L − by assigning each label to a vertexof a standard simplex (see Section 2). Thus, the images areconsidered as samples of a function f : [0 , n → R L − .In the same manner, we associate with each k -th layer ofa DL network, a function f k : [0 , n k → R L − , where n k is the number of features/neurons at the k -th layer. Thesamples of f k are obtained by applying the network onthe original images up the given k -th layer. For example,in a convolution layer, we capture the representations afterthe cycle of convolution, non-linearity and pooling. We thenextract vectors created by normalizing and concatenatingthe feature map values corresponding to the images. Recallthat although the functions { f k } are embedded in differentdimensions { n k } , through the simple normalizing of thefeatures, our method is able to assign smoothness indicesto each layer that are comparable.Next we describe how we estimate the smoothness ofeach function f k . To this end, we have made several im-provements and simpliﬁcations to the method of [22]. Wecompute a RF over the samples of f k with the choice r = 1 and then apply the wavelet decomposition of the RF (seeSection 2). For each k and M one computes the discrete errorof the wavelet M -term approximation for the case p = 2 σ M ( f k ) = 1 | I | X i ∈ I k S M ( f k )( x i ) − f k ( x i ) k l ( L − . (4.1)We then use the theoretical estimate (3.11) and numericestimate of σ M := σ M ( f k ) in (4.1) to model the errorfunction by σ M ∼ c k M − α k for unknown c k , α k . Noticethat the constant c k absorbs the terms relating to theabsolute constant, the number of trees in RF model aswell as the Besov-norm. Numerically, one simply models log ( σ M ) ∼ log ( c k ) − αlog ( M ) , M = 1 , ..., ˜ M , and thensolves through least squares for c k , α k . Finally, we set α k asour estimate for the ‘critical’ Besov smoothness index of f k . Remarks :(i) Observe that for the ﬁt of c k and α k , we only use ˜ M signiﬁcant terms, so as to avoid ﬁtting the ‘noisy’ tail of theexponential expression. In some cases, we allow ourselvesto select ˜ M adaptively, by discarding a tail of wavelet com-ponents that is over-ﬁtting the training data, but increasingthe error on validation set samples (see Figure 2). However,in cases where the goal is to demonstrate understandingof generalization we restrict the analysis to only using thetraining set and then pre-select ˜ M (e.g. ˜ M = 1000 in theexperiments we review below).(ii) Notice that since each representation space can be of verydifferent dimension, it is crucial that the method is invariantunder different dimension embedding.(iii) We note that this approach to compute the geometricBesov smoothness of a labeled dataset is a signiﬁcant gener-alization of the method used in [20] to compute the (classi-cal) Besov smoothness of a single image. Nevertheless, thereis a distinct similar underlying function space approach. PPLICATIONS AND E XPERIMENTAL R ESULTS

In all of the experiments we used TensorFlow networksmodels. we extracted the representation of any data sample(e.g. image) in any layer of a network, by simply runningthe TensorFlow ‘Session’ object with the given layer and thedata sample as the parameters.The computation of the Besov smoothness index in agiven feature space is implemented as explained in Section4. We used an updated version of the code of [22] whichis available via the link in [48]. For the hyper-parameter

Fig. 3. Smoothness analysis of the layer representations of “Urban8K”using the DeepListen [19] fully-connected architecture that determines the number of M -term errors (4.1), M =1 , ..., ˜ M , which are used to model the α -smoothness, weused ˜ M = 1000 . The code was executed on the AmazonWeb Services cloud, on r3.8xlarge conﬁgurations that have32 virtual CPUs and 244 GB of memory. We note thatcomputing the smoothness of all the representation of acertain dataset of images over all layers requires signiﬁcantcomputation. One needs to create a RF approximation of therepresentation at each layer, sort the wavelet componentsbased on their norms and compute the errors M -term errors(4.1), M = 1 , ..., ˜ M , before the numeric ﬁt of the smoothnessindex can be computed. Thus, in our experiments, we com-puted and used for the smoothness ﬁt only the errors σ i , i = 1 , ..., , to speed up the computation. We now present results for estimates of smoothness analysisin layer representations for some datasets and trained net-works. We begin with the audio dataset “Urban Sound Clas-siﬁcation” from [46]. We applied our smoothness analysis onrepresentations of the “Urban8K” audio data at the layersof the DeepListen model [19] which achieves an accuracyof . The network is a simple feed-forward of 4 fullyconnected layers with ReLU non-linearities. As describedin Section 4, we created a functional representation of thedata at each layer and estimated the Besov α smoothnessindex at each layer. In Figure 3 we see how the clusteringis ‘unfolded’ by the network, as the Besov α -index increasesfrom layer to layer.Next we present results on image datasets. We trainedthe network [44] on the CIFAR10 image dataset [8]. Asdescribed in [44], the images were cropped to size × .The network has 2 convolution layers (with 9216 and 2304features, respectively) and 2 fully connected layers (with 384and 192 features, respectively) with an additional soft-maxlayer (with ﬁnal layer ‘logits’ of 10 classes). The trainingset data size is 50,000 and the testing 10,000. As expected[44], the trained network achieves accuracy on thetesting data. In Figure 4 we see a clear indication of howthe smoothness begins to evolve during the training after 20epochs and the ‘unfolding’ of the clustering improves fromlayer to layer. We also see that the smoothness improvesafter 50 epochs, correlating with the improvement of theaccuracy.We now describe our experiments with the well-knownMNIST dataset of 60,000 training and 10,000 testing images[36]. The DL network conﬁguration we used is the ‘textbook’ Fig. 4. Smoothness analysis of DL layers representations of CIFAR10Fig. 5. Smoothness analysis of DL layers representations of MNIST version of [37], which is composed of two convolutionlayers and two fully connected layers. Training the modelfor 100 epochs produce a model with . accuracy onthe training data and a clear monotone increase of Besovsmoothness across layers as shown in Figure 5. Following [49], we applied random mis-labeling to theMNIST and CIFAR10 image sets at various levels. We ran-domly picked subsets of size q % of the size of dataset,with q = 10% , , , , and then for each imagein this subset we picked a random label. We then trainedthe network of [37] on the misclassiﬁed MNIST datasetsand the network of [44] on the misclassiﬁed CIFAR10 set.We emphasize that the goal of this experiment is to under-stand generalization [49] and automatically detect the levelof corruption solely from the smoothness analysis of thetraining data. Recall from [49] that a network can convergerelatively quickly to an over-ﬁt even on highly mis-classiﬁedtraining sets. Thus, convergence is not a good indication tothe generalization capabilities and speciﬁcally to the level ofmis-labeling in the training data.Next, we created a wavelet decomposition of RF on therepresentation of the training set at the last inner layerof the network. Typically, this is the fully connected layerright before the softmax. In Figure 6 we see the decay Fig. 6. Precision error decay with adaptive wavelet approximation onmis-labeled MNIST

Mis-labeling

0% 10% 20% 30% 40%

MNIST smoothness

CIFAR10 smoothness

TABLE 1Smoothness analysis of mis-labeled image images of the precision error of the adaptive wavelet approxima-tions (2.10) as we add more wavelet terms. It is clear thatdatasets with less mis-labeling have more ‘sparsity’, i.e.,are better approximated with less wavelet terms. We alsomeasured for each mis-labeled training dataset the Besov α -smoothness at the last inner layer. The results are presentedin Table 1. We see a strong correlation between the amountof mis-labeling and the smoothness. ONCLUSION

In this paper we presented a theoretical approach to theanalysis of the performance of the hidden layers in DLarchitectures. We plan to continue the experimental analysisof deeper architectures and larger datasets (see e.g. [43])and hope to demonstrate that our approach is applicableto a wide variety of machine learning and deep learningarchitectures. As some advanced DL architectures havemillions of features in their hidden layers, we will need toovercome the problem of estimating representation smooth-ness in such high dimensions. Furthermore, in some of ourexperiments we noticed interesting phenomena within sub-components of the layers (e,g, the different operations ofconvolution, non-linearity and pooling). We hope to reachsome understanding and share some insights regardingthese aspects too. A PPENDIX P ROOF OF T HEOREM

Obviously, it is sufﬁcient to prove the equivalence for asingle tree T . Observe that condition (3.7) also implies thatfor any Ω ′ ∈ T , with parent Ω , we also have | Ω | ≤ (1 − ρ ) − | Ω ′ | . We use this as well as (2.5) to prove the ﬁrst direction of theequivalence as follows N τ ( f, T ) τ = X Ω ′ =Ω , Ω ′ ∈T k ψ Ω ′ k τp = X Ω ′ =Ω , Ω ′ ∈T , Ω parent of Ω ′ (cid:18) | Ω ′ | /p (cid:13)(cid:13)(cid:13) ~E Ω ′ − ~E Ω (cid:13)(cid:13)(cid:13) l ( L − (cid:19) τ = X Ω ′ =Ω , Ω ′ ∈T (cid:16) | Ω ′ | /p − /τ k ψ Ω ′ k τ (cid:17) τ ≤ c ( τ ) X Ω ′ =Ω , Ω ′ ∈T , Ω parent of Ω ′ ( | Ω ′ | − α (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) f ( · ) − ~E Ω ′ (cid:13)(cid:13)(cid:13) l ( L − (cid:13)(cid:13)(cid:13)(cid:13) L τ (Ω ′ ) ! τ + | Ω ′ | − α (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) f ( · ) − ~E Ω (cid:13)(cid:13)(cid:13) l ( L − (cid:13)(cid:13)(cid:13)(cid:13) L τ (Ω ′ ) ! τ ) ≤ c ( τ, ρ, α ) X Ω =Ω ,, Ω ′ ∈T , Ω parent of Ω ′ ( | Ω ′ | − α (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) f ( · ) − ~E Ω ′ (cid:13)(cid:13)(cid:13) l ( L − (cid:13)(cid:13)(cid:13)(cid:13) L τ (Ω ′ ) ! τ + | Ω | − α (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) f ( · ) − ~E Ω (cid:13)(cid:13)(cid:13) l ( L − (cid:13)(cid:13)(cid:13)(cid:13) L τ (Ω) ! τ ) ≤ c ( τ, ρ, α ) X Ω ∈T | Ω | − α (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) f ( · ) − ~E Ω (cid:13)(cid:13)(cid:13) l ( L − (cid:13)(cid:13)(cid:13)(cid:13) L τ (Ω) ! τ = 2 c ( τ, ρ, α ) X Ω ∈T (cid:16) | Ω | − α w ( f, Ω) τ (cid:17) τ = c | f | τ ˜ B α, τ . We now prove the other direction. We assume < τ ≤ (the case < τ < ∞ is similar). For any Ω ∈ T we have w ( f, Ω) ττ ≤ X Ω ′ ∈T , Ω ′ ⊂ Ω k ψ Ω ′ k ττ , (A.1)by the following estimates w ( f, Ω) ττ = Z Ω (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X Ω ′ ∈T ψ Ω ′ ( x ) − ~E Ω (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) τl ( L − dx = Z Ω (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X Ω ′ ∈T ψ Ω ′ ( x ) − X Ω ′ ∈T , Ω ⊆ Ω ′ ψ Ω ′ ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) τl ( L − dx = Z Ω (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X Ω ′ ∈T , Ω ′ ⊂ Ω ψ Ω ′ ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) τl ( L − dx ≤ X Ω ′ ∈T , Ω ′ ⊂ Ω k ψ Ω ′ k ττ . Also, observe that by condition (3.7), for any Ω ′ ∈ T X Ω ∈T , Ω ′ ⊂ Ω (cid:18) | Ω ′ || Ω | (cid:19) ατ ≤ ∞ X k =1 ρ kατ ≤ c ( ρ, α, τ ) . (A.2) We apply (A.1) and (A.2) to conclude | f | τ ˜ B α, τ ( T ) ≤ X Ω ∈T | Ω | − ατ X Ω ′ ∈T , Ω ′ ⊂ Ω k ψ Ω ′ k ττ = X Ω ∈T X Ω ′ ∈T , Ω ′ ⊂ Ω (cid:18) | Ω ′ || Ω | (cid:19) ατ (cid:16) | Ω ′ | − α k ψ Ω ′ k τ (cid:17) τ = X Ω ′ =Ω , Ω ′ ∈T (cid:16) | Ω ′ | − α k ψ Ω ′ k τ (cid:17) τ X Ω ∈T , Ω ′ ⊂ Ω (cid:18) | Ω ′ || Ω | (cid:19) ατ ≤ c ( α, τ, ρ ) X Ω ′ =Ω , Ω ′ ∈T (cid:16) | Ω ′ | − α | Ω ′ | /τ − /p k ψ Ω ′ k p (cid:17) τ = c ( α, τ, ρ ) X Ω ′ =Ω , Ω ′ ∈T k ψ Ω ′ k τp = cN τ ( f, T ) τ . A CKNOWLEDGMENTS

The authors would like to thank Vadym Boikov, WIX AIand Kobi Gurkan, Tel-Aviv University, for their help withrunning the experiments. This research was carried outwith the generous support of the Amazon AWS ResearchProgram. R EFERENCES [1] Alani D., Averbuch A. and Dekel S., Image coding using geometricwavelets,

IEEE transactions on image processing

Introduction to machine learning , MIT Press, 2004.[3] Bengio Y., Courville A. and Vincenty P., Representation Learning: AReview and New Perspectives,

IEEE Transactions on Pattern Analysisand Machine Intelligence

TEST

Machine Learning

Classiﬁcation andRegression Trees

IEEE journal of knowledge and data engineering

Microsoft Research technical report , report TR-2011-114, 2011.[12] Dahmen W., Dekel S. and Petrushev P., Two-level-split decomposi-tion of anisotropic Besov spaces,

Constructive approximation

Ten lectures on wavelets , CBMS-NSF Regional Con-ference Series in Applied Mathematics,1992.[14] Dekel S., Elisha O. and Morgan O., Wavelet decomposition ofGradient Boosting, submitted .[15] Dekel S. and Leviatan D., Adaptive multivariate approximationusing binary space partitions and geometric wavelets,

SIAM Journalon Numerical Analysis

Proceedings of the 31stInternational Conference on Machine Learning

32, 2014.[17] DeVore R., Nonlinear approximation,

Acta Numerica

Constructive approximation , SpringerScience and Business, 1993.[19] DeepListenhttps://github.com/jaron/deep-listening[20] DeVore R., Jawerth B. and Lucier B., Image compression throughwavelet transform coding,

IEEE transactions on information theory

Proceedings of the IEEE international conference on Privacy,security and data mining

Journal of machine learning research

17: 1-38, 2016.[23] Feng N., Wang J. and Saligrama V., Feature-Budgeted RandomForest, In Proceedings of The 32nd International Conference onMachine Learning, 1983-1991, 2015.[24] Kelley P. and Barry R., Sparse spatial autoregressions,

Statistics andProbability Letters

Pattern Recognition Letters

JMLR: Workshop and Conference Proceedings

Journal of Machine Learning Research

The elements of statisticallearning , Springer, 2009.[29] Joly A., Schnitzler F.,Geurts P. and Wehenkel L., L1-based com-pression of random forest models, In

Proceedings of the EuropeanSymposium on Artiﬁcial Neural Networks, Computational Intelligenceand Machine Learning , 375-380, 2012.[30] Karaivanov B. and Petrushev P., Nonlinear piecewise polynomialapproximation beyond Besov spaces,

Applied and computational har-monic analysis

International Conference on datascience and engineering , 64-68, 2012.[32] Loh W., Classiﬁcation and regression trees,

Wiley InterdisciplinaryReviews: Data Mining and Knowledge Discovery

Advances inNeural Information Processing Systems

A Wavelet tour of signal processing, 3rd edition (the sparseway) , Acadmic Press, 2009.[35] Martinez-Muoz G., Hern´andez-Lobato D. and Suarez A., An anal-ysis of ensemble pruning techniques based on ordered aggregation,

IEEE Transactions on pattern analysis and machine intelligence

Submitted , 2016.[39] Raileanu L. and Stoffel K., Theoretical comparison between theGini index and information gain criteria,

Annals of Mathematics andArtiﬁcial Intelligence

IEEESignal Processing Letters

IEEE transactions on image processing preprint .[43] Sun C., Shrivastava1 A., Singh S. and Gupta1 A., RevisitingUnreasonable Effectiveness of Data in Deep Learning Era, preprint .[44] AlexaNet implementation using TensorFlow,https://github.com/tensorﬂow/models/blob/master/tutorials/image/cifar10/cifar10.py[45] UCI machine learning repository, http://archive.ics.uci.edu/ml/.[46] Urban Sound Classiﬁcation dataset,https://serv.cusp.nyu.edu/projects/urbansounddataset[47] Urban Sound Classiﬁcation CNN implementationhttps://github.com/jaron/deep-listening/blob/master/4-us8k-cnn-salamon.ipynb[48] Wavelet-based Random Forest source code,https://github.com/orenelis/WaveletsForest.git.[49] Zhang C., Bengio S., Hardt M., Recht B. and Vinyals O., Under-standing deep learning requires rethinking generalization, In