Radius-margin bounds for deep neural networks
aa r X i v : . [ c s . L G ] N ov Radius-margin bounds for deep neural networks
Mayank Sharma [email protected]
Department of Electrical EngineeringIndian Institute of Technology, DelhiNew Delhi, 110016, India
Jayadeva [email protected]
Department of Electrical EngineeringIndian Institute of Technology, DelhiNew Delhi, 110016, India
Sumit Soman [email protected]
Department of Electrical EngineeringIndian Institute of Technology, DelhiNew Delhi, 110016, India
Editor:
Abstract
Explaining the unreasonable effectiveness of deep learning has eluded researchers around the globe.Various authors have described multiple metrics to evaluate the capacity of deep architectures.In this paper, we allude to the radius margin bounds described for a support vector machine(SVM) with hinge loss, apply the same to the deep feed-forward architectures and derive theVapnik-Chervonenkis (VC) bounds which are different from the earlier bounds proposed in termsof number of weights of the network. In doing so, we also relate the effectiveness of techniques likeDropout and Dropconnect in bringing down the capacity of the network. Finally, we describe theeffect of maximizing the input as well as the output margin to achieve an input noise-robust deeparchitecture.
Keywords:
Complexity, learning theory, VC dimension, radius-margin bounds, Deep neuralnetworks (DNN)
1. Introduction
Deep neural networks (DNN) have been the method of choice for owing to their great success inplethora of machine learning tasks, such as image classification and segmentation (Krizhevsky et al.,2012), speech recognition (Hinton et al., 2012), natural language processing (Collobert et al., 2011),reinforcement learning (Mnih et al., 2015) and various other tasks (Schmidhuber, 2015; LeCun et al.,2015). It is known that depth and width of network plays a key role in its learning abilities. Al-though multiple architectures of DNNs exits like recurrent neural networks (RNNs) (Hochreiter and Schmidhuber,1997) and recursive nets (Socher et al., 2011) however, for the discussions in the paper, we focuson the feed-forward architectures of DNNs. In the works of (Hornik, 1991; Cybenko, 1989) it wasshown that a single hidden layer network or a shallow architecture can approximate any measurableBorel function given enough number of neurons in the hidden layer but recently it was shown by(Montufar et al., 2014) that a deep network can divide the space into an exponential number ofsets, a feat which cannot be achieved by a shallow architecture with same number of parameters.1imilarly, it was shown by (Telgarsky, 2016) that for a given depth and the number of param-eters there exists a DNN that can be approximated by a shallow network with parameters thatare exponential in number of layers of the network. (Cohen et al., 2016) conclude that functionsthat can be implemented by DNNs are exponentially more expressive than functions implementedby a shallow network. These theoretical results which showcase the expressiveness of DNNs havebeen empirically backed up with deep architectures being the current state-of-the-art in multipleapplications across various domains.Many researchers have shown the effect of depth and width on the performance of deep archi-tectures. It is known that increasing the depth or width increases the number of parameters inthe network, and often these numbers can be much larger than the number of the samples usedto the train the network itself. These networks are currently trained using stochastic gradientdescent (SGD). With such a huge number, the obvious question is to ask is why do these machineslearn effectively? Researchers have tried to answer this question by proving statistical guaranteeson learning capacities of these networks. Multiple complexity measures have been proposed inthe literature, namely Vapnik-Chervonenkis (VC) dimension (Vapnik and Vapnik, 1998) and itsrelated extensions like pseudo-dimension, fat shattering dimension (Anthony and Bartlett, 2009)and radius-margin bounds (Burges, 1998; Smola and Sch¨olkopf, 1998), Radamacher Complexity(Bartlett and Mendelson, 2002) and covering numbers (Zhou, 2002) to name a few. All these mea-sures define a number that characterizes the complexity of the hypothesis class which in this caseis the neural network. The most popular among these is the VC dimension which defines the sizeof largest set that can be shattered by the given hypothesis class.(Bartlett et al., 1999) provided the VC dimension bounds for piecewise linear neural networks.(Karpinski and Macintyre, 1995; Sontag, 1998; Baum and Haussler, 1989) gave the VC bounds forgeneral feed-forward neural network with sigmoidal non-linear units. These bounds are defined withrespect to the number of parameters and in general are quite large. Shalev-Shwartz and Ben-David(2014) presented bounds which are linear in terms of the trainable parameters. These bounds growlarger with increase in width and depth of the networks and fail to explain the unreasonableeffectiveness of the depth of neural networks. (Bartlett, 1998) showed that the VC dimensionof a network can be bounded by the norm of parameters or weights rather than the number ofparameters. The norm of weights can be much smaller than the number of weights. Thus this boundcould explain the rationale behind the minimization of the norm of the weights. (Neyshabur et al.,2015) presented Radamacher complexity based bounds showing the deep network bounds in termsof norm of the weights and the number of neurons per layer. (Sun et al., 2016) also presentedRadamacher average bounds for multi-class convolutional neural networks with pooling operationsin terms of norm of the weights and the size of pooling regions. (Xie et al., 2015) showed thatmutual angular regularizer (MAR) can greatly improve the performance of a neural network. Theyshowed that increasing the diversity of hidden units in a neural network reduces the estimation errorand increases the approximation error. Authors also presented generalization bounds in terms ofRadamacher complexity. However, as mentioned in (Kawaguchi et al., 2017), the dependency onthe depth of the network is exponential. (Sokolic et al., 2017) presented generalization bounds interms of Jacobian matrix of the network and showed better performance of networks when presentedwith smaller number of samples. They provide theoretical justification to the contractive penaltyused in (An et al., 2015; Rifai et al., 2011) by explaining the effect of Jacobian regularization oninput margin. 2urrently, the neural networks are regularized using Dropout (Srivastava et al., 2014) or Drop-connect (Wan et al., 2013) in conjugation with weight regularization. Dropout randomly dropsthe neurons to prevent their co-adaptations, while Dropconnect randomly drops connections tryingto achieve a similar objective. Both these methods can be thought of as ensemble averaging ofmultiple neural networks done through a simple technique of using Bernoulli gating random vari-ables to remove certain neurons and weights respectively. The properties of Dropout are studied in(Baldi and Sadowski, 2014) while the Radamacher complexity analysis of dropconnect is mentionedin (Wan et al., 2013).There have also been performance analysis of various architectures of feed-forward neural net-works. One such architecture is residual network (resnet) (He et al., 2016a), whose analysis hasbeen presented in (He et al., 2016b). It uses a direct or an identity connection from previouslayer to the next layer and allows very deep architectures to be trained effectively with minimalgradient vanishing problem (Bengio et al., 1994). Several variants of residual networks have beenproposed namely wide residual networks (Zagoruyko and Komodakis, 2016), inception residual net-work (Szegedy et al., 2017) and the generalization of residual networks named as highway networks(Srivastava et al., 2015).Our contribution in this work is as follows, Firstly we present radius-margin bounds on feed-forward neural networks. Then, we show the bounds for Dropout and Dropconnect and show thatthese regularizers brings down the expected sample complexity for deep architectures. Next, wepresent the margin bounds for residual architectures. Furthermore, we compute the radius-marginbound for an input noise-robust algorithm and then show that Jacobian regularizer along with hingeloss approximates the input noise-robust hinge loss. Finally, we hint at the fact that enlarging theinput margin for a neural network via minimization of Jacobian regularizer is required to obtainan input noise-robust loss function. To our knowledge this is one of the first effort in showingthe effectiveness of various regularizers in bringing down the sample complexity of neural networksusing the radius margin bounds for margin based loss function. In this paper make use of binaryclass support vector machine (SVM) (Cortes and Vapnik, 1995) loss function at the output layer.This can also be generalized to any margin based loss function for both binary and multi-classsettings.
2. Preliminaries
Given a binary class problem, we denote
X ∈ ℜ d as the input set and Y ∈ {− , } as ourlabel set. Here d is the dimensionality of the input pattern. The training set is defined as S = (( x , y ) , . . . , ( x m , y m )), which is a finite sequence of m pairs in X × Y . Let D denote theprobability distribution over X × Y . The training set S is i.i.d. sampled from the distribution D .Let F denotes the hypothesis class. The goal of the learning problem or a learning algorithm A isto find a function f ∈ F : X × Y → ℜ . Let γ denote the margin of classifier, which is given by: γ = sup a min i inf x {k x i − x k ≤ a : f ( x ) = 0; y i f ( x i ) > } Now, consider a feed-forward neural network with one input layer, P hidden layers and one outputneuron. Let the number of neurons in layer k be given by h k : k ∈ { , . . . , P + 1 } , where h = d denote the dimension of the input sample and h P +1 = 1 denote the number of neurons in outputlayer. The number of units in output layer is one for binary classification. Let W { k,k +1 } ∈ ℜ { h k ,h k +1 } denote the weights going from layer k to layer k + 1 such that, w t { k,k +1 } ∈ ℜ h k denote the weights3oing from layer k to neuron t ∈ { , . . . , h j } in layer k . Let σ ( · ) denote the activation function,which is Rectified Linear Units (ReLUs), tanh or any L σ Lipschitz continuous activation functionpassing though origin for each of the neurons in hidden layers and a linear activation function inoutput layer. We keep the norm of inputs bounded by R i.e., k x k ≤ R, ∀ x ∈ X and the norm ofweights going from layer k to layer k + 1 bounded by A k such that k w t { k,k +1 } k ≤ A k . The functioncomputed by network in layer k is given by φ ( x ) k = σ ( φ ( x ) k − · W { k − ,k } ) ∈ ℜ h k , ∀ k ∈ { , . . . , P } ,where φ ( x ) = x , φ ( x ) P +1 = φ ( x ) P · w { P,P +1 } , σ ( · ) is applied element-wise and the operator · denotes the dot product.Thus, the hypothesis class of feed-forward neural networks with P hidden layers and norm ofweights bounded by A j is given by: F = σ ( σ ( . . . σ ( x · W { , } ) · W { , } . . . · W { P − ,P − } ) · W { P − ,P } ) · w { P,P +1 } Lipschitz property : Let C ⊆ ℜ d . A function σ : ℜ d → ℜ k is L σ -Lipschitz over C if for every x , x ∈ C , we have: k σ ( x ) − σ ( x ) k ≤ L σ k x − x k (1)ReLU σ ReLU ( z ) = max (0 , z ) and Leaky ReLU σ LReLU ( z ) = max (0 . z, z ) are Lipschitz continuousfunction with Lipschitz constant L σ = 1. Likewise, sigmoid σ sig ( z ) = − z ) and hyperbolictangent σ tanh ( z ) = exp( x ) − exp( − x )exp( x )+exp( − x ) are Lipschitz continuous functions with Lipschitz constants L σ = and L σ = 1 respectively. We will focus mostly on ReLU and activation functions which passesthrough origin like Leaky ReLU and tanh.For activation functions passing through origin, eq. 1 holds true for all z , z ∈ C . Hence, theeq. 1 also holds for all z ∈ C and z = 0 k σ ( z ) − σ ( z ) k ≤ L σ k z − z k k σ ( z ) − σ (0) k ≤ L σ k z − k since σ (0) = 0 for functions passing through origin k σ ( z ) k ≤ L σ k z k (2)0 − loss : The 0 − x , y ) ∈ X ×Y and predictor g ( x ) = sign ( f ( x )) : f ∈ F is given by: ℓ − ( g, ( x , y )) = ( g ( x ) = y g ( x ) = y True risk of − loss : The true risk of the prediction rule g is defined as: L − D ( g ) = P ( x ,y ) ∼D [ g ( x ) = y ] Empirical risk of − loss : The empirical risk of the prediction rule g is defined as: L − S ( g ) = 1 m m X i =1 ( g ( x i ) = y i )4 inge loss : The empirical 0 − − ℓ hinge ( f, ( x , y )) = max(0 , − yf ( x ))Clearly, ℓ − ( g, ( x , y )) ≤ ℓ hinge ( f, ( x , y )). Empirical risk of hinge loss : The empirical risk of ℓ hinge ( f, ( x, y )) is defined as: L hingeS ( f ) = 1 m m X i =1 max(0 , − y i f ( x i ))
3. Radius-margin bounds
In this section we provide radius margin bounds for feed-forward neural networks including thoseof regularizers like Dropout and Dropconnect. The reader should refer to the section 4 (Appendix)for the proofs of the theorems mentioned in the main text.
Theorem 1
The upper bound on VC dimension
V Cdim ( F ) for a training set S ⊆ { ( x , y ) m : k x k ≤ R ; ( X × Y ) m } which is fully shattered with the output margin of γ = k w { P,P +1 } k by afunction f ∈ F from the hypothesis class F of neural networks with P hidden layers, h k , ∀ k ∈{ , . . . , P + 1 } neurons in each layer with L σ -Lipschitz activation function passing through originand the norm of weights constrained by A k for all k ∈ { , . . . , P + 1 } is given by: V Cdim ( F ) ≤ R A P +1 L Pσ P Y k =1 h k A k (3)The bound given in eq. 3 defines the dependence of VC dimension on the radius of data andthe product of max-norm terms with number of neurons per layer. There is always a dependenceon the depth of the network in terms of number of product terms included in the eq. 3. The boundhas several implications:1. The bound is independent of dimensionality on input data, it is only dependent on radius ofdata.2. The bound is independent of number of weights, but rather dependent on max-norm ofweights.3. The bound depicts the key role of depth for deep networks. Increasing the depth of the networkdoes not always increase the VC dimension. If the product h k A k ≤ k = { , . . . , P } then there is a decrease in the capacity of the network as the depth increases. On the otherhand if the product h k A k ≥ k = { , . . . , P } , then the network capacity increaseswith depth. Thus, by changing the number of neurons and max-norm constraints on weightsone can alter the capacity of the network to the desired values.4. Keeping the number of neurons in hidden layers fixed to h and using ReLU activation functionwith L σ = 1, we get the VC bound similar to Theorem 1 of (Neyshabur et al., 2015): V Cdim ( F ) ≤ R A P +1 h P P Y k =1 A k (4)5he bound presented in eq. 4 shows that keeping the number of neurons fixed in each layer,the VC dimension of the hypothesis class of neural network can be controlled by changing themax-norm constraint on the weights on the network. However, the exponential dependencyon the depth cannot be avoided. Effect of Dropout : We now show the effect of Dropout on the same network, where we multiplyeach neuron n in layer k with Bernoulli selector random variable δ i { k,n } for each sample i . Everyselector random variable δ i { k,n } takes the value 1 with probability p k and 0 with dropout probability q j = 1 − p j for each layer j and is independent from each other. The dropout mask for each layer j and each sample i is given by u ik ∈ { , } h k , where each entry of the mask is u i { k,n } = δ i { k,n } . Thenew hypothesis class F do of neural network is given as: F do = ( u P ⊙ σ (( u P − ⊙ σ ( . . . ( u ⊙ σ (( u ⊙ x ) · W { , } ) · W { , } ) . . . · W { P − ,P − } )) · W { P − ,P } )) · w { P,P +1 } Here, ⊙ represents element wise multiplication. Theorem 2
For the same network as mentioned in Theorem 1, when added with dropout to eachlayer k for all k ∈ { , . . . , P + 1 } with dropout probability q k , the VC dimension bound V Cdim ( F do ) is bounded by: V Cdim ( F do ) ≤ p R A P +1 L Pσ P Y k =1 p k h k A k (5) Effect of Dropconnect : We now show the effect of Dropconnect on the same network as men-tioned in Theorem 1, where we multiply the individual elements of the weight matrix W { k,k +1 } withthe elements of a matrix of i.i.d drawn Bernoulli selector random variables U i { k,k +1 } ∈ { , } { h k ,h k +1 } for all k ∈ { , . . . , P } and all samples i ∈ { , . . . , m } . The elements of the matrix U i { k,k +1 } are thevectors u { i,t }{ k,k +1 } ∈ { , } h k , ∀ t ∈ { , . . . , h k +1 } and each vector is composed of Bernoulli randomvariable such that u { i,n,t }{ k,k +1 } = δ { i,n,t }{ k,k +1 } , ∀ n ∈ { , . . . , h k } , which is 1 with probability p { k,k +1 } and 0with probability q { k,k +1 } = 1 − p { k,k +1 } . The hypothesis class of feed-forward neural networks withDropconnect regularizer is given by: F dc = σ ( σ ( . . . σ ( x · ( U { , } ⊙ W { , } )) · ( U { , } ⊙ W { , } ) . . . · ( U { P − ,P − } ⊙ W { P − ,P − } )) · ( U { P − ,P } ⊙ W { P − ,P } )) · ( u { P,P +1 } ⊙ w { P,P +1 } )Here, ⊙ represents element wise multiplication. Theorem 3
For the same network as mentioned in Theorem 1, when added with Dropconnect toeach layer k for all k ∈ { , . . . , P + 1 } with Dropconnect probability q { k,k +1 } = 1 − p { k,k +1 } , the VCdimension bound V Cdim ( F dc ) is bounded by: V Cdim ( F dc ) ≤ p { , } R A P +1 L Pσ P Y k =1 p { k,k +1 } h k A k (6)6 mplications of Dropout and Dropconnect : The two bounds presented in eq. 5 and eq. 6are equivalent. Thus the two techniques brings down the capacity of the network, thus preventingproblems like overfitting. The reasons as to why these two methods outperform other kinds ofregularizers is they act like ensemble of networks and allow to learn representations from fewernumber of neurons or weights at each iterations. Details of such an interpretation is mentioned inSrivastava et al. (2014) and Wan et al. (2013). Bounds for a Resnet architecture : Consider a generic resnet with T residual blocks having´ T residual units per block. Each of the residual unit consists of activation function σ ( · ) followedby convolution layer ( cv ), followed by dropout, σ ( · ) and cv layer. The final output of cv layer isadded with the output of previous layer. We use a cv layer ( cv ( · )) after the input to increase thenumber of filters. After every one resnet block, we have a cv layer, max-pool or an average-pool layer for dimensionality reduction. For our discussions, we keep a cv unit cv i , ∀ i ∈ { , . . . , T } fordimensionality reduction rather than max-pool or average-pool . After T residual blocks we have P fully connected layers with dropout. Lastly, we have our classifier layer and the hinge loss isapplied to the classifier layer.Consider the input data x i = φ ( x i ) ∈ ℜ { h ,h ,h } . Let the number of filters in each cv layerin block r be N r , ∀ r ∈ { , . . . , R } and size of filters for the cv layers in those blocks as well as cv i dimensionality reduction blocks be { v r × v r } with strides { s r × s r } . The function cv ( · ) takes infilter size, number of filters, strides and padding as the parameters alongside the input, which arenot shown for brevity. The output of residual unit ´ r in block r is given by: φ { r, ´ r } ( x i ) = φ { r, ´ r − } ( x i ) + cv { r, ´ r, } ( σ ( U r ⊙ cv { r, ´ r, } ( σ ( φ { r, ´ r − } ( x i ))))) , ∀ r, ´ r ∈ { , . . . , T } × { , . . . , ´ T } φ { r, } ( x i ) = cv r − ( φ { r − , ´ R } ( x i )) + cv { r, , } ( σ ( U r ⊙ cv { r, , } ( σ ( cv r − ( φ { r − , ´ R } ( x i )))))) , ∀ r ∈ { , . . . , T } φ { , } ( x i ) = cv ( φ ( x i )) + cv { , ´ r, } ( σ ( U ⊙ cv { , ´ r, } ( σ ( cv ( φ ( x i )))))) Theorem 4
The
V Cdim ( F ) bound of a residual network as described above is given by: V Cdim ( F ) ≤ R A P +1 (cid:20) L Pσ P Y k =1 p k h k A k ! (cid:18) (cid:0) A N v (cid:1) T Y r =1 (cid:0) A r N r v r (cid:1) T L Tσ p ´ Tr (cid:19)(cid:21) (7) Implications of the VC bound on Resnet architecture : The bound given in eq. 7 is depen-dent on the max-norm of the weights, size of filters in each block, dropout probability, number ofblocks, residual units per block and the Lipschitz constant of the activation function. It shows thatthe bound increases exponentially in the number of residual units per block which is expected asnumber of residual units increases the capacity of the network.
Robustness measures the variation of the loss function w.r.t. the input ( x, y ) ∼ D . (Xu and Mannor,2012) presented generalization bounds for an algorithm A being ǫ ( · ) robust in terms of Radamacheraverages. The idea that a large margin implying robustness was applied to deep networks in(Sokolic et al., 2017). Here, we present the idea of robustness of an algorithm in terms of the VCdimension by incorporating the notion noise in a sample such that its label remains unchanged.Theorem 5 shows that for a robust algorithm the VC dimension is larger than a non-robust algo-rithm. 7 heorem 5 Consider the set T = (cid:26) (∆ , . . . , ∆ m ) | max i k ∆ i k ≤ c, ∀ i ∈ { , . . . , m } (cid:27) . Let thisset denote the noise that can be added to the samples x i to obtain ˆx i = x i + ∆ i such that for some f ∈ F ro , f (ˆ x i ) = f ( x i ) , ∀ i ∈ { , . . . , m } , then the VC bound for such a hypothesis class F ro isgiven by: V Cdim ( F ro ) ≤ A P +1 ( R + c ) L Pσ P Y k =1 h k A k Gradient regularization : Consider the input noise-robust loss function,max (∆ ,..., ∆ m ) ∈T m X i =1 max (cid:0) , − y i ( φ P ( ˆx i ) · w { P,P +1 } ) (cid:1) We now use the first order approximation for φ P ( ˆx i ) = φ P ( x i + ∆ i ) to get, φ P ( x i + ∆ i ) ≅ φ P ( x i ) + ∂φ P ( x i ) ∂ x i · ∆ i (8)Using eq. 8 the objective function can be written as:max (∆ ,..., ∆ m ) ∈T m X i =1 max (cid:0) , − y i ( φ P ( x i ) + ∂φ P ( x i ) ∂ x i · ∆ i ) · w { P,P +1 } (cid:1) ≤ m X i =1 max (cid:0) , − y i ( φ P ( x i ) · w { P,P +1 } ) (cid:1) + max (∆ ,..., ∆ m ) ∈T m X i =1 (cid:18) ∂φ P ( x i ) ∂ x i · ∆ i (cid:19) · w { P,P +1 } ≤ m X i =1 max (cid:0) , − y i ( φ P ( x i ) · w { P,P +1 } ) (cid:1) + max (∆ ,..., ∆ m ) ∈T m X i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) ∂φ P ( x i ) ∂ x i · ∆ i (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) k w { P,P +1 } k ≤ m X i =1 max (cid:0) , − y i ( φ P ( x i ) · w { P,P +1 } ) (cid:1) + max (∆ ,..., ∆ m ) ∈T m X i =1 (cid:13)(cid:13)(cid:13)(cid:13) ∂φ P ( x i ) ∂ x i (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13) ∆ i (cid:13)(cid:13) A P +1 ≤ m X i =1 max (cid:0) , − y i ( φ P ( x i ) · w { P,P +1 } ) (cid:1) + m X i =1 (cid:13)(cid:13)(cid:13)(cid:13) ∂φ P ( x i ) ∂ x i (cid:13)(cid:13)(cid:13)(cid:13) F cA P +1 (9)The term (cid:13)(cid:13)(cid:13) ∂φ P ( x i ) ∂ x i (cid:13)(cid:13)(cid:13) is the norm of Jacobian matrix J ( x ) of the deep neural network (DNN).We now show that minimizing the term is equivalent to maximizing the input margin of the DNN. Input and output margin : The input margin of sample x i can be defined as: γ iip = sup a { (cid:13)(cid:13) x i − x (cid:13)(cid:13) ≤ a ; sign ( φ P +1 ( x )) = sign ( φ P +1 ( x i )) } (10)whereas, the output margin of the sample x i is given as: γ iop = sup a { (cid:13)(cid:13) φ P ( x i ) − φ P ( x ) (cid:13)(cid:13) ≤ a ; sign ( φ P +1 ( x )) = sign ( φ P +1 ( x i )) } (11)8sing the Theorem 3, Corollary 2 of Sokolic et al. (2017) and the Lebesgue differentiation theorem,we get, (cid:13)(cid:13) φ P ( x i ) − φ P ( x ) (cid:13)(cid:13) ≤ sup x i , x , t ∈ [0 , (cid:13)(cid:13)(cid:13) J (cid:0) x + t ( x i − x ) (cid:1)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) x i − x (cid:13)(cid:13)(cid:13) (12)Assume that the point x lies on the decision boundary, then the term (cid:13)(cid:13) x i − x (cid:13)(cid:13) is equal to γ iip .Using the aforementioned fact, one can write:sup x i , x ,t ∈ [0 , (cid:13)(cid:13)(cid:13) J (cid:0) x + t ( x i − x ) (cid:1)(cid:13)(cid:13)(cid:13) = sup x : k x i − x k ≤ γ iip k J ( x ) k (13)Using eqs. 13, 10 and 11 in eq. 12 we get, γ iop ≤ sup x : k x i − x k ≤ γ iip k J ( x ) k γ iip = ⇒ γ iip ≥ γ iop sup x : k x i − x k ≤ γ iip k J ( x ) k (14)Let conv ( X ) = { x : x + t ( x i − x ) , t ∈ [0 , } denotes convex set, then from eq. 12 and eq. 14 wecan write, γ iip ≥ γ iop sup x ∈ conv ( X ) k J ( x ) k (15)From eq. 15 we see that, minimizing the norm of Jacobian matrix J ( x ) amounts to increasing thelower bound on input margin whereas, eq. 9 shows that minimizing the norm of J ( x ) along withthe hinge loss approximates the input noise-robust hinge loss function. The two hints at the factthat maximizing the input margin is required to obtain an input noise-robust deep architecture.
4. Conclusion
This paper studies the radius margin bounds for deep architectures both fully connected andresidual convolutional networks in presence of hinge loss at the output. We show that the capacityof the deep architecture can be bounded by the number of neurons, the filter size for each layer, theDropout probability or Dropconnect probability and the max norm of the weights. We also hintat the equivalence of minimizing the norm of the Jacobian matrix of the network and robustnessto the input perturbation. We show that minimizing the norm of the Jacobian leads to a networkwith large input margin which in turn causes the network to be robust to perturbation in the inputspace. In the future, we would like to study the effect of weight quantization on the VC dimensionbound of the deep architectures.
Acknowledgments
We would like to acknowledge support for this project from Safran Group.9 ppendix
Proof of Theorem 1Proof : Since the set { x , . . . , x m } is fully shattered by the hypothesis class, implies for all y =( y , . . . , y m ) ∈ {− , } m , there exists w { P,P +1 } , such that, ∀ i ∈ [1 , m ] , ≤ y i ( φ P ( x i ) · w { P,P +1 } ) (16)Summing up these inequalities yields, m ≤ (cid:0) m X i =1 y i φ P ( x i ) (cid:1) · w { P,P +1 } ≤ k w { P,P +1 } k k m X i =1 y i φ P ( x i ) k ≤ A P +1 k m X i =1 y i φ P ( x i ) k Since, the inequality holds for all y ∈ {− , } m , it also holds on expectation over ( y , . . . , y m ) drawni.i.d. according to a uniform distribution over {− , } . Since the distribution is uniform, hence for i = j , we have E [ y i y j ] = E [ y i ] E [ y j ]. Thus, since the distribution is uniform E [ y i y j ] = 0 if i = j , E [ y i y j ] = 1 otherwise. This gives, m ≤ A P +1 E y [ k m X i =1 y i φ P ( x i ) k ]Applying Jensen’s inequality, m ≤ A P +1 (cid:2) E y [ k m X i =1 y i φ P ( x i ) k ] (cid:3) = A P +1 (cid:2) m X i =1 m X j =1 E y [ y i y j ]( φ P ( x i ) · φ P ( x j )) (cid:3) = A P +1 (cid:2) m X i =1 ( φ P ( x i ) · φ P ( x i )) (cid:3) ≤ A P +1 (cid:2) m max i k φ P ( x i ) k (cid:3) = ⇒ √ m ≤ A P +1 (cid:2) max i k φ P ( x i ) k (cid:3) (17)Now, we prove the bound on r = max i k φ P ( x i ) k . r = max i k [ σ ( φ P − ( x i ) · w { P − ,P } ) , . . . , σ ( φ P − ( x i ) · w h P { P − ,P } )] k Let t be the index of the maximum absolute value in the vector φ P ( x i ) ≤ max i h P k σ ( φ P − ( x i ) · w t { P − ,P } ) k Using eq. 2 we get, ≤ max i h P L σ k φ P − ( x i ) k k w t { P − ,P } k = h P A P L σ max i k φ P − ( x i ) k ≤ (max i k φ ( x i ) k ) L Pσ P Y k =1 h k A k = (max i k x i k ) L Pσ P Y k =1 h k A k (18)= ⇒ r ≤ R L Pσ P Y k =1 h k A k (19)using eq. 19 in eq. 17 we get, √ m ≤ A P +1 (cid:2) R L Pσ P Y k =1 h k A k (cid:3) m ≤ A P +1 R L Pσ P Y k =1 h k A k = ⇒ V Cdim ( F ) ≤ A P +1 R L Pσ P Y k =1 h k A k Proof of Theorem 2Proof : Since, u ik is a vector of random variables for each sample i and each layer k , we have totake expectations over each random variable present to determine the expected VC dimension ofthe network. Here, φ k ( x i ) = σ (( u ik − ⊙ φ k − ( x i )) · W { k − ,k } ). Following eq. 16 we get, ∀ i ∈ [1 , m ] , ≤ y i ( u iP ⊙ φ P ( x i ) · w { P,P +1 } ) (20)Since, the inequality holds for all y i ∈ {− , } and for all u iP , it also holds on expectation for { y , . . . , y m } and U k = { u k , . . . , u mk } , ∀ k ∈ { , . . . , P } . The distribution over u ik is Bernoulli, thus E [ u i { k,t } ] = p k . This gives, m ≤ A P +1 E y E U ,..., U P [ k m X i =1 y i u iP ⊙ φ P ( x i ) k ]Applying Jensen’s inequality, m ≤ A P +1 (cid:2) E y E U ,..., U P [ k m X i =1 y i u iP ⊙ φ P ( x i ) k ] (cid:3) = A P +1 (cid:2) m X i =1 m X j =1 E y [ y i y j ] E U ,..., U P [( u iP ⊙ φ P ( x i )) · ( u jP ⊙ φ P ( x j ))] (cid:3) = A P +1 (cid:2) m X i =1 E U ,..., U P [( u iP ⊙ φ P ( x i )) · ( u iP ⊙ φ P ( x i ))] (cid:3) = A P +1 (cid:20) m X i =1 E U ,..., U P − E U P (cid:2) (cid:0) ( δ i { P, } ) φ P ( x i ) + . . . + ( δ i { P,h P } ) φ P ( x ih P ) (cid:1) (cid:3)(cid:21) E [( δ i { P,t } ) ] = E [( δ i { P,t } )] = p P we get,= A P +1 (cid:20) m X i =1 p P E U ,..., U P − [ k φ P ( x i ) k ] (cid:21) ≤ A P +1 (cid:20) p P m max i E U ,..., U P − [ k φ P ( x i ) k ] (cid:21) = ⇒ √ m ≤ A P +1 (cid:20) p P max i E U ,..., U P − [ k φ P ( x i ) k ] (cid:21) (21)Now, we prove the bound on r = max i E U ,..., U P − [ k φ P ( x i ) k ]. r = max i E U ,..., U P − [ k [ σ (( u iP − ⊙ φ P − ( x i )) · w { P − ,P } ) , . . . , ( u iP − ⊙ σ ( φ P − ( x i )) · w h P { P − ,P } )] k ]Let t be the index of the maximum absolute value in the vector φ P ( x i ) ≤ h P max i E U ,..., U P − [ k σ (( u iP − ⊙ φ P − ( x i ))) · w t { P − ,P } k ]Using eq. 2 we get, ≤ h P L σ max i E U ,..., U P − E U P − [ k u iP − ⊙ φ P − ( x i ) k ] k w t { P − ,P } k = h P A P L σ max i E U ,..., U P − E U P − [ k u iP − ⊙ φ P − ( x i ) k ]Applying the expectation over u iP − we get,= h P A P L σ p P − max i E U ,..., U P − [ k φ P − ( x i ) k ]Applying it recursively till layer 1 we get, ≤ L Pσ P Y k =1 p k h k A k (max i E U [ k φ ( x i ) k )] (22)= L Pσ p P Y k =1 p k h k A k (max i k x i k )= ⇒ r ≤ R p L Pσ P Y k =1 p k h k A k (23)using eq. 23 in eq. 21 we get, √ m ≤ A P +1 (cid:2) R p L Pσ P Y k =1 p k h k A k (cid:3) m ≤ A P +1 R p L Pσ P Y k =1 p k h k A k = ⇒ V Cdim ( F do ) ≤ A P +1 R p L Pσ P Y k =1 p k h k A k roof of Theorem 3Proof : Following eq. 16 we get, ∀ i ∈ [1 , m ] , ≤ y i ( φ P ( x i ) · ( u { i, }{ P,P +1 } ⊙ w { P,P +1 } ))= ⇒ m ≤ m X i =1 y i ( φ P ( x i ) · ( u { i, }{ P,P +1 } ⊙ w { P,P +1 } ))= m X i =1 y i (( u { i, }{ P,P +1 } ⊙ φ P ( x i )) · w { P,P +1 } ) (24)Since, the inequality holds for all y i ∈ {− , } and for all u i { P,P +1 } , it also holds on expectationfor { y , . . . , y m } and U ′{ k,k +1 } = { U { k,k +1 } , . . . , U m { k,k +1 } } , ∀ k ∈ { , . . . , P } . The elements of thevector u { i,t }{ k,k +1 } , ∀ t ∈ { , . . . , h k +1 } are distributed according to the Bernoulli distribution, thus E [ u { i,n,t }{ k,k +1 } ] = E [ δ { i,n,t }{ k,k +1 } ] = p k,k +1 , ∀ n ∈ { , . . . , h k } . This gives, m ≤ A P +1 E y E U { , } ,..., U { P,P +1 } [ k m X i =1 y i ( u { i, }{ P,P +1 } ⊙ φ P ( x i )) k ]Using Jensen’s inequality, ≤ A P +1 (cid:2) E y E U { , } ,..., U { P,P +1 } [ k m X i =1 y i ( u { i, }{ P,P +1 } ⊙ φ P ( x i )) k ] (cid:3) = A P +1 (cid:2) m X i =1 m X j =1 E y [ y i y j ] E U { , } ,..., U { P,P +1 } [( u { i, }{ P,P +1 } ⊙ φ P ( x i )) · ( u { j, }{ P,P +1 } ⊙ φ P ( x j ))] (cid:3) = A P +1 (cid:2) m X i =1 E U { , } ,..., U { P,P +1 } [( u { i, }{ P,P +1 } ⊙ φ P ( x i )) · ( u { i, }{ P,P +1 } ⊙ φ P ( x i ))] (cid:3) = A P +1 (cid:20) m X i =1 E U { , } ,..., U { P − ,P } E U { P,P +1 } (cid:2) (cid:0) ( δ { i, , }{ P,P +1 } ) φ P ( x i ) + . . . + ( δ { i,h P , }{ P,P +1 } ) φ P ( x ih P ) (cid:1) (cid:3)(cid:21) Using the fact that E [( δ { i,n, }{ P,P +1 } ) ] = E [( δ { i,n, }{ P,P +1 } )] = p { P,P +1 } we get,= A P +1 (cid:20) m X i =1 p P,P +1 E U { , } ,..., U { P − ,P } [ k φ P ( x i ) k ] (cid:21) ≤ A P +1 (cid:20) p { P,P +1 } m max i E U { , } ,..., U { P − ,P } [ k φ P ( x i ) k ] (cid:21) = ⇒ √ m ≤ A P +1 (cid:20) p { P,P +1 } max i E U { , } ,..., U { P − ,P } [ k φ P ( x i ) k ] (cid:21) (25)Now, we prove the bound on r = max i E U { , } ,..., U { P − ,P } [ k φ P ( x i ) k ]. r = max i E U { , } ,..., U { P − ,P } [ k [ σ ( φ P − ( x i ) · ( u { i, }{ P,P +1 } ⊙ w { P − ,P } )) , . . . ,σ ( φ P − ( x i ) · ( u { i,h P }{ P,P +1 } ⊙ w h P { P − ,P } ))] k ]13et t be the index of the maximum absolute value in the vector φ P ( x i ) ≤ h P max i E U { , } ,..., U { P − ,P } [ k σ ( φ P − ( x i ) · ( u { i,t }{ P,P +1 } ⊙ w t { P − ,P } )) k ] (26)We use the Lipschitz property from eq. 2 and also the fact that φ P − ( x i ) · ( u { i,t }{ P,P +1 } ⊙ w t { P − ,P } ) =( u { i,t }{ P,P +1 } ⊙ φ P − ( x i )) · w t { P − ,P } ) to obtain, ≤ h P L σ max i E U { , } ,..., U { P − ,P − } E U { P − ,P } [ k u i,t { P,P +1 } ⊙ φ P − ( x i ) k ] k w t { P − ,P } k = h P A P L σ max i E U { , } ,..., U { P − ,P − } E U { P − ,P } [ k u i,t { P,P +1 } ⊙ φ P − ( x i ) k ]Applying the expectation over u i,t { P,P +1 } we get,= h P A P L σ p { P − ,P } max i E U { , } ,..., U { P − ,P − } [ k φ P − ( x i ) k ]Applying it recursively till layer 1 we get, ≤ L Pσ P Y k =1 p { k,k +1 } h k A k (max i E U { , } [ k φ ( x i ) k )]= L Pσ p { , } P Y k =1 p { k,k +1 } h k A k (max i k x i k )= ⇒ r ≤ R p { , } L Pσ P Y k =1 p { k,k +1 } h k A k (27)using eq. 27 in eq. 25 we get, √ m ≤ A P +1 (cid:2) R p { , } L Pσ P Y k =1 p { k,k +1 } h k A k (cid:3) m ≤ A P +1 R p { , } L Pσ P Y k =1 p { k,k +1 } h k A k = ⇒ V Cdim ( F dc ) ≤ A P +1 R p { , } L Pσ P Y k =1 p { k,k +1 } h k A k Proof of Theorem 4Proof : Following eq. 20 - eq. 22, for P fully connected layers and a classifier layer we get, √ m ≤ A P +1 " L Pσ P Y k =1 p k h k A k (max i E U , . . . , E U T [ k cv T ( φ { T, ´ T } ( x i )) k ]) (28)(29)14et { t , t , t } denote the index of maximum value of cv T ( φ { T, ´ T } ( x i )). Let w be the weight con-necting the previous layer to the current layer whose norm is bounded by A T . The part of previouslayer connected to the current layer via the weight w is given by ´ φ { T, ´ T } ( x i ). Let the filter sizesremain the same and equal to v T for the block T . Thus we get, ≤ A P +1 " L Pσ P Y k =1 p k h k A k ! (cid:18)(cid:0) A T N T v T (cid:1) max i E U , . . . , E U T [ k ´ φ { T, ´ T } ( x i ) k ] (cid:19) = A P +1 (cid:20) L Pσ P Y k =1 p k h k A k ! (cid:18) (cid:0) A T N T v T (cid:1) max i E U , . . . , E U T [ k ´ φ { T, ´ T − } ( x i ) + ´ cv { T, ´ T , } ( σ ( U T ⊙ cv { T, ´ T , } ( σ ( φ { T, ´ T − } ( x i ))))) k ] (cid:19)(cid:21) Again, let { t , t , t } denote the index of maximum value of ´ φ { T, ´ T } ( x i ), then for the last cv ( · ) layerof resnet unit, we get,= A P +1 (cid:20) L Pσ P Y k =1 p k h k A k ! (cid:18) (cid:0) A T N T v T (cid:1) (cid:0) N T v T (cid:1) max i E U , . . . , E U T [ k ´ φ { T, ´ T − } ( x i ) { t ,t ,t } + (´ σ ( U T ⊙ cv { T, ´ T , } ( σ ( φ { T, ´ T − } ( x i ))))) ˙ w k ] (cid:19)(cid:21) Neglecting the effect of ´ φ { T, ´ T − } ( x i ) { t ,t ,t } on the sum, and considering the norm of the weightsbounded by A T we get,= A P +1 (cid:20) L Pσ P Y k =1 p k h k A k ! (cid:18) (cid:0) A T N T v T (cid:1) (cid:0) A T N T v T (cid:1) max i E U , . . . , E U T [ k ´ σ ( U T ⊙ cv { T, ´ T , } ( σ ( φ { T, ´ T − } ( x i )))) k ] (cid:19)(cid:21) (30)Using the Lipschitzness of σ ( · ) we get,= A P +1 (cid:20) L Pσ P Y k =1 p k h k A k ! (cid:18) (cid:0) A T N T v T (cid:1) (cid:0) A T N T v T (cid:1) L σ max i E U , . . . , E U T [ k ´ U T ⊙ ´ cv { T, ´ T , } ( σ ( φ { T, ´ T − } ( x i ))) k ] (cid:19)(cid:21) We now take an expectation over the dropout variable,= A P +1 (cid:20) L Pσ P Y k =1 p k h k A k ! (cid:18) (cid:0) A T N T v T (cid:1) (cid:0) A T N T v T (cid:1) L σ p T max i E U , . . . , E U T [ k ´ cv { T, ´ T , } ( σ ( φ { T, ´ T − } ( x i ))) k ] (cid:19)(cid:21) A P +1 (cid:20) L Pσ P Y k =1 p k h k A k ! (cid:18) (cid:0) A T N T v T (cid:1) (cid:0) A T N T v T (cid:1) L σ p T max i E U , . . . , E U T [ k ´ σ ( φ { T, ´ T − } ( x i )) k ] (cid:19)(cid:21) = A P +1 (cid:20) L Pσ P Y k =1 p k h k A k ! (cid:18) (cid:0) A T N T v T (cid:1) (cid:0) A T N T v T (cid:1) L σ p T max i E U , . . . , E U T [ k ´ φ { T, ´ T − } ( x i ) k ] (cid:19)(cid:21) Doing the above for ´ T residual units in each of the T residual blocks we arrive at,= A P +1 (cid:20) L Pσ P Y k =1 p k h k A k ! (cid:18) (cid:0) A N v (cid:1) T Y r =1 (cid:0) A r N r v r (cid:1) (cid:0) A r N r v r (cid:1) T L Tσ p ´ Tr max i [ k ( x i ) k ] (cid:19)(cid:21) Let R denote the radius of the dataset, thus we get the following bound on VC dimension of aresidual network: V Cdim ( F ) ≤ R A P +1 (cid:20) L Pσ P Y k =1 p k h k A k ! (cid:18) (cid:0) A N v (cid:1) T Y r =1 (cid:0) A r N r v r (cid:1) T L Tσ p ´ Tr (cid:19)(cid:21) Proof of Theorem 5Proof : Following the eq. 16 - eq. 18 and replacing x i with ˆx i = x i + ∆ i we obtain, √ m ≤ A P +1 (cid:2) max i ( k ˆx i k ) L Pσ P Y k =1 h k A k (cid:3) = A P +1 (cid:2) max i ( k x i + ∆ i k ) L Pσ P Y k =1 h k A k (cid:3) (31)Applying triangle inequality to the term k x i + ∆ i k in eq. 31 we get, √ m ≤ A P +1 (cid:2) max i ( k x i k + k ∆ i k ) L Pσ P Y k =1 h k A k (cid:3) (32)16ince max i ( k x i k ) ≤ R and max i ( k ∆ i k ) ≤ c , using these in eq. 32 we obtain, √ m ≤ A P +1 (cid:2) ( R + c ) L Pσ P Y k =1 h k A k (cid:3) m ≤ A P +1 (cid:2) ( R + c ) L Pσ P Y k =1 h k A k (cid:3) = ⇒ V Cdim ( F ro ) ≤ A P +1 (cid:2) ( R + c ) L Pσ P Y k =1 h k A k (cid:3) References
Senjian An, Munawar Hayat, Salman H Khan, Mohammed Bennamoun, Farid Boussaid, and Fer-dous Sohel. Contractive rectifier networks for nonlinear maximum margin classification. In
Proceedings of the IEEE international conference on computer vision , pages 2515–2523, 2015.Martin Anthony and Peter L Bartlett.
Neural network learning: Theoretical foundations . cambridgeuniversity press, 2009.Pierre Baldi and Peter Sadowski. The dropout learning algorithm.
Artificial intelligence , 210:78–122, 2014.Peter L Bartlett. The sample complexity of pattern classification with neural networks: the size ofthe weights is more important than the size of the network.
IEEE transactions on InformationTheory , 44(2):525–536, 1998.Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds andstructural results.
Journal of Machine Learning Research , 3(Nov):463–482, 2002.Peter L Bartlett, Vitaly Maiorov, and Ron Meir. Almost linear vc dimension bounds for piecewisepolynomial networks. In
Advances in Neural Information Processing Systems , pages 190–196,1999.Eric B Baum and David Haussler. What size net gives valid generalization? In
Advances in neuralinformation processing systems , pages 81–90, 1989.Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradientdescent is difficult.
IEEE transactions on neural networks , 5(2):157–166, 1994.Christopher JC Burges. A tutorial on support vector machines for pattern recognition.
Data miningand knowledge discovery , 2(2):121–167, 1998.Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensoranalysis. In
Conference on Learning Theory , pages 698–728, 2016.Ronan Collobert, Jason Weston, L´eon Bottou, Michael Karlen, Koray Kavukcuoglu, and PavelKuksa. Natural language processing (almost) from scratch.
Journal of Machine Learning Re-search , 12(Aug):2493–2537, 2011. 17orinna Cortes and Vladimir Vapnik. Support-vector networks.
Machine learning , 20(3):273–297,1995.George Cybenko. Approximation by superpositions of a sigmoidal function.
Mathematics of Control,Signals, and Systems (MCSS) , 2(4):303–314, 1989.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 770–778, 2016a.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residualnetworks. In
European Conference on Computer Vision , pages 630–645. Springer, 2016b.Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly,Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networksfor acoustic modeling in speech recognition: The shared views of four research groups.
IEEESignal Processing Magazine , 29(6):82–97, 2012.Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.
Neural computation , 9(8):1735–1780, 1997.Kurt Hornik. Approximation capabilities of multilayer feedforward networks.
Neural networks , 4(2):251–257, 1991.Marek Karpinski and Angus Macintyre. Polynomial bounds for vc dimension of sigmoidal neuralnetworks. In
Proceedings of the twenty-seventh annual ACM symposium on Theory of computing ,pages 200–208. ACM, 1995.Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. arXiv preprint arXiv:1710.05468 , 2017.Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolu-tional neural networks. In
Advances in neural information processing systems , pages 1097–1105,2012.Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.
Nature , 521(7553):436–444,2015.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-levelcontrol through deep reinforcement learning.
Nature , 518(7540):529–533, 2015.Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number oflinear regions of deep neural networks. In
Advances in neural information processing systems ,pages 2924–2932, 2014.Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neuralnetworks. In
Conference on Learning Theory , pages 1376–1401, 2015.18alah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. In
Proceedings of the 28th internationalconference on machine learning (ICML-11) , pages 833–840, 2011.J¨urgen Schmidhuber. Deep learning in neural networks: An overview.
Neural networks , 61:85–117,2015.Shai Shalev-Shwartz and Shai Ben-David.
Understanding machine learning: From theory to algo-rithms . Cambridge university press, 2014.Alex J Smola and Bernhard Sch¨olkopf.
Learning with kernels . GMD-Forschungszentrum Informa-tionstechnik, 1998.Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. Parsing natural scenes and naturallanguage with recursive neural networks. In
Proceedings of the 28th international conference onmachine learning (ICML-11) , pages 129–136, 2011.Jure Sokolic, Raja Giryes, Guillermo Sapiro, and Miguel RD Rodrigues. Robust large margin deepneural networks.
IEEE Transactions on Signal Processing , 2017.Eduardo D Sontag. Vc dimension of neural networks.
NATO ASI Series F Computer and SystemsSciences , 168:69–96, 1998.Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: a simple way to prevent neural networks from overfitting.
Journal of Machine LearningResearch , 15(1):1929–1958, 2014.Rupesh Kumar Srivastava, Klaus Greff, and J¨urgen Schmidhuber. Highway networks. arXivpreprint arXiv:1505.00387 , 2015.Shizhao Sun, Wei Chen, Liwei Wang, Xiaoguang Liu, and Tie-Yan Liu. On the depth of deepneural networks: A theoretical view. In
AAAI , pages 2066–2072, 2016.Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4,inception-resnet and the impact of residual connections on learning. In
AAAI , pages 4278–4284,2017.Matus Telgarsky. Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485 , 2016.Vladimir Naumovich Vapnik and Vlamimir Vapnik.
Statistical learning theory , volume 1. WileyNew York, 1998.Li Wan, Matthew Zeiler, Sixin Zhang, Yann L Cun, and Rob Fergus. Regularization of neuralnetworks using dropconnect. In
Proceedings of the 30th international conference on machinelearning (ICML-13) , pages 1058–1066, 2013.Pengtao Xie, Yuntian Deng, and Eric Xing. On the generalization error bounds of neural networksunder diversity-inducing mutual angular regularization. arXiv preprint arXiv:1511.07110 , 2015.Huan Xu and Shie Mannor. Robustness and generalization.
Machine learning , 86(3):391–423, 2012.19ergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146 ,2016.Ding-Xuan Zhou. The covering number in learning theory.