Regularisation of Neural Networks by Enforcing Lipschitz Continuity
Henry Gouk, Eibe Frank, Bernhard Pfahringer, Michael J. Cree
aa r X i v : . [ s t a t . M L ] S e p Regularisation of Neural Networks by Enforcing LipschitzContinuity
Henry Gouk [email protected]
Eibe Frank [email protected]
Bernhard Pfahringer [email protected]
Department of Computer ScienceUniversity of WaikatoHamilton, New Zealand
Michael J. Cree [email protected]
School of EngineeringUniversity of WaikatoHamilton, New Zealand
Abstract
We investigate the effect of explicitly enforcing the Lipschitz continuity of neural networkswith respect to their inputs. To this end, we provide a simple technique for computingan upper bound to the Lipschitz constant of a feed forward neural network composed ofcommonly used layer types and demonstrate inaccuracies in previous work on this topic.Our technique is then used to formulate training a neural network with a bounded Lipschitzconstant as a constrained optimisation problem that can be solved using projected stochas-tic gradient methods. Our evaluation study shows that, in isolation, our method performscomparatively to state-of-the-art regularisation techniques. Moreover, when combined withexisting approaches to regularising neural networks the performance gains are cumulative.We also provide evidence that the hyperparameters are intuitive to tune and demonstratehow the choice of norm for computing the Lipschitz constant impacts the resulting model.
Keywords:
Neural Networks, Regularisation
1. Introduction
Supervised learning is primarily concerned with the problem of approximating a functiongiven examples of what output should be produced for a particular input. In order for theapproximation to be of any practical use, it must generalise to unseen data points. Thus,we need to select an appropriate space of functions in which the machine should searchfor a good approximation, and select an algorithm to search through this space. This istypically done by first picking a large family of models, such as support vector machinesor decision trees, and applying a suitable search algorithm. Crucially, when performing thesearch, regularisation techniques specific to the chosen model family must be employed tocombat overfitting. For example, one could limit the depth of decision trees considered bya learning algorithm, or impose probabilistic priors on tunable model parameters.Regularisation of neural network models is a particularly difficult challenge. The meth-ods that are currently most effective (Srivastava et al., 2014; Ioffe and Szegedy, 2015) are c (cid:13) https://creativecommons.org/licenses/by/4.0/ . ouk, Frank, Pfahringer, and Cree heuristically motivated, which can make the process of applying these techniques to newproblems nontrivial or unreliable. In contrast, well-understood regularisation approachesadapted from linear models, such as applying an ℓ penalty term to the model parameters,are known to be less effective than the ad hoc approaches (Srivastava et al., 2014). Thisprovides a clear motivation for developing well-founded and effective regularisation meth-ods for neural networks. Following the intuition that functions are considered simpler whenthey vary at a slower rate, and thus generalise better, we develop a method that allowsus to control the Lipschitz constant of a network—a measure of the maximum variation afunction can exhibit. Our experiments show that this is a useful inductive bias to imposeon neural network models.One of the prevailing themes in the theoretical work surrounding neural networks isthat the magnitude of the weights directly impacts the generalisation gap (Bartlett, 1998;Bartlett et al., 2017; Neyshabur, 2017), with larger weights being associated with poorerrelative performance on new data. In several of the most recent works (Bartlett et al., 2017;Neyshabur, 2017), some of the dominant terms in these bounds are equal to the upper boundof the Lipschitz constant of neural networks as we derive it in this paper. While previousworks have only considered the Lipschitz continuity of networks with respect to the ℓ norm, we put a particular emphasis on working with ℓ and ℓ ∞ norms and construct apractical algorithm for constraining the Lipschitz constant of a network during training.The algorithm takes a single hyperparameter that is used to enforce an upper bound on theLipschitz constant of the network.Several interesting properties of this regularisation technique are demonstrated exper-imentally. We show that our algorithm performs similarly to other methods in isolation.Moreover, when the method presented in this paper is combined with other commonly usedregularisers the performance gains are cumulative. We verify that the hyperparameter be-haves in an intuitive manner: when set to a small value the generalisation gap is very small,and as the value of the hyperparameter is increased, the generalisation gap widens.The paper begins with an outline of previous work related to regularisation and the Lip-schitz continuity of neural networks in Section 2. This is followed by a detailed derivation ofour upper bound on the Lipschitz constant of a wide class of feed forward neural networks inSection 3, where we give consideration to multiple choices of vector norms. Section 4 showshow this upper bound can be used to regularise the neural network in an efficient manner.Experiments showing the utility of our regularisation approach are given in Section 5, andconclusions are drawn in Section 6.
2. Related work
One of the most widely applied regularisation techniques currently used for deep networks isdropout (Srivastava et al., 2014). By randomly setting the activations of each hidden unit tozero with some probability, p , during training, this method noticeably reduces overfitting fora wide variety of models. Various extensions have been proposed, such as randomly settingweights to zero instead of activations (Wan et al., 2013). Another modification, concretedropout (Gal et al., 2017), allows one to directly learn the dropout rate, p , thus making thesearch for a good set of hyperparameters easier. Kingma et al. (2015) have also shown thatthe noise level in Gaussian dropout can be learned during optimisation. Srivastava et al. ipschitz Continuous Neural Networks (2014) found that constraining the ℓ norm of the weight vector for each unit in isolation—atechnique that they refer to as maxnorm—slightly improved the performance of networkstrained with dropout.Batch normalisation (Ioffe and Szegedy, 2015), which was initially developed to acceler-ate the convergence of the training process, has also been shown to improve generalisation.It is efficient and simple to implement: it solely standardises the activations of each layerby aggregating statistics over minibatches. The activations are then rescaled and translatedby model parameters learned during training. A similar technique that is more effectivein the small minibatch case, when the statistics required by batch normalisation cannotbe computed reliably, is weight normalisation (Salimans and Kingma, 2016). Rather thancomputing the mean and standard deviation, the orientation and magnitude of each weightvector are decoupled and learned separately. Interestingly, it has been shown that weightdecay (adding an ℓ penalty term to the loss function) provides no regularisation benefitwhen used in conjunction with these methods (van Laarhoven, 2017), but it does changethe effective learning rate of commonly used optimisation algorithms. However one couldachieve the same effect by simply changing the learning rate directly.The recent work on optimisation for deep learning has also contributed to our under-standing of the generalisation performance of neural networks. Hardt et al. (2016) providesome theoretical justifications for common practices, such as early stopping. Neyshabur et al.(2015) presents the Path-SGD optimisation algorithm, a method specifically designed fortraining deep networks. They show that it outperforms several conventional stochasticgradient methods for training multi-layer perceptrons.Several papers have shown that the generalisation gap of a neural network is dependenton the magnitude of the weights (Bartlett et al., 2017; Neyshabur, 2017; Bartlett, 1998).Early results, such as Bartlett (1998), present bounds that assume sigmoidal activationfunctions, but nevertheless relate generalisation to the sum of the absolute values of theweights in the network. More recent work has shown that the product of spectral norms,scaled by various other weight matrix norms, can be used to construct bounds on thegeneralisation gap. Bartlett et al. (2017) scale the spectral norm product by a term relatedto the element-wise ℓ norm, whereas Neyshabur et al. (2017) use the Frobenius norm.Neyshabur et al. (2017) speculate that Lipschitz continuity alone is insufficient to guaranteegeneralisation. However, it appears in multiple generalisation bounds (Neyshabur, 2017;Bartlett et al., 2017), and this paper presents evidence that it is an effective means forcontrolling the generalisation performance of a deep network.Yoshida and Miyato (2017) proposed a new regularisation scheme in the form of aterm in the loss function that penalises the sum of spectral norms of the weight matri-ces. Our work differs from theirs in several ways. Firstly, we investigate norms other than ℓ . Secondly, Yoshida and Miyato (2017) use a penalty term, whereas we employ a hardconstraint on the induced weight matrix norm. Moreover, they penalise the sum of thenorms, but the Lipschitz constant is determined by the product of operator norms. Finally,Yoshida and Miyato (2017) use a heuristic to regularise convolutional layers that does notcorrespond to penalising their spectral norm. Specifically, they compute the largest singularvalue of a flattened weight tensor, as opposed to deriving the true matrix corresponding tothe linear operation performed by convolutional layers, as we do in Section 3.2. Explicitlyconstructing this matrix and computing its largest singular value—even approximately— ouk, Frank, Pfahringer, and Cree would be prohibitively expensive. We provide efficient methods for computing the ℓ and ℓ ∞ norms of convolutional layers exactly, and show how one can approximate the spectralnorm efficiently by avoiding the need to explicitly construct the matrix representing thelinear operation performed by convolutional layers.Constraining the Lipschitz continuity of a network is not only interesting for regularisa-tion. Miyato et al. (2017) have shown that constraining the weights of the discriminator ina generative adversarial network to have a specific spectral norm can improve the quality ofgenerated samples. They use the same technique as Yoshida and Miyato (2017) to computethese norms, and thus may benefit from the improvements presented in this paper.
3. Computing the Lipschitz Constant
A function, f : X → Y , is said to be Lipschitz continuous if it satisfies D Y ( f ( ~x ) , f ( ~x )) ≤ kD X ( ~x , ~x ) ∀ ~x , ~x ∈ X, (1)for some real-valued k ≥
0, and metrics D X and D Y . The value of k is known as theLipschitz constant, and the function can be referred to as being k -Lipschitz. Generally, weare interested in the smallest possible Lipschitz constant, but it is not always possible tofind it. In this section, we show how to compute an upper bound to the Lipschitz constantof a feed-forward neural network with respect to the input features. Such networks can beexpressed as a series of function compositions: f ( ~x ) = ( φ l ◦ φ l − ◦ ... ◦ φ )( ~x ) , (2)where each φ i is an activation function, linear operation, or pooling operation. A particu-larly useful property of Lipschitz functions is how they behave when composed: the com-position of a k -Lipschitz function, f , with a k -Lipschitz function, f , is a k k -Lipschitzfunction. It is important to note that k k will not necessarily be the smallest Lipschitzconstant of ( f ◦ f ), even if k and k are individually the best Lipschitz constants of f and f , respectively. Denoting the Lipschitz constant of some function, f , as L ( f ), repeatedapplication of this composition property yields the following upper bound on the Lipschitzconstant for the entire feed-forward network: L ( f ) ≤ l Y i =1 L ( φ i ) . (3)Thus, we can compute the Lipschitz constants of each layer in isolation and combinethem in a modular way to establish an upper bound on the constant of the entire network. Inthe remainder of this section, we derive closed form expressions for the Lipschitz constantsof common layer types when D X and D Y correspond to ℓ , ℓ , or ℓ ∞ norms respectively. Aswe will see in Section 4, Lipschitz constants with respect to these norms can be constrainedefficiently. ipschitz Continuous Neural Networks A fully connected layer, φ fc ( ~x ), implements an affine transformation parameterised by aweight matrix, W , and a bias vector, ~b : φ fc ( ~x ) = W ~x + ~b. (4)Others have already established that, under the ℓ norm, the Lipschitz constant of afully connected layer is given by the spectral norm of the weight matrix (Miyato et al., 2017;Neyshabur, 2017). We provide a slightly more general formulation that will prove to bemore useful when considering other p -norms. We begin by plugging the definition of a fullyconnected layer into the definition of Lipschitz continuity: k ( W ~x + ~b ) − ( W ~x + ~b ) k p ≤ k k ~x − ~x k p . (5)By setting ~a = ~x − ~x and simplifying the expression slightly, we arrive at k W~a k p ≤ k k ~a k p , (6)which, assuming ~x = ~x , can be rearranged to k W~a k p k ~a k p ≤ k, ~a = 0 . (7)The smallest Lipschitz constant is therefore equal to the supremum of the left-hand sideof the inequality, L ( φ fc ) = sup ~a =0 k W~a k p k ~a k p , (8)which is the definition of the operator norm.For the p -norms we consider in this paper, there exist efficient algorithms for computingoperator norms on relatively large matrices. Specifically, for p = 1, the operator norm isthe maximum absolute column sum norm; for p = ∞ , the operator norm is the maximumabsolute row sum norm. The time required to compute both of these norms is linearlyrelated to the number of elements in the weight matrix. When p = 2, the operator norm isgiven by the largest singular value of the weight matrix—the spectral norm—which can beapproximated relatively quickly using a small number of iterations of the power method. Convolutional layers, φ conv ( X ), also perform an affine transformation, but it is usuallymore convenient to express the computation in terms of discrete convolutions and point-wise additions. For a convolutional layer, the i -th output feature map is given by φ convi ( X ) = M l − X j =1 F i,j ∗ X j + B i , (9) ouk, Frank, Pfahringer, and Cree where each F i,j is a filter, each X j is an input feature map, B i is an appropriately shapedbias tensor exhibiting the same value in every element, and the previous layer produced M l − feature maps.The convolutions in Equation 9 are linear operations, so one can exploit the isomorphismbetween linear operations and square matrices of the appropriate size to reuse the matrixnorms derived in Section 3.1. To represent a single convolution operation as a matrix–vectormultiplication, the input feature map is serialised into a vector, and the filter coefficientsare used to construct a doubly block circulant matrix. Due to the structure of doubly blockcirculant matrices, each filter coefficient appears in each column and row of this matrixexactly once. Consequently, the ℓ and ℓ ∞ operator norms are the same and given by k F i,j k , the sum of the absolute values of the filter coefficients used to construct the matrix.Summing over several different convolutions associated with different input feature mapsand the same output feature map, as done in Equation 9, can be accomplished by hor-izontally concatenating matrices. For example, suppose V i,j is a matrix that performs aconvolution of F i,j with the j -th feature map serialised into a vector. Equation 9 can nowbe rewritten in matrix form as φ convi ( ~x ) = [ V , V , ... V ,M l − ] ~x + ~b i , (10)where the inputs and biases, previously represented by X and B i , have been serialisedinto vectors ~x and ~b i , respectively. The complete linear transformation, W , performed bya convolutional layer to generate M l output feature maps can be constructed by addingadditional rows to the block matrix: W = V , . . . V ,M l − ... . . . V M l , V M l ,M l − . (11)To compute the ℓ and ℓ ∞ operator norms of W , recall that the operator norm of V i,j for p ∈ { , ∞} is k F i,j k . A second matrix, W ′ , can be constructed from W , where each block, V i,j , is replaced with the corresponding operator norm, k F i,j k . Each of these operatornorms can be thought of as a partial row or column sum for the original matrix, W . Now,based on the discussion in Section 3.1, the ℓ operator norm is given by k W k = max j M l X i =1 k F i,j k , (12)and the ℓ ∞ operator norm is given by k W k ∞ = max i M l − X j =1 k F i,j k , (13) ipschitz Continuous Neural Networks Yoshida and Miyato (2017) and Miyato et al. (2017) both investigate the effect of penal-ising or constraining the spectral norm of convolutional layers by reinterpreting the weighttensor of a convolutional layer as a matrix, U = ~u , . . . ~u ,M l − ... . . . ~u M l , ~u M l ,M l − , (14)where each ~u i,j contains the elements of the corresponding F i,j serialised into a row vector.They then proceed to compute the spectral norm of U , rather than computing the spectralnorm of W , given in Equation 11. However, it is unclear when the spectral norm of U isan accurate approximation of the spectral norm of W , and it is in fact trivial to constructcases where it is highly inaccurate. Explicitly constructing W and applying a conventionalsingular value decomposition to compute the spectral norm of a convolutional layer is in-feasible, but we show how the power method can be adapted to use standard convolutionalnetwork primitives to compute it efficiently. Consider the usual process for computing thelargest singular value of a square matrix using the power method, provided in Algorithm 1.The expression of most interest to us is inside the for loop, namely ~x i = W T W ~x i − , (15)which, due to the associativity of matrix multiplication, can be broken down into two steps: ~x ′ i = W ~x i − (16)and ~x i = W T ~x ′ i . (17)When W is the matrix in Equation 11, the expressions given in Equations 16 and17 correspond to a forward propagation and a backwards propagation through a convolu-tional layer, respectively. Thus, if we replace these matrix multiplication with convolutionand transposed convolution operations respectively, as implemented in many deep learningframeworks, the spectral norm can be computed efficiently. Note that only a single vec-tor must undergo the forward and backward propagation operations, rather than an entirebatch of instances. This means, for most cases, only a small increase in runtime will beincurred by using this method. It also automatically takes into account the padding andstride hyperparameters used by the convolutional layer. Most common activation functions and pooling operations are, at worst, 1-Lipschitz withrespect to all p -norms. For example, the maximum absolute sub-gradient of the ReLUactivation function is 1, which means that ReLU operations have a Lipschitz constant of
1. For example, by explicitly constructing W in MATLAB and checking whether svd(W) is equal to svd(U) for the single filter case. In all randomly selected kernels we tried, results of the two spectral normcalculations were very different. ouk, Frank, Pfahringer, and Cree Algorithm 1
Power method for producing the largest singular value, σ max , of a non-squarematrix, W .Randomly initialise ~x for i = 1 to n do ~x i ← W T W ~x i − end for σ max ← k W ~x n k k ~x n k one. A similar argument yields that the Lipschitz constant of max pooling layers is one.The Lipschitz constant of the softmax is one (Gao and Pavel, 2017). Recently developed feed-forward architectures often include residual connections betweennon-adjacent layers (He et al., 2016). These are most commonly used to construct structuresknown as residual blocks: φ res ( ~x ) = ~x + ( φ j + n ◦ ... ◦ φ j +1 )( ~x ) , (18)where the function composition may contain a number of different linear transformationsand activation functions. In most cases the composition is formed by two convolutionallayers, each preceded by a batch normalisation layer and a ReLU function. While networksthat use residual blocks still qualify as feed-forward networks, they no longer conform to thelinear chain of function compositions we formalised in Equation 2. Fortunately, networkswith residual connections are usually built by composing a linear chain of residual blocksof the form given in Equation 18. Hence, the Lipschitz constant of a residual network willbe the product of Lipschitz constants for each residual block. For a k -Lipschitz function, f , and a k -Lipschitz function, f , we are interested in the Lipschitz constant of their sum: k ( f ( ~x ) + f ( ~x )) − ( f ( ~x ) + f ( ~x )) k p (19)which can be rearranged to k ( f ( ~x ) − f ( ~x )) + ( f ( ~x ) − f ( ~x )) k p . (20)The subadditivity property of norms and the Lipschitz constants of f and f can thenbe used to bound Equation 20 from above: k ( f ( ~x ) − f ( ~x )) + ( f ( ~x ) − f ( ~x )) k p ≤ k f ( ~x ) − f ( ~x ) k p + k f ( ~x ) − f ( ~x ) k p (21) ≤ k k ~x − ~x k p + k k ~x − ~x k p (22)= ( k + k ) k ~x − ~x k p . (23)Thus, we can see that the Lipschitz constant of the addition of two functions is boundedfrom above by the sum of their Lipschitz constants. Setting f to be the identity functionand f to be a linear chain of function compositions, we arrive at the definition of a residual ipschitz Continuous Neural Networks block as given in Equation 18. Noting that the Lipschitz constant of the identity functionis one, we can see that the Lipschitz constant of a residual block is bounded by L ( φ res ) ≤ j + n Y i = j +1 L ( φ i ) , (24)where the property given in Equation 3 has been applied to the function compositions.
4. Constraining the Lipschitz Constant
The assumption motivating our work is that the Lipschitz constant of a feed-forward neuralnetwork enables control of how well the model will generalise to new data. Using thecomposition property of Lipschitz functions, we have shown that the Lipschitz constant ofa network is the product of the Lipschitz constants of each layer. Thus, controlling theLipschitz constant of a network can be accomplished by constraining the Lipschitz constantof each layer in isolation by performing constrained optimisation when training the network.In practice, we pick a single hyperparameter, λ , and use it to control the upper bound of theLipschitz constant for each layer. This means the network as a whole will have a Lipschitzconstant less than or equal to λ d , where d is the depth of the network.The easiest way to adapt existing deep learning methods to allow for constrained opti-misation is to introduce a projection step and perform a variant of the projected stochasticgradient method. In our particular problem, because each parameter matrix is constrainedin isolation, it is straightforward to project any infeasible parameter values back into theset of feasible matrices. Specifically, after each weight update step we must check that noneof the weight matrices (including the filter banks in the convolutional layers) are violatingthe Lipschitz constant constraints. If the weight update has caused a weight matrix to leavethe feasible set, we must replace the resulting matrix with the closest matrix that does liein the feasible set. This can all be accomplished with the projection function, π ( W, λ ) = 1max(1 , k W k p λ ) W, (25)which will leave the matrix untouched if it does not violate the constraint, and projectit back to the closest matrix in the feasible set if it does. Closeness is measured by thematrix distance metric induced by taking the operator norm of the difference between twomatrices. This will work with any valid operator norm, because all norms are absolutelyhomogeneous. In particular, it will work with the operator norms with p ∈ { , , ∞} ,which can be computed using the approaches outlined in Section 3. Pseudocode for thisprojected gradient method is given in Algorithm 2. We have observed fast convergence whenusing the AMSGrad update rule (Reddi et al., 2018), but other variants of the stochasticgradient method also work. For example, in our experiments, we show that stochasticgradient descent with Nesterov’s momentum is compatible with our approach. p -norm Estimation A natural question to ask is which p -norm should be chosen when using the training pro-cedure given in Algorithm 2. The Euclidean (i.e., spectral) norm is often seen as the ouk, Frank, Pfahringer, and Cree Algorithm 2
Projected stochastic gradient method to optimise a neural network subjectto the Lipschitz Constant Constraint (LCC). t ← while W ( t )1: l not converged do t ← t + 1 g ( t )1: l ← ∇ W l f ( W ( t − l ) c W ( t )1: l ← update ( W ( t − l , g ( t )1: l ) for i = 1 to l do W ( t ) i ← π ( c W ( t ) i , λ ) end forend while default choice, due to its special status when talking about distances in the real world.Like Yoshida and Miyato (2017), we use the power method to estimate the spectral normsof the linear operations in deep networks. The convergence rate of the power method isrelated to the ratio of the two highest singular values, σ σ . If the two largest singular val-ues are almost the same, it will converge very slowly. Because each iteration of the powermethod for computing the spectral norm of a convolutional layer requires both forwardpropagation and backward propagation, it is only feasible to perform a small number ofiterations. However, regardless of the quality of the approximation, we can be certain thatour approximation is an underestimate of the true norm: the expression in the final line ofAlgorithm 1 is maximised when ~x n is the first eigenvector of W . Therefore, if the algorithmhas not converged ~x n will not be an eigenvector of W and our approximation of σ max willbe an underestimate.In contrast to the spectral norm calculation, we compute the values of the ℓ and ℓ ∞ norms exactly, in time that is linear in the number of weights in a layer, so it always com-prises a relatively small fraction of the overall runtime for training the network. However,it may be the case that the ℓ and ℓ ∞ constraints do not provide as suitable an inductivebias as the ℓ constraint. This is something we investigate in our experimental evaluation. Constraining the Lipschitz continuity of the network will have an impact on the magnitudeof the activations produced by each layer, which is what batch normalisation attempts toexplicitly control (Ioffe and Szegedy, 2015). Thus, we consider whether batch normalisationis compatible with our Lipschitz Constant Constraint regulariser. Batch normalisation canbe expressed as φ bn ( ~x ) = diag ~γ p Var[ ~x ] ! ( ~x − E[ ~x ]) + ~β, (26) ipschitz Continuous Neural Networks where diag( · ) denotes a diagonal matrix, and ~γ and ~β are learned parameters. This can beseen as performing an affine transformation with a linear transformation termdiag ~γ p Var[ ~x ] ! ~x. (27)Based on the operator norm of this diagonal matrix the Lipschitz constant of a batchnormalisation layer, with respect to p -norms where p ∈ { , , ∞} , is given by L ( φ bn ) = max i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ~γ i p Var[ ~x i ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (28)Thus, if one were to use batch normalisation in conjunction with our technique, the ~γ parameter must also be constrained. This is accomplished by using the expression inEquation 28 to compute the operator norm in the projection function given in Equation 25.We use a moving average estimate of the variance for performing the projection, ratherthan the variance computed solely on the current minibatch of training examples. This isdone because the minibatch estimates of the mean and variance can be quite noisy. In the standard formulation of dropout, one corrupts activations during training by perform-ing pointwise multiplication with vectors of Bernoulli random variables. As a consequence,when making a prediction at test time—when units are not dropped out—the activationsmust be scaled by the probability that they remained uncorrupted during training. Thismeans the activation magnitude at both test time and training time is approximately thesame. The majority of modern neural network make extensive use of rectified linear units,or similar activation functions that are also homogeneous. This implies that scaling theactivations at test time is equivalent to scaling the weight matrices in the affine transforma-tion layers. By definition, this will also scale the operator norm, and therefore the Lipschitzconstant, of that layer. As a result, one may expect that when using our technique in con-junction with dropout, the λ hyperparameter will need to be increased in order to maintainthe desired Lipschitz constant. Note that this does not necessarily mean that the optimalvalue for λ , from the point of view of generalisation performance, can be found by per-forming hyperparameter optimisation without dropout, and then dividing the best λ foundon the validation set by one minus the desired dropout rate: the change in optimisationdynamics and regularisation properties of dropout make it difficult to predict analyticallyhow these two methods interact when considering generalisation performance.
5. Experiments
In our experiments, we aim to answer several questions about the behaviour of our LipschitzConstant Constraint (LCC) regularisation scheme. The question of most interest is howwell this regularisation technique compares to the state-of-the-art, in terms of the accuracymeasured on held out data. In addition to this, we perform experiments that demonstratehow sensitive the method is to the choice of λ , how it interacts with existing regularisation ouk, Frank, Pfahringer, and Cree methods, and investigate to what extent other methods implicitly control the Lipschitzcontinuity of feed-forward networks.Several different architectural design philosophies are used throughout the experiments.Specifically, we use multi-layer perceptrons, VGG-style convolutional networks, and net-works with residual connections. This is done to ensure that the regularisation methodworks for a broad range of feed-forward architectures. Similarly, two different optimisersare used in order to verify that Algorithm 2 behaves well when used with a range of commonstochastic gradient methods. Specifically, we use SGD with Nesterov momentum for someexperiments, and an adaptive gradient method, AMSGrad (Reddi et al., 2018), for otherexperiments. We begin by illustrating several key points using a simple synthetic dataset generated usingthe function f ( x ) = sin( x ) + 15 cos(19 x ) . (29)This function was chosen because there are two trends: one with a large amplitude thatvaries slowly, and another that varies quickly and has a much smaller amplitude. A modelwith a reasonably small Lipschitz constant should largely ignore the high frequency signaland model only the component with low frequency and high amplitude. Because the functionis a scalar function of one variable, it is also very convenient to visualise.A training dataset was generated by randomly sampling 1,000 points from a uniformdistribution covering the range [ − , λ values for each of the three p -norms discussed earlier. Fully connectednetworks with two hidden layers, each containing 1,000 hidden units, were used. Figures 1,2, and 3 show the result of training these networks on the synthetic training set using p = 1, p = 2, and p = ∞ , respectively.For all three norms, when λ is set to one, the function approximates the training datavery poorly, failing to even capture the general trend provided by the sine function. As λ isincreased, the networks approximate the function used to generate the dataset with higheraccuracy. As expected, for each norm, there is a window where only the low frequencycomponent of the function is approximated.One might expect that, because we are dealing with a scalar function, all three normswould behave the same, as they all reduce to taking the absolute value in the single dimen-sional case. Despite this, we clearly observe varying levels of approximation quality whencomparing the use of different norms with the same value for λ . In the case of the ℓ norm,this is fairly easy to explain: we are only approximating the spectral norm of each layerusing the power method, which we have already explained is liable to underestimate thetrue spectral norm if only a few iterations are used. We suspect the difference between the ℓ and ℓ ∞ norm approaches—most noticeable when λ is two—is explained by a less obviousphenomenon. Because of the difference in the operator norms for the ℓ and ℓ ∞ vectornorms, the set of permissible weight matrices is different. This is something that is verylikely to impact the trajectory of the weights during optimisation, and hence result in quite ipschitz Continuous Neural Networks different solutions. Optimisation dynamics are presumably also the reason the function ispoorly approximated when λ is one—even though the sine function is 1-Lipschitz. The CIFAR-10 dataset (Krizhevsky and Hinton, 2009) contains 60,000 tiny images, each be-longing to one of 10 classes. We follow the common protocol of using 10,000 of the images inthe 50,000 image training set for tuning model hyperparameters. Two networks are consid-ered for this dataset: a VGG19-style network, resized to be compatible with the 32 ×
32 pixelimages in CIFAR-10, and a Wide Residual Network (WRNs) (Zagoruyko and Komodakis,2016). All experiments on this dataset utilise data augmentation in the form of randomcrops and horizontal flips, and the image intensities were rescaled to fall into the [ − , − and decreased by a factor of 10 after epoch100 and epoch 120. The WRNs were trained for a total of 200 epochs using the stochasticgradient method with Nesterov’s momentum. The learning rate was initialised to 0 .
1, anddecreased by a factor of 5 at epochs 60, 120, and 160. All network weights were initialisedusing the method of Glorot and Bengio (2010).We begin by comparing how well the LCC regulariser improves generalisation comparedto other commonly used and related methods. Specifically, dropout, batch normalisation,and the spectral decay method of Yoshida and Miyato (2017) are considered. Dropout andbatch normalisation are the two most widely used regularisation schemes, often acting askey components of state-of-the-art models (Simonyan and Zisserman, 2014; He et al., 2016;Zagoruyko and Komodakis, 2016), and the spectral decay method has a similar goal to the ℓ instantiation of our method: encouraging the spectral norm of the weight matrices tobe small. For this particular experiment, we consider each regulariser in isolation, and alsothe combination of our technique with dropout and batch normalisation. Results are givenin Tables 1 and 2 for the VGG and WRN architectures, respectively. Interestingly, theperformance of the VGG network varies considerably more than that of the Wide ResidualNetwork. VGG models trained with our regularisation exhibit only a small performancegain over the unregularised and spectral decay baselines; however, when combined withbatch normalisation, there is a sizeable increase in accuracy. In the case of WRNs, ourmethod performs similarly to dropout. It is noteworthy that spectral decay performs worsethan LCC- ℓ , as expected based on our discussion in Section 3.2. CIFAR-100, like CIFAR-10, is a dataset of 60,000 tiny images, but with 100 classes. Thesame data augmentation methods used for CIFAR-10 are also used for training models onCIFAR-100—random crops and horizontal flips. We train WRNs and VGG-style networkson this dataset. We found that the learning rate schedules used in the CIFAR-10 experi-ments also worked well on this dataset, which is not surprising given their similarities. Wedid, however, optimise the regularisation hyperparameters specifically for CIFAR-100. The L (sin) = max | δ x sin( x ) | = max | cos( x ) | = 1 ouk, Frank, Pfahringer, and Cree − − − − (a) λ = 1 − − − − (b) λ = 2 − − − − (c) λ = 4 − − − − (d) λ = 8 − − − − (e) λ = 16 − − − − (f) λ = ∞ Figure 1: Visualisation of neural networks trained on a 1-dimensional synthetic datasetwith different values of λ . The ℓ norm was used for regularising these networks. The bluedots are the training data, and the red lines are the predictions made by each network. ipschitz Continuous Neural Networks − − − − (a) λ = 1 − − − − (b) λ = 2 − − − − (c) λ = 4 − − − − (d) λ = 8 − − − − (e) λ = 16 − − − − (f) λ = ∞ Figure 2: Visualisation of neural networks trained on a 1-dimensional synthetic datasetwith different values of λ . The ℓ norm was used for regularising these networks. The bluedots are the training data, and the red lines are the predictions made by each network. ouk, Frank, Pfahringer, and Cree − − − − (a) λ = 1 − − − − (b) λ = 2 − − − − (c) λ = 4 − − − − (d) λ = 8 − − − − (e) λ = 16 − − − − (f) λ = ∞ Figure 3: Visualisation of neural networks trained on a 1-dimensional synthetic datasetwith different values of λ . The ℓ ∞ norm was used for regularising these networks. The bluedots are the training data, and the red lines are the predictions made by each network. ipschitz Continuous Neural Networks Table 1: Performance of VGG19 networks trained with different regularisation methodson CIFAR-10. LCC- ℓ p denotes our Lipschitz Constant Constraint method for some given p -norm. Method VGG19None 88.39%Spectral Decay 87.30%Batchnorm 90.45%Dropout 89.55%LCC- ℓ ℓ ℓ ∞ ℓ ℓ ℓ ∞ ℓ ℓ ℓ ∞ ℓ ℓ Dropout + Batchnorm + LCC- ℓ ∞ ℓ ℓ ℓ ∞ ℓ ℓ Dropout + LCC- ℓ ∞ ℓ . ouk, Frank, Pfahringer, and Cree Table 3: Performance of networks trained with different regularisation methods on CIFAR-100. LCC- ℓ p denotes our Lipschitz Constant Constraint method for some given p -norm.Method VGG19None 56.14%Spectral Decay 56.21%Batchnorm 64.17%Dropout 53.50%LCC- ℓ ℓ ℓ ∞ ℓ ℓ ℓ ∞ ℓ ℓ Batchnorm + LCC- ℓ ∞ ℓ ℓ ℓ ∞ ℓ ℓ ℓ ∞ ℓ ℓ ℓ ∞ The Fashion-MNIST dataset (Xiao et al., 2017) is designed as a more challenging drop-inreplacement for the original MNIST dataset of hand-written digits (LeCun et al., 1998).Both contain 70 ,
000 greyscale images labelled with one of 10 possible classes. The last10,000 instances are used as the test set. For optimising hyperparameters we use the last10,000 of the training set to measure validation accuracy. We train a small convolutionalnetwork on both of these datasets with different combinations of regularisers. The networkcontains only two convolutional layers, each consisting of 5 × × ipschitz Continuous Neural Networks These layers feed into a fully connected layer with 128 units, which is followed by the outputlayer with 10 units. ReLU activations are used for all layers, and each model is trained for60 epochs using AMSGrad. The learning rate was started at 0 . ℓ .Table 5: Results of the small convolutional network on the MNIST and Fashion-MNSITdatasets. Method MNIST Fashion-MNISTNone 99.25% 91.49%Spectral Decay 99.12% 91.58%Batchnorm 99.27% 91.30%Dropout 99.45% 91.61%LCC- ℓ ℓ ℓ ∞ ℓ ℓ ℓ ∞ ℓ ℓ ℓ ∞ ℓ ℓ Dropout + Batchnorm + LCC- ℓ ∞ The Street View House Numbers dataset contains over 600 ,
000 images of digits extractedfrom Google’s Street View platform. Each image contains three colour channels and has aresolution of 32 ×
32 pixels. As with the previous datasets, the only preprocessing performedis to rescale the input features to the range [ − , − , which was decreased by a factor of 10 at epochs 15 and18. Results demonstrating the performance of individual regularisers, and the combinationof LCC with dropout and batch normalisation, are given in Table 6. We also train small ouk, Frank, Pfahringer, and Cree WRN models on this dataset. Once again, due to the large size of the training set, it issufficient to only train each network for 20 epochs in total. We therefore use a compressedlearning rate schedule, compared to WRNs trained on CIFAR-10 and CIFAR-100. Thelearning rate is started at 0.1, and is decreased by a factor of 5 at epochs 6, 12, and 16.Test set performance for each of the WRN models trained on SVHN is provided in Table 7.In isolation, LCC provides only a small improvement in test accuracy, which is in linewith what has been observed on other datasets so far. Also congruent with our other resultsis that the performance of LCC combined with batch normalisation on the VGG networks isnoticeably higher than other approaches—in this case these networks perform comparablywith the WRN models. For this dataset, the LCC methods provide very little benefit to theresidual networks we have considered. Spectral decay again performs worse than LCC- ℓ and actually fails to converge when using wide residual networks.Table 6: Prediction accuracy of VGG-style networks trained on the SVHN dataset usingdifferent regularisation techniques.Method VGGNone 96.95%Spectral Decay 96.41%Batchnorm 96.90%Dropout 97.77%LCC- ℓ ℓ ℓ ∞ ℓ ℓ ℓ ∞ ℓ ℓ ℓ ∞ ℓ Dropout + Batchnorm + LCC- ℓ Dropout + Batchnorm + LCC- ℓ ∞ Neural networks consisting exclusively of fully connected layers have a long history of beingapplied to classification problems arising in data mining scenarios. To evaluate how wellthe LCC regularisers work on tabular data, we have trained fully connected networks onthe classification datasets collected by Geurts and Wehenkel (2005). These datasets wereprimarily collected from the University California at Irvine dataset repository, and the onlyselection criterion used by Geurts and Wehenkel (2005) was that they contain only numericfeatures. Each network contained two hidden layers consisting of 100 units, and used theReLU activation function. We performed two repetitions of 5-fold cross validation for eachdataset. Hyperparameters for each regulariser were tuned on a per-fold basis using grid ipschitz Continuous Neural Networks Table 7: Prediction accuracy of WRN-16-4 networks trained on the SVHN dataset usingdifferent regularisation techniques.Method WRN-16-4None 97.92%Spectral Decay 9.09%Dropout
LCC- ℓ ℓ ℓ ∞ ℓ ℓ ℓ ∞ p ∈ { . , . , . , . } . Thevalues considered for λ when using the ℓ and ℓ ∞ approaches were { , , ..., , } , and forthe ℓ variant we used { , , ..., , } . Once again, we also considered the combination ofeach of the regularisation methods.Because multiple estimates of the test set accuracy are available to us, we can performhypothesis tests to determine statistically significant differences in performance of each ofthe regularisers over the unregularised baseline. We use the hypothesis testing procedureoutlined by Bouckaert and Frank (2004) in order to overcome the problem of overlappingtesting and training sets introduced by the cross validation procedure. The mean test setaccuracies on each dataset are given in Table 8, with statistically significant improvementsand degredations in accuracy at the 95% confidence level indicated by the + and - symbols,respectively.Several interesting trends can be found in this table. One particularly surprising trendis that the presence of dropout is a very good indicator of a statistically significant degre-dation in accuracy. Interestingly, the only exceptions to this are the two synthetic datasets,where dropout is associated with an improvement in accuracy. The combination of batchnormalisation with LCC is one of the more reliable approaches to regularisation on the tab-ular classification datasets considered. In particular, the LCC- ℓ ∞ method combined withbatch normalisation achieves the highest mean accuracy on seven of the 10 datasets. On theother three datasets, either all regularisers are ineffective, or LCC- ℓ ∞ is still present in thebest combination of regularisation schemes. This provides strong evidence that LCC- ℓ ∞ isa good choice for regularisation of neural network models trained on tabular data.These results can also be visualised using a critical difference diagram (Demˇsar, 2006),as shown in Figure 4. The average rank of LCC- ℓ ∞ combined with batch normalisation is2.6, which is considerably higher than the next best method: batch normalisation, with anaverage rank of 5.9. However, there is insufficient evidence to be able to state that LCC- ℓ ∞ significantly outperforms batch normalisation. It can also be seen from this diagram thatthe LCC- ℓ ∞ with batch normalisation is statistically significantly better than most of thecombinations of regularisers that include dropout. ouk, Frank, Pfahringer, and Cree CD15 13 11 9 7 5 3 1 2.6 BN- ℓ ∞ ℓ ℓ ℓ ∞ ℓ ℓ ℓ ∞ ℓ ℓ ℓ ∞ ℓ ℓ Figure 4: A critical difference diagram showing the statistically significant (95% confidence)differences between the average rank of each method. The number beside each method isthe average rank of that method across all datasets. The thick black bars overlaid on groupsof thin black lines indicate a clique of methods that have not been found to be statisticallysignificantly different. λ We have proposed using the Lipschitz constant of a network to control its effective modelcapacity. This suggests setting λ too low will cause the network to underfit the data.Similarly, if λ is set too high, the network will overfit. Figure 5 shows how the validationaccuracies of the VGG19 network trained on CIFAR-10 are impacted by varying the valueof λ for the LCC- ℓ , LCC- ℓ , and LCC- ℓ ∞ regularisation methods. The plot correspondingto LCC- ℓ best follows our expected trend. With λ = 2, the network underfits, relative tothe best validation accuracy obtained. Increasing λ results in improved accuracy initially,but increasing it further causes the model to begin to overfit. This trend is also apparentto a lesser extent in the LCC- ℓ and LCC- ℓ ∞ plots. The results presented so far have indicated that constraining the Lipschitz constant of anetwork provides effective regularisation. It is interesting to consider whether other com- ipschitz Continuous Neural Networks
20 30 40 50 60 70 8087 . . . . λ A cc u r a c y λ A cc u r a c y . . . . λ A cc u r a c y Figure 5: Validation accuracy versus λ for the VGG19 networks trained on CIFAR-10,regularised with the LCC- ℓ (top), LCC- ℓ (middle), and LCC- ℓ ∞ (bottom) methods.monly used regularisation schemes are implicitly constraining the Lipschitz constant. Theperformance achieved by combining other methods with the LCC regularisation approach ouk, Frank, Pfahringer, and Cree indicates that the methods are complementary, but to further investigate this, we supplyplots of the Lipschitz constant of each layer of the networks trained on CIFAR-10. Theseplots are given in Figure 6. When the network is trained with dropout, we scale each ofthe operator norms by the probability of retaining an activation for the reasons describedin Section 4.3. The network trained with batch normalisation is omitted from these plotsdue to excessively large variance in the per-layer Lipschitz constants—often over an orderof magnitude larger than that of other methods, and an order of magnitude lower in somecases.The most obvious characteristic of the plots in Figure 6 is that networks trained withoutLCC have significantly larger operator norms than those trained with LCC. Interestingly,dropout does reduce both the ℓ and ℓ ∞ operator norms for nearly all layers compared tothe network trained without dropout or LCC. This indicates that dropout does, to someextent, penalise layers with large operator norms, but that there is also another mechanismat play. When using LCC- ℓ to regularise a network, a single iteration of the power methodis sufficient to estimate the operator norm during training. However, when constructingthese plots, we performed power iterations until the operator norms converged to theirtrue values. As can be seen in the middle plot, this shows that some layers were not quiteconstrained to have an operator norm less than four—the value chosen for λ for this network. The technique presented in this paper incurs only a small increase in the time required toperform each minibatch of training. The exact amount of time required to perform theprojection of each weight matrix, in the case of the ℓ and ℓ ∞ variants, is proportional tothe number of parameters in the network. In the ℓ case, the runtime is approximatelythe same as performing both a forward and backward propagation for a single sample. Fortypical network architectures, one can expect an increase in runtime of only a few percent,across all metrics.Another potential concern is the number of weight updates required for the networkloss to converge. Learning curves for VGG and WRN networks trained on CIFAR-100 areprovided in Figure 7. These plots alleviate the concern that the projection operation mayslow convergence, due to the truncated weight update. The rate of convergence of ourmethod follows similar trends to the other regularisers. Something interesting to note isthat it is particularly dependent on dropping the learning rate. This is most evident inthe WRN chart, where the models that do not use the per layer operator norm constraintconverge towards much better test accuracies in the first 120 epochs, before the learningrate has been lowered substantially.
6. Conclusion
We have presented a simple and effective regularisation technique for deep feed-forwardneural networks, shown that it is applicable to a variety of feed-forward neural networkarchitectures, and does not depend on a particular optimiser. We have also demonstratedthat our technique can be used in conjunction with both batch normalisation and dropout,often with cumulative performance gains. The investigation into the differences between thethree p -norms ( p ∈ { , , ∞} ) we considered has provided some useful information about ipschitz Continuous Neural Networks which one might be best suited to the problem at hand. In particular, the ℓ ∞ norm wasbest suited to tabular data, and the ℓ norm consistently showed the best performancewhen used as a regulariser on natural image datasets. However, given that LCC- ℓ is onlyapproximately constraining the norm, if one wants a guarantee that the Lipschitz constantof the trained network is bounded below some user specified value, then using the ℓ or ℓ ∞ norm would be more appropriate.Lastly, we anticipate that the utility of constraining the Lipschitz constant of neuralnetworks is limited not only to improving classification accuracy. There is already evi-dence that constraining the Lipschitz constant of the discriminator networks in GANs isuseful (Arjovsky et al., 2017; Miyato et al., 2017). Given the shortcomings in these pre-vious approaches to constraining Lipschitz constants we have outlined, one might expectimprovements training GANs that are k -Lipschitz with respect to the ℓ or ℓ ∞ norms, andapproximately 1-Lipschitz with respect to the ℓ norm. Exploring how well our techniqueworks with recurrent neural networks would also be of great interest, but the theoreticalbasis for the method presented in this paper does apply in the context of recurrent neuralnetworks. Finally, our method assumes that all layers should have the same Lipschitz con-stant. This is a potentially inappropriate assumption in practice, and a more sophisticatedhyperparameter tuning mechanism that allows for selecting a different value of λ for eachlayer could provide a further improvement to performance. References
Martin Arjovsky, Soumith Chintala, and L´eon Bottou. Wasserstein GAN. arXiv preprintarXiv:1701.07875 , 2017.Peter L Bartlett. The sample complexity of pattern classification with neural networks: thesize of the weights is more important than the size of the network.
IEEE transactions onInformation Theory , 44(2):525–536, 1998.Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized marginbounds for neural networks. In
Advances in Neural Information Processing Systems ,pages 6241–6250, 2017.Remco R Bouckaert and Eibe Frank. Evaluating the replicability of significance tests forcomparing learning algorithms. In
Pacific-Asia Conference on Knowledge Discovery andData Mining , pages 3–12. Springer, 2004.Janez Demˇsar. Statistical comparisons of classifiers over multiple data sets.
Journal ofMachine learning research , 7(Jan):1–30, 2006.Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In
Advances in Neural Infor-mation Processing Systems , pages 3584–3593, 2017.Bolin Gao and Lacra Pavel. On the properties of the softmax function with application ingame theory and reinforcement learning. arXiv preprint arXiv:1704.00805 , 2017. ouk, Frank, Pfahringer, and Cree Pierre Geurts and Louis Wehenkel. Closed-form dual perturb and combine for tree-basedmodels. In
Proceedings of the 22nd International Conference on Machine Learning , pages233–240. ACM, 2005.Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforwardneural networks. In
Proceedings of the Thirteenth International Conference on ArtificialIntelligence and Statistics , pages 249–256, 2010.Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability ofstochastic gradient descent. In
International Conference on Machine Learning , pages1225–1234, 2016.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning forimage recognition. In
Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 770–778, 2016.Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network train-ing by reducing internal covariate shift. In
International Conference on Machine Learning ,pages 448–456, 2015.Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the localreparameterization trick. In
Advances in Neural Information Processing Systems , pages2575–2583, 2015.Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.
Technical report, University of Toronto , 2009.Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learningapplied to document recognition.
Proceedings of the IEEE , 86(11):2278–2324, 1998.Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normal-ization for generative adversarial networks. 2017.Behnam Neyshabur.
Implicit Regularization in Deep Learning . PhD thesis, Toyota Tech-nological Institute at Chicago, 2017.Behnam Neyshabur, Ruslan R Salakhutdinov, and Nati Srebro. Path-sgd: Path-normalizedoptimization in deep neural networks. In
Advances in Neural Information ProcessingSystems , pages 2422–2430, 2015.Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A PAC-bayesian approach to spectrally-normalized margin bounds for neural networks. arXivpreprint arXiv:1707.09564 , 2017.Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adamand beyond. In
International Conference on Learning Representations , 2018. URL https://openreview.net/forum?id=ryQu7f-RZ .Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameteriza-tion to accelerate training of deep neural networks. In
Advances in Neural InformationProcessing Systems , pages 901–909, 2016. ipschitz Continuous Neural Networks Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556 , 2014.Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-dinov. Dropout: a simple way to prevent neural networks from overfitting.
Journal ofMachine Learning Research , 15(1):1929–1958, 2014.Twan van Laarhoven. L2 regularization versus batch and weight normalization. arXivpreprint arXiv:1706.05350 , 2017.Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization ofneural networks using dropconnect. In
International Conference on Machine Learning ,pages 1058–1066, 2013.Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset forbenchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 , 2017.Yuichi Yoshida and Takeru Miyato. Spectral norm regularization for improving the gener-alizability of deep learning. arXiv preprint arXiv:1705.10941 , 2017.Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In