Robust Learning with Frequency Domain Regularization
RRobust Learning with Frequency DomainRegularization
Weiyu Guo, Yidong OuyangInformation SchoolCentral University of Finance and Economics [email protected], [email protected]
July 8, 2020
Abstract
Convolution neural networks have achieved remarkable performancein many tasks of computing vision. However, CNN tends to bias to lowfrequency components. They prioritize capturing low frequency patternswhich lead them fail when suffering from application scenario transforma-tion. While adversarial example implies the model is very sensitive to highfrequency perturbations. In this paper, we introduce a new regularizationmethod by constraining the frequency spectra of the filter of the model.Different from band-limit training, our method considers the valid fre-quency range probably entangles in different layers rather than continuousand trains the valid frequency range end-to-end by backpropagation. Wedemonstrate the effectiveness of our regularization by (1) defensing toadversarial perturbations; (2) reducing the generalization gap in differentarchitecture; (3) improving the generalization ability in transfer learningscenario without fine-tune.
Convolution neural networks (CNNs)[1] have achieved remarkable performancein many tasks of computing vision, e.g. , object detection[2, 3, 4], semanticsegmentation[5], image captioning[6], by capturing and representing multi-levelfeatures from a huge volume of data. However, existing experiments[7, 8]demonstrate that CNNs are often with great fragility[9]. Only injecting minuteperturbation, e.g. , random noise, contrast change, or blurring, can lead tosignificant degradation of model performance, i.e. , CNN models usually lacksthe ability of generalization transfer.A variety of explanation of the vulnerability have been proposed, e.g. , thelimit of the data-sets scale, the distribution of real data is inconsistent withtraining data, and computational constraints[10], which resulting in a variety ofcoping strategies, such as data augmentation[11], adversarial training[12, 13, 14]1 a r X i v : . [ c s . L G ] J u l nd parameter regularization[15, 16]. In deed, these strategies are propelling themodels of CNNs to encoding invariant features as well as neglecting the variableinformation in the learning phase. In essence, convolutions are a kind of signalprocessing operations that amplify certain frequencies of the input and attenuateothers. This leads us to ask that whether can prompt CNNs to ”remember”invariant features by explicitly representing certain frequency components of itsconvolution layers? And, how to find certain frequency components for differentlayers? In this paper, we show that the answer to the questions leads to somesurprising new perspectives on: model robustness and generalization.The low frequency components in training set are easier to be learned thanhigh frequency components, because the number of low-frequency signals islarge but their variation is little. For a finite training set, there exists a validfrequency range, and information beyond the lower bound usually is the biasof data-set, while information beyond the upper bound often is the noise. Thisphenomenon probably leads to a CNN with common settings always first quicklycapturing low frequency components in their dominant, but easily over-fittingwhen suffering from application scenario transformation. Therefore, we might beable to promote the generalization and convergence performance of CNN modelsby putting frequency range constraints on convolution layers on learning phase.The architecture of CNNs is designed to abstract information layer by layerfrom low to high[17]. It is generally assumed that, low layers are in charge ofextracting low frequency information, such as dots, lines and texture, while highones are responsible for high frequency information, such as shapes and sketches.Intuitively, we can drive a CNN model to pinpoint the valid frequency range oftraining set by imposing the low frequency constrains on previous convolutionlayers while high frequency constrains on the rears. However, due to existingsome other factors, e.g., shortcut connection[18], learning methods, and sampledistribution, the valid frequency range probably entangled in different layersrather than continuous.In this paper, we propose a novel frequency domain regularization on convo-lution layers, which improves the generalization and convergence performance ofCNN models by automatically untangling the spectrum of convolution layers,and navigating the model to the valid frequency range of training set. In anutshell, our main contributions can be summarized as follows. • An extreme small but valid spectral range for different layers was pin-pointed. • A general training approach with frequency domain regularization on convo-lution layers, for improving the generalization and convergence performanceof CNN models. Compared with data augmentation technique and otherimplicit regularization techniques, our training technique improves thetransferability of model. • Comprehensive evaluation to investigate the effectiveness of proposedapproach, and demonstrate how it can raise the generalization of CNNmodels. 2
Related work
Promoting the generalization of models is very important for deep learning.Generally, there are three branches of techniques to achieve the target, i.e. , dataaugmentation, regularization and spectrum analysis.
Data augmentation:
The idea of data augmentation[19, 20] is commonto reduce overfitting on models, which increases the amount of training datausing information only in training data. Simple techniques, such as cropping,rotating, flipping, etc. , are prevailing in CNN model training, and usually canimprove performance on validation a few. However, such simple techniques cannot provide any practical defense against adversarial examples[21], which leadsto an emerging direction of data augmentation, i.e., adversarial training[22, 23].Indeed, adversarial training performs unsupervised generation of new samples us-ing GANs[24], which can provide amount of hard examples for training. However,recent studies[11] demonstrate that adversarial training usually improve robust-ness to corruptions that are concentrated in the high frequency domain whilereducing robustness to corruptions that are concentrated in the low frequencydomain.
Regularization: [25] comprehensively evaluates the performance of explicitand implicit regularization techniques, i.e., dropout[26], weight decay[27, 28],batch normalization[29], early stopping. And gives the comments that althoughregularizers can provide marginal improvement, they seem not to be the funda-mental reason for generalization, but the architecture. All of the regularizationtechniques mentioned above have little effect on preventing the model fromquickly fitting random labeled data. Sharpness and norms are other perspectivefor generalization[30]. There is a tight connection between spectral norm andLipschitz Continuity, which can be used to flatten minima and bound the gener-alization error[31, 14, 12]. Jacobian penalty[32] and orthogonality of weights[33]can also be used for improving generalization. But none of the regularizationtechniques focus on the transferability of model on unseen domain, nor can theyexplicit pinpoint the valid range of feature to help the model shield againstbackground and noise.
Spectrum analysis:
Indeed, convolution is a common method to extractspecific spectrum in signal processing. Inspired by this, there is substantial recentinterest in studying the spectral properties of CNNs, with applications to modelcompression[34], speeding up model inference[35, 36], memory reduction[37],theoretical understanding of CNN capacity[17, 38], and eventually, better trainingmethodologies[11, 39, 40, 41]. Especially, the works that leverage spectrumproperties of CNNs to design better training methodologies are most relevant tothis paper. For example, recent study[42, 17] find that a CNN model usually isbiased towards lower Fourier frequencies while natural images tend to have thebulk of their Fourier spectrum concentrated on the low to mid-range frequencies.From this discovery, some works try to drop the high frequency components fromthe inputs to improve the generalization of the model, e.g. , spectral dropout[43]and Band-limited training[44]. In practice, high-frequency components perhapsare non-robust but highly predictable[11]. Therefore, although high frequency3omponents contain noise, we do not simply drop them in our work. Morein-depth discussion is needed for valid spectral range.
In this section, we introduce our regularization method to constrain the frequencyspectra of the convolution. The overview of our method is illustrated in Figure1. Figure 1: Overview of our method
Fourier transform
Given a tensor t ∈ C M × N , Fourier transform is used to transform t to thespectral domain. F ( x ) hw = 1 √ M N M − (cid:88) m =0 N − (cid:88) n =0 x mn e − πi ( mhM + nwN ) ∀ h (cid:51) { , , ..., M − } , ∀ w (cid:51) { , , ..., N − } FFT-based convolution
The property of frequency analysis ensures that convolution in the spatialdomain is equal to element-wise multiplication in the spectral domain. The mainintuition of frequency analysis is that an image represented in spatial domain issignificant redundancy, while represented in spectral domain can improve filter tofeature the specific length-scales and orientations [40, 45]. Convergence speedupand lower computational cost are additional benefit. x ∗ y = F − ( F x [ w ] · F y [ w ]) S [ ω ] = F x [ ω ] · F y [ ω ] S [ ω ] is called the spectrum of the convolution.4 .2 Mask design Mask design is the key component of our method. Mask helps us pinpointthe valid frequency range entangled between different layers, and using backpropagation to update it is the main difference from similar work.
Binarized mask
Our regularization try to mask the frequency of background and noise, onlymaintaining the frequency that is useful for the classification. M c [ ω ] is the maskthat limits the spectrum S [ ω ] . Gradient Computation and Accumulation:
The gradients of mask are accumulated in real-valued variables, as Algorithm1.
Algorithm 1
Forward and back propagation. C is the cost function for mini-batch, L is the number of layers. Quanindicates element-wise multiplication.Thefunction Binarize() specifics how to binarize the masks, and Clip(), how to clipthe masks. Require: a minibatch of inputs and targets ( x, y ) . Ensure: updated
M asksM t +1 , Weights W t +1 . { .Computing the masks gradients : }{ . .F orward propagation : } for k = 1 to L do M bk ← Binarize ( M k ) S [ ω ] bk ← M bk ∗ S [ ω ] bk end for { . .Backward propagation : }{ P lease note that the gradients are not binary. } automatic dif f erentiation get dS [ ω ] for k = L to do dM k ← dS [ ω ] k ∗ S [ ω ] k dS [ ω ] k ← dS [ ω ] k ∗ M k compute dx and dy end for { .Accumulating the parameters gradients : } for k = 1 to L do W k t +1 ← Update ( W k , η, dW tk ) M t +1 k ← Clip (
Update ( M k , η, dM tk ) , , end for We demonstrate the effectiveness of our regularization method with variousdatasets and architecture and compare it with several state-of-art methods. Weexplore in detail to illustrate the property of our approach.5 .1 Experimental settings
Datasets
We conduct our experiments with Cifar10[46], which contains 10classes, 50000 images for training and 10000 images for testing.
Baseline training
All models use SGD, with momentum set to 0.9. ForCifar dataset, the learning rate is set to 0.01. For Imagenet dataset, the learningrate is set to 0.1. If weight decay and dropout are used, weight decay is setto − and the keep-prop is set to 0.9. When training from sketch, Mask isinitiated with random numbers from a normal distribution with mean 0.8 andvariance 0.2. It means we don’t drop any frequency at beginning, and withthe model learning, the accumulated gradient of mask will pinpoint the validfrequency range. While for finetuning, Mask is initiated with random numbersfrom a normal distribution with mean 0.6 and variance 0.1 to accelerate thelearning of mask. a cc u r a c y original train_accuracyoriginal test_accuracyoriginal train-testourMethod train_accuracyourMethod test_accuracyourMethod train-testsketch train_accuracysketch test_accuracysketch train-test Figure 2: Sketch and finetuneIt was observed that Gaussian data augmentation and adversarial trainingimprove robustness to all noise and many of the blurring corruptions, whiledegrading robustness to fog and contrast[11]. Our method has better resultsagainst fog, contrast, and impulse noise, which shows that our method alleviatesthe low frequency brittle caused by adversarial training.With such amount of activations suppressed in Table 4.2 and the CAMillustration in Figure 3, our method utilizes the valid frequency range to capturethe most important frequency for classification.6able 1: Summary of test accuracy on Cifar dataset for LeNet architecture.For dropout, DropPath, and SpatialDropout, we trained models with the bestkeep_prob values reported by [47] .Lenet baseline 58.48Lenet + normalization 60.41Lenet + normalization + random crop + data augmentation 75.06Lenet + normalization + random crop + data augmentation + weight decay 76.23Lenet + our method + weight decay 66.7Lenet + our method + normalization + weight decay 68.3Lenet + our method + random crop 74.0Lenet + our method + data augmentation 69.8Lenet + our method + random drop(0.2) 62.1Table 2: Comparison between naturally trained model, Gaussian data augmen-tation, adversarial training, and our method on clean images and Cifar10-C forresnet-20 architecture. Clear Impulse_noise Fog ContrastNatural 93.5 50.436 85.14 70.858GaussAdversarialOur method 94.06 57.344 86.752 73.432
When using Binarized mask in spatial domain, it will have side effect likegenerating boundary. However, in spectral domain, this kind of side effect isnot obvious. It can be seen as a novel way of denoising. Benefitting from theproperty of spectral domain that represents the feature in an invariant andsparse way, our method can suppress the wrong activation in the spectral domainthrough binarized mask.
Our method pinpoints the valid frequency range for the training set. What ifwe random drop some frequency after using our method, will the model learnredundancy feature, do not rely on heavily on those frequency and perform better.In the last line of Table 1, we show that it is not a better choice. This verifiesthat our method pinpoints the valid frequency range on another perspective.7able 3: The percentage of frequency each convolution layer masks for resnet-20architecture. Conv Layer Mask Percentagelayer1conv1.1 0.6938layer1conv1.2 0.5810layer1conv2.1 0.5213layer1conv2.2 0.4889layer2conv1.1 0.3354layer2conv1.2 0.3696layer2conv2.1 0.3124layer2conv2.2 0.2644layer3conv1.1 0.2082layer3conv1.2 0.2621layer3conv2.1 0.2453layer3conv2.2 0.1477layer4conv1.1 0.0810layer4conv1.2 0.0455layer4conv2.1 0.0337layer4conv2.2 0.0175
If we train the spectral domain mask for each class, it may have better perfor-mance and implicit transform among the same category in different datasets.However, in the test time, we need to determine the category of the imagesbefore using inter-class mask. So, we may need to change the architecture of themodel, which is left to our future work.
We proposed a novel regularization method in the train time to explicit removethe unimportant frequency. 1) We pinpoint the valid frequency range entangledin different layers. 2) We demonstrate the model trained with our regularizationis more robust on unseen data.Comparing with Band-limited training[44] and spectral dropout[43], theydo some restriction on spectral domain. Our method differs from them in twoaspects: 1) Compared with energy-based compression technique, our method donot drop high frequency component indiscriminately. Our goal is not to minimizethe approximation error between masked input and filter with unmasked ones,but to find out the most important frequency for classification and force themodel to shield against background and noise. 2) we do not use hyperparameterkeep-percentage to determine the threshold for masking. Our method uses back8igure 3: Class activation mapping (CAM)[48] for resnet-18 model.propagation to figure out mask, so our method can be used end-to-end.Comparing with self-supervised learning strategy [49, 50, 51], our methoddoes not have complex architecture. We try to leverage the transferability offrequency to solve the problem of transferability of model and domain adaptation.
References [1] Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson,Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwrittendigit recognition with a back-propagation network. In
Advances in neuralinformation processing systems , pages 396–404, 1990.[2] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, ScottReed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multiboxdetector. In
European conference on computer vision , pages 21–37. Springer,2016.[3] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 , 2018.[4] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn:Towards real-time object detection with region proposal networks. In
Advances in neural information processing systems , pages 91–99, 2015.[5] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutionalnetworks for biomedical image segmentation. In
International Conference onMedical image computing and computer-assisted intervention , pages 234–241.Springer, 2015. 96] Jyoti Aneja, Aditya Deshpande, and Alexander G Schwing. Convolutionalimage captioning. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 5561–5570, 2018.[7] Aharon Azulay and Yair Weiss. Why do deep convolutional networksgeneralize so poorly to small image transformations? arXiv preprintarXiv:1805.12177 , 2018.[8] Samuel Dodge and Lina Karam. A study and comparison of human anddeep learning recognition performance under visual distortions. In , pages 1–7. IEEE, 2017.[9] Douglas Heaven. Why deep-learning ais are so easy to fool.
Nature ,574(7777):163, 2019.[10] Sébastien Bubeck, Eric Price, and Ilya P. Razenshteyn. Adversarial examplesfrom computational constraints. In
ICML , 2018.[11] Dong Yin, Raphael Gontijo Lopes, Jonathon Shlens, Ekin Dogus Cubuk,and Justin Gilmer. A fourier perspective on model robustness in computervision. In
NeurIPS , 2019.[12] Farzan Farnia, Jesse M. Zhang, and David Tse. Generalizable adversarialtraining via spectral normalization.
ArXiv , abs/1811.07457, 2018.[13] Moustapha Cissé, Piotr Bojanowski, Edouard Grave, Yann Dauphin, andNicolas Usunier. Parseval networks: Improving robustness to adversarialexamples. In
ICML , 2017.[14] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida.Spectral normalization for generative adversarial networks.
ArXiv ,abs/1802.05957, 2018.[15] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learningstructured sparsity in deep neural networks. In
NIPS , 2016.[16] Farzan Farnia, Jesse M. Zhang, and David Tse. A spectral approach togeneralization and optimization in neural networks. 2018.[17] Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Dräxler, Min Lin,Fred A. Hamprecht, Yoshua Bengio, and Aaron C. Courville. On the spectralbias of neural networks. In
ICML , 2018.[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residuallearning for image recognition. , pages 770–778, 2015.1019] Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer,and Balaji Lakshminarayanan. AugMix: A simple data processing methodto improve robustness and uncertainty.
Proceedings of the InternationalConference on Learning Representations (ICLR) , 2020.[20] Ekin Dogus Cubuk, Barret Zoph, Dandelion Mané, Vijay Vasudevan, andQuoc V. Le. Autoaugment: Learning augmentation policies from data.
ArXiv , abs/1805.09501, 2018.[21] Ekin D Cubuk, Barret Zoph, Samuel S Schoenholz, and Quoc V Le. Intrigu-ing properties of adversarial examples. arXiv preprint arXiv:1711.02846 ,2017.[22] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, andOle Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300 , 2015.[23] Jianmin Bao, Dong Chen, Fang Wen, Houqiang Li, and Gang Hua. Cvae-gan:fine-grained image generation through asymmetric training. In
Proceedingsof the IEEE International Conference on Computer Vision , pages 2745–2754,2017.[24] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generativeadversarial nets. In
Advances in neural information processing systems ,pages 2672–2680, 2014.[25] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and OriolVinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 , 2016.[26] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, andRuslan Salakhutdinov. Dropout: a simple way to prevent neural networksfrom overfitting.
The journal of machine learning research , 15(1):1929–1958,2014.[27] Anders Krogh and John A Hertz. A simple weight decay can improvegeneralization. In
Advances in neural information processing systems , pages950–957, 1992.[28] Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization inadam. arXiv preprint arXiv:1711.05101 , 2017.[29] Sergey Ioffe and Christian Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. arXiv preprintarXiv:1502.03167 , 2015.[30] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and NathanSrebro. Exploring generalization in deep learning. In
NIPS , 2017.1131] Peter L. Bartlett, Dylan J. Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. In
NIPS , 2017.[32] Judy Hoffman, Daniel A. Roberts, and Sho Yaida. Robust learning withjacobian regularization.
ArXiv , abs/1908.02729, 2019.[33] Aaditya Prakash, James A. Storer, Dinei A. F. Florêncio, and Cha Zhang.Repr: Improved training of convolutional filters. , pages 10658–10667, 2018.[34] Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, andYixin Chen. Compressing convolutional neural networks in the frequencydomain. In
KDD ’16 , 2016.[35] Michaël Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convo-lutional networks through ffts.
CoRR , abs/1312.5851, 2013.[36] Nicolas Vasilache, Jeff Johnson, Michaël Mathieu, Soumith Chintala, SerkanPiantino, and Yann LeCun. Fast convolutional nets with fbfft: A gpuperformance evaluation.
CoRR , abs/1412.7580, 2014.[37] Bochen Guan, Jinnian Zhang, William A. Sethares, Richard Kijowski, andFang Liu. Specnet: Spectral domain convolutional neural network.
ArXiv ,abs/1905.10915, 2019.[38] Yusuke Tsuzuku and Issei Sato. On the structural sensitivity of deep convolu-tional networks to the directions of fourier basis functions. , pages51–60, 2018.[39] Shin Fujieda, Kohei Takayama, and Toshiya Hachisuka. Wavelet convolu-tional neural networks.
ArXiv , abs/1805.08620, 2018.[40] Oren Rippel, Jasper Snoek, and Ryan P. Adams. Spectral representationsfor convolutional neural networks.
ArXiv , abs/1506.03767, 2015.[41] Pengju Liu, Hongzhi Zhang, Kai Zhang, Liang Lin, and Wangmeng Zuo.Multi-level wavelet-cnn for image restoration. , pages886–88609, 2018.[42] Jason Jo and Yoshua Bengio. Measuring the tendency of cnns to learnsurface statistical regularities.
ArXiv , abs/1711.11561, 2017.[43] Salman H. Khan, Munawar Hayat, and Fatih Murat Porikli. Regularizationof deep neural networks with spectral dropout.
Neural networks : the officialjournal of the International Neural Network Society , 110:82–90, 2017.1244] Adam Dziedzic, John Paparrizos, Sanjay Krishnan, Aaron Elmore, andMichael Franklin. Band-limited training and inference for convolutionalneural networks. In
International Conference on Machine Learning , pages1745–1754, 2019.[45] David J. Field. Relations between the statistics of natural images andthe response properties of cortical cells.
Journal of the Optical Society ofAmerica. A, Optics and image science , 1987.[46] Alex Krizhevsky. Learning multiple layers of features from tiny images.2009.[47] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V. Le. Dropblock: A regularizationmethod for convolutional networks.
ArXiv , abs/1810.12890, 2018.[48] Bolei Zhou, Aditya Khosla, Àgata Lapedriza, Aude Oliva, and AntonioTorralba. Learning deep features for discriminative localization. , pages2921–2929, 2016.[49] Ishan Misra and Laurens van der Maaten. Self-supervised learning ofpretext-invariant representations.
ArXiv , abs/1912.01991, 2019.[50] Fei Pan, Inkyu Shin, François Rameau, Seokju Lee, and In So Kweon.Unsupervised intra-domain adaptation for semantic segmentation throughself-supervision.
ArXiv , abs/2004.07703, 2020.[51] Changdong Xu, Xingran Zhao, Xin Jin, and Xiu-Shen Wei. Exploringcategorical regularization for domain adaptive object detection.