DeepCorrect: Correcting DNN models against Image Distortions
11 DeepCorrect: Correcting DNN Models againstImage Distortions
Tejas Borkar,
Student Member, IEEE, and Lina Karam,
Fellow, IEEE
Abstract —In recent years, the widespread use of deep neuralnetworks (DNNs) has facilitated great improvements in perfor-mance for computer vision tasks like image classification andobject recognition. In most realistic computer vision applications,an input image undergoes some form of image distortion such asblur and additive noise during image acquisition or transmission.Deep networks trained on pristine images perform poorly whentested on such distortions. In this paper, we evaluate the effectof image distortions like Gaussian blur and additive noise onthe activations of pre-trained convolutional filters. We proposea metric to identify the most noise susceptible convolutionalfilters and rank them in order of the highest gain in classi-fication accuracy upon correction. In our proposed approachcalled
DeepCorrect , we apply small stacks of convolutional layerswith residual connections , at the output of these ranked filtersand train them to correct the worst distortion affected filteractivations, whilst leaving the rest of the pre-trained filteroutputs in the network unchanged. Performance results showthat applying
DeepCorrect models for common vision tasks likeimage classification (ImageNet), object recognition (Caltech-101,Caltech-256) and scene classification (SUN-397), significantlyimproves the robustness of DNNs against distorted images andoutperforms other alternative approaches.
Index Terms —deep neural networks, image distortion, imageclassification, residual learning, image denoising, image deblur-ring.
I. I
NTRODUCTION T ODAY, state-of-the-art algorithms for computer visiontasks like image classification, object recognition andsemantic segmentation employ some form of deep neuralnetworks (DNNs). The ease of design for such networks,afforded by numerous open source deep learning libraries [1],[2], has established DNNs as the go-to solution for manycomputer vision applications. Even challenging computer vi-sion tasks like image classification [3], [4], [5], [6] and objectrecognition [7], [8], [9], which were previously considered tobe extremely difficult, have seen great improvements in theirstate-of-the-art results due to the use of DNNs. An importantfactor contributing to the success of such deep architecturesin computer vision tasks is the availability of large scaleannotated datasets [10], [11].The visual quality of input images is an aspect veryoften overlooked while designing DNN based computer vi-sion systems. In most realistic computer vision applications,an input image undergoes some form of image distortionincluding blur and additive noise during image acquisition,transmission or storage. However, most popular large scale
T. Borkar and L. Karam are with the Department of Electrical, Computerand Energy Engineering, Arizona State University, Tempe, AZ, 85281 USAe-mail: { tsborkar, karam } @asu.edu. datasets do not have images with such artifacts. Dodge andKaram [12] showed that even though such image distortionsdo not represent adversarial samples for a DNN, they docause a considerable degradation in classification performance.Fig. 1 shows the effect of image quality on the predictionperformance of a DNN trained on high quality images devoidof distortions.Testing distorted images with a pre-trained DNN model forAlexNet [6], we observe that adding even a small amount ofdistortion to the original image results in a misclassification,even though the added distortion does not hinder the humanability to classify the same images (Fig. 1). In the caseswhere the predicted label for a distorted image is correct,the prediction confidence drops significantly as the distortionseverity increases. This indicates that features learnt from adataset of high quality images are not invariant to imagedistortion or noise and cannot be directly used for applicationswhere the quality of images is different than that of thetraining images. Some issues to consider while deployingDNNs in noise/distortion affected environments include thefollowing. For a network trained on undistorted images, areall convolutional filters in the network equally susceptibleto noise or blur in the input image? Are networks able tolearn some filters that are invariant to input distortions, evenwhen such distortions are absent from the training set? Isit possible to identify and rank the convolutional filters thatare most susceptible to image distortions and recover the lostperformance, by only correcting the outputs of such rankedfilters?In our proposed approach called DeepCorrect , we try toaddress these aforementioned questions through the followingnovel contributions: • Evaluating the effect of common image distortionslike Gaussian blur and Additive White Gaussian Noise(AWGN) on the outputs of pre-trained convolutionalfilters. We find that for every layer of convolutional filtersin the DNN, certain filters are more susceptible to inputdistortions than others and that correcting the activationsof these filters can help recover lost performance. • Measuring the susceptibility of convolutional filters toinput distortions and ranking filters in the order of highestsusceptibility to input distortion. • Correcting the worst distortion-affected filter activationsby appending correction units , which are small blocksof stacked convolutional layers trained using a target-oriented loss, at the output of select filters, whilst leavingthe rest of the pre-trained filter outputs in a DNN un-changed. a r X i v : . [ c s . C V ] A p r Originalgoldfish 0.977 blur σ = 2goldfish 0.873 blur σ = 4goldfish 0.398 blur σ = 6goldfish 0.100gila monster 0.234 rock python 0.034 velvet 0.016 velvet 0.010 (a)
Originalgoldfish 0.977 AWGN σ = 20goldfish 0.254 AWGN σ = 60starfish 0.054 AWGN σ = 100starfish 0.027gila monster 0.234 sidewinder 0.165 prayer rug 0.032 stole 0.019 (b)Fig. 1. Effect of image quality on DNN predictions, with predicted label andconfidence generated by a pre-trained AlexNet [6] model. Distortion severityincreases from left to right, with the left-most image in a row having no dis-tortion (original). Bold green text indicates correct classification, while italicred text denotes misclassification . (a) Examples from the ILSVRC2012 [10]validation set distorted by Gaussian blur. (b) Examples from the ILSVRC2012validation set distorted by Additive White Gaussian Noise (AWGN) • Representing full-rank convolutions in our
DeepCorrect models with rank-constrained approximations consistingof a sequence of separable convolutions with rectan-gular kernels to mitigate the additional computationalcost introduced by our correction units . This results inpruned
DeepCorrect models that have almost the samecomputational cost as the respective pre-trained DNN,during inference.Applying our
DeepCorrect models for common vision taskslike image classification [10], object recognition [13] [14] andscene classification [15] significantly improves the robustnessof DNNs against distorted images and also outperforms otheralternative approaches, while training significantly lesser pa-rameters. To ensure reproducibility of presented results, thecode for
DeepCorrect is made publicly available at https://github.com/tsborkar/DeepCorrect.The remainder of the paper is organized as follows. Sec-tion II provides an overview of the related work in assess-ing and improving the robustness of DNNs to input imageperturbations. Section III describes the distortions, networkarchitectures and datasets we use for analyzing the distortionsusceptibility of convolutional filters in a DNN. A detailed description of our proposed approach is presented in SectionIV followed, in Section V, by extensive experimental valida-tion with different DNN architectures and multiple datasetscovering image classification, object recognition and sceneclassification. Concluding remarks are given in Section VI.II. R
ELATED W ORK
DNN susceptibility to specific small magnitude perturba-tions which are imperceptible to humans but cause networks tomake erroneous predictions with high confidence (adversarialsamples) has been studied in [16] and [17]. The concept ofrubbish samples proposed in [18] studies the vulnerabilityof DNNs to make arbitrary high confidence predictions forrandom noise images that are completely unrecognizable tohumans, i.e., the images contain random noise and no actualobject. However, both adversarial samples and rubbish samplesare relatively less encountered in common computer visionapplications as compared to other common distortions due toimage acquisition, storage, transmission and reproduction.Karam and Zhu [19] present QLFW, a face matching datasetconsisting of images with five types of quality distortions.Basu et al. [20] present the n-MNIST dataset, which addsGaussian noise, motion blur and reduced contrast to theoriginal images of the MNIST dataset. Dodge and Karam [12]evaluate the impact of a variety of quality distortions suchas Gaussian blur, AWGN and JPEG compression on variousstate-of-the-art DNNs and report a substantial drop in classi-fication accuracy on the ImageNet (ILSVRC2012) dataset inthe presence of blur and noise. A similar evaluation for thetask of face recognition is presented in [21].Rodner et al. [22] assess the sensitivity of various DNNs toimage distortions like translation, AWGN and salt & peppernoise, for the task of fine grained categorization on the CUB-200-2011 [23] and Oxford flowers [24] datasets, by proposinga first-order Taylor series based gradient approximation thatmeasures the expected change in final layer outputs for smallperturbations to input image. Since a gradient approxima-tion assumes small perturbations, Rodner et al. ’s sensitivitymeasure does not work well for higher levels of distortionas shown by [22] and does not assess susceptibility at afilter level within a DNN. Furthermore, Rodner et al. do notpresent a solution for making the network more robust to inputdistortions; instead, they simply fine-tune the whole networkwith the distorted images added as part of data augmentationduring training. Retraining large networks such as VGG16 [3]or ResNet-50 [5] on large-scale datasets is computationallyexpensive. Unlike Rodner et al. ’s work, our proposed rankingmeasure assesses sensitivity of individual convolutional filtersin a DNN, is not limited to differentiable distortion processes,and holds good for both small and large perturbations.Zheng et al. [25] improve the general robustness of DNNsto unseen small perturbations in the input image through theintroduction of distortion agnostic stability training , whichminimizes the KL-divergence between DNN output predic-tions for a clean image and a noise perturbed version of thesame image. The perturbed images are generated by adding All figures in this paper are best viewed in color. uncorrelated Gaussian noise to the original image.
Stabilitytraining provides improved DNN robustness against JPEGcompression, image scaling and cropping. However, the top-1accuracy of stability trained models is lower than the originalmodel, when tested on most distortions including motionblur, defocus blur and additive noise among others [26].Sun et al. also propose a distortion agnostic approach toimprove DNN robustness by introducing 3 feature quantizationoperations, i.e., a floor function with adaptive resolution, apower function with learnable exponents and a power functionwith data dependent exponents, that act on the convolutionalfilter activations before the ReLU non-linearity is applied.However, similar to stability training , the performance of feature quantized models is lower than the original model,when tested on distortions like defocus blur and additive noise.Additionally, no single feature quantization function consis-tently outperforms the other two for all types of distortion.Although both distortion agnostic methods [25], [26] improveDNN robustness, their top-1 accuracy is much lower thanthe accuracy of the original DNN on distortion free images,making it difficult to deploy these models.A non-blind approach to improve the resilience of networkstrained on high quality images would be to retrain the networkparameters (fine-tune) on images with observed distortiontypes. Vasiljevic et al. [27] study the effect of various typesof blur on the performance of DNNs and show that DNNperformance for the task of classification and segmentationdrops in the presence of blur. Vasiljevic et al. [27] and Zhou et al. [28] show that fine-tuning a DNN on a dataset comprisedof both distorted and undistorted images helps to recover partof the lost performance when the degree of distortion is low.Diamond et al. [29] propose a joint denoising, deblurringand classification pipeline. This involves an image preprocess-ing stage that denoises and deblurs the image in a manner thatpreserves image features optimal for classification rather thanaesthetic appearance. The classification stage has to be fine-tuned using distorted and clean images, while the denoisingand deblurring stages assume a priori knowledge of cameraparameters and the blur kernel, which may not be available atthe time of testing.III. E
XPERIMENTAL S ETUP
Here, we describe the various image distortions and networkarchitectures used to evaluate the susceptibility of individualconvolutional filters to input distortions. We use the ILSVRC-2012 validation set [10] for our experiments.
A. Distortions
We first focus on evaluating two important and conflictingtypes of image distortions: Gaussian blur and AWGN over 6levels of distortion severity. Gaussian blur, often encounteredduring image acquisition and compression [30], represents adistortion that eliminates high frequency discriminative objectfeatures like edges and contours, whereas AWGN is commonlyused to model additive noise encountered during image acqui-sition and transmission. We use a noise standard deviation σ n ∈ {
10, 20, 40, 60, 80, 100 } for AWGN and blur standard x -conv- - x -maxpool- x -conv- - x -maxpool- x -conv- - x -conv- - x -conv- - x -maxpool- -fc -fc -fc -softmax RGB
Image55 x x x x x x x x x x x x x x (a) x -conv- - x -maxpool- x -conv- - x -conv- - x -conv- - x -conv- - x -conv- - x -conv- - x -conv- - x -conv- - x -conv- - x -conv- - x -conv- - x -conv- - x -conv- - x -conv- - x -conv- - x -conv- - global avg.pool x -fc -softmax RGB
Image112 x x x x x x x x x x x x x (b)Fig. 2. Network architectures for our baseline models. Convolutional layersare parameterized by k x k -conv- d - s , where k x k is the spatial extent of the filter, d is the number of output filters in a layer and s represents the filter stride.Maxpooling layers are parameterized as k x k -maxpool- s , where s is the spatialstride. (a) AlexNet DNN [6] (b) ResNet18 [5] . deviation σ b ∈ {
1, 2, 3, 4, 5, 6 } for Gaussian blur. The sizeof the blur kernel is set to 4 times the value of σ b . B. Network Architectures
We use two different network architectures as shown inFig. 2, specifically: a shallow 8-layered DNN (AlexNet) and a deeper 18-layered network with residual connections(ResNet18) [5]. We use the term ”pre-trained” or ”baseline” We use the code and weights for the AlexNet DNN available online athttps://github.com/heuritech/convnets-keras Every convolutional layer is followed by a ReLU non-linearity for theAlexNet DNN. In addition to the ReLU non-linearity, the first and secondconvolutional layers of the AlexNet DNN are also followed by a local responsenormalization operation [6]. Every convolutional layer is followed by a batch normalization operationand ReLU non-linearity for the ResNet18 model. Skip connections and resid-ual feature maps are combined through an element-wise addition. Dashed-lineskip connections perform an identity mapping using a stride of 2 to reducefeature map size and pad zero entries along the channel axis to increasedimensionality [5].
TABLE IT OP -1 ACCURACY OF PRE - TRAINED NETWORKS FOR DISTORTIONAFFECTED IMAGES AS WELL AS UNDISTORTED IMAGES ( ORIGINAL ). F OR G AUSSIAN BLUR AND
AWGN,
ACCURACY IS REPORTED BY AVERAGINGOVER ALL LEVELS OF DISTORTION . Models Original Gaussian blur AWGNAlexNet 0.5694 0.2305 0.2375ResNet18 0.6912 0.3841 0.3255 network to refer to any network that is trained on undistortedimages. IV. D
EEP C ORRECT
Although pre-trained networks perform poorly on test im-ages with significantly different image statistics than thoseused to train these networks (Table I), it is not obviousif only some convolutional filters in a network layer areresponsible for most of the observed performance gap or ifall convolutional filters in a layer contribute more or lessequally to the performance degradation. If only a subset ofthe filters in a layer are responsible for most of the lostperformance, we can focus on restoring only the most severelyaffected activations and avoid modifying all the remainingfilter activations in a DNN.
A. Ranking Filters through Correction Priority
We define the output of a single convolutional filter φ i , j to the input x i by φ i , j ( x i ) , where i and j correspond tolayer number and filter number, respectively. If g i ( · ) is atransformation that models the distortion acting on filter input x i , then the output of a convolutional filter φ i , j to the distortionaffected input is given by (cid:102) φ i , j ( x i ) = φ i , j ( g i ( x i )) . It should benoted that (cid:102) φ i , j ( x i ) represents the filter activations generated bydistorted inputs and φ i , j ( x i ) represents the filter activations forundistorted inputs. Assuming we have access to φ i , j ( x i ) for agiven set of input images, replacing (cid:102) φ i , j ( x i ) with φ i , j ( x i ) ina deep network is akin to perfectly correcting the activationsof the convolutional filter φ i , j against input image distortions.Computing the output predictions by swapping a distortionaffected filter output with its corresponding clean outputfor each of the ranked filters would improve classificationperformance. The extent of improvement in performance isindicative of the susceptibility of a particular convolutionalfilter to input distortion and its contribution to the associatedperformance degradation.We now define the correction priority of a convolutionalfilter φ i , j as the improvement in DNN performance on avalidation set, generated by replacing (cid:102) φ i , j ( x i ) with φ i , j ( x i ) for apre-trained network. Let the baseline performance (computedover distorted images) for a network be p b , which can beobtained by computing the average top-1 accuracy of the net-work over a set of images or another task-specific performancemeasure. Let p swp ( i , j ) denote the new improved performanceof the network after swapping (cid:102) φ i , j ( x i ) with φ i , j ( x i ) . As ourimplementation focuses on classification tasks, the averagetop-1 accuracy over a set of distorted images is used tomeasure p b and p swp ( i , j ) . The correction priority for filter φ i , j is then given by: τ ( i , j ) = p swp ( i , j ) − p b (1) Algorithm 1
Computing Correction Priority
Input: ( x , i , g i ( x , i ) , y ) ,..., ( x M , i , g i ( x M , i ) , y M ) are given tripletswith 1 ≤ i ≤ L , where i represents the layer number, x m , i is the m th undistorted input for layer i and g i ( x m , i ) is the correspondingdistorted version, M is the total number of images in thevalidation set and y m is the ground-truth label for the m th inputimage. Output:
Correction priority τ p b : = for m = M do Predict class label ypred m for distorted image g ( x m , ) Compute p b = p b + M h ( y m , ypred m ) , where h ( y m , ypred m ) =
1, if y m = ypred m and 0 otherwise. end for for i = do N i ← number of filters in layer i φ i , j ← j th convolutional filter in the i th layer for j = N i do p swp ( j ) = 0 for m = M do φ i , j ( g i ( x m , i )) ← φ i , j ( x m , i ) Predict class label ypred m p swp ( j ) = p swp ( j ) + M h ( y m , ypred m ) , where h ( y m , ypred m ) =
1, if y m = ypred m and 0 otherwise. end for end for τ ( i , j ) ← p swp ( j ) − p b end for return τ A higher τ ( i , j ) indicates higher susceptibility of the convolu-tional filter φ i , j to input distortion. Using the proposed rankingmeasure in Equation (1) and 5000 images (i.e., 5 images perclass) randomly sampled from the ILSVRC-2012 training set,we compute correction priorities for every convolutional filterin the network and rank the filters in descending order ofcorrection priority. The detailed overview and pseudo-code forcomputing correction priorities is summarized in Algorithm 1.We evaluate in Fig. 3, the effect of correcting differentpercentages, β i , of the ranked filter activations in the i th DNN layer of AlexNet for distortion affected images. For theAlexNet model, it is possible to recover a significant amountof the lost performance, by correcting only 50% of the filteractivations in any one layer, which indicates that a selectsubset of convolutional filters in each layer are indeed moresusceptible to distortions than the rest. Although we showgraphs for the AlexNet model, form our experiments, we makesimilar observations for the ResNet18 model as well.Convolutional filter visualizations from the first layer of thepre-trained AlexNet model (Fig. 4a) reveal two types of filterkernels: 1) mostly color agnostic, frequency- and orientation-selective filters that capture edges and object contours and 2)color specific blob shaped filters that are sensitive to specificcolor combinations. Figs. 4b and 4c visualize the top 50%filters in the first convolutional layer of AlexNet, that aremost susceptible to Gaussian blur and AWGN, respectively, asidentified by our proposed ranking metric. The identified filtersmost susceptible to Gaussian blur are mainly frequency- andorientation-selective filters, most of which are color agnostic,while filters most susceptible to AWGN are a mix of both colorspecific blobs and frequency- and orientation-selective filters.
25 50 75 100 n T o p - a cc u r a c y AWGN: Alexnet, conv 1 25 50 75 100 n AWGN: Alexnet, conv 2 25 50 75 100 n AWGN: Alexnet, conv 3 25 50 75 100 n AWGN: Alexnet, conv 4 25 50 75 100 n AWGN: Alexnet, conv 5 top 10% top 25% top 50% top 75% top 90% b T o p - a cc u r a c y Blur: Alexnet, conv 1 2 4 6 b Blur: Alexnet, conv 2 2 4 6 b Blur: Alexnet, conv 3 2 4 6 b Blur: Alexnet, conv 4 2 4 6 b Blur: Alexnet, conv 5
Fig. 3. Effect of varying the percentage of corrected filter activations β i ∈ { } , in the i th convolutional layer (conv i ) of pre-trainedAlexNet, for AWGN and Gaussian blur affected images, respectively.(a)(b)(c)Fig. 4. (a) 96 convolutional filter kernels of size 11x11x3 in the firstconvolutional layer of pre-trained AlexNet. (b) Convolutional filter kernelsmost susceptible to Gaussian blur (top 50%), as identified by our proposedranking metric. (c) Convolutional filter kernels most susceptible to AWGN(top 50%), as identified by our proposed ranking metric. In (b) and (c), thefilters are sorted in descending order of susceptibility going row-wise fromtop left to bottom right. This is in line with our intuitive understanding that Gaussianblur majorly affects edges and object contours and not objectcolor, while AWGN affects color as well as object contours.
B. Correcting Ranked Filter Outputs
Here, we propose a novel approach, which we refer to as
DeepCorrect , where we learn a task-driven corrective trans-form that acts as a distortion masker for convolutional filters that are most susceptible to input distortion, while leaving allthe other pre-trained filter outputs in the layer unchanged. Let R i represent a set consisting of the N i ranked filter indices inthe i th layer of the network, computed using the procedure inSection IV-A. Also let R i , β i represent a subset of R i consistingof the top β i N i ranked filter indices in network layer i , where N i is the total number of convolutional filters in layer i and β i is the percentage of filters corrected in layer i , as defined inSection IV-A. If Φ i represents the set of convolutional filtersin the i th layer, the objective is to learn a transform F corr i ( : ) such that: F corr i ( Φ R i , β i ( g i ( x i ))) ≈ Φ R i , β i ( x i ) (2)where x i is the undistorted input to the i th layer of convo-lutional filters and g i ( · ) is a transformation that models thedistortion acting on x i . Since we do not assume any specificform for the image distortion process, we let the correctivetransform F corr i ( : ) take the form of a shallow residual block ,which is a small stack of convolutional layers (4 layers)with a single skip connection [5], such as the one shownin Fig. 5. We refer to such a residual block as a correctionunit . F corr i ( : ) can now be estimated using a target-orientedloss such as the one used to train the original network,through backpropogation [31], but with much less number ofparameters.Consider an L layered DNN Φ that has been pre-trainedfor an image classification task using clean images. Φ canbe interpreted as a function that maps network input x to anoutput vector Φ ( x ) ∈ R d , such that: Φ = Φ L ◦ Φ L − ◦ . . . Φ ◦ Φ (3)where Φ i is the mapping function (set of convolutional filters)representing the i th DNN layer and d is the dimensionality ofthe network output. D i - 1 β i N i pre-trainedfilters’ outputs • Batch Norm k x k - conv - D i - 1Batch Norm k x k - conv - D i - 1Batch Norm1 x 1 - conv - β i N i - 1+ β i N i correctedfilters’ outputsReLUReLUReLU Fig. 5.
Correction unit based on a residual function [5], acting on the outputsof β i N i ( < β i < ) filters out of N i total filters in the i th convolutional layer ofa pre-trained DNN. All convolutional layers in the residual block , except thefirst and last layer, are parameterized by k x k -conv- D i - s , where k x k is spatialextent of the filter, D i (correction unit kernel depth) is the number of outputfilters in a layer, s represents the filter stride and i represents the layer numberof the convolutional layer being corrected in the pre-trained DNN. Without loss of generality, if we add a correction unit thatacts on the top β N ranked filters in the first network layer,then the resultant network Φ corr is given by: Φ corr = Φ L ◦ Φ L − ◦ . . . Φ ◦ Φ corr (4)where Φ corr represents the new mapping function for the firstlayer, in which the corrective transform F corr ( : ) acts on theactivations of the filter subset Φ R , β and all the remainingfilter activations are left unchanged. If W represents thetrainable parameters in F corr , then F corr can be estimated byminimizing : E ( W ) = λ R ( W ) + M M ∑ m = L ( y m , Φ corr ( x m )) (5) where λ is a constant, R is a regularizer such as l normor l norm, L is a standard cross-entropy classification loss, y m is the target output label for the m th input image x m , M represents the total number of images in the training setand, since we train on a collection of both distorted andclean images, x m represents a clean or a distorted image. Thetrainable parameters in Equation (5) are W , while all othernetwork parameters are fixed and kept the same as those in thepre-trained models. Although Equation (5) shows correctionunit estimation for only the first layer, it is possible to addsuch units at the output of distortion susceptible filters in anylayer and in one or more layers. Fig. 6 shows an example ofa DeepCorrect model for the pre-trained AlexNet model inFig. 2a. The correction unit is applied before the ReLU non-linearity acting uponthe distortion-susceptible convolutional filter outputs. Max pooling layersfollowing convolutional layers 1, 2 and 5 in the pre-trained AlexNet havenot been shown in the ImageNet
DeepCorrect model for uniformity.
Correction unit
Concat3x3-conv-256-11x1-conv-128-1 1x1-conv-128-1Correction unitConcat4096-fc4096-fc1000-fc1000-softmax
RGB
Image x x x x x x x x x x Fig. 6.
DeepCorrect model for AlexNet, with 75% filter outputs corrected inthe first two layers and 50% filter outputs corrected in the next three layers .Convolution layers from the original architecture in Fig. 2, shown in graywith dashed outlines, are non-trainable layers and their weights are kept thesame as those of the pre-trained model. C. Rank-constrained DeepCorrect models
The inference time of our
DeepCorrect models (SectionIV-B) can be slower than the respective baseline models due tothe additional computational cost introduced by the correctionunits . To mitigate the impact of these additional computations,we propose the use of a rank-constrained approximation to thefull-rank
DeepCorrect model, which not only has the samecomputational cost as the corresponding baseline DNN butalso retains almost 99% of the performance of our full-rank
DeepCorrect models.Consider the n th full-rank 3-D convolutional filter withweights W n ∈ R k × k × C in a DNN convolutional layer with N filters, where k × k represents the filter’s spatial extentand C is the number of input channels for the filter; then arank constrained convolution is implemented by factorizing theconvolution of W n with input z into a sequence of separableconvolutions (i.e., horizontal and vertical filters) as in [32]: W n ∗ z ≈ P ∑ p = h pn ∗ ( v p ∗ z ) = P ∑ p = h pn ∗ C ∑ c = v cp ∗ z c (6) where the first convolutional filter bank consists of P verticalfilters { v p ∈ R k × × C : p ∈ [ ... P ] } and the second convolution consists of a horizontal filter that operates on P input featuremaps { h n ∈ R × k × P } . The number of intermediate filter maps, P , controls the rank of the low-rank approximation. Thecomputational cost of the original full-rank convolutional layerfor N output feature maps with width W (cid:48) and height H (cid:48) is O ( Nk CH (cid:48) W (cid:48) ) , whereas the rank-constrained approximationhas a computational cost of O (( N + C ) kPH (cid:48) W (cid:48) ) and a speedupcan be achieved when NkC > ( N + C ) P . For the special caseof convolutional layers in our correction units , where N = C (Fig. 5), if P = N /
2, the computational cost of a convolutionallayer can be reduced k times (typically, k = DeepCorrect model is thus generatedby replacing each full-rank convolutional layer (except 1 × P for each approximation chosen such that the total compu-tational cost of our model is the same as the baseline DNN.Instead of using iterative methods or training the separablefilters from random weights, we use the simple yet fast, matrixdecomposition approach of Tai et. al. [33] to get the exactglobal optimizer of the rank-constrained approximation fromits respective trained full-rank DeepCorrect model.V. E
XPERIMENTAL R ESULTS
We evaluate
DeepCorrect models against the alternativeapproaches of network fine-tuning and stability training [25],for the DNN architectures mentioned in Section III. The DNNmodels are trained and tested using a single Nvidia Titan-X GPU. Unlike common image denoising and deblurringmethods like BM3D [34] and NCSR [35] which expect thedistortion level to be known during both train and test phasesor learning-based methods that train separate models for eachdistortion level,
DeepCorrect trains a single model for alldistortion levels at once and, consequently, there is no need toknow the distortion level at test time.
A. AlexNet Analysis1) Finetune model:
We fine-tune the AlexNet model inFig. 2a on a mix of distortion affected images and clean imagesto generate a single fine-tuned model and refer to this model asFinetune in our results. Starting with an initial learning rate (=0.001) that is 10 times lower than that used to generate the pre-trained model in Fig. 2a a fixed number of iterations (62500iterations ≈
10 epochs), with the learning rate reduced by afactor of 10 after every 18750 iterations (roughly 3 epochs).We use a data augmentation method as proposed by [5].
2) Stability trained model:
Following the stability training method outlined in [25] that considers that unseen distortionscan be modelled by adding AWGN to the input image, wefine-tune all the fully connected layers of the pre-trainedAlexNet model by minimizing the KL-divergence between theclassification scores for a pair of images ( I , I (cid:48) ), where I (cid:48) = I + η and η ∼ N ( , σ ) . We use the same hyper-parameters usedfor the classification task in [25]: σ = 0.04 and regularizationcoefficient α = 0.01. TABLE IIT OP -1 ACCURACY OF A LEX N ET - BASED
DNN
MODELS FOR DISTORTIONAFFECTED IMAGES OF THE I MAGE N ET VALIDATION SET (ILSVRC2012),
AVERAGED OVER ALL LEVELS OF DISTORTION AND CLEAN IMAGES . B
OLDNUMBERS SHOW BEST ACCURACY AND UNDERLINED NUMBERS SHOWNEXT BEST ACCURACY . Method Gaussian blur AWGN
Baseline 0.2305 0.2375Finetune 0.4596 0.4894Finetune-rc 0.4549 0.4821
Deepcorr
Deepcorr -b 0.5022 0.5063
Deepcorr -rc 0.4992 0.5052Stability [25] 0.2163 0.2305NCSR [35]+AlexNet [6] 0.2193 -BM3D [34]+AlexNet [6] - 0.5032TABLE IIIC
OMPUTATIONAL PERFORMANCE OF A LEX N ET - BASED
DNN
MODELS . Metric Baseline/ Finetune
Deepcorr Deepcorr -b Deepcorr -rc
FLOPs 7.4 × × × × Trainableparams 60.96M 2.81M 1.03M 1.03M
3) DeepCorrect models:
Our main
DeepCorrect model forAlexNet shown in Fig. IV-B and referred to as
Deepcorr inTable II is generated by correcting 75% ranked filter outputsin the first two layers ( β , β = 0.75) and 50% ranked filteroutputs in the next three layers ( β , β , β = 0.5) of thepre-trained AlexNet shown in Fig. 2a. The correction units (Fig. 5) in each convolutional layer are trained using an initiallearning rate of 0.1 and the same learning rate schedule,data augmentation and total iterations used for generatingthe Finetune model in Section V-A1. We also generate twoadditional variants based on our Deepcorr model,
Deepcorr -b, a computationally lighter model than
Deepcorr based on abottleneck architecture for its correction units as describedlater in this section, and
Deepcorr -rc, which is a rank-constrained model (Section IV-C) derived from the full-rank
Deepcorr -b model such that its test-time computational costis almost the same as the Finetune model. For comparison, arank-constrained model is also derived for the Finetune modelusing similar decomposition parameters as
Deepcorr -rc andthe resulting model is denoted by Finetune-rc.Table II shows the superior performance of our proposedmethod as compared to the alternative approaches and Table IIIsummarizes the computational performance in terms of train-able parameters and floating point operations (FLOPs) of theseDNN models during training and testing, respectively. Weevaluate the training computational cost in terms of the numberof trainable parameters that are updated during training. Thetest-time computational cost is evaluated in terms of the totalFLOPs, i.e., total multiply-add operations needed for a singleimage inference, as outlined by He et. al. in [5]. In particular,a k × k convolutional layer operating on C input maps andproducing N output feature maps of width W (cid:48) and height H (cid:48) Since
Deepcorr -rc is derived from the
Deepcorr -b model through a low-rank approximation, the number of trainable parameters and subsequently itseffect on model convergence is the same as
Deepcorr -b.
TABLE IVC
ORRECTION UNIT ARCHITECTURES FOR A LEX N ET - BASED D EEP C ORRECT MODELS . T
HE NUMBER FOLLOWING C ORR - UNIT SPECIFIES THE LAYER ATWHICH CORRECTION UNITS ARE APPLIED . C
ORRECTION UNIT CONVOLUTIONAL LAYERS ARE REPRESENTED AS k X k , d , WHERE k X k IS THE SPATIALEXTENT OF A FILTER AND d IS THE NUMBER OF FILTERS IN A LAYER . S
TACKED CONVOLUTIONAL LAYERS ARE ENCLOSED IN BRACKETS , FOLLOWED BYTHE NUMBER OF LAYERS STACKED . Model
Corr-unit 1 Corr-unit 2 Corr-unit 3 Corr-unit 4 Corr-unit 5 β N = 72 β N = 192 β N = 192 β N = 192 β N = 128 Output size Output size Output size Output size Output size ×
55 27 ×
27 13 ×
13 13 ×
13 13 × Deepcorr ×
1, 72 1 ×
1, 192 1 ×
1, 192 1 ×
1, 192 1 ×
1, 128 (cid:104) ×
5, 72 (cid:105) × (cid:104) ×
3, 192 (cid:105) × (cid:104) ×
3, 192 (cid:105) × (cid:104) ×
3, 192 (cid:105) × (cid:104) ×
3, 128 (cid:105) × ×
1, 72 1 ×
1, 192 1 ×
1, 192 1 ×
1, 192 1 ×
1, 128Total FLOPs: 16.5 × FLOPs: 8.1 × FLOPs: 5.3 × FLOPs: 1.2 × FLOPs: 1.2 × FLOPs: 5.5 × Deepcorr -b 1 ×
1, 36 1 ×
1, 96 1 ×
1, 96 1 ×
1, 96 1 ×
1, 64 (cid:104) ×
3, 36 (cid:105) × (cid:104) ×
3, 96 (cid:105) × (cid:104) ×
3, 96 (cid:105) × (cid:104) ×
3, 96 (cid:105) × (cid:104) ×
3, 64 (cid:105) × ×
1, 72 1 ×
1, 192 1 ×
1, 192 1 ×
1, 192 1 ×
1, 128Total FLOPs: 4.4 × FLOPs: 1.2 × FLOPs: 2.0 × FLOPs: 4.8 × FLOPs: 4.8 × FLOPs: 2.1 × Deepcorr -rc 1 ×
1, 36 1 ×
1, 96 1 ×
1, 96 1 ×
1, 96 1 ×
1, 64 (cid:34) ×
1, 271 ×
3, 36 (cid:35) × (cid:34) ×
1, 721 ×
3, 96 (cid:35) × (cid:34) ×
1, 721 ×
3, 96 (cid:35) × (cid:34) ×
1, 721 ×
3, 96 (cid:35) × (cid:34) ×
1, 481 ×
3, 64 (cid:35) × ×
1, 72 1 ×
1, 192 1 ×
1, 192 1 ×
1, 192 1 ×
1, 128Total FLOPs: 2.5 × FLOPs: 6.8 × FLOPs: 1.1 × FLOPs: 2.7 × FLOPs: 2.7 × FLOPs: 1.2 × requires Nk CH (cid:48) W (cid:48) FLOPs [5]. The detailed architecture andcorresponding FLOPs for each correction unit in our different
DeepCorrect models for AlexNet are summarized in TableIV. Design choices for various correction unit architecturesand their impact on inference-time computational cost are alsodiscussed later in this section.As it can be seen from Table II, the
Deepcorr model, whichis our best performing model in terms of top-1 classificationaccuracy, outperforms the Finetune model with ≈
10% and ≈
4% relative average improvement for Gaussian blur andAWGN, respectively, by just training ≈ Deepcorr significantlyimproves the robustness of a pre-trained DNN achieving anaverage top-1 accuracy of 0.5071 and 0.5092 for Gaussianblur and AWGN affected images, respectively, as comparedto the corresponding top-1 accuracy of 0.2305 and 0.2375 forthe pre-trained AlexNet DNN.All our
DeepCorrect model variants consistently outperformthe Finetune and Stability models for both distortion types(Table II). We also observe that fine-tuning DNN modelson distortion specific data significantly outperforms DNNmodels trained through distortion agnostic stability training .For completeness, we also compare classification performancewith AlexNet when combined with a commonly used non-blind image denoising method (BM3D) proposed by [34]and a deblurring method (NCSR) proposed by [35], whereBM3D and NCSR are applied prior to the baseline AlexNet,for AWGN and Gaussian blur, respectively. Table II showstop-1 accuracy for these two methods with the
Deepcorr model outperforming each for AWGN and Gaussian blur,respectively.
4) Correction unit architectures:
Increasing the correctionunit kernel depth ( D i in Fig. 5) makes our proposed correctionunit fatter, whereas decreasing D i makes the correction unit thinner. A natural choice for D i would be to make it equal tothe number of distortion susceptible filters that need correction( β i N i ), in a DNN layer i ; this is also the default parame-ter setting used in the correction units for Deepcorr . Since D i is always equal to the number of distortion susceptiblefilters that need correction (i.e., D i = β i N i ), the number oftrainable parameters in the correction units of Deepcorr scalelinearly with the number of corrected filters ( β i N i ) in eachconvolutional layer, even though Deepcorr trains significantlylesser parameters than Finetune and still achieves a betterclassification accuracy (Tables II and III).As shown in Table III, during the testing phase, the cor-rection units in Deepcorr add almost 2 times more FLOPsrelative to the baseline DNN, for evaluating a single image.As discussed in more detail later in this section, one wayto limit the number of trainable parameters and FLOPs isto explore a bottleneck architecture for our correction units ,where the convolutional kernel depth D i (Fig. 5) is set to50% of the distortion susceptible filters that need correctionin a DNN layer i (i.e., D i = β i N i ) as compared to D i beingset to the number of filters to correct (i.e., D i = β i N i ) in Deepcorr . Replacing each correction unit in Deepcorr withsuch a bottleneck correction unit , which we refer to as
Deepcorr -b, results in a
DeepCorrect model that has signifi-cantly less trainable parameters and FLOPs than the original
Deepcorr model (Table IV). Compared to the correctionunits in Deepcorr , bottleneck correction units provide a 60%reduction in FLOPs on average, with a 85% reduction inFLOPs for Corr-unit 1 as shown in Table IV. From TablesII and III, it can be seen that
Deepcorr -b achieves ≈ Deepcorr , with a 63%reduction in trainable parameters and a 73% reduction inFLOPs, relative to
Deepcorr . As shown in Table III, the
Deepcorr -b model still requires 58% more FLOPs relativeto the baseline DNN, for evaluating a single image at testtime. A further reduction in the computational cost at test-timecan be achieved by deriving a rank-constrained
DeepCorrect model (
Deepcorr -rc) from the
Deepcorr -b model using theapproach outlined in Section IV-C. By replacing each full rankconvolutional layer in a
Deepcorr -b correction unit with apair of separable convolutions (Section IV-C), we can reducethe FLOPs for each correction unit by an additional 46% asshown in Table IV. Replacing the pre-trained convolutional b T o p - a cc u r a c y Blur: Alexnet, conv 1 0 2 4 6 b Blur: Alexnet, conv 2 0 2 4 6 b Blur: Alexnet, conv 5 ranked bottom 50%ranked top 50%ranked bottom 75%ranked top 75% n T o p - a cc u r a c y AWGN: Alexnet, conv 1 0 25 50 75 100 n AWGN: Alexnet, conv 2 0 25 50 75 100 n AWGN: Alexnet, conv 5 ranked bottom 50%ranked top 50%ranked bottom 75%ranked top 75%
Fig. 7. Effect of
DeepCorrect ranking metric on correction unit performance when integerated with AlexNet [6]. Dashed lines represent correction units trained on the least susceptible filters in a DNN layer and solid lines represent correction units trained on the most susceptible filters, as identified by ourranking metric (Section IV-A). layers of the baseline AlexNet (which are left unchanged inthe
DeepCorrect models and shown in gray in Fig. 6) by theirequivalent low-rank approximations provides an additional29% reduction in FLOPs, relative to the baseline AlexNetmodel, such that the resultant
Deepcorr -rc model now hasalmost the same FLOPs as Finetune (Table III) and still retainsalmost 99% of the accuracy achieved by
Deepcorr .
5) Effect of ranking on correction unit performance:
If thesuperior performance of our
DeepCorrect model is only due tothe additional network parameters provided by the correctionunit , then re-training the correction unit on the least susceptible β i N i filter outputs in a DNN layer should also achieve thesame performance as that achieved by applying a correctionunit on the β i N i most susceptible filter outputs identified byour ranking metric (Section IV-A). β i and N i , as defined inSection IV-A, represent the percentage of filters corrected inthe i th layer and the total number of filters in the i th layer,respectively. To this end, we evaluate the performance oftraining correction units on: 1) β i N i filters most susceptibleto distortion, and 2) β i N i filters least susceptible to distortion,as identified by our ranking metric. For this analysis, β i ∈{ } .As shown in Fig. 7, correction units trained on the β i N i mostdistortion susceptible filters (solid lines) outperform thosetrained on the least susceptible filters (dashed lines) of conv 1,conv 2 and conv 5 layers of the AlexNet model, for Gaussianblur and AWGN, respectively. Although we observe a similartrend for the conv 3 and conv 4 layers, we plot results foronly conv 1, conv 2 and conv 5 as these show the largestdifference in performance due to ranking distortion susceptiblefilters. Correcting the top 75% distortion susceptible filters(solid orange), as identified by our ranking measure, achievesthe best performance in all layers. Similarly, for all layers anddistortions, correction units trained on the top 50% susceptible T o p - e rr o r GB AWGNDeepcorr Finetune Deepcorr-b
Fig. 8. Top-1 error on the ILSVRC-2012 validation set, for
DeepCorrect model variants and fine-tuning. filters (solid green) not only significantly outperform correc-tion units trained on the 50% least susceptible filters (dashedblue) but also outperform correction units trained on the 75%least susceptible filters (dashed red), where 25% filters areshared among both approaches.
6) Accelerating training:
We analyze the evolution of thevalidation set error over training iterations for
Deepcorr , Deepcorr -b as well as the Finetune model and stop trainingfor the
Deepcorr models when their respective validationerror is lesser than or equal to the minimum validation errorachieved by the Finetune model. For any particular value ofvalidation error achieved by the Finetune model in Fig. 8,both the
Deepcorr model variants are able to achieve the samevalidation error in much lesser number of training iterations.Thus, we conclude that just learning corrective transformsfor activations of a subset of convolutional filters in a layeraccelerates training through reduced number of training epochsneeded for convergence.
7) Generalization of DeepCorrect Features to otherDatasets:
To analyze the ability of distortion invariant features TABLE VM
EAN ACCURACY PER CATEGORY OF PRE - TRAINED A LEX N ET DEEPFEATURE EXTRACTOR FOR CLEAN IMAGES . Caltech-101 Caltech-256 SUN-3970.8500 0.6200 0.3100
TABLE VIM
EAN ACCURACY PER CATEGORY FOR G AUSSIAN BLUR AFFECTEDIMAGES , AVERAGED OVER ALL DISTORTION LEVELS . B
OLD NUMBERSSHOW BEST ACCURACY . Dataset Baseline Finetune
Deepcorr
Caltech-101 0.4980 0.7710
Caltech-256 0.2971 0.5167
SUN-397 0.1393 0.2369
TABLE VIIM
EAN ACCURACY PER CATEGORY FOR
AWGN
AFFECTED IMAGES , AVERAGED OVER ALL DISTORTION LEVELS . B
OLD NUMBERS SHOW BESTACCURACY . Dataset Baseline Finetune
Deepcorr
Caltech-101 0.3423 0.7705
Caltech-256 0.1756 0.4995
SUN-397 0.0859 0.1617 learnt for image classification to generalize to related taskslike object recognition and scene recognition, we evaluatethe
Deepcorr and Finetune models trained on the ImageNetdataset (Section V-A) as discriminative deep feature extractorson the Caltech-101 [13], Caltech-256 [14] and SUN-397 [15]datasets. Unlike object recognition datasets like Caltech-101and Caltech-256, which bear some similarity to an imageclassification/object recognition dataset like ImageNet, a scenerecognition dataset like SUN-397 bears no similarity to theImageNet dataset and is expected to be challenging for featuresextracted using models learnt on ImageNet [36]. Following theexperimental procedure proposed by [36], we use the outputof the first (Caltech-101 and Caltech-256) or second (SUN-397) fully-connected layer in these models as a deep featureextractor for images affected by distortion.Since the above deep feature models have not been trainedon any one of these datasets (Caltech-101, Caltech-256 andSUN-397), for each dataset, we train linear SVMs on topof the deep features, which are extracted from a randomset of training data, and evaluate the performance in termsof mean accuracy per category averaged over 5 data splits,following the training procedure adopted by [36]. The train-ing data for each split consists of 25 training images and5 validation images per class, sampled randomly from theconsidered dataset and all remaining images are used fortesting. A baseline accuracy for undistorted images is firstestablished by training linear SVMs on features extracted onlyfrom undistorted images using the AlexNet DNN shown inFig. 2a, and results are reported in Table V. Similar tothe evaluation in Section V-A, we now independently addGaussian blur and AWGN to train and test images usingthe same distortion levels as reported in Section III andreport performance averaged over all distortion levels in Ta-blesVI-VII for deep features extracted using baseline AlexNet,Finetune and
Deepcorr models trained on ImageNet . Both For each of the three models, a single set of linear SVMs is trained forimages affected by different levels of distortion and also clean images. TABLE VIIIT OP -1 ACCURACY OF R ES N ET BASED
DNN
MODELS FOR DISTORTIONAFFECTED IMAGES OF THE I MAGE N ET VALIDATION SET ( ILSVRC
AVERAGED OVER ALL LEVELS OF DISTORTION AND CLEAN IMAGES . B
OLDNUMBERS SHOW BEST ACCURACY AND UNDERLINED NUMBERS SHOWNEXT BEST ACCURACY . Method G.Blur AWGN M.Blur D.Blur Cam.Blur
Baseline 0.3841 0.3255 0.4436 0.3582 0.4749Finetune 0.5617 0.5970 0.6197 0.5615 0.6041Finetune-rc 0.5548 0.5898 0.6113 0.5537 0.5973
Deepcorr
Deepcorr -b Deepcorr -rc 0.5821 0.6033 0.6411 0.5785 0.6276Stability [25] 0.3412 0.3454 0.4265 0.3182 0.4720TABLE IXC
OMPUTATIONAL PERFORMANCE OF R ES N ET BASED DNN MODELS . Metric Baseline/ Finetune
Deepcorr Deepcorr -b Deepcorr -rc
FLOPs 1.8 × × × × Trainableparams 11.7M 8.41M 5.5M 5.5M
Gaussian blur and AWGN significantly affect the accuracy ofthe baseline feature extractor for all three datasets, with a 41%and 60% drop in respective accuracies for Caltech-101, a 52%and 71% drop in respective accuracies for Caltech-256, anda 55% and 72% drop in respective mean accuracy for SUN-397, relative to the benchmark performance for clean images.For Caltech-101, the
Deepcorr feature extractor outperformsthe Finetune feature extractor with a 8.5% and 4.2% relativeimprovement in mean accuracy for Gaussian blur and AWGNaffected images, respectively. For Caltech-256, the
Deepcorr feature extractor outperforms the Finetune feature extractorwith a 13.8% and 9.7% relative improvement for Gaussian blurand AWGN, respectively. Similarly, features extracted usingthe
Deepcorr model significantly outperform those extractedusing the Finetune model for SUN-397, with a 28.7% and81.5% relative improvement in mean accuracy for Gaussianblur and AWGN, respectively. The large performance gapbetween
Deepcorr and Finetune feature extractors highlightsthe generic nature of distortion invariant features learnt by our
DeepCorrect models.
B. ResNet18 Analysis
Similar to the AlexNet analysis in Section V-A, forResNet18, we evaluate the performance of our proposed
DeepCorrect models against DNN models trained throughfine-tuning and stability training . In addition to Gaussian blur(GB) and AWGN, we also evaluate 3 additional distortiontypes used by Vasiljevic et. al. [27], namely 1) Motion blur(M.Blur), 2) Defocus blur (D.Blur) and 3) Camera shakeblur (Cam.Blur). Spatially uniform disk kernels of varyingradii are used to simulate defocus blur while horizontal andvertical box kernels of single pixel width and varying lengthare used to simulate motion blur (uniform linear motion) [27].For simulating camera shake blur, we use the code in [37]to generate 100 random blur kernels and report performanceaveraged over all camera shake blur kernels.Using the training procedures outlined in Sections V-A1and V-A2, we generate fine-tuned and stability trained modelsfrom the baseline ResNet18 DNN by training all layers (11.7M TABLE XC
ORRECTION UNIT ARCHITECTURES FOR R ES N ET BASED D EEP C ORRECT MODELS . T
HE NUMBER FOLLOWING C ORR - UNIT SPECIFIES THE LAYER ATWHICH CORRECTION UNITS ARE APPLIED . C
ORRECTION UNIT CONVOLUTIONAL LAYERS ARE REPRESENTED AS k X k , d , WHERE k X k IS THE SPATIALEXTENT OF A FILTER AND d IS THE NUMBER OF FILTERS IN A LAYER . S
TACKED CONVOLUTIONAL LAYERS ARE ENCLOSED IN BRACKETS , FOLLOWED BYTHE NUMBER OF LAYERS STACKED . Model Corr-unit 1 Corr-unit 3, 5 Corr-unit 5, 7 Corr-unit 9, 11 Corr-unit 13, 15 β N = 48 β N = 48 β N = 96 β N = 192 β N = 384 Output size Output size Output size Output size Output size ×
112 56 ×
56 28 ×
28 14 ×
14 7 × Deepcorr ×
1, 48 1 ×
1, 48 1 ×
1, 96 1 ×
1, 192 1 ×
1, 384 (cid:104) ×
3, 48 (cid:105) × (cid:104) ×
3, 48 (cid:105) × (cid:104) ×
3, 96 (cid:105) × (cid:104) ×
3, 192 (cid:105) × (cid:104) ×
3, 384 (cid:105) × ×
1, 48 1 ×
1, 48 1 ×
1, 96 1 ×
1, 192 1 ×
1, 384Total FLOPs: 1.7 × FLOPs: 5.7 × FLOPs: 1.4 × FLOPs: 1.4 × FLOPs: 1.4 × FLOPs: 1.4 × Deepcorr -b 1 ×
1, 31 1 ×
1, 31 1 ×
1, 62 1 ×
1, 124 1 ×
1, 248 (cid:104) ×
3, 31 (cid:105) × (cid:104) ×
3, 31 (cid:105) × (cid:104) ×
3, 62 (cid:105) × (cid:104) ×
3, 124 (cid:105) × (cid:104) ×
3, 248 (cid:105) × ×
1, 48 1 ×
1, 48 1 ×
1, 96 1 ×
1, 192 1 ×
1, 384Total FLOPs: 1.1 × FLOPs: 3.6 × FLOPs: 9.2 × FLOPs: 9.2 × FLOPs: 9.2 × FLOPs: 9.2 × Deepcorr -rc 1 ×
1, 31 1 ×
1, 31 1 ×
1, 62 1 ×
1, 124 1 ×
1, 248 (cid:34) ×
1, 241 ×
3, 31 (cid:35) × (cid:34) ×
1, 241 ×
3, 31 (cid:35) × (cid:34) ×
1, 481 ×
3, 62 (cid:35) × (cid:34) ×
1, 961 ×
3, 124 (cid:35) × (cid:34) ×
1, 1921 ×
3, 248 (cid:35) × ×
1, 48 1 ×
1, 48 1 ×
1, 96 1 ×
1, 192 1 ×
1, 384Total FLOPs: 0.6 × FLOPs: 2.0 × FLOPs: 5.1 × FLOPs: 5.1 × FLOPs: 5.1 × FLOPs: 5.1 × parameters) of the DNN on a mix of distorted and clean im-ages. Similarly, using the training procedure of Section V-A3,our competing DeepCorrect model is generated by training correction units that are appended at the output of the mostsusceptible filters in the odd-numbered convolutional layers (1to 17) of the baseline ResNet18 model right after each skipconnection merge (Fig. 2b). Similar to the AlexNet analysispresented earlier (Section V-A), we generate three
DeepCor-rect models (i.e.,
Deepcorr , Deepcorr -b and
Deepcorr -rc).For comparison, a rank-constrained model is derived for theFinetune model using similar decomposition parameters as
Deepcorr -rc and the resulting model is denoted by Finetune-rc. Table VIII summarizes the accuracy of ResNet18-basedDNN models against various distortions, while Table IX showsthe computational performance of the same DNN models,measured in terms of FLOPs and trainable parameters. Similarto the AlexNet analysis, the detailed architecture and cor-responding FLOPs for each correction unit in our different
DeepCorrect models for ResNet18 are shown in Table X.From Table VIII, we observe that our
DeepCorrect modelsnot only significantly improve the robustness of the baselineDNN model to input distortions but also outperform the alter-native approaches of model fine-tuning and stability training interms of classification accuracy.
Deepcorr , which trains almost28% lesser parameters (Table IX) than Finetune, outperformsFinetune by 3.2% for Gaussian blur, 1.47% for AWGN, 4.86%for motion blur, 6.95% for defocus blur and 5% for camerashake blur.
Deepcorr -b, which trains 50% lesser parameters(Table IX) than Finetune, outperforms Finetune by 3.95%for Gaussian blur, 1.9% for AWGN, 4.46% for motion blur,3.84% for defocus blur and 5.36% for camera shake blur. Sim-ilarly our rank constrained model (
Deepcorr -rc), which trains50% parameters less than Finetune (Table IX), outperformsFinetune by 3.63% for Gaussian blur, 1% for AWGN, 3.45%for motion blur, 3% for defocus blur and 3.89% for camerashake blur. As shown in Tables IX and X, during the testingphase,
Deepcorr -b requires 60% more FLOPs relative to thebaseline DNN, for evaluating a single image. On the otherhand, for
Deepcorr -rc, just using rank-constrained correction units results in a 50% reduction in FLOPs as compared to
Deepcorr -b (Table X), with a 30% additional reduction inFLOPs relative to baseline ResNet18 achieved by replacingthe pre-trained convolutional layers of baseline ResNet18 withtheir low-rank approximations. From Tables VIII and IX, it canbe seen that
Deepcorr -rc requires almost the same number ofFLOPs as Finetune but achieves a superior performance thanthe Finetune and Stability models without sacrificing inferencespeed. VI. C
ONCLUSION
Deep networks trained on pristine images perform poorlywhen tested on distorted images affected by image blur oradditive noise. Evaluating the effect of Gaussian blur andAWGN on the activations of convolutional filters trained onundistorted images, we observe that select filters in each DNNconvolutional layer are more susceptible to input distortionsthan the rest. We propose a novel objective metric to assess thesusceptibility of convolutional filters to distortion and use thismetric to identify the filters that maximize DNN robustness toinput distortions, upon correction of their activations.We design correction units , which are residual blocks thatare comprised of a small stack of trainable convolutional layersand a single skip connection per stack. These correction units are added at the output of the most distortion susceptible filtersin each convolutional layer, whilst leaving the rest of the pre-trained (on undistorted images) filter outputs in the networkunchanged. The resultant DNN models which we refer toas
DeepCorrect models, significantly improve the robustnessof DNNs against image distortions and also outperform thealternative approach of network fine-tuning on common visiontasks like image classification, object recognition and sceneclassification, whilst training significantly less parameters andachieving a faster convergence in training. Fine-tuning limitsthe ability of the network to learn invariance to severe levelsof distortion, and re-training an entire network can be compu-tationally expensive for very deep networks. By correcting themost distortion-susceptible convolutional filter outputs, we arenot only able to make a DNN robust to severe distortions, but are also able to maintain a very good performance on cleanimages.Although we focus on image classification and object recog-nition, our proposed approach is not only generic enough toapply to a wide selection of tasks that use DNN models suchas object detection [7], [38] or semantic segmentation [11] butalso other less commonly occurring noise types like adversarialnoise. R EFERENCES[1] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” arXiv preprint arXiv:1408.5093 , 2014.[2] F. Chollet, “Keras,” https://github.com/fchollet/keras, 2015.[3] K. Simonyan and A. Zisserman, “Very deep convolutional networksfor large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.[Online]. Available: http://arxiv.org/abs/1409.1556[4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2015, pp. 1–9.[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2016, pp. 770–778.[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica-tion with deep convolutional neural networks,” in
Advances in neuralinformation processing systems , 2012, pp. 1097–1105.[7] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2014, pp. 580–587.[8] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances inNeural Information Processing Systems, 2015, pp. 91–99.[9] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only lookonce: Unified, real-time object detection,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition arXiv preprintarXiv:1312.6199 , 2013.[17] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” arXiv preprint arXiv:1412.6572 , 2014.[18] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks areeasily fooled: High confidence predictions for unrecognizable images,”in Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2015, pp. 427–436.[19] L. J. Karam and T. Zhu, “Quality labeled faces in the wild (QLFW):a database for studying face recognition in real-world environments,”in Proc. SPIE 9394, Human Vision and Electronic Imaging, vol. XX,2015. [20] S. Basu, M. Karki, S. Ganguly, R. DiBiano, S. Mukhopadhyay,S. Gayaka, R. Kannan, and R. Nemani, “Learning sparse featurerepresentations using probabilistic quadtrees and deep belief nets,” inNeural Processing Letters. Springer, 2015, pp. 1–13.[21] S. Karahan, M. K. Yildirum, K. Kirtac, F. S. Rende, G. Butun, andH. K. Ekenel, “How image degradations affect deep CNN-based facerecognition?” in International Conference of the Biometrics SpecialInterest Group (BIOSIG). IEEE, 2016, pp. 1–5.[22] E. Rodner, M. Simon, R. B. Fisher, and J. Denzler, “Fine-grainedrecognition in the noisy wild: Sensitivity analysis of convolutional neuralnetworks approaches,” arXiv preprint arXiv:1610.06756 , 2016.[23] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie,“The caltech-ucsd birds-200-2011 dataset,”
Tech. Rep.
CoRR , vol.abs/1604.04326, 2016. [Online]. Available: http://arxiv.org/abs/1604.04326[26] Z. Sun, M. Ozay, Y. Zhang, X. Liu, and T. Okatani, “Feature quantizationfor defending against distortion of images,” in
The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , June 2018.[27] I. Vasiljevic, A. Chakrabarti, and G. Shakhnarovich, “Examining theimpact of blur on recognition by convolutional networks,”
CoRR , vol.abs/1611.05760, 2016. [Online]. Available: http://arxiv.org/abs/1611.05760[28] Y. Zhou, S. Song, and N.-M. Cheung, “On classification of dis-torted images with deep convolutional neural networks,” arXiv preprintarXiv:1701.01924 , 2017.[29] S. Diamond, V. Sitzmann, S. P. Boyd, G. Wetzstein, and F. Heide,“Dirty pixels: Optimizing image classification architectures for rawsensor data,”
CoRR , vol. abs/1701.06487, 2017. [Online]. Available:http://arxiv.org/abs/1701.06487[30] N. Ponomarenko, V. Lukin, A. Zelensky, K. Egiazarian, M. Carli, andF. Battisti, “TID2008-A database for evaluation of full-reference visualquality assessment metrics,” in Advances of Modern Radioelectronics,vol. 10, no. 4, 2009, pp. 30–45.[31] D. E. Rumelhart, R. Durbin, R. Golden, and Y. Chauvin, “Backpropa-gation,” Y. Chauvin and D. E. Rumelhart, Eds. Hillsdale, NJ, USA: L.Erlbaum Associates Inc., 1995, ch. Backpropagation: The Basic Theory,pp. 1–34.[32] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutionalneural networks with low rank expansions,”
CoRR , vol. abs/1405.3866,2014. [Online]. Available: http://arxiv.org/abs/1405.3866[33] C. Tai, T. Xiao, X. Wang, and W. E, “Convolutional neural networkswith low-rank regularization,”
CoRR , vol. abs/1511.06067, 2015.[Online]. Available: http://arxiv.org/abs/1511.06067[34] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “BM3D imagedenoising with shape-adaptive principal component analysis,” in Proc.Workshop on Signal Processing with Adaptive Sparse Structured Rep-resentations (SPARS09), 2009.[35] W. Dong, L. Zhang, G. Shi, and X. Li, “Nonlocally centralized sparserepresentation for image restoration,” in IEEE Trans. on Image Process-ing, vol. 22, no. 4, 2013, pp. 1620–1630.[36] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, andT. Darrell, “DeCAF: A deep convolutional activation feature for genericvisual recognition,” in International Conference on Machine Learning,2014, pp. 647–655.[37] A. Chakrabarti, “A neural approach to blind motion deblurring,”