[PDF] Single Image Deraining via Scale-space Invariant Attention Neural Network

Abstract

Image enhancement from degradation of rainy artifacts plays a critical role in outdoor visual computing systems. In this paper, we tackle the notion of scale that deals with visual changes in appearance of rain steaks with respect to the camera. Specifically, we revisit multi-scale representation by scale-space theory, and propose to represent the multi-scale correlation in convolutional feature domain, which is more compact and robust than that in pixel domain. Moreover, to improve the modeling ability of the network, we do not treat the extracted multi-scale features equally, but design a novel scale-space invariant attention mechanism to help the network focus on parts of the features. In this way, we summarize the most activated presence of feature maps as the salient features. Extensive experiments results on synthetic and real rainy scenes demonstrate the superior performance of our scheme over the state-of-the-arts.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Single Image Deraining via Scale-space InvariantAttention Neural Network

Bo Pang, Deming Zhai,

Member, IEEE,

Junjun Jiang,

Member, IEEE,

Xianming Liu,

Member, IEEE

Abstract —Image enhancement from degradation of rainy ar-tifacts plays a critical role in outdoor visual computing systems.In this paper, we tackle the notion of scale that deals with visualchanges in appearance of rain steaks with respect to the camera.Speciﬁcally, we revisit multi-scale representation by scale-spacetheory, and propose to represent the multi-scale correlation inconvolutional feature domain, which is more compact and robustthan that in pixel domain. Moreover, to improve the modelingability of the network, we do not treat the extracted multi-scalefeatures equally, but design a novel scale-space invariant attentionmechanism to help the network focus on parts of the features.In this way, we summarize the most activated presence offeature maps as the salient features. Extensive experiments resultson synthetic and real rainy scenes demonstrate the superiorperformance of our scheme over the state-of-the-arts.

Index Terms —Single image deraining, multi-scale feature,scale-invariant, attention mechanism

I. I

NTRODUCTION

In practical applications, tasks of outdoor scene analysisinevitably involve scenarios where images are captured underbad weather conditions, such as rainy days. Such factors causedegradation in image quality, resulting in unexpected impactson subsequent tasks like object detection [19], recognition[18] and scene analysis [7]. The image enhancement task thatattempts to remove rain steaks is thus useful and necessaryfor outdoor visual system, which serves as the pre-processingstep to help improve detection or recognition performance.Deraining becomes an active low-level image processingproblem. Many works emerge in recent years, either video-based [10], [14], or single-image based [17], [26]. In thispaper, we focus on the line of single image deraining, whichcan be formulated as an ill-posed problem. The early methodstreat rain removal as a signal separation problem, relyingon some prior modeling about the background layer andthe rain steak layer. However, the artifact of rain steaks isinherently a kind of signal-dependent noise, features of whichare intrinsically overlapped with those of the background inthe feature space, making this inverse problem even harder tosolve. The progress of deep learning based image restorationlights the path of single image deraining [17], [26], [20].The kind of data-driven approach is able to model morecomplicated mappings from rain images to clean ones, and

This work was supported by XX. (

Corresponding author: Deming Zhai ).B. Pang, D. Zhai, J. Jiang and X. Liu are with the School of ComputerScience and Technology, Harbin Institute of Technology, Harbin 150001,China, and also with the Peng Cheng Laboratory, Shenzhen 518052, China([email protected]; [email protected]; [email protected]). thus achieves much better deraining results than the traditionalmodel-based approach [16], [12], [4].In real-world scenarios, rain steaks appear at differentscales, depending on their distance from the camera. Thisleads to rain artifacts with varying sizes, background clut-ter and heavy occlusions, making single image derainingremain a challenging task. Some works try to handle thismulti-scale effect in rain modeling. For instance, Fu et al. [1] construct pyramid frameworks to exploit the multi-scaleknowledge for deraining. Jiang et al. [8] propose to ﬁrstgenerate Gaussian pyramid and then fuse the multi-scale infor-mation. These methods explicitly decompose the rain imageinto different pyramid levels by progressively downsamplingin pixel domain. Yet, we note that this multi-scale represen-tation approach through image pyramid is not optimal. Thedownsampling operation by removing large amount of pixelsresults in blurry artifact and resolution reduction, making thefollowing fusion procedure not sufﬁciently informed about thekey characteristics in identifying salient features in images.Considering the above limitation of existing methods, inthis work, we revisit multi-scale representation by scale-spacetheory, and propose to represent the multi-scale correlationin convolutional feature domain, which is more compact androbust than representations in pixel domain. Speciﬁcally, wepropose a scale-aware deep convolutional neural network forderaining, which includes a multi-scale feature extractionbranch coupling with scale-space invariant attention branch.In the feature branch, we build multi-scale pyramid in featurespace through average pooling operations with various sizes,which can efﬁciently suppress noise and preserve backgroundinformation. Besides compact representation and robustness tonoise, this manner brings other beneﬁts, such as invariance tolocal translation and enlarged receptive ﬁeld. In the attentionbranch, we tailor a scale-space invariant attention mechanismto quantify the importance of the input multi-scale features tofocus on. We achieve this by building difference-of-Gaussian(DoG) pyramid in scale-space, which is coupled with thefeature branch and able to reveal latent salient features crossscales. Finally, LSTM-based muiti-stage reﬁnement strategy isemployed to progressively improve the network performance.In a nutshell, our scheme works by learning an intermediateattention maps in scale-space that are used to select the mostrelevant pieces of information from multi-scale features forseparating the background and rain steaks.The main contributions of this work are highlighted asfollows: • We propose a multi-scale correlation representation infeature space for single image deraining, which is more a r X i v : . [ c s . C V ] J un OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

Fig. 1: The overall framework of our network, which includes multiple stages. X is the input rain image, { X k } n − k =0 areintermediate deraining results, X n is the ﬁnal deraining result, { R k } n − k =1 are the estimated rain layers. The core module ineach stage is Cross-scale Feature Aggregation, in which multi-scale features are extracted and scale-space invariant attentionmasks are derived. Adjacent stages are connected by LSTM to propagate information and achieve progressive reﬁnement.compact and robust than the counterpart representation inpixel domain. • We propose a scale-space invariant attention networkthrough DoG pyramid, which can reveal latent salientfeatures cross scales. • Our scheme achieves the best single image derainingperformance so far, which is consistently superior than awide range of state-of-the-arts over various benchmarkingdatasets.The rest sections are organized as follows: The relatedworks are brieﬂy overviewed in Section 2. In Section 3, weintroduce the proposed method in detail. In Section 4, weprovide extensive experimental results and ablation study todemonstrate the superior performance of our method.II. R

ELATED W ORK

In this section, we brieﬂy overview existing model-basedand deep learning based single image deraining works.

A. Model-based methods

A rainy image can be modeled as a linear combination ofthe background layer and the rain streak layer. Based on this,model-based methods conduct deraining by explicitly deﬁningsome prior models on both the background layer and therain streak layer. The task of deraning is then transferredto a signal separation problem. Luo et al. [16] propose adictionary learning method based on sparsely approximat-ing the patches of two layers by very high discriminativecodes over a learned dictionary with strong mutual exclusivityproperty. Li et al. [12] point out either taking dictionarylearning methods or imposing a low rank structure will alwaysleave too many rain streaks in the background image orover smooth the background image. So they propose a prior-based method using GMM which can accommodate multipleorientations and scales of the rain streaks. Gu et al. [4]combine analysis sparse representation (ASR) and synthesissparse representation (SSR) which both are used for sparsity-based image modeling proposing a joint convolutional analysisand synthesis (JCAS) sparse representation model. However, these model-based methods couldn’t well formulate complexraining process, which is not enough to retrieve backgroundinformation.

B. Deep Learning based methods

For the data-based methods to do the rain rain removaltask, the intuitive idea is to learn the mapping from the rainyimage to clear backgroud image. However, such solution mightcause loss of the background information. To better recoverthe clear image, Fu et al. [2] design a two-layers network.One is called base layer. The other is called detail layer. Inbase layer, it mainly focuses on low frequency informationof the image using low-pass ﬁlter. While in detail layer, theydesign a CNN to obtain the high frequency information ofthe image. They enhance both the outputs of two layers andthen combine them to get clean image. Due to the successof deep residual network, Fu et al. [3] concentrate on highfrequency detail by learning the residual part between therainy image and the clear image. Afterwards considering thedifferent scale, direction and shape of the rain streaks, Li et al. [11] adopt the dilated convolutional neural network to acquirelarge receptive ﬁeld and used the recurrent structure. Fu et al. [1] construct light weight pyramid frameworks to exploit themulti-scale knowledge for deraining. In Yang’s work[26], theyconstruct contextualized dilated networks to aggregate contextinformation at multiple scales for learning the rain features.Jiang et al. [8] propose to ﬁrst generate Gaussian pyramid andthen fuse the multi-scale information. In our work, we proposeto represent the multi-scale correlation in convolutional featuredomain, which is more compact and robust than that in pixeldomain. Since there are many modules used in network, thenetwork is too complicated to analyze each module’s function.So Ren et al. [17] provide a simple and strong baseline usinga LSTM block and several resblocks. And for better dealingwith the real world’s rainy image, several methods also havebeen proposed [21] [6] [28].III. P

ROPOSED M ETHOD

In this section, we introduce in detail the proposed deepneural network for single image deraining.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3 (a) (b)(c) (d)

Fig. 2: Illustration of rain steak layer estimation. (a) and (c)are the input rainy images, (b) and (d) are the estimated rainstreak layers by our network.

A. Network Architecture

The overall architecture of our proposed deraining networkis illustrated in Fig. 1. Our scheme tackles the derainingproblem in a multi-stage manner, which processes the inputrain image X and intermediate deraining results { X k } n − k =0 togenerate the output clean image X n progressively, where atthe beginning X = X . The core module, named cross-scalefeature aggregation (CFA), is recursively conducted in eachstage with the same network parameters, which is tailored tocapture rich cross-scale information of input data. We connectCFA modules at adjacent stages with convolutional LSTMunits that propagate information across the stages. In one CFAmodule, to estimate the rain steak layer R k , we learn multi-scale features { F ( s ) } with three scale sizes ( ↓ , ↓ , ↓ ), anddesign a scale-space invariant attention network to derive theimportance masks { M ( s ) } . In this way, we identify salientregions in the input data for the network to focus on, i.e. , { F a ( s ) = F ( s ) + F ( s ) (cid:12) M ( s ) } , which are helpful to improvethe modeling ability of the proposed deraining network. Theillustration of rain steak layer estimation is shown in Fig. 2. Itcan be found our scheme estimates the rain steak layer verywell. B. Cross-scale Feature Aggregation

According to the scale-space theory [13], the real-worldobjects have the nature of multi-scale, which exist as mean-ingful entities over certain ranges of scales. This implies thatthe perception of objects depends on the scale of observation.For images of unknown scenes, it is unlikely to know a priorwhat scales are relevant. The only reasonable approach is torepresent the image data at multiple scales [13].In rain images, the notion of “scale" deals with visualchanges in rain steaksâ ˘A ´Z appearance with respect to theirdistance from the camera. Inspired by the scale-space theory,in the task of single image deraining, we consider to learnmulti-scale features to capture a rich representation of imagedata. A straightforward approach is to build a coarse-to-ﬁnepyramid in pixel domain, as done in [1] and [8]. However,the image representation in pixel domain is not compact.The downsampling operation by decreasing the pixel numberresults in information loss about key characteristics in identi-fying salient features in images. Fig. 3: Cross-scale Feature AggregationInstead, in our work, we propose to construct multi-scalerepresentation in feature domain, which is more compact androbust than that in pixel domain. Speciﬁcally, in the k -th stage,we concatenate the input rain image X and the derainingresult X k − of the last stage as the input, which is fed intoconvolutional neural networks to extract feature maps F = Conv ( X ⊕ X k − ) (1)where ⊕ denotes to the concatenation operator, Conv ( · ) represents one convolutional layer. Note that here we omitthe parameters of Conv ( · ) , which serve as the parametersof the overall network. Then we build multi-scale pyramidof CNN features through average pooling operations and oneconvolution layer to obtain multi-scale features: F ( s ) = Conv (cid:16)

Pooling ( F , s ) (cid:17) (2)where Pooling ( · , · ) represents the pooling operator withvarious size s = 1 , , , which means that the receptive ﬁeldof each scale is with size of × , × and × , and witha stride of 1, 2 and 4, respectively.To improve the discriminative ability of our network, wedo not treat the extracted features equally, but design a novelattention mechanism that couples with the multi-scale featureextraction to help the network focus on parts of the features. Inthis way, we summarize the most activated presence of featuremaps, which are expected to be ones of rain steaks.As illustrated in Fig. 3, for scale i , the proposed scale-space invariant attention network (SIAN) is exploited to derivethe importance mask M ( s ) ∈ [0 , , according to which weidentify salient changes in latent CNN features: F a ( s ) = F ( s ) + F ( s ) (cid:12) M ( s ) (3)where (cid:12) denotes the element-wise multiplication. Finally, allsalient features are concatenated together to form the cross-scale feature: F a = F a (1) ⊕ Up (cid:16) F a (2) ⊕ Up ( F a (4) ) (cid:17) (4)where Up ( · ) denotes the upsampling operator with factor 2.After one convolutional layer followed by ReLU activation, F a is passed into ResGroup which contains two Resblocks

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

Fig. 4: Scale-space Invariant Attention Networkand a convolutional layer to get the estimation of rain steaklayer R k : R k = ResGroup ( F a ) (5)The deraining result of the k -th stage is: X k = X − R k (6) C. Scale-space Invariant Attention Network

In this subsection, we elaborate how we design a powerfulattention mechanism that can reveal latent information fromfeatures captured at different scales. The proposed scale-spaceinvariant attention mechanism is inspired by the classical SIFTfeature extractor [15]. SIFT achieves great success in featureextraction and description before the rise of deep learning,which is invariant to image scale and rotation, and partlyinvariant to afﬁne distortion, addition of noise, and change inillumination. These wonderful properties make the underlyingwisdom of SIFT particularly enlightening for the task ofderaining, in which the rain steaks also exhibit multi-scale,various rotations and afﬁne transformation. The rain imagesalso suffer from noise and low-contrast artifacts.SIFT has the four major stages of computation: 1) Scale-space extrema detection, 2) Keypoint localization, 3) Orien-tation assignment, 4) Keypoint descriptor. For the design ofattention network, we only care about the ﬁrst two stages.The keypoint, aka extrema, which is salient change in scale-space, is naturally deﬁned as the attention. To detect extrema,scale-space is ﬁrst constructed in CNN feature space. Asillustrated in Fig. 4, the SIAN architecture includes threeoctaves, each of which corresponds to a scale-space withvarious smoothness levels indicating by smoothing parameter σ and scaling parameter sk l − , where l = 1 , · · · , and s = 1 , , .Each octave O ( s ) includes six layers. We deﬁne the ﬁrstlayer in O ( s ) as f ( s ) . The l -th layer in O ( s ) is then deﬁnedas: L ( f ( s ) , sk l − σ ) = G ( x, y, k l − σ ) ⊗ f ( s ) ( x, y ) (7) Fig. 5: Progressive Reﬁnement by LSTMwhere G ( x, y, σ ) is the Gaussian kernel G ( x, y, σ ) = 12 πσ e − ( x + y ) / σ (8)Here σ is set as 1.6 in practical implementation.For the ﬁrst octave O (1) , the ﬁrst layer f (1) is the Gaussiansmoothed version of the feature F deﬁned in Eq. (1): f (1) = G ( x, y, σ (cid:48) ) ⊗ F (9)where σ (cid:48) is set as 1.52 in our implementation. For the resttwo octaves, the ﬁrst layer is the pooling version of the lastthird layer of the previous octave. Formally, the ﬁrst layer ofthe octave O (2) is f (2) = Pooling (cid:16) L ( f (1) , k σ ) , (cid:17) (10)and the ﬁrst layer of the octave O (4) is f (4) = Pooling (cid:16) L ( f (2) , k σ ) , (cid:17) (11)Note that the pooling operation we use here is the max pooling,which is beneﬁcial to the following local extrema detectionprocess.The derived scale-space pyramid is coupled with the multi-scale pyramid in feature extraction. Given octaves, difference-of-Gaussian (DoG) is created by considering the differencebetween adjacent layers D ( f ( s ) , sk l σ ) = L ( f ( s ) , sk l σ ) − L ( f ( s ) , sk l − σ ) (12)According to the DoG pyramid, we then detect the localextrema, i.e. , the salient features. We no longer do it bycomparing a point against its 26 neighbors in spatial and scaledomain, as done in SIFT, but turn to the learning approach.Speciﬁcally, we concatenate all the DoG layers in octave O ( s ) ,which are then passed through a convolutional layer followedby a ReLU activation, and ﬁnally we use the sigmoid functionto get the ﬁnal attention mask M ( s ) . D. Progressive Reﬁnement by LSTM

To further improve the network performance, similar toPReNet [17], we connect adjacent stages with convolutionalLSTM (ConvLSTM) units that propagate information fromthe previous stage [24]. LSTM [5] is good at handling time-sequence data. There are three gates in LSTM, including theinput gate, the forget gate, the output gate and the cell state.The key in ConvLSTM is the cell state, which encodes thestate information that will be propagated to the next LSTM.In our work, as illustrated in Fig. 5, after obtaining the F a ( s ) according to Eq. (3), it is updated by passing two convolutional OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5 (a) Rainy (21.2dB/0.727) (b) Groundtruth (c) Stage=1 (19.5dB/0.704) (d) Stage=2 (21.1dB/0.765)(e) Stage=3 (22.0dB/0.787) (f) Stage=4 (22.5dB/0.800) (g) Stage=5 (22.8dB/0.804) (h) Stage=6 (23.0dB/0.807)

Fig. 6: Illustration of progressive reﬁnement. The subjective and objective (PSNR/SSIM) results of six stages are provided. Itcan be found the quality of derained image is improved gradually stage-by-stage.

Algorithm 1

Network Training Flow.

Input:

Rainy images { X i } ; corresponding ground truth { Y i } ;network with initial parameter θ ; initial learning rate γ ; Output:

Network parameters θ ∗ for j = 1: num_epochs do Pick up training batch set { ( X i , Y i ) } mi =1 . for i = 1: m do X i = X i , L ( θ ) = 0 for k =1:6 do F = Conv ( X i ⊕ X ik − ) ; F ( s ) = Conv (cid:16)

Pooling ( F , s ) (cid:17) , s = 1 , , ; Generate attention mask M ( s ) by SIAN; F a ( s ) = F ( s ) + F ( s ) (cid:12) M ( s ) ; Update F a ( s ) by passing it through ConvLSTM; F a = F a (1) ⊕ Up (cid:16) F a (2) ⊕ Up ( F a (4) ) (cid:17) ; R ik = ResGroup ( F a ) ; X ik = X i − R ik ; L ( θ ) = L ( θ ) − SSIM ( X ik , Y i ) ; end for ; θ = θ − γ ∗ A ( ∇ θ L ( θ )) ; end for ; end for ; θ ∗ = θ .layers, and serves as the input of the next ConvLSTM block. InFig. 6, we provide an illustration of the effect of progressivereﬁnement by LSTM. It can found the quality of derainedimage becomes better and better along with the stage numberincreases. E. Network Training

The network parameters θ involve the kernel weights ofthe performed convolutional layers. The training process isconducted based on several public benchmarking dataset, asshown in Table I, which are synthesized data and thus thereare many pairs of rain images { X i } and the correspondingground truth { Y i } . TABLE I: Benchmarking datasets used for network trainingas well as performance evaluation Datasets Sample Number (Train/Test) DescriptionRain12 [12] 12 Only for testingRain100L [25] 200/100 Synthesized withone type of rain streaks(light rain case)Rain100H [25] 1,800/100 Synthesized withïˇn ˛Ave types of rain streaks(heavy rain case)Rain1400 [3] 12,600/1,400 1,000 clean imageused to synthesize14,000 rain images.

TABLE II: Performance comparison of different lossfunctions

Loss Rain100L Rain100HPSNR SSIM PSNR SSIM

MAE 38.54 0.981 30.12 0.904MSE 38.52 0.981 30.03 0.902Negative SSIM

For each image, we compute its accumulated negative SSIMloss [22] over all outputs of K stages. As shown in Table II,negative SSIM loss works better than the popular MAE andMSE loss. Traversing all training samples, the ﬁnal trainingloss is: L ( θ ) = − N (cid:88) i =1 K (cid:88) k =1 SSIM ( X ik , Y i ) (13)where N is the number of samples used for training.The optimal parameters θ ∗ can be obtained by: θ ∗ = arg min θ L ( θ ) (14)This minimization problem can be addressed by ADAM [9]: θ = θ − γ ∗ A ( ∇ θ L ( θ )) (15)where γ is the learning rate, and A ( ∇ θ L ( θ )) represents theupdated value based on ADAM. The whole network trainingﬂow is summarized in Algorithm 1.Our network is trained on NVIDIA GTX 1080Ti. Theimage patch size of all the dataset are set as × and OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6 the batch size are set as 18. We extract image patches withstride 40, 80, and 100 for Rain100L, Rain100H and Rain1400,respectively. For Rain100L, we perform the data augmentationby horizontal ﬂip. The training epoch of Rain100L, Rain100Hand Rain1400 are set as 100, 100 and 50, respectively. Thelearning rate γ is set as 2e-4.IV. E XPERIMENTS

In this section, extensive quantitative and qualitative resultsare provided to demonstrate the superior performance of theproposed method. Ablation study is also offered to promotedeeper understanding of our network.

A. Evaluation on Synthetic Datasets1) Comparison with the state-of-the-arts:

Our method iscomprehensively compared with state-of-the-art model-basedand deep learning based works on synthetic benchmarkingdatasets shown in Table I. The comparison study group in-cludes: • Model-based: 1) Discriminative Sparse Coding, DSC[16]; 2) Gaussian Mixture Model, GMM [12]; 3) JointConvolutional Analysis and Synthesis Sparse Represen-tation, JCAS [4]; • Deep Learning based: 1) Clear [2]; 2) Deep DetailNetwork, DDN [3]; 3) Recurrent Squeeze-and-ExcitationContext Aggregation Net, RESCAN [11]; 4) ProgressiveRecurrent Network, PReNet [17]; 5) Spatial AttentiveNetwork, SPANet [21]; 6) Enhanced JOint Rain DEtec-tion and Removal, JORDER-E [26]; 7) Semi-supervisedImage Rain Removal, SSIR [23]; 8) Lightweight PyramidNetworks, LPNet [1].We follow the same experiment settings as introduced in[20] [27]. Peak signal-to-noise ratio (PSNR) and SSIM areused for quantitative performance evaluation. We only considerthe luminance channel, since it has the most signiﬁcant impacton the human visual system to evaluate the image quality. Weadopt the numerical results reported in [20].The quantitative evaluation results on Rain100L, Rain100H,and Rain1400, Rain12 are presented in Table III and TableIV, respectively. It can be found that, compared with deeplearning based methods, three model-based methods—DSC,GMM and JCAS—achieve relatively lower PSNR and SSIMvalues, due to the lack of modeling ability. Deep learningbased methods achieve great success in single image deraining.For instance, compared with the best performed model-basedmethod GMM, the method JORDER_E improves the PSNRby over 9dB on Rain100L. Among all compared data-drivenmethods, our proposed scheme achieves the best PSNR andSSIM performance on various datasets. The PSNR gains overthe second best performed work are 0.19dB, 0.22dB, 0.12dBand 0.55dB on Rain100L, Rain100H, Rain1400 and Rain12,respectively. These results demonstrate the superiority of ourwork.We also provide qualitative evaluation through visual qualitycomparison. The example images cover various scenarios,including light rain steaks, large rain streaks and dense rain TABLE III: The quantitative evaluation results with respectto PSNR (dB)/SSIM on Rain100L and Rain100H. The bestand the second ones are highlighted by bold and underline.

Datasets Rain100L Rain

Metrics PSNR SSIM PSNR SSIMInput 26.90 0.838 13.56 0.371DSC [16] (ICCV’15) 27.34 0.849 13.77 0.320GMM [12] (CVPR’16) 29.05 0.872 15.23 0.450JCAS [4] (ICCV’17) 28.54 0.852 14.62 0.451Clear [2] (TIP’17) 30.24 0.934 15.33 0.742DDN [3](CVPR’17) 32.38 0.926 22.85 0.725RESCAN [11] (ECCV’18) 38.52 0.981 29.62 0.872PReNet [17] (CVPR’19) 37.45 0.979 . .

61 0 . . SIRR [23] (CVPR’19) 32.37 0.926 22.47 0.716LPNet [1] (TNNLS’ 20) 33.40 0.960 23.40 0.820Ours

TABLE IV: The quantitative evaluation results with respectto PSNR (dB)/SSIM on Rain1400 and Rain12. The best andthe second ones are highlighted by bold and underline.

Datasets Rain1400 Rain12Metrics PSNR SSIM PSNR SSIMInput 25.24 0.810 30.14 0.856DSC [16] (ICCV’15) 27.88 0.839 30.07 0.866GMM [12] (CVPR’16) 27.78 0.859 32.14 0.916JCAS [4] (ICCV’17) 26.20 0.847 33.10 0.931Clear [2] (TIP’17) 26.21 0.895 31.24 0.935DDN [3] (CVPR’17) 28.45 0.889 34.04 0.933RESCAN [11] (ECCV’18) 32.03 0.931 36.43 0.952PReNet [17] (CVPR’19) 32.55 . . .

69 0 . SIRR [23] (CVPR’19) 28.44 0.889 34.02 0.935LPNet [1] (TNNLS’ 20) - - 34.7 0.95Ours accumulation. As illustrated in Fig. 7-Fig. 9, the model-based method JCAS cannot remove the rain steaks well.In its deraining results, most rain steaks are still existing.LPNet, which builds coarse-to-ﬁne pyramid in pixel domainto exploit the multi-scale correlation, cannot preserve theimage structures well. It can be found in Fig. 7, LPNetcannot remove heavy rain steaks. PreNet also employs theprogressive reﬁnement manner as ours. However, as shownin Fig. 8, it leads to oversmoothing effect in the buildingregions. JORDER_E performs the second best in quantitativeevaluation. In subjective evaluation, it can be seen there isstill rain trace in sky region of Fig. 7; it also suffers fromoversmoothing as PReNet in Fig. 8.

B. Evaluation on Real Rainy Scenes

We further investigate the performance of our method onreal deraining cases. In Fig. 10, we show the subjectivecomparison results on two real rainy images with the PReNetand JORDER_E. It can be found, for the ﬁrst image, oursand JORDER_E achieve better results than PReNet, whichgenerates large oversmoothing regions on the roof. For thesecond image, the result by our scheme is more clear than theother two methods.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7 (a) Input (21.17dB/0.727) (b) JCAS (24.33dB/0.809) (c) LPNet (24.60dB/0.876) (d)

JORDER_E(42.49dB/0.988) (e) PreNet (40.23dB/0.987) (f) RESCAN(40.98dB/0.987) (g) Ours (42.93dB/0.991) (h) Groundtruth

Fig. 7: Visual deraining results on sample from Rain100L. (a) Input (13.55dB/0.482) (b) JCAS (15.62dB/0.570) (c) LPNet (28.03dB/0.916) (d)

JORDER_E (24.95dB/0.878) (e) PreNet (24.39dB/0.863) (f) RESCAN(23.31dB/0.862) (g) Ours (28.03dB/0.916) (h) Groundtruth

Fig. 8: Visual deraining results on sample from Rain100H.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8 (a) Input (21.59dB/0.771) (b) PreNet (29.76dB/0.908) (c) Ours (29.93dB/0.908) (d) Groundtruth(e) Input (28.48dB/0.798) (f) PreNet (35.22dB/0.936) (g) Ours (36.08dB/0.948) (h) Groundtruth

Fig. 9: Visual deraining results on samples from Rain1400 and Rain12. (a) Input (b) PreNet (c) JORDER_E (d) Ours(e) Input (f) PreNet (g) JORDER_E (h) Ours

Fig. 10: Visual deraining results on real rainy images

C. Ablation Study

The main modules of our scheme include multi-scale featureextraction, scale-space invariant attention network (SIAN), andLSTM based progressive reﬁnement. In this subsection, weprovide ablation analysis to show the roles of these modulesto the ﬁnal performance. We deﬁne the group of ablation studyas: • Baseline: only the ﬁrst stage of Fig. 1 is used, and { M ( i ) } = 0 , which means no attention mechanism isemployed; • Baseline+LSTM: all stages of Fig. 1 are used, and { M ( i ) } = 0 ; • Baseline+SIAN+LSTM: the complete form of Fig. 1. As shown in Table. V, compared with Baseline, Base-line+LSTM works better. This demonstrate the strategy of pro-gressive reﬁnement is helpful. Compare with Baseline+LSTM,Baseline+SIAN+LSTM further improves the PSNR and SSIMperformance, which demonstrates the proposed attention net-work really can improve the modeling ability of the network.TABLE V: The ablation study about the role of modules tothe ﬁnal performance

Method Rain100L Rain100HPSNR SSIM PSNR SSIM

Baseline 37.92 0.980 28.55 0.893Baseline+LSTM 38.68 0.982 30.20 0.906Baseline+SIAN+LSTM

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

V. C

ONCLUSION

In this work, we presented a novel single image derain-ing scheme based on scale-aware deep neural networks. Toaggregate features from multiple scales into our rain steaksprediction, we developed a new scale-space invariant attentionmechanism that learns a set of importance masks, one for eachscale. Experimental results show that our proposed methodachieves state-of-the-art performance with respect to bothquantitative and qualitative evaluations.R

EFERENCES[1] X. Fu, B. Liang, Y. Huang, X. Ding, and J. Paisley. Lightweight pyramidnetworks for image deraining.

IEEE Transactions on Neural Networksand Learning Systems , pages 1–14, 2019.[2] Xueyang Fu, Jiabin Huang, Xinghao Ding, Yinghao Liao, and JohnPaisley. Clearing the skies: A deep network architecture for single-imagerain removal.

IEEE Transactions on Image Processing , 26(6):2944–2956, 2017.[3] Xueyang Fu, Jiabin Huang, Delu Zeng, Yue Huang, Xinghao Ding,and John Paisley. Removing rain from single images via a deep detailnetwork. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 3855–3863, 2017.[4] Shuhang Gu, Deyu Meng, Wangmeng Zuo, and Lei Zhang. Jointconvolutional analysis and synthesis sparse representation for singleimage layer separation. In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 1708–1716, 2017.[5] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780, 1997.[6] Xiaowei Hu, Chi-Wing Fu, Lei Zhu, and Pheng-Ann Heng. Depth-attentional features for single-image rain removal. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , pages8022–8031, 2019.[7] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-basedvisual attention for rapid scene analysis.

IEEE Transactions on PatternAnalysis & Machine Intelligence , (11):1254–1259, 1998.[8] Kui Jiang, Zhongyuan Wang, Peng Yi, Chen Chen, Baojin Huang, YiminLuo, Jiayi Ma, and Junjun Jiang. Multi-scale progressive fusion networkfor single image deraining. pages 1–8, 2020.[9] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochasticoptimization. arXiv , 2014.[10] Minghan Li, Xiangyong Cao, Qian Zhao, Lei Zhang, Chenqiang Gao,and Deyu Meng. Video rain/snow removal by transformed onlinemultiscale convolutional sparse coding. arXiv , 2019.[11] Xia Li, Jianlong Wu, Zhouchen Lin, Hong Liu, and Hongbin Zha. Re-current squeeze-and-excitation context aggregation net for single imagederaining. In

Proceedings of the European Conference on ComputerVision (ECCV) , pages 254–269, 2018.[12] Yu Li, Robby T Tan, Xiaojie Guo, Jiangbo Lu, and Michael S Brown.Rain streak removal using layer priors. In

Proceedings of the IEEEconference on computer vision and pattern recognition , pages 2736–2744, 2016.[13] Tony Lindeberg.

Scale-Space Theory in Computer Vision . KluwerAcademic Publishers, USA, 1994.[14] Jiaying Liu, Wenhan Yang, Shuai Yang, and Zongming Guo. Erase orﬁll? deep joint recurrent rain removal and reconstruction in videos. In , 2018.[15] G Lowe. Sift-the scale invariant feature transform.

Int. J , 2:91–110,2004.[16] Yu Luo, Yong Xu, and Hui Ji. Removing rain from a single image viadiscriminative sparse coding. In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 3397–3405, 2015.[17] Dongwei Ren, Wangmeng Zuo, Qinghua Hu, Pengfei Zhu, and DeyuMeng. Progressive image deraining networks: a better and simplerbaseline. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 3937–3946, 2019.[18] Julius T Tou and Rafael C Gonzalez. Pattern recognition principles.1974.[19] Paul Viola, Michael Jones, et al. Robust real-time object detection.

International journal of computer vision , 4(34-47):4, 2001.[20] Hong Wang, Yichen Wu, Minghan Li, Qian Zhao, and Deyu Meng. Asurvey on rain removal from video and single image. arXiv preprintarXiv:1909.08326 , pages 1–8, 2019. [21] Tianyu Wang, Xin Yang, Ke Xu, Shaozhe Chen, Qiang Zhang, andRynson WH Lau. Spatial attentive single-image deraining with a highquality real rain dataset. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 12270–12279, 2019.[22] Zhou Wang, Alan Conrad Bovik, Hamid Rahim Sheikh, and Eero P.Simoncelli. Image quality assessment: From error visibility to structuralsimilarity.

IEEE Trans Image Process , 13(4), 2004.[23] Wei Wei, Deyu Meng, Qian Zhao, Zongben Xu, and Ying Wu. Semi-supervised transfer learning for image rain removal. In

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,pages 3877–3886, 2019.[24] SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-KinWong, and Wang-chun Woo. Convolutional lstm network: A machinelearning approach for precipitation nowcasting. In

Advances in neuralinformation processing systems , pages 802–810, 2015.[25] Wenhan Yang, Robby T Tan, Jiashi Feng, Jiaying Liu, Zongming Guo,and Shuicheng Yan. Deep joint rain detection and removal from a singleimage. In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 1357–1366, 2017.[26] Wenhan Yang, Robby T Tan, Jiashi Feng, Jiaying Liu, Shuicheng Yan,and Zongming Guo. Joint rain detection and removal from a singleimage with contextualized deep networks.

IEEE transactions on patternanalysis and machine intelligence , 2019.[27] Wenhan Yang, Robby T Tan, Shiqi Wang, Yuming Fang, and Jiaying Liu.Single image deraining: From model-based to data-driven and beyond. arXiv preprint arXiv:1912.07150 , pages 1–8, 2019.[28] He Zhang and Vishal M Patel. Density-aware single image de-rainingusing a multi-stream dense network. In