[PDF] Input Dropout for Spatially Aligned Modalities

Abstract

Computer vision datasets containing multiple modalities such as color, depth, and thermal properties are now commonly accessible and useful for solving a wide array of challenging tasks. However, deploying multi-sensor heads is not possible in many scenarios. As such many practical solutions tend to be based on simpler sensors, mostly for cost, simplicity and robustness considerations. In this work, we propose a training methodology to take advantage of these additional modalities available in datasets, even if they are not available at test time. By assuming that the modalities have a strong spatial correlation, we propose Input Dropout, a simple technique that consists in stochastic hiding of one or many input modalities at training time, while using only the canonical (e.g. RGB) modalities at test time. We demonstrate that Input Dropout trivially combines with existing deep convolutional architectures, and improves their performance on a wide range of computer vision tasks such as dehazing, 6-DOF object tracking, pedestrian detection and object classification.

Full PDF

IINPUT DROPOUT FOR SPATIALLY ALIGNED MODALITIES

S´ebastien de Blois, Mathieu Garon, Christian Gagn´e ∗ , Jean-Franc¸ois Lalonde Universit´e Laval

ABSTRACT

Input Dropout , a simple technique thatconsists in stochastic hiding of one or many input modalitiesat training time, while using only the canonical (e.g. RGB)modalities at test time. We demonstrate that

Input Dropout triv-ially combines with existing deep convolutional architectures,and improves their performance on a wide range of computervision tasks such as dehazing, 6-DOF object tracking, pedes-trian detection and object classiﬁcation.

Index Terms — Machine learning, Deep learning, Com-puter vision, Dropout, Dehazing, Tracking, Classiﬁcation, De-tection

1. INTRODUCTION

The use of deeper networks and data-hungry algorithms tosolve challenging computer vision problems has created theneed for ever richer datasets. In addition to common imagedatasets such as the famed ImageNet [6], datasets containingmultiple modalities have also been collected to address a va-riety of problems ranging from depth estimation [21], indoorscene understanding [28], 6-DOF tracking [9], multispectralobject detection [15], autonomous driving [10, 4] to haze re-moval [23], to name just a few. More generally, learning frommultiple modalities has been explored to determine which onesare useful [27], and multiple ways of combining them havebeen proposed [11, 14, 20, 26].Training deep learning models on additional modalitiestypically means that these extra modalities must also be avail-able at test time. Unfortunately, capturing more modalitiesrequires signiﬁcant time and effort. Adding sensors alongsidean RGB camera results in increased power consumption, less ∗ Canada-CIFAR AI Chair

TrainingTesting

R G B D R G B 0 0 0 0 D

Fig. 1 : The Input Dropout strategy for a given RGB image andadditional modality (e.g. depth in orange). At training time, theadditional modality is concatenated to the RGB image, with a givenprobability the RGB modality or the additional modality is being setto 0 (black). At test time (middle), the additional modality is alwaysunavailable (i.e., set to 0). In training, for the addit mode, only thetwo left cases are used, while for the both mode, all three cases areused. portable setups, the need to carefully calibrate and synchronizeeach sensor, as well as additional constraints on bandwidth andstorage requirements. This may not be practical for multipleapplications—including augmented reality, robotics, wearableand mobile computing, etc.—where these physical constraintspreclude the use of additional sensors.This dichotomy between the advantage brought by addi-tional modalities and the impediment they impose on realsystems has attracted attention in the literature. Can we trainon additional modalities without relying on them at test time?In their “learning with privileged information” paper, Vapniket al. [25] introduce a theoretical framework which shows thatthis may indeed be feasible. Practical techniques have sincebeen introduced, but those tend to be speciﬁcally targeted to-wards speciﬁc network architectures and applications. Forexample, “modality hallucination” [13] and variants [7, 8] pro-pose to train networks on different modalities independently,and shows that by changing the input modality of one of thenetworks while forcing the latent space to keep its formerstructure improves convergence. In [22], authors proposed toindependently process multiple modalities in parallel brancheswithin a network, and fuse the resulting feature maps usingso-called “modality dropout” to make the network invariantto missing modalities. Despite improving performance, thesemethods are complex to implement, may require multiple train-ing steps, and must be adapted differently to each problem.In this paper, we propose a technique for exploiting addi-tional modalities at training time, without having to rely on a r X i v : . [ c s . C V ] M a y hem at test time. In contrast to previous work which requirestask-speciﬁc architectures [22] or multiple training passes [8],our approach is extremely simple (can be implemented in a fewlines of code), is independent of the learning architecture used,and does not require any additional training pass. Assumingthat modalities are spatially-aligned and share the same spatialresolution, we propose to randomly dropout [24] entire inputmodalities at training time. At test time, the missing modalityis simply set to 0. We demonstrate that our proposed strategy, Input Dropout , can be leveraged to obtain between 2–20% gainover training on RGB-only, on a variety of applications.

2. INPUT DROPOUTAssumptions

We assume that all input modalities are spa-tially aligned and can be represented as additional channels ofthe same input image. In our experiments, we also assume thatthe RGB modality is the only modality available at test time,therefore the other modality is never available during testing.

Approach

Our proposed

Input Dropout strategy is illus-trated in ﬁg. 1. The additional modality is ﬁrst channel-wiseconcatenated to the RGB image, and the resulting tensor is fedas input to the neural network. The ﬁrst convolutional layer ofthe network must be adapted to this new input dimensionality(c.f. sec. 4). At training time, one of the input modalities israndomly set to 0 with probability P drop ∈ [0 , . This effec-tively “drops out” [24] the corresponding modality. At testtime, the additional modality is always set to 0. Implementing Input Dropout requires a few lines of PyTorch code.Since we assume a single additional modality is combinedwith an RGB image, we are faced with two options. Wecould randomly drop only the additional modality and alwayskeep the RGB (we dub this option addit ), or drop either theRGB or the additional modality ( both ). In these two cases, auniform probability distribution for the different possible casesis used. For the addit mode, the probability of dropping theadditional modality is set to P drop = 0 . . For the both mode,the probability of dropping either the RGB or the additionalmodality is P drop = 0 . .Our method is mainly related to “modality dropout” [22],which fuses the modalities in a learned latent space. Theirmain limitation is that specialized network branches must belearned for each modality, which adds complexity. In contrast,our method can be used on existing convolutional architectureswith very little change. We will compare to [22] in sec. 4.

3. INPUT DROPOUT FOR IMAGE DEHAZING

We ﬁrst experiment with

Input Dropout on single image de-hazing [17, 29] with depth (RGB+D) as the additional modal-ity available at training time only. For this, we employ theD-Hazy dataset [2], which contains 1449 pairs of RGB+Dimages where haze is synthetically added on images from the NYU Depth dataset [21]. We use 1180 images in training, 69for validation, and 200 for test. Our model is similar to [17],the only difference being that the generator is a ResNet (withnine blocks) as in [16].Similar to [17, 29], the network is trained on a combinationof a GAN, a pixel-wise L , and a perceptual loss [16] topreserve the sharpness of the image: L generator = L GAN + λ L + λ L percep , (1)where λ = λ = 10 (obtained with grid search on the valida-tion set). At training time, Input Dropout uses the addit mode.Indeed, it does not make sense to drop the RGB image since itwould be equivalent to obtain a haze-free image from a depthmap. We also experiment on single image dehazing usingsegmentation (RGB+S) as an additional training modality.TheFoggy Cityscape Dataset [23], an extension of Cityscapes [4]which contains ground truth scene segmentations is used here.The same network and training procedure are used.Quantitative dehazing results with

Input Dropout are pro-vided in tab. 1, and corresponding representative qualitativeresults in ﬁg. 2. Over the RGB-only baseline, relative improve-ments of 3.6% and 3.4% on PSNR and SSIM respectively areobserved when using

Input Dropout on RGB+D, and 4.5%PSNR and 2.2% SSIM for RGB+S. We also compare ourmethod to competing techniques, such as “Dehazing for seg-mentation” (D4S) [5] which proposes an approach to dehazeto increase performance for a subsequent task using a modalityonly available during training, and Pix2Pix GAN [3] whichemploys an extra generator to generate the missing modalityfrom the RGB image. In every case,

Input Dropout performsbetter while being simpler than the other approaches. Notethat we have not compared our approach to “modality distil-lation” [8] here since the method cannot be applied to thisscenario. Indeed, it would involve training a network to de-haze a depth (or segmentation) image, which would requirehallucinating scene contents.

4. INPUT DROPOUT FOR CLASSIFICATION

We evaluate the use of

Input Dropout for image classiﬁca-tion using RGB+D training data. For this, we rely on themethodology proposed by Garcia et al. [8], who use the cropsof individual objects from the NYU V2 dataset [21] adaptedby [13] for object classiﬁcation using RGB+D. We used thesame split as in [8]: 4,600 RGB-D images in total, wherearound 50% are used for training and the remainder for testing.Here, we rely on a ResNet-34 [12], initialized with pretrainedweights on ImageNet [6]. To adapt the pretrained ResNet-34 touse

Input Dropout , we append additional channels to the ﬁltersof the ﬁrst convolution layer and initialize the new weightsrandomly. Doing so preserves the pretrained weights for theRGB modality.Tab. 2 shows the quantitative classiﬁcation accuracy ob-tained with the various methods. First, we report results when nput Ground truth RGB only

Input Dropout

Difference

Fig. 2 : Qualitative examples for dehazing RGB images from (top row) D-Hazy [1] and (bottom row) Foggy Cityscapes [23, 4]. From left toright: hazy input, ground truth haze-free image, results when trained on RGB only, results with

Input Dropout , absolute difference between the3rd and 4th column, shown using a color map ranging from blue (low) to yellow (high).RGB+D RGB+SMethods PSNR SSIM PSNR SSIMRGB-only 17.61 0.74 23.55 0.91D4S [5] N/A N/A 23.90 0.92Pix2Pix GAN [3] 17.70 0.75 22.90 0.91

Input Dropout : Quantitative results for single image dehazing using anadditional depth (RGB+D) and segmentation (RGB+S) modality attraining time. Results are reported on the D-Hazy dataset [1] forRGB+D and the Foggy Cityscapes dataset [23] for RGB+S. Foreach technique, the average over ﬁve different training runs are re-ported. In all scenarios,

Input Dropout , despite its simplicity, isthe technique that provides the largest improvement over the RGB-only baseline. RGB+D: Statistically signiﬁcant results are in bold,Wilcoxon–Mann–Whitney (WMW) test with an α of 0.025. RGB+S:Statistically signiﬁcant results are in bold, WMW test with an α of0.05 for PSNR and 0.025 for SSIM. the depth modality is available at test time to provide an upperbound on performance. Next, we evaluate training a single net-work on the RGB modality only (“RGB-only”), the approachof [3] which relies on a GAN to hallucinate the depth at testtime, and our Input Dropout strategy (in the addit mode). Ourapproach provides the best results, despite being the simplest.We further compare to ensemble methods. First, two net-works trained on RGB only, with their answers averaged beforethe argmax, yield an absolute performance improvement of3.2% over the single-network baseline. The “modality distil-lation” approach of Garcia et al. [8] relies on a combinationof two networks: one trained on RGB only, and another, so-called “hallucination” network. That second network is trainedto produce a latent representation that is similar to a proxy net-work trained on the depth modality only. The ﬁnal output is the mean of the RGB-only and the hallucination network.We reimplemented their approach in PyTorch to ensure directcomparison with our results, which yields a 1.8% absoluteimprovement over the RGB+RGB baseline.We directly compare our technique to “modality distilla-tion” [8] by using one network trained on RGB only, and an-other network trained on RGB+D with

Input Dropout (insteadof their “hallucination” network). This yields approximatelythe same performance as “modality distillation” [8], despitebeing simpler to train, requiring a single network architectureand a single training pass (i.e. both networks can be trained inparallel, while they must be trained sequentially for [8]).

Method Ensemble AccuracyRGB+D No 58.9%Depth (D) only No 57.0%RGB only No 47.5%Pix2Pix GAN [3] No 48.2%ModDrop [22] No 44.3%

Input Dropout No RGB+RGB Yes 50.7%Mod. distillation [8] Yes

Input Dropout + RGB Yes : Classiﬁcation accuracies on the NYU V2 dataset adaptedby [13]. Results are the average over ﬁve different training runs.Statistically signiﬁcant results are in bold, WMW test with an α of0.05 for no ensemble and 0.025 for ensemble.

5. OTHER APPLICATIONS

We evaluate

Input Dropout (in the both mode) on two addi-tional applications: tracking in RGB+D and pedestrian detec- ranslation (mm) Rotation (degrees)Occlusion % 0–30 45–75 0–30 45–75RGB-only

Input Dropout

Relative gain -2.2% 33.3% 20.5% 13.4%

Table 3 : Tracking error in translation and rotation with respect to theratio of occlusion from the dataset of Garon et al. [9]. We observe that

Input Dropout augment most of the scenarios signiﬁcantly. In transla-tion, the error with Input Dropout stabilizes after 45% occlusion, andthe average relative gain in rotation is 16.7%. tion in RGB+thermal.

We ﬁrst focus on the problem of tracking 3D objects in 6degrees of freedom (DOF). To do so, we employ the methodol-ogy of Garon et al. [9], who presented a technique for trackinga known 3D object in real-time using synthetic RGB+D data.They also provide an evaluation dataset containing 297 realsequences captured with a Kinect V2 with ground truth anno-tations of the 6-DOF poses of 11 different objects.Here, we focus on the “occlusion” scenario proposed by [9]where the objects are rotated on a turntable while being par-tially hidden by a planar occluder with (measured) occlusionvarying from 0% to 75%. We evaluate

Input Dropout using thesame CNN architecture as in [9]. For a given pose P = [ R t ] where t is the translation and R the rotation matrix, the trans-lation error is deﬁned by its L norm and the rotation matrixdistance is computed with: δ R ( R , R ) = arccos( Tr( R T R ) −

12 ) , (2)where Tr is the matrix trace [9].Quantitative 6-DOF tracking results are reported in tab. 3.We observe that Input Dropout generally improves the resultsfor the tracking task in translation with a relative gain as highas 33.3% in the hardest sequences, and an average of 17%relative gain in rotation. The error reported is the average of 5training runs for each method.

We experiment with pedestrian detection on RGB+T (ther-mal) images using the KAIST Multispectral pedestriandataset [15]. The training/validation/test sets are composedof 16,000/1,100/3,500 pairs of thermal/visible images fornighttime and 32,000/1,500/8,500 for daytime.Here, we rely on RetinaNet [19], which is a state-of-the-artarchitecture for object detection. The RetinaNet is trainedwith a focal loss using a ResNet-34 [12] and a Feature Pyra-mid Network (FPN) [18] as backbone for feature extraction.

Method Nighttime DaytimeRGB-only 0.228 0.351

Input Dropout

Relative gain 18.9% 15.1%

Table 4 : Mean average precision (mAP) with an IoU of 0.5 resultswith RGB+T for nighttime and daytime pedestrian detection withand without

Input Dropout , RGB only in test time. Each resultscolumn indicates the modality that is used at test time. The RGB-onlyrow trains on the test modality only, while

Input Dropout uses bothmodalities at training time. Results are the average over ﬁve differenttraining runs. The last row is the relative performance gain resultingfrom using

Input Dropout . Statistically signiﬁcant results are in bold,WMW test with an α of 0.01 for nighttime and daytime. The RetinaNet is initialized with pretrained weights on Im-ageNet [6]. As in sec. 4, additional channels are appendedto the ﬁlters of the ﬁrst convolutional layer to preserve thelearned weights on RGB. The network is then ﬁne-tuned onthe KAIST images until convergence on the validation set.To evaluate performance, we compute the mean averageprecision (mAP) with an intersection-over-union (IoU) scoreof 0.5. Tab. 4 shows the results of the experiments in both night-and daytime scenarios. We observe that our

Input Dropout strategy yields improvements in all cases, nighttime RGBpedestrian detection improves by 18.9%, and daytime RGBpedestrian detection improves by 15.1%.

6. DISCUSSION

We propose

Input Dropout as a simple and effective strategyfor leveraging additional modalities at training time which arenot available at test time. We extensively test our technique inseveral applications—including single image dehazing, objectclassiﬁcation, 3D object tracking, and object detection—onseveral additional modalities—including depth, segmentationmaps, and thermal images. In all cases, using

Input Dropout intraining yields improved performance at test time, even if theadditional modality is unavailable. Our approach, which can beimplemented in a few lines of code only, can be used as a drop-in replacement with no change to the network architecture,aside from the addition of one extra input dimension to theﬁrst layer ﬁlters. The main limitation of our approach is that wehave experimented on adding only a single additional modalityto the RGB baseline. In the future, we plan on exploring theapplicability of the approach with more modalities.

7. ACKNOWLEDGEMENTS

This research was supported by Thales, NSERC and theNSERC/Creaform Industrial Research Chair on 3D Scan-ning: CREATION 3D. We thank Nvidia Corporation for thedonation of the GPUs used in this research. . REFERENCES [1] C. Ancuti, C. O. Ancuti, and C. De Vleeschouwer. D-hazy: A dataset to evaluate quantitatively dehazing algo-rithms. In

ICIP , 2016.[2] C. Ancuti, C. O. Ancuti, and R. Timofte. Ntire 2018challenge on image dehazing: Methods and results. In

CVPR Workshops , 2018.[3] B. Bischke, P. Helber, F. Koenig, D. Borth, and A. Dengel.Overcoming missing and incomplete modalities withgenerative adversarial networks for building footprintsegmentation. In

CBMI , pages 1–6. IEEE, 2018.[4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. En-zweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele.The cityscapes dataset for semantic urban scene under-standing. In

CVPR , 2016.[5] S. de Blois, I. Hedhli, and C. Gagn´e. Learning of imagedehazing models for segmentation tasks. In

EUSIPCO ,A Coru˜na, Spain, Sept. 2019.[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database.In

CVPR , 2009.[7] N. C. Garcia, P. Morerio, and V. Murino. Modality distil-lation with multiple stream networks for action recogni-tion. In

ECCV , 2018.[8] N. C. Garcia, P. Morerio, and V. Murino. Learningwith privileged information via adversarial discriminativemodality distillation.

TPAMI , 2019.[9] M. Garon, D. Laurendeau, and J.-F. Lalonde. A frame-work for evaluating 6-dof object trackers. In

ECCV ,2018.[10] A. Geiger, P. Lenz, and R. Urtasun. Are we ready forautonomous driving? The KITTI vision benchmark suite.In

CVPR , 2012.[11] H. Gunes and M. Piccardi. Affect recognition from faceand body: early fusion vs. late fusion. In

IEEE inter-national conference on systems, man and cybernetics ,2005.[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residuallearning for image recognition. In

CVPR , 2016.[13] J. Hoffman, S. Gupta, and T. Darrell. Learning with sideinformation through modality hallucination. In

CVPR ,2016.[14] J.-F. Hu, W.-S. Zheng, J. Lai, and J. Zhang. Jointly learn-ing heterogeneous features for rgb-d activity recognition.In

CVPR , 2015.[15] S. Hwang, J. Park, N. Kim, Y. Choi, and I. So Kweon.Multispectral pedestrian detection: Benchmark datasetand baseline. In

CVPR , 2015. [16] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses forreal-time style transfer and super-resolution. In

ECCV ,2016.[17] R. Li, J. Pan, Z. Li, and J. Tang. Single image dehazingvia conditional generative adversarial network. In

CVPR ,2018.[18] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan,and S. Belongie. Feature pyramid networks for objectdetection. In

CVPR , 2017.[19] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar.Focal loss for dense object detection. In

ICCV , 2017.[20] O. Mees, A. Eitel, and W. Burgard. Choosing smartly:Adaptive multimodal fusion for object detection in chang-ing environments. In

IROS , 2016.[21] P. K. Nathan Silberman, Derek Hoiem and R. Fergus.Indoor segmentation and support inference from rgbdimages. In

ECCV , 2012.[22] N. Neverova, C. Wolf, G. Taylor, and F. Nebout. Mod-drop: adaptive multi-modal gesture recognition.

TPAMI ,38(8):1692–1706, 2016.[23] C. Sakaridis, D. Dai, and L. Van Gool. Semanticfoggy scene understanding with synthetic data.

IJCV ,126(9):973–992, 2018.[24] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever,and R. Salakhutdinov. Dropout: A simple way to preventneural networks from overﬁtting.

Journal of MachineLearning Research , 15:1929–1958, 2014.[25] V. Vapnik and A. Vashist. A new learning paradigm:Learning using privileged information.

Neural networks ,22(5-6):544–557, 2009.[26] J. Wagner, V. Fischer, M. Herman, and S. Behnke. Mul-tispectral pedestrian detection using deep fusion convo-lutional neural networks. In

ESANN , 2016.[27] Y. Wu, E. Y. Chang, K. C.-C. Chang, and J. R. Smith.Optimal multimodal fusion for multimedia data analysis.In

ACM Multimedia , 2004.[28] J. Xiao, A. Owens, and A. Torralba. Sun3d: A databaseof big spaces reconstructed using sfm and object labels.In

ICCV , 2013.[29] X. Yang, Z. Xu, and J. Luo. Towards perceptual imagedehazing by physics-based disentanglement and adver-sarial training. In