[PDF] A-Fast-RCNN: Hard Positive Generation via Adversary for Object Detection

Abstract

How do we learn an object detector that is invariant to occlusions and deformations? Our current solution is to use a data-driven strategy -- collect large-scale datasets which have object instances under different conditions. The hope is that the final classifier can use these examples to learn invariances. But is it really possible to see all the occlusions in a dataset? We argue that like categories, occlusions and object deformations also follow a long-tail. Some occlusions and deformations are so rare that they hardly happen; yet we want to learn a model invariant to such occurrences. In this paper, we propose an alternative solution. We propose to learn an adversarial network that generates examples with occlusions and deformations. The goal of the adversary is to generate examples that are difficult for the object detector to classify. In our framework both the original detector and adversary are learned in a joint manner. Our experimental results indicate a 2.3% mAP boost on VOC07 and a 2.6% mAP boost on VOC2012 object detection challenge compared to the Fast-RCNN pipeline. We also release the code for this paper.

Full PDF

AA-Fast-RCNN: Hard Positive Generation via Adversary for Object Detection

Xiaolong Wang Abhinav Shrivastava Abhinav GuptaThe Robotics Institute, Carnegie Mellon University

Abstract

How do we learn an object detector that is invariant toocclusions and deformations? Our current solution is to usea data-driven strategy – collect large-scale datasets whichhave object instances under different conditions. The hopeis that the ﬁnal classiﬁer can use these examples to learninvariances. But is it really possible to see all the occlu-sions in a dataset? We argue that like categories, occlu-sions and object deformations also follow a long-tail. Someocclusions and deformations are so rare that they hardlyhappen; yet we want to learn a model invariant to such oc-currences. In this paper, we propose an alternative solution.We propose to learn an adversarial network that generatesexamples with occlusions and deformations. The goal ofthe adversary is to generate examples that are difﬁcult forthe object detector to classify. In our framework both theoriginal detector and adversary are learned in a joint man-ner. Our experimental results indicate a 2.3% mAP booston VOC07 and a 2.6% mAP boost on VOC2012 object de-tection challenge compared to the Fast-RCNN pipeline. Wealso release the code for this paper.

1. Introduction

The goal of object detection is to learn a visual modelfor concepts such as cars and use this model to localizethese concepts in an image. This requires the ability torobustly model invariances to illuminations, deformations,occlusions and other intra-class variations. The standardparadigm to handle these invariances is to collect large-scaledatasets which have object instances under different condi-tions. For example, the COCO dataset [18] has more than10K examples of cars under different occlusions and defor-mations. The hope is that these examples capture all possi-ble variations of a visual concept and the classiﬁer can theneffectively model invariances to them. We believe this hasbeen one of the prime reasons why ConvNets have been sosuccessful at the task of object detection: they are able touse all this data to learn invariances.However, like object categories, we believe even occlu-sions and deformations follow a long-tail distribution. That https://github.com/xiaolonw/adversarial-frcnn Real World OcclusionsOften Rare … Occlusions Created by Adversarial NetworksReal World DeformationsOften Rare … Figure 1: We argue that both occlusions and deformationsfollow a long-tail distribution. Some occlusions and defor-mations are rare. In this paper, we propose to use an ad-versarial network to generate examples with occlusions anddeformations that will be hard for an object detector to clas-sify. Our adversarial network adapts as the object detectorbecomes better and better. We show boost in detection ac-curacy with this adversarial learning strategy empirically.is, some of the occlusions and deformations are so rare thatthere is a low chance that they will occur in large-scaledatasets. For example, consider the occlusions shown inFigure 1. We notice that some occlusions occur more of-ten than others ( e.g ., occlusion from other cars in a parkinggarage is more frequent than from an air conditioner). Sim-ilarly, some deformations in animals are common ( e.g ., sit-ting/standing poses) while other deformations are very rare.So, how can we learn invariances to suchrare/uncommon occlusions and deformations? Whilecollecting even larger datasets is one possible solution, it isnot likely to scale due to the long-tail statistics.Recently, there has been a lot of work on generatingimages (or pixels) [4, 9, 26]. One possible way to learn1 a r X i v : . [ c s . C V ] A p r bout these rare occurrences is to generate realistic imagesby sampling from the tail distribution. However, this isnot a feasible solution since image generation would re-quire training examples of these rare occurrences to be-gin with. Another solution is to generate all possible oc-clusions and deformations and train object detectors fromthem. However, since the space of deformations and occlu-sions is huge, this is not a scalable solution. It has beenshown that using all examples is often not the optimal solu-tion [33, 39] and selecting hard examples is better. Is therea way we can generate “hard” positive examples with dif-ferent occlusions and deformations and without generatingthe pixels themselves?How about training another network: an adversary thatcreates hard examples by blocking some feature maps spa-tially or creates spatial deformations by manipulating fea-ture responses. This adversary will predict what it thinkswill be hard for a detector like Fast-RCNN [7] and in turnthe Fast-RCNN will adapt itself to learn to classify these ad-versarial examples. The key idea here is to create adversar-ial examples in convolutional feature space and not generatethe pixels directly since the latter is a much harder problem.In our experiments, we show substantial improvements inperformance of the adversarial Fast-RCNN (A-Fast-RCNN)compared to the standard Fast-RCNN pipeline.

2. Related Work

In recent years, signiﬁcant gains have been made in theﬁeld of object detection. These recent successes build uponthe powerful deep features [16] learned from the task of Im-ageNet classiﬁcation [3]. R-CNN [8] and OverFeat [30] ob-ject detection systems led this wave with impressive resultson PASCAL VOC [5]; and in recent years, more computa-tionally efﬁcient versions have emerged that can efﬁcientlytrain on larger datasets such as COCO [18]. For exam-ple, Fast-RCNN [7] shares the convolutions across differentregion proposals to provide speed-up, Faster-RCNN [28]and R-FCN [2] incorporate region proposal generation inthe framework leading to a completely end-to-end version.Building on the sliding-window paradigm of the Overfeatdetector, other computationally-efﬁcient approaches haveemerged such as YOLO [27], SSD [19] and DenseBox [13].Thorough comparisons among these methods are discussedin [12].Recent research has focused on three principal directionson developing better object detection systems. The ﬁrstdirection relies on changing the base architecture of thesenetworks. The central idea is that using deeper networksshould not only lead to improvements in classiﬁcation [3]but also object detection [5, 18]. Some recent work in thisdirection include ResNet [10], Inception-ResNet [38] andResNetXt [43] for object detection.The second area of research has been to use contextualreasoning, proxy tasks for reasoning and other top-down mechanisms for improving representations for object de-tection [1, 6, 17, 24, 32, 34, 45]. For example, [32] usesegmentation as a way to contextually prime object detec-tors and provide feedback to initial layers. [1] uses skip-network architecture and uses features from multiple lay-ers of representation in conjunction with contextual reason-ing. Other approaches include using a top-down features forincorporating context and ﬁner details [17, 24, 34] whichleads to improved detections.The third direction to improve a detection systems is tobetter exploit the data itself. It is often argued that the recentsuccess of object detectors is a product of better visual rep-resentation and the availability of large-scale data for learn-ing. Therefore, this third class of approaches try to explorehow to better utilize data for improving the performance.One example is to incorporate hard example mining in aneffective and efﬁcient setup for training region-based Con-vNets [33]. Other examples of ﬁndind hard examples fortraining include [20, 35, 41].Our work follows this third direction of research wherethe focus is on leveraging data in a better manner. How-ever, instead of trying to sift through the data to ﬁnd hardexamples, we try to generate examples which will be hardfor Fast-RCNN to detect/classify. We restrict the space ofnew positive generation to adding occlusions and defor-mations to the current existing examples from the dataset.Speciﬁcally, we learn adversarial networks which try to pre-dict occlusions and deformations that would lead to mis-classiﬁcation by Fast-RCNN. Our work is therefore relatedto lot of recent work in adversarial learning [4, 9, 14, 21,22, 23, 26, 29, 37]. For example, techniques have beenproposed to improve adversarial learning for image gener-ation [26] as well as for training better image generativemodel [29]. [29] also highlights that the adversarial learn-ing can improve image classiﬁcation in a semi-supervisedsetting. However, the experiments in these works are con-ducted on data which has less complexity than object de-tection datasets, where image generation results are signif-icantly inferior. Our work is also related to a recent workon adversarial training in robotics [25]. However, insteadof using an adversary for better supervision, we use the ad-versary to generate the hard examples.

3. Adversarial Learning for Object Detection

Our goal is to learn an object detector that is robust todifferent conditions such as occlusion, deformation and illu-mination. We hypothesize that even in large-scale datasets,it is impossible to cover all potential occlusions and defor-mations. Instead of relying heavily on the dataset or siftingthrough data to ﬁnd hard examples, we take an alternativeapproach. We actively generate examples which are hardfor the object detector to recognize. However, instead ofgenerating the data in the pixel space, we focus on a re-stricted space for generation: occlusion and deformation. ullyConnectedLayers

SoftmaxClassificationLoss

Soft-L1Bbox Reg.

Loss

RoI

Pooling Layer × Fully

ConnectedLayers

Occlusion

MaskDropout

Values

Mask Selection according to Loss

Conv Layers

Figure 2: Our network architecture of ASDN and how it combines with Fast RCNN approach. Our ASDN network takesas input image patches with features extracted using RoI pooling layer. ASDN network than predicts an occlusion/dropoutmask which is then used to drop the feature values and passed onto classiﬁcation tower of Fast-RCNN.Mathematically, let us assume the original object detec-tor network is represented as F ( X ) where X is one of theobject proposals. A detector gives two outputs F c whichrepresents class output and F l represent predicted boundingbox location. Let us assume that the ground-truth class for X is C with spatial location being L . Our original detectorloss can be written down as, L F = L softmax ( F c ( X ) , C ) + [ C / ∈ bg ] L bbox ( F l ( X ) , L ) , where the ﬁrst term is the SoftMax loss and the secondterm is the loss based on predicted bounding box locationand ground truth box location (foreground classes only).Let’s assume the adversarial network is represented as A ( X ) which given a feature X computed on image I , gen-erates a new adversarial example. The loss function for thedetector remains the same just that the mini-batch now in-cludes fewer original and some adversarial examples.However, the adversarial network has to learn to predictthe feature on which the detector would fail. We train thisadversarial network via the following loss function, L A = −L softmax ( F c ( A ( X )) , C ) . Therefore, if the feature generated by the adversarial net-work is easy for the detector to classify, we get a high lossfor the adversarial network. On the other hand, if after ad-versarial feature generation it is difﬁcult for the detector, weget a high loss for the detector and a low loss for the adver-sarial network.

4. A-Fast-RCNN: Approach Details

We now describe the details of our framework. We ﬁrstgive a brief overview of our base detector Fast-RCNN. Thisis followed by describing the space of adversarial genera-tion. In particular, we focus on generating different types of occlusions and deformations in this paper. Finally, inSection 5, we describe our experimental setup and show theresults which indicate signiﬁcant improvements over base-lines.

We build upon the Fast-RCNN framework for object de-tection [7]. Fast-RCNN is composed of two parts: (i) a con-volutional network for feature extraction; (ii) an RoI net-work with an RoI-pooling layer and a few fully connectedlayers that output object classes and bounding boxes.Given an input image, the convolutional network of theFast-RCNN takes the whole image as an input and producesconvolutional feature maps as the output. Since the opera-tions are mainly convolutions and max-pooling, the spatialdimensions of the output feature map will change accord-ing to the input image size. Given the feature map, the RoI-pooling layer is used to project the object proposals [40]onto the feature space. The RoI-pooling layer crops andresizes to generate a ﬁxed size feature vector for each ob-ject proposal. These feature vectors are then passed throughfully connected layers. The outputs of the fully connectedlayers are: (i) probabilities for each object class includingthe background class; and (ii) bounding box coordinates.For training, the SoftMax loss and regression loss areapplied on these two outputs respectively, and the gradientsare back propagated though all the layers to perform end-to-end learning.

We consider two types of feature generations by adver-sarial networks competing against the Fast-RCNN (FRCN)detector. The ﬁrst type of generation is occlusion. Here,we propose Adversarial Spatial Dropout Network (ASDN) (b) Generated Masks(a) Pre-training via Searching

Figure 3: (a) Model pre-training: Examples of occlusions that are sifted to select the hard occlusions and used as ground-truthto train the ASDN network (b) Examples of occlusion masks generated by ASDN network. The black regions are occludedwhen passed on to FRCN pipeline.which learns how to occlude a given object such that it be-comes hard for FRCN to classify. The second type of gen-eration we consider in this paper is deformation. In thiscase, we propose Adversarial Spatial Transformer Network(ASTN) which learns how to rotate “parts” of the objectsand make them hard to recognize by the detector. By com-peting against these networks and overcoming the obstacles,the FRCN learns to handle object occlusions and deforma-tions in a robust manner. Note that both the proposed net-works ASDN and ASTN are learned simultaneously in con-junction with the FRCN during training. Joint training pre-vents the detector from overﬁtting to the obstacles createdby the ﬁxed policies of generation.Instead of creating occlusions and deformations on theinput images, we ﬁnd that operating on the feature spaceis more efﬁcient and effective. Thus, we design our adver-sarial networks to modify the features to make the objectharder to recognize. Note that these two networks are onlyapplied during training to improve the detector. We willﬁrst introduce the ASDN and ASTN individually and thencombine them together in an uniﬁed framework.

We propose an Adversarial Spatial Dropout Network(ASDN) to create occlusions on the deep features for fore-ground objects. Recall that in the standard Fast-RCNNpipeline, we can obtain the convolutional features for eachforeground object proposal after the RoI-pooling layer. Weuse these region-based features as the inputs for our adver-sarial network. Given the feature of an object, the ASDNwill try to generate a mask indicating which parts of the fea-ture to dropout (assigning zeros) so that the detector cannotrecognize the object.More speciﬁcally, given an object we extract the feature X with size d × d × c , where d is the spatial dimension and c represents the number of channels (e.g., c = 256 , d = 6 inAlexNet). Given this feature, our ASDN will predict a mask M with d × d values which are either 0 or 1 after threshold-ing. We visualize some of the masks before thresholding inFig. 3(b). We denote M ij as the value for the i th row and j th column of the mask. Similarly, X ijk represents the value inchannel k at location i, j of the feature. If M ij = 1 , wedrop out the values of all the channels in the correspondingspatial location of the feature map X , i.e., X ijk = 0 , ∀ k . Network Architecture.

We use the standard Fast-RCNN (FRCN) architecture. We initialize the network us-ing pre-training from ImageNet [3]. The adversarial net-work shares the convolutional layers and RoI-pooling layerwith FRCN and then uses its own separate fully connectedlayers. Note that we do not share the parameters in ourASDN with Fast-RCNN since we are optimizing two net-works to do the exact opposite tasks.

Model Pre-training.

In our experiment, we ﬁnd it im-portant to pre-train the ASDN for creating occlusions beforeusing it to improve Fast-RCNN. Motivated by the FasterRCNN detector [28], we apply stage-wise training here. Weﬁrst train our Fast-RCNN detector without ASDN for 10Kiterations. As the detector now has a sense of the objectsin the dataset, we train the ASDN model for creating theocclusions by ﬁxing all the layers in the detector.

Initializing ASDN Network.

To initialize the ASDNnetwork, given a feature map X with spatial layout d × d ,we apply a sliding window with size d × d on it. We repre-sent the sliding window process by projecting the windowback to the image as 3(a). For each sliding window, we dropout the values in all channels whose spatial locations arecovered by the window and generate a new feature vectorfor the region proposal. This feature vector is then passedthrough classiﬁcation layers to compute the loss. Based onthe loss of all the d × d windows, we select the one withthe highest loss. This window is then used to create a single d × d mask (with 1 for the window location and 0 for theother pixels). We generate these spatial masks for n pos-itive region proposals and obtain n pairs of training exam-ples { ( X , ˜ M ) , ..., ( X n , ˜ M n ) } for our adversarial dropoutnetwork. The idea is that the ASDN should learn to gen-erate the masks which can give the detector network highlosses. We apply the binary cross entropy loss in trainingthe ASDN and it can be formulated as, L = − n n (cid:88) p d (cid:88) i,j [ ˜ M pij A ij ( X p ) + (1 − ˜ M pij )(1 − A ij ( X p ))] , (1) ullyConnected Layers

Softmax

ClassificationLoss

Soft-L1Bbox Reg. Loss × FullyConnectedLayers OcclusionMaskDropoutValues × Localisation

Network Sampler Ԧ𝜃 G Grid

Generator

ASDN ASTN

RoI-PoolingFeature

Figure 4: Network architecture for combining ASDN and ASTN network. First occlusion masks are created and then thechannels are rotated to generate hard examples for training.where A ij ( X p ) represents the outputs of the ASDN inlocation ( i, j ) given input feature map X p . We train theASDN with this loss for 10K iterations. We show that thenetwork starts to recognize which part of the objects are sig-niﬁcant for classiﬁcation as shown in Fig. 3(b). Also notethat our output masks are different from the Attention Maskproposed in [31], where they use the attention mechanismto facilitate classiﬁcation. In our case, we use the masks toocclude parts to make the classiﬁcation harder. Thresholding by Sampling.

The output generated byASDN network is not a binary mask but rather a continu-ous heatmap. Instead of using direct thresholding, we useimportance sampling to select the top pixels to mask out.Note that the sampling procedure incorporates stochasticityand diversity in samples during training. More speciﬁcally,given a heatmap, we ﬁrst select the top pixels with topprobabilities and randomly select pixels out of them toassign the value 1 and the rest of pixels are set to 0. Joint Learning.

Given the pre-trained ASDN and Fast-RCNN model, we jointly optimize these two networks ineach iteration of training. For training the Fast-RCNN de-tector, we ﬁrst use the ASDN to generate the masks onthe features after the RoI-pooling during forward propaga-tion. We perform sampling to generate binary masks anduse them to drop out the values in the features after the RoI-pooling layer. We then forward the modiﬁed features tocalculate the loss and train the detector end-to-end. Notethat although our features are modiﬁed, the labels remainthe same. In this way, we create “harder” and more diverseexamples for training the detector.For training the ASDN, since we apply the samplingstrategy to convert the heatmap into a binary mask, whichis not differentiable, we cannot directly back-prop the gra-dients from the classiﬁcation loss. Alternatively, we takethe inspirations from the REINFORCE [42] approach. We compute which binary masks lead to signiﬁcant drops inFast-RCNN classiﬁcation scores. We use only those hardexample masks as ground-truth to train the adversarial net-work directly using the same loss as described in Eq. 1.

We now introduce the Adversarial Spatial Transformer Net-work (ASTN). The key idea is to create deformations on theobject features and make object recognition by the detectordifﬁcult. Our network is built upon the Spatial TransformerNetwork (STN) proposed in [15]. In their work, the STN isproposed to deform the features to make classiﬁcation eas-ier. Our network, on the other hand, is doing the exact op-posite task. By competing against our ASTN, we can traina better detector which is robust to deformations.

STN Overview.

The Spatial Transformer Network [15]has three components: localisation network, grid generatorand sampler. Given the feature map as input, the locali-sation network will estimate the variables for deformations(e.g., rotation degree, translation distance and scaling fac-tor). These variables will be used as inputs for the grid gen-erator and sampler to operate on the feature map. The out-put is a deformed feature map. Note that we only need tolearn the parameters in the localisation network. One of thekey contribution of STN is making the whole process dif-ferentiable, so that the localisation network can be directlyoptimized for the classiﬁcation objective via back propaga-tion. Please refer to [15] for more technical details.

Adversarial STN.

In our Adversarial Spatial Trans-former Network, we focus on feature map rotations. That is,given a feature map after the RoI-pooling layer as input, ourASTN will learn to rotate the feature map to make it harderto recognize. Our localisation network is composed with3 fully connected layers where the ﬁrst two layers are ini-tialized with fc6 and fc7 layers from ImageNet pre-trainedetwork as in our Adversarial Spatial Dropout Network.We train the ASTN and the Fast-RCNN detector jointly.For training the detector, similar to the process in theASDN, the features after RoI-pooling are ﬁrst transformedby our ASTN and forwarded to the higher layers to computethe SoftMax loss. For training the ASTN, we optimize it sothat the detector will classify the foreground objects as thebackground class. Different from training ASDN, since thespatial transformation is differentiable, we can directly usethe classiﬁcation loss to back-prop and ﬁnetune the param-eters in the localisation network of ASTN.

Implementation Details.

In our experiments, we ﬁndit very important to limit the rotation degrees produced bythe ASTN. Otherwise it is very easy to rotate the object up-side down which is the hardest to recognize in most cases.We constrain the rotation degree within 10 ◦ clockwise andanti-clockwise. Instead of rotating all the feature map in thesame direction, we divide the feature maps on the channeldimension into 4 blocks and estimate 4 different rotationangles for different blocks. Since each of the channel corre-sponds to activations of one type of feature, rotating chan-nels separately corresponds to rotating parts of the object indifferent directions which leads to deformations. We alsoﬁnd that if we use one rotation angle for all feature maps,the ASTN will often predict the largest angle. By using 4different angles instead of one, we increase the complex-ity of the task which prevents the network from predictingtrivial deformations. The two adversarial networks ASDN and ASTN can also becombined and trained together in the same detection frame-work. Since these two networks offer different types of in-formation. By competing against these two networks simul-taneously, our detector become more robust.We combine these two networks into the Fast-RCNNframework in a sequential manner. As shown in Fig. 4, thefeature maps extracted after the RoI-pooling are ﬁrst for-warded to our ASDN which drop out some activations. Themodiﬁed features are further deformed by the ASTN.

5. Experiments

We conduct our experiments on PASCAL VOC 2007,PASCAL VOC 2012 [5] and MS COCO [18] datasets. Asis standard practice, we perform most of the ablative stud-ies on the PASCAL VOC 2007 dataset. We also report ournumbers on the PASCAL VOC 2012 and COCO dataset.Finally, we perform a comparison between our method andthe Online Hard Example Mining (OHEM) [33] approach.

PASCAL VOC.

For the VOC datasets, we use the ‘train-val’ set for training and ‘test’ set for testing. We followmost of the setup in standard Fast-RCNN [7] for training. We apply SGD for 80K to train our models. The learningrate starts with . and decreases to . after 60K it-erations. We use the selective search proposals [40] duringtraining. MS COCO.

For the COCO dataset, we use the ‘train-val35k’ set for training and the ‘minival’ set for testing.During training the Fast-RCNN [7], we apply SGD with320K iterations. The learning rate starts with . and de-creases to . after 280K iterations. For object propos-als, we use the DeepMask proposals [24].In all the experiments, our minibatch size for trainingis 256 proposals with 2 images. We follow the Torch im-plementation [44] of Fast-RCNN. With these settings, ourbaseline numbers for are slightly better than the reportednumber in [7]. To prevent the Fast-RCNN from overﬁttingto the modiﬁed data, we provide one image in the batchwithout any adversarial occlusions/deformations and applyour approach on another image in the batch. We report our results for using ASTN and ASDN dur-ing training Fast-RCNN in Table 1. For the AlexNet ar-chitecture [16], our implemented baseline is . mAP.Based on this setting, joint learning with our ASTN modelreaches . and joint learning with the ASDN modelgives higher performance of . . As both methods arecomplementary to each other, combining ASDN and ASTNinto our full model gives another boost to . mAP.For the VGG16 architecture [36], we conduct the sameset of experiments. Firstly, our baseline model reaches . mAP, much higher than the reported number . in [7]. Based on this implementation, joint learning withour ASTN model gives an improvement to . mAP andthe ASDN model reaches . mAP. Our full model withboth ASTN and ASDN improves the performance to . .Our ﬁnal result gives . boost upon the baseline.To show that our method also works with very deepCNNs, we apply the ResNet-101 [10] architecture in train-ing Fast-RCNN. As the last two lines in Table.1 illustrate,the performance of Fast-RCNN with ResNet-101 is . mAP. By applying the adversarial training, the result is . mAP. We can see that our approach consistently im-proves performances on different types of architectures. We compare our Advesarial SpatialDropout Network with various dropout/occlusion strategyin training using the AlexNet architecture. The ﬁrst sim-ple baseline we try is random spatial dropout on the featureafter RoI-Pooling. For a fair comparison, we mask the ac-tivations of the same number of neurons as we do in theASDN network. As Table 2 shows, the performance of ran-dom dropout is . mAP which is slightly better thanthe baseline. Another dropout strategy we compare to is aable 1: VOC 2007 test detection average precision (%). FRCN (cid:63) refers to FRCN [7] with our training schedule. method arch mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike persn plant sheep sofa train tv

FRCN [7] AlexNet 55.4 67.2

FRCN (cid:63)

AlexNet 57.0 67.3

Ours (ASTN) AlexNet 58.1 68.7

Ours (ASDN) AlexNet 58.5 67.1

Ours (full) AlexNet 58.9 67.6

FRCN [7] VGG 66.9 74.5

FRCN (cid:63)

VGG 69.1 75.4

Ours (ASTN) VGG 69.9 73.7

Ours (ASDN) VGG 71.0 74.4

Ours (full) VGG 71.4 75.7

FRCN (cid:63)

ResNet 71.8 78.7

Ours (full) ResNet 73.6 75.4

Table 2:

VOC 2007 test detection average precision (%). Ablative analysis on the Adversarial Spatial Dropout Network.FRCN (cid:63) refers toFRCN [7] with our training schedule. method arch mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike persn plant sheep sofa train tv

FRCN (cid:63)

AlexNet 57.0 67.3

Ours (random dropout) AlexNet 57.3 68.6

Ours (hard dropout) AlexNet 57.7 66.3

Ours (ﬁxed ASDN) AlexNet 57.5 66.3

Ours (joint learning) AlexNet 58.5 67.1 similar strategy we apply in pre-training the ASDN (Fig. 3).We exhaustively enumerate different kinds of occlusion andselect the best ones for training in each iteration. The per-formance is . mAP (Ours (hard dropout)), which isslightly better than random dropout.As we ﬁnd the exhaustive strategy can only explore verylimited space of occlusion policies, we use the pre-trainedASDN network to replace it. However, when we ﬁx the pa-rameters of the ASDN, we ﬁnd the performance is . mAP (Ours (ﬁxed ASDN) ) , which is not as good as theexhaustive strategy. The reason is the ﬁxed ASDN has notreceived any feedback from the updating Fast-RCNN whilethe exhaustive search has. If we jointly learn the ASDNand the Fast-RCNN together, we can get . mAP, . improvement compared to the baseline without dropout.This evidence shows that joint learning of ASDN and Fast-RCNN is where it makes a difference. ASTN Analysis.

We compared our Adversarial SpatialTransformer Network with random jittering on the objectproposals. The augmentations include random changes ofscale, aspect ratio and rotation on the proposals during train-ing the Fast-RCNN. With AlexNet, the performance of us-ing random jittering is . mAP while our ASTN resultsis . . With VGG16, we have . for random jitter-ing and . for the ASTN. For both architectures, themodel with ASTN works better than random jittering. Figure 5 shows the graph of how performance of each cat-egory changes with the occlusions and deformations. In- (a) Ours (ASTN) (b) Ours (ASDN) (c) Ours (Full)

Figure 5: Changes of APs compared to baseline FRCN.terestingly the categories that seemed to be helped by bothASTN and ASDN seem to be quire similar. It seems thatboth plant and bottle performance improves with ad-versarial training. However, combining the two transforma-tions together seems to improve performance on some cate-gories which were hurt by using occlusion or deformationsalone. Speciﬁcally, categories like car and aeroplane are helped by combining the two adversarial processes.able 3:

VOC 2012 test detection average precision (%). FRCN (cid:63) refers to FRCN [7] with our training schedule. method arch mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike persn plant sheep sofa train tv

FRCN [7] VGG 65.7 80.3

FRCN (cid:63)

VGG 66.4 81.8

Ours (full) VGG 69.0 82.2 bicycle (oth): ov=0.00 1 − r=0.30 bird (sim): ov=0.00 1 − r=0.54 boat (bg): ov=0.00 1 − r=0.65 sheep (sim): ov=0.00 1 − r=0.46 Figure 6: Some of the false positives for our approach.These are top false positives for adversarial training but notthe original Fast-RCNN.

Figure 6 shows some of the false positives of our approachwith the diagnosing code [11]. These examples are hand-picked such that they only appeared in the list of falsepositives for adversarial learning but not the original Fast-RCNN. These results indicate some of the shortcomings ofadversarial learning. In some cases, the adversary createsdeformations or occlusions which are similar to other objectcategories and leading to over-generalization. For example,our approach hides the wheels of the bicycle which leads toa wheel chair being classiﬁed as a bike.

We show our results with VGG16 on the PASCAL VOC2012 dataset in Table 3, where our baseline performanceis . .Our full approach with joint learning of ASDNand ASTN gives . boost to . mAP. This againshows that the performance improvement using VGG onVOC2012 is signiﬁcant. We also observe that our methodimproves performance of all the categories except sofa inVOC 2012. We believe this is probably because of largerdiversity in VOC 2012.We ﬁnally report the results in MS COCO dataset. Thebaseline method with VGG16 architecture is . AP on the VOC metric and . AP on the standard COCOmetric. By applying our method, we achieve . AP and . AP on the VOC and COCO metric respectively.

Our method is also related to the Online Hard Exam-ple Mining (OHEM) approach [33]. Our method allows usto sample data-points which might not exist in the dataset,whereas OHEM is bound by the dataset. However, OHEMhas more realistic features since they are extracted from realimages. For comparisons, our approach ( . ) is betterthan OHEM ( . ) in VOC2007. However, our result( . ) is not as good as OHEM ( . ) in VOC2012.Since these two approaches are generating or selecting dif-ferent types of features in training, we believe they shouldbe complementary. To demonstrate this, we use an ensem-ble of these two approaches and compare it with separateensembles of OHEM and Ours alone on VOC 2012. As aresult, the ensemble of two methods achieves . mAP,while the ensemble of two OHEM models ( . ) or twoof our models ( . ) are not as good, indicating the com-plementary nature of two approaches.

6. Conclusion

One of the long-term goals of object detection is to learnobject models that are invariant to occlusions and deforma-tions. Current approaches focus on learning these invari-ances by using large-scale datasets. In this paper, we ar-gue that like categories, occlusions and deformations alsofollow a long-tail distribution: some of them are so rarethat they might be hard to sample even in a large-scaledataset. We propose to learn these invariances using adver-sarial learning strategy. The key idea is to learn an adversaryin conjunction with original object detector. This adversarycreates examples on the ﬂy with different occlusions and de-formations, such that these occlusions/deformations makeit difﬁcult for original object detector to classify. Instead ofgenerating examples in pixel space, our adversarial networkmodiﬁes the features to mimic occlusion and deformations.We show in our experiments that such an adversarial learn-ing strategy provides signiﬁcant boost in detection perfor-mance on VOC and COCO dataset.

Acknowledgement :This work is supported by the Intelligence AdvancedResearch Projects Activity (IARPA) via Department of Interior/ InteriorBusiness Center (DoI/IBC) contract number D16PC00007. The U.S. Gov-ernment is authorized to reproduce and distribute reprints for Govern-mental purposes notwithstanding any copyright annotation thereon. Dis-claimer: The views and conclusions contained herein are those of the au-thors and should not be interpreted as necessarily representing the ofﬁcialpolicies or endorsements, either expressed or implied, of IARPA, DoI/IBC,or the U.S. Government. AG was also supported by Sloan Fellowship. eferences [1] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Inside-outside net: Detecting objects in context with skippooling and recurrent neural networks. arXiv preprintarXiv:1512.04143 , 2015.[2] J. Dai, Y. Li, K. He, and J. Sun. R-FCN: Object detection viaregion-based fully convolutional networks.

NIPS , 2016.[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In

CVPR , 2009.[4] E. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep gen-erative image models using a laplacian pyramid of adversar-ial networks. In

NIPS , 2015.[5] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,and A. Zisserman. The pascal visual object classes (voc)challenge.

IJCV , 2010.[6] S. Gidaris and N. Komodakis. Object detection via a multi-region and semantic segmentation-aware cnn model. In

ICCV , 2015.[7] R. Girshick. Fast r-cnn. In

ICCV , 2015.[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In

CVPR , 2014.[9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial nets. In

NIPS , 2014.[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In

CVPR , 2016.[11] D. Hoiem, Y. Chodpathumwan, and Q. Dai. Diagnosing errorin object detectors. In

ECCV , 2012.[12] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara,A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, andK. Murphy. Speed/accuracy trade-offs for modern convolu-tional object detectors. In

CoRR , 2016.[13] L. Huang, Y. Yang, Y. Deng, and Y. Yu. Densebox: Unifyinglandmark localization with end to end object detection. In

CoRR , 2015.[14] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-imagetranslation with conditional adversarial networks.

CVPR ,2017.[15] M. Jaderberg, K. Simonyan, A. Zisserman, andK. Kavukcuoglu. Spatial transformer networks. In

NIPS , 2015.[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassiﬁcation with deep convolutional neural networks. In

NIPS , 2012.[17] T. Lin, P. Dollr, R. Girshick, K. He, B. Hariharan, and S. Be-longie. Feature pyramid networks for object detection. In

CoRR , 2017.[18] T. Lin, M. Maire, S. Belongie, L. D. Bourdev, R. B. Girshick,J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zit-nick. Microsoft COCO: common objects in context.

CoRR ,abs/1405.0312, 2014.[19] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.Fu, and A. C. Berg. Ssd: Single shot multibox detector. In

ECCV , 2016.[20] I. Loshchilov and F. Hutter. Online batch selectionfor faster training of neural networks. arXiv preprintarXiv:1511.06343 , 2015.[21] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi- scale video prediction beyond mean square error.

CoRR ,abs/1511.05440, 2015.[22] M. Mirza and S. Osindero. Conditional generative adversar-ial nets.

CoRR , abs/1411.1784, 2014.[23] D. Pathak, P. Kr¨ahenb¨uhl, J. Donahue, T. Darrell, andA. Efros. Context encoders: Feature learning by inpainting.In

CVPR , 2016.[24] P. O. Pinheiro, R. Collobert, and P. Dollr. Learning to seg-ment object candidates. In

NIPS , 2015.[25] L. Pinto, J. Davidson, and A. Gupta. Supervision via compe-tition: Robot adversaries for learning tasks. In

ICRA , 2017.[26] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks.

CoRR , abs/1511.06434, 2015.[27] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. Youonly look once: Uniﬁed, real-time object detection. In

CVPR , 2015.[28] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-wards real-time object detection with region proposal net-works. In

NIPS , 2015.[29] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Rad-ford, and X. Chen. Improved techniques for training gans.

CoRR , 2016.[30] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,and Y. LeCun. Overfeat: Integrated recognition, localiza-tion and detection using convolutional networks.

CoRR ,abs/1312.6229, 2013.[31] S. Sharma, R. Kiros, and R. Salakhutdinov. Action recogni-tion using visual attention. In

CoRR , 2016.[32] A. Shrivastava and A. Gupta. Contextual priming and feed-back for faster r-cnn. In

ECCV , 2016.[33] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. In

CVPR , 2016.[34] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Be-yond skip connections: Top-down modulation for object de-tection. In

CoRR , 2017.[35] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, andF. Moreno-Noguer. Fracking deep convolutional image de-scriptors. arXiv preprint arXiv:1412.6537 , 2014.[36] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition.

CoRR ,abs/1409.1556, 2014.[37] J. T. Springenberg. Unsupervised and semi-supervised learn-ing with categorical generative adversarial networks. In

CoRR , 2015.[38] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. Inception-v4, inception-resnet and the impact of residual connectionson learning. In

CoRR , 2016.[39] M. Tak´aˇc, A. Bijral, P. Richt´arik, and N. Srebro. Mini-batch primal and dual methods for svms. arXiv preprintarXiv:1303.2314 , 2013.[40] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders.Selective search for object recognition.

International Jour-nal of Computer Vision , 2013.[41] X. Wang and A. Gupta. Unsupervised learning of visual rep-resentations using videos. In

ICCV , 2015.[42] R. J. Williams. Simple statistical gradient-following algo-rithms for connectionist reinforcement learning. In

Machinelearning , 1992.[43] S. Xie, R. Girshick, P. Dollr, Z. Tu, and K. He. Aggregatedesidual transformations for deep neural networks. arXivpreprint arXiv:1611.05431 , 2016.[44] S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross,S. Chintala, and P. Doll´ar. A multipath network for objectdetection. In

BMVC , 2016.[45] X. Zeng, W. Ouyang, B. Yang, J. Yan, and X. Wang. Gatedbi-directional cnn for object detection.