[PDF] Scaling Object Detection by Transferring Classification Weights

Abstract

Large scale object detection datasets are constantly increasing their size in terms of the number of classes and annotations count. Yet, the number of object-level categories annotated in detection datasets is an order of magnitude smaller than image-level classification labels. State-of-the art object detection models are trained in a supervised fashion and this limits the number of object classes they can detect. In this paper, we propose a novel weight transfer network (WTN) to effectively and efficiently transfer knowledge from classification network's weights to detection network's weights to allow detection of novel classes without box supervision. We first introduce input and feature normalization schemes to curb the under-fitting during training of a vanilla WTN. We then propose autoencoder-WTN (AE-WTN) which uses reconstruction loss to preserve classification network's information over all classes in the target latent space to ensure generalization to novel classes. Compared to vanilla WTN, AE-WTN obtains absolute performance gains of 6% on two Open Images evaluation sets with 500 seen and 57 novel classes respectively, and 25% on a Visual Genome evaluation set with 200 novel classes. The code is available at this https URL.

Full PDF

SScaling Object Detection by Transferring Classiﬁcation Weights

Jason Kuen Federico Perazzi Zhe Lin Jianming Zhang Yap-Peng Tan Nanyang Technological University, Singapore Adobe Research

Abstract

Large scale object detection datasets are constantly in-creasing their size in terms of the number of classes and an-notations count. Yet, the number of object-level categoriesannotated in detection datasets is an order of magnitudesmaller than image-level classiﬁcation labels. State-of-theart object detection models are trained in a supervised fash-ion and this limits the number of object classes they candetect. In this paper, we propose a novel weight transfernetwork (WTN) to effectively and efﬁciently transfer knowl-edge from classiﬁcation network’s weights to detection net-work’s weights to allow detection of novel classes withoutbox supervision. We ﬁrst introduce input and feature nor-malization schemes to curb the under-ﬁtting during train-ing of a vanilla WTN. We then propose autoencoder-WTN(AE-WTN) which uses reconstruction loss to preserve clas-siﬁcation network’s information over all classes in the tar-get latent space to ensure generalization to novel classes.Compared to vanilla WTN, AE-WTN obtains absolute per-formance gains of on two Open Images evaluation setswith 500 seen and 57 novel classes respectively, and on a Visual Genome evaluation set with 200 novel classes.

1. Introduction

State-of-the-art object detectors [12, 34] are typicallytrained with a large number of bounding box annotations.Large-scale datasets such as COCO [26], Pascal VOC[7] and OpenImages [22] provide a substantial amount ofbounding boxes, but the number of annotated object cat-egories is often very limited. The reason is that scalingthe number of bounding boxes can be semi-automated, e.g .[22], while increasing the number of classes requires sig-niﬁcant human labor. On the other hand, image-level labelssuch as those available in classiﬁcation datasets are mucheasier to collect as they do not require costly boundingbox annotations. As a consequence, several works investi-gated the training of object detectors in a weakly-supervisedregime, using only image-level labels. These methods lever-age a variety of classes available in classiﬁcation datasetsor image tags found in social networks [29] but neglect the

Figure 1. Our proposed detector has no access to box-level trainingannotations for the object class represented by the red box, “

Car-bonara ”. It learns to detect novel object classes by transferringweight knowledge from large-scale pre-trained image classiﬁca-tion network. spatial information available in object detection datasets.In contrast, partial supervised methods [16] employ bothtypes of annotations. While existing methods [24, 33, 39,40] that transfer knowledge from a classiﬁcation network toa detection network with partial supervision achieve higheraccuracy than weakly-supervised methods [3, 4, 24], theyincur a signiﬁcant computational cost during training andtesting. The overhead comes either from joint training ofthe two networks [24, 33], or from performing forwardpasses of the classiﬁcation network during testing [39, 40].Furthermore, joint-training methods often require storage-intensive, large-scale classiﬁcation datasets to be presentwhile training the detection network.To overcome these limitations, we propose a novel ap-proach to transfer discriminative semantic knowledge fromclassiﬁcation to detection with a non-linear weight-transfernetwork (WTN) [16]. Given a set of common classes anno-tated for both tasks, we learn a function, the weight-transfernetwork, that maps weights at the fully-connected layer ofthe classiﬁcation network to those of the object detectionnetwork. Once trained, WTN is used to extend the numberof categories recognized by the object detector via transfer-ring weights of unseen classes from the classiﬁcation net-work. This strategy is advantageous because it only addslittle computational and memory overheads to training and a r X i v : . [ c s . C V ] S e p o burden to inference at all.Compared to the vanilla weight-transfer network [16],we introduce two key components to our model. First, weinsert normalization layers to account for the different am-plitude of the classiﬁcation weights. Secondly, we replacethe multilayer perceptron with an autoencoder. The latentspace of the autoencoder corresponds to the classiﬁcationweights of the object detector and therefore is trained withobject-level supervision. The reconstruction loss betweenthe input and output of the autoencoder is essential to retainsemantic information of all the classes while the detectionnetwork’s classiﬁcation loss facilitates the learning of a dis-criminative embedding of the class weights.Extensive experimentation on Open Images [22] andVisual Genome [20] datasets demonstrates that the pro-posed method signiﬁcantly outperforms existing partially-supervised detection approaches on challenging detectiontasks involving novel object classes. Moreover, due to theauxiliary regularization effect brought by the reconstructionloss of autoencoder WTN, our proposed method even recov-ers the performance loss of existing WTN on seen classes. Contributions . The contributions of this work are three-fold: i) we address the under-ﬁtting issue of WTN by in-troducing input and feature normalization schemes. Theresulting model WTN + achieves improved detection per-formance over the vanilla WTN; ii) we propose our mainmodel, autoencoder WTN, that better preserve semanticknowledge of all object classes, while learning to generatediscriminative classiﬁcation weights for the detection net-work; iii) we verify the effectiveness of our method withextensive evaluations using large-scale datasets with mil-lions of images and several hundreds of object classes.

2. Related Works

Over the years, several convolutional network-based ob-ject detection frameworks and architectures have been pro-posed: R-CNN [10], Fast R-CNN [9], Faster R-CNN[34], R-FCN [5], SSD [27], YOLO [32, 33], FPN [25].They can be roughly categorized into single-shot detectors[5, 27, 32, 33] which predict detection boxes from featuremaps directly, and two-shot detectors [9, 10, 34] which ﬁrstgenerate object proposals and then perform spatial extrac-tion of feature maps based on the proposals for further pre-dictions. These approaches have improved object detectionfrom an algorithmic perspective and in a fully supervisedsetting. In this work, we adopt Faster R-CNN [34] becauseits box-level classiﬁcation head learns just a single set ofclassiﬁcation weights, resembling image-level classiﬁcation(source task) networks. This allows a smoother knowledgetransfer from classiﬁcation to detection, compared to usingsingle-shot detection networks which learn multiple sets ofclassiﬁcation weights for different anchor boxes. Object-levels annotations are time-consuming and te-dious to collect, especially when the number of classes islarge. With a large number of classes, it is very challengingto obtain accurate and complete annotations due to complexoverlapping meanings of classes. Thus, several approachesattempt to scale up the number of object classes handled byobject detectors using image-level annotations. Transfer-ring knowledge from image classiﬁcation to object detec-tion is an active research area tackling the lack of boundingbox annotations of the target datasets and/or object classes.These knowledge transfer-based methods for scaling up ob-ject detection can be divided into two categories: weakly-supervised and partially-supervised approaches.Weakly-supervised methods typically rely only on animage-level classiﬁcation dataset and leverage class agnos-tic box proposals or prior object knowledge to build objectdetectors. For example, Uijlings et al . [43] perform multipleinstances learning with knowledge transfer (source datasetwith bounding boxes) to produce boxes for the target train-ing dataset. In [41], a weakly-supervised object detector istrained on a weakly-labeled web dataset to generate pseudoground-truths for the target detection task. [37] combinesregion-level semantic similarity and common-sense infor-mation learned from some external knowledge bases to trainthe detector with just image-level labels.More closely related to our work are weight adapta-tion methods [15, 39, 40] that ﬁne-tune classiﬁcation net-works and learn detection-speciﬁc bias vectors to adaptthe networks for detection. These adaptation-based meth-ods assume the classiﬁcation power of the network is well-preserved (e.g., using R-CNN [10]) when transferred to thedetection task. This restricts them from being effectivelyapplied to recent detection methods (e.g., Faster R-CNN[34], feature pyramid network [25]) that signiﬁcantly mod-ify the backbone network structure. Whereas, our methodis not restricted by such constraints.In general, classiﬁcation weight-based knowledge trans-fer [16] can be applied to any recent detection frameworks[27, 33, 34]. On the other hand, partially supervised ap-proaches employ weak labels, i.e. image-level annota-tions, as well as bounding box-level annotations. For ex-ample, YOLO-9000 [33] extend the detector’s class cover-age by concurrently training on bounding box-level data andimage-level data, such that the image-level data contributeonly to classiﬁcation loss. By decoupling the detectionnetwork into two branches (positive-sensitive & semantic-focused), R-FCN-3K [37] is able to scale detection up to3000 classes despite being trained on limited bounding boxannotations for several object classes. In contrast to these,we focus on large-scale object detection without having ac-cess to additional data (classiﬁcation) sources during thetraining. A well-trained image classiﬁcation network pos-sesses sufﬁciently rich semantic knowledge about the large-cale dataset’s categories and the information is compressedin weights of its classiﬁcation layers. We argue that suchweights can effectively be exploited to help build an objectdetector handling a large number of categories.

3. Weight Transfer Network

Preliminaries . We consider the setting of a classiﬁcationnetwork CLN that handles object classes C , and a detec-tion network DEN that handles object classes D . The num-ber of categories handled by CLN is much greater than thenumber of categories handled by DEN, i.e. | C | >> | D | .The goal of our approach is to expand the number of cate-gories handled by DEN through partial supervision, wherewe transfer weight knowledge from CLN (source task) toDEN (target task). We make use of the ﬁnal fully-connected(FC) layer weights of the CLN that has been pre-trainedon a large scale image classiﬁcation dataset. The ﬁnal FClayer weights can be seen as a form of semantic embeddings comprising rich knowledge about the object categories andthe complex class relationships. Furthermore, pre-trainedlarge-scale image classiﬁcation networks are very accessi-ble and many are shared publicly.Classiﬁcation knowledge from CLN is transferred toDEN using a weight transfer network (WTN) through theobject categories shared ( S ) between the two tasks: S = C ∩ D . WTN is a neural network that works as a class-generic function T () used to transform per-class classiﬁca-tion weight vectors W C = [ w C , w C , ..., w | C | C ] from CLN toDEN’s classiﬁcation weights W D = [ w D , w D , ..., w | D | D ] asfollows: W D = T ( W C ) .WTN is trained jointly with DEN on detection datasetwith classes D . The gradients of WTN’s network parame-ters come from DEN’s box-level classiﬁcation loss (cid:96) cls . Be-fore training WTN and DEN, we ‘freeze’ W C (taken frompre-trained CLN). While S rely on WTN, for the DEN’scategories which are not part of S (i.e., D \ S ), we traintheir weights as in conventional detection network. To ob-tain DEN’s classiﬁcation score predictions, we simply per-form matrix multiplication between DEN’s box-level vi-sual features and WTN’s predicted weights, just like how itworks for conventional classiﬁcation weights. Convention-ally, WTN is based on a two-layer multi-layer perceptron(MLP) architecture.Due to its class-genericness, WTN is able to carry outeffective inductive learning [6]. In other words, despite thatonly classes S are seen by WTN and DEN during training,during testing WTN (and the DEN model that incorporatesWTN) can work reasonably well with classes N of CLNthat are not shared with DEN, i.e. N = C \ S . Normalizations.

Large-scale classiﬁcation datasets havean unbalance class distribution, which has strong implica-

FCLeakyReLUFC StdNormFCGroupNormLeakyReLUFCWTN WTN+ w d ∈ W DEN w d ∈ W DEN w c ∈ W CLN w c ∈ W CLN

Figure 2. Comparison between network architectures of WTN andWTN + . The white rectangles correspond to layers with learnableparameters. tions in how the classiﬁcation weights of CLN are trained.E.g., in one large-scale CLN, we discover that the ‘highest-norm’ class has a weight vector norm that is 28 times that ofthe ‘lowest-norm’ class. Besides, a class-generic non-linear WTN naturally cannot adapt and learn as well as (conven-tional) class-speciﬁc linear classiﬁcation weights, for lossminimization. These pose challenges to the training andoptimization of WTN. Empirically, we found that training adetection network (DEN) with existing WTN methods dete-riorates the performance on D classes, compared to a con-ventional DEN trained on the same labels but without WTN.Thus, drawing from the recent ﬁndings in activationsnormalization techniques [17, 44], we introduce a new vari-ant of WTN, WTN + that improves performance on D classes and it is easier to optimize. The model architec-tural differences between WTN and WTN + are illustratedin Fig. 2. Standard normalization is applied to the inputweights W C to enable different input channels to contributecomparably to the prediction of W D , in order to curb theoverdominance/underdominance of certain categories. Let v j denote the weights of j -th feature/channel of W C , wenormalize v j by: v j − µ ( v j ) σ ( v j ) , where µ ( · ) and σ ( · ) are the mean and standard deviation functions respectively. GroupNormalization [44] layer, known for its strong optimizationbeneﬁts, is added to normalize intermediate features to en-courage good gradient ﬂows for easier network optimiza-tion. These small but crucial modiﬁcations are the key totraining highly effective WTN.

4. Autoencoder Weight Transfer Network

During training, only the shared classes S contribute tothe gradients and losses of WTN. The novel object classes N are unknown to and unconsidered by WTN. The lack ofknowledge of the entire class population of C limits WTN’scapability to effectively model the good classiﬁcation space cls + ℓ box ℓ rec w c ∈ W CLN

Cls. Head

FCGroupNormLeakyReLUFC FCGroupNormLeakyReLUFCFC otherDENInput Image

DecoderEncoder

AE-WTNCLNFC sharedFC novelFC sharedFC novelBox Reg. Cls. Head

Bounding Box Prediction

FCGroupNormLeakyReLUFC FCGroupNormLeakyReLUFCFC otherDENInput Image

DecoderEncoder

AE-WTNCLNFC sharedFC novelFC sharedFC novelBox Reg. Cls

Bounding Box Prediction w d ∈ W DEN ℓ cls + ℓ box ℓ rec w c ∈ W CLN

Figure 3. The train and test phases of object detector (DEN) withan Autoencoder-WTN (AE-WTN).

Train phase : Before trainingDEN, we extract CLN’s ﬁnal FC layer’s weights W C , and discardthe earlier layers. Trained simultaneously with DEN, AE-WTNlearns to transform weights from CLN to DEN through the sharedclasses S . The “other” detection classes (i.e., D \ S ) are trainednormally as conventional classiﬁcation weights. Only “other” and S contribute to the detection loss (cid:96) cls . AE-WTN uses a reconstruc-tion loss (cid:96) rec to reconstruct the weights for both S and N , from itsencoder’s outputs. Test phase (dashed polygon): CLN’s weightsof both the novel classes N and shared classes S can be adaptedofﬂine for use in DEN through AE-WTN. With that, DEN is ableto detect novel classes N in addition to S and “other” classes. originally attained by the pre-trained CLN for handling alarge number of categories. We hypothesize that by lettingWTN have a narrow view of the class population, its model-ing capability (relating to N speciﬁcally) is severely under-exploited and this compromises the performance of WTNon classes N .To this end, we introduce Autoencoder-WTN (AE-WTN) – a novel WTN variant that attempts to preserveknowledge on all of classes C contained in pre-trained W C ,while learning a discriminative WTN function to achievegood detection performance. AE-WTN is an autoencoder with both encoder and decoder networks. AE-WTN is builton top of WTN + . The encoder network shares the samearchitecture as WTN + ’s, while the decoder network (withseparate network layers/parameters) is the mirrored versionof the encoder. Following existing WTN, the encoder net-work works as a function T () to predict W D given W C as input. During training, gradients are propagated fromDEN’s loss to the encoder network. The network architec-ture of AE-WTN and how it interacts with CLN and DENare illustrated in Fig. 3.AE-WTN is trained with an additional autoencoder-based training loss – reconstruction loss [11, 14] that forcesthe decoder network to predict (or reconstruct) the originalinputs, from the output activations of the encoder network.Let T () denote the encoder network and G () denote the de-coder network, the reconstruction is predicted as follows: ˆ w C = G ( T ( w C )); ∀ w C ∈ W C . Here, we adopt smoothL1 loss [9] as the reconstruction loss to minimize the differ-ence between the predicted reconstructions and the originalinputs ( W C ): (cid:96) rec = (cid:40) .

5( ˆ w C − w C ) , if | ˆ w C − w C | < | ˆ w C − w C | − . , otherwise (1)Note that we apply reconstruction loss to all CLN classes C (i.e., S ∩ N ), rather than just shared classes S . On theother hand, the detection loss (box-level classiﬁcation) onlycares about classes S and “other” detection classes. Withsuch formulation, we perform multi-task training based onthe following mixture of training losses (excluding RegionProposal Network’s [34]: (cid:96) cls + (cid:96) box + α(cid:96) rec , where (cid:96) box isbox regression loss and α is the loss scaling hyperparameter.Reconstruction loss penalizes intermediate network acti-vations which do poorly to reconstruct the original weights W CLN . Since AE-WTN’s output W D (weights for DEN) is aform of intermediate network activations, they are affectedby the reconstruction loss and are expected to retain origi-nal class information greatly for reconstruction purpose. Incontrast, existing WTN (or even WTN + ) is solely drivenby DEN’s classiﬁcation loss (which may not be optimal formodel generalization) and is not compelled to retain more ofpotentially useful class information. Reconstruction-basedinformation preservation has been shown to help neural net-works achieve better local optima [23, 45] in supervisedlearning. By complementing CLN’s classiﬁcation loss witha reconstruction loss, AE-WTN is able to learn a non-linearmapping that achieves a good balance between class/classdiscriminability and class information retainment. We ﬁndthat this has a regularization effect on AE-WTN and it helpsimprove generalization performance on the fully-annotatedobject categories ( D ∩ S ) seen during training. This obser-vation is aligned with the ﬁndings of [23, 45] that super-vised learning can be improved with autoencoders. Whilewe apply reconstruction loss to all classes including N which do not have supervised annotations), [23, 45] applythe loss to only input examples with supervised annotations.Our work also resembles semi-supervised learning wherereconstruction loss (autoencoder) [31, 46] is used as an aux-iliary loss to exploit unlabeled data (in this work, class N are unlabeled) to improve model performance and general-ization.During the training of existing WTN, W C,N the weightsof novel classes N , contained in W C , is not utilized. And,classes N do not contribute to the training. Deep neural net-works are generally known to eliminate task-irrelevant in-formation of the inputs through training [36, 42]. Thus, it islikely that WTN learns to “dismiss” some class informationabout classes N that is unimportant to classes S but is use-ful for the detection of classes N . The reconstruction lossof AE-WTN addresses such a shortcoming of existing WTNby explicitly involving the novel object classes N . The richclass information in W C,N (which is potentially beneﬁcialto AE-WTN’s test-performance on classes N ) is preservedin the intermediate network activations of AE-WTN.

5. Experiments

Training and evaluation sets for seen classes D . Weuse the ofﬁcial training and validation dataset (referred toas OI-500 ) [22] from Open Images V4 Challenge whichcontains 500 object classes for training and evaluating DENon classes D . The object classes in Open Images dataset arehierarchically organized and many classes are not mutuallyexclusive. Open Images’ ofﬁcial evaluation metric [22],a custom version of “Average Precision (AP) @ 0.5 IoUthreshold” or AP is used for evaluation on the validationset provided. We use the same Open Images training set totrain baseline Faster RCNN and our WTN-based modelsfor fair comparisons on novel classes N . Evaluation set for novel classes . N To evaluate DEN’sperformance on novel classes N , we employ two evaluationdatasets. The ﬁrst evaluation set ( OI-57 ) is a subset ofOpen Images V4 complete /non-challenge dataset con-taining 57 novel object classes and 31,061 images. Thesecond evaluation set (

VG-200 ) is set as a subset of VisualGenome [20] dataset containing 24,690 images spanning200 high-frequency object classes which are novel toDEN. We adopt the same AP metric for OI-57 . Sincemany object instances in Visual Genome dataset are notannotated at all, we follow the practice of [2] by using

Average Recall /AR @100 detections per image to gaugethe detection performance of DEN on this evaluation set. Classiﬁcation Network (CLN) ( source ). Prior to trainingWTN and DEN, a pre-trained large-scale CLN model has to be acquired. We use a publicly available ResNet-101pre-trained on Open Images v2 [22] with 5000 objectclasses. It is trained with multi-label (sigmoid) classi-ﬁcation loss given the multi-label nature of the dataset.Training resolution is × . The model is trainedasynchronously with 50 GPU workers and batch size 32for 620K training steps. Incoming features to the ﬁnalclassiﬁcation layer is -dimensional. Detection Network (DEN) ( target ). The DEN architecturein this paper is a Faster R-CNN [34] with a backboneintegrating ResNet-50 [13] and Feature Pyramid Net-work (FPN) [25]. ResNet-50 backbone is pre-trainedon ImageNet-1k [35] dataset, and its BN parameters arefrozen during training of DEN. The box-level head (forbox classiﬁcation and regression) is a 2-layer multi-layerperceptron (MLP) with a 2048-dimensional feature andoutput channels. DEN is trained with mini batches of 8images (2 images/GPU) for a total of 180K iterations. Weoptimize the network using SGD with momentum of 0.9and initial learning rate of × − . The network is regu-larized with weight decay of × − . We stick closely tothe original training loss functions of Faster R-CNN exceptfor the classiﬁcation loss which we replace with sigmoidbinary cross-entropy, taking Open Images class hierarchyand multilabel nature into account. The training class la-bels are expanded [1] based on the hierarchy tree [22] given. Weight Transfer Network (WTN) . By default, WTN vari-ants have input/feature/output channels of 2048. For GroupNormalization (GN) layer in WTN + and AE-WTN, we fol-low the same “number of groups”/ groups hyperparame-ter, which is set to 32 as found to be a good choice by[44]. WTN networks are trained from scratch simultane-ously with DEN using AdamW [28] using default hyperpa-rameters and weight decay of × − . For AE-WTN, α isset to 20 throughout the experiments. To validate the effectiveness of our proposed AE-WTNmodel, we experimentally compare it with existing weighttransfer-related methods described in the following. Notethat all these methods use the same Faster R-CNN detectionframework and a ResNet-50 backbone. • Faster R-CNN : Vanilla Faster R-CNN [34] performs fullysupervised learning on seen classes. In contrast to WTN,vanilla Faster R-CNN learns conventional classiﬁcationweights which are both linear and class-speciﬁc. To detectnovel classes, we employ the nearest-neighbor approach (NN) , taking the detections of nearest seen classes. • LSDA [15]: LSDA adapts CLN’s weights for detectiontask by learning additive class-speciﬁc biases. To makepredictions for a novel class during test-time, the biases of

I-500 OI-57 VG-200 (Seen) (Novel) (Novel)

Method AP AP AR Faster R-CNN [34] 59.55 - -Faster R-CNN (NN) - 28.09 49.39LSDA [15] 59.44 25.89 51.14LSDA (Visual Transfer) [39] 59.44 26.43 53.03ZSD [2] with CLN weights 47.37 34.63 38.04ZSD [2] with fastText [18] 58.39 29.51 35.09WTN [16] 52.87 34.94 41.91WTN + (cid:73) default model 58.82 39.28 65.60 (cid:66) × weight decay 58.46 40.79 65.87 (cid:66) activity regularizer [30] 55.86 33.47 36.26 (cid:66) Dropout [38] 57.14 40.09 65.52 (cid:66) reduced capacity 58.80 37.81 63.16AE-WTN

Table 1. Comparison with weight transfer-related methods on eval-uation datasets – OI-500 (seen classes), OI-57 (novel classes), andVG-200 (novel classes). nearest classes are averaged and added to CLN’s weightvector. The visual similarity transfer variant [39] is alsoincluded. • ZSD [2]: ZSD performs zero-shot detection through pre-trained word embeddings. In a joint visual-word embeddingsetting, the detector learns to output visual embeddings inthe words’ embedding space. Here, two kinds of embed-dings are considered – CLN’s weights and fastText [18]. • WTN [16]: This corresponds to the standard (existing)WTN model that makes use of neither normalization tech-niques nor reconstruction loss. • WTN + variants : Since the reconstruction loss of AE-WTN can be seen as a regularizer, we compare it with sev-eral WTN + variants regularized with increased weight de-cay ( × ) [21], activity regularizer (0.01) [30], Dropout (0.3)[38] on intermediate activations, and reduced network ca-pacity (halving the number of channels in hidden layer).The results are given in Table 5.1. We use ResNet-50as the backbone for the vanilla Faster RCNN detector, andits AP on OI-500 is . which is mildly worse thanthe . achieved by the state-of-the-art SE-ResNeXt-101detector [1]. Overall, WTN methods outperform the non-WTN methods by large margins on the novel classes (OI-57 and VG-200), due to the powerful weight transfer func-tion learned by WTN that can generalize to many classes.Among the WTN methods, AE-WTN that incorporates allthe proposed improvements achieves the best results.WTN and WTN + suffer from the weakened performanceon OI-500 (seen classes D ) compared with the vanillaFaster R-CNN detector that it is built upon. In otherwords, switching to WTN from conventional classiﬁcationweights decreases performance on the seen classes. Thisphenomenon has been observed by prior works [15, 19] attempting to scale object detection with weak or partialsupervision. By integrating autoencoder into WTN (AE-WTN), the seen-class detection performance can be recov-ered. It is extremely challenging to train conventional WTNfrom scratch. The reconstruction loss (which is more easilyoptimized than detection loss) encourages AE-WTN to out-put weights highly representative of original CLN weights,thus providing a good initialization to attain better local op-tima. Similar to prior works that ﬁnd reduced supervisedtraining loss with autoencoder [23, 45], we ﬁnd that the box-level classiﬁcation training loss (cid:96) cls on seen classes attainedby AE-WTN (0.5572) is lower than that of WTN + (0.5754).Moreover, the reconstruction loss explicitly involvesnovel classes N during training and forces AE-WTN to pre-serve rich class information of the novel classes in the latentand output spaces. It also encourages visual features learnedby DEN to be more “generic” (less speciﬁc to classes D in the detection dataset) in order to accommodate to themany classes represented by AE-WTN. Therefore, the de-tector equipped with AE-WTN shows improved (absolute)performances of . and . over WTN + on OI-57 andVG-200 respectively. Compared to other existing regular-ization techniques applied to WTN + , AE-WTN performsbetter across all datasets. This provides conﬁrmation thatthe advantages of the reconstruction loss cannot be simplyreplicated by other regularizers that do not leverage the richclass information contained in CLN’s weights. Qualitative results . We provide in Fig. 5 some qualita-tive results obtained by our proposed AE-WTN detector ontest images of Open Images [22] and Visual Genome [20]datasets. Only the classes with the highest scores are shown,and novel classes compete with seen classes for the samebounding box. Remarkably, the detector can detect a vari-ety of novel classes at greater conﬁdence than seen classes,despite not seeing them during training.

Local neighborhood preservation.

To better under-stand the implications of the reconstruction loss on localneighborhood preservation of AE-WTN, we compute theoverlapping count between nearest neighbors obtainedby CLN’s weights and the output weights of the WTNmodel of interest (AE-WTN, WTN + , or WTN), varyingthe number of neighbors (a standard hyperparameter of nearest neighbor approach) for all methods. This studyis performed on 20 randomly-sampled classes and thecounts are averaged across those classes. Nearest neighborsare among the 5,000 classes of CLN. The ﬁndings arepresented in Fig. 4. E.g., at 100 neighbours, AE-WTN’soutput weights and CLN’s weights have an average of48.25 overlapping neighbours, while WTN + and WTNhave 38.0 and 31.95 overlapping neighbours respectively.As shown, AE-WTN consistently reaches greater numbers C o un t o f o v e r l a pp i n g C L N ' s n e i g h b o r s AE-WTNWTN + WTN

Figure 4. The overlapping count (vertical axis) between CLN’snearest neighbors and the nearest neighbors obtained by the WTNmodel of interest (AE-WTN, WTN + , or WTN), given varyingnumbers of nearest neighbors (horizontal axis). OI-500 OI-57 VG-200WTN → WTN + (Seen) (Novel) (Novel) InputNorm. GroupNorm. AP AP AR (cid:55) (cid:55) (cid:51) (cid:55) (cid:55) (cid:51) (cid:51) (cid:51) Table 2. Ablation study on WTN + architecture. of overlapping neighbors (with CLN’s neighbors) thanWTN + and WTN do, indicating that AE-WTN can betterpreserve the local neighborhood relationships of classesthan WTN + . Noticeably, the gap widens as the number ofnearest neighbors increases. Normalizations in WTN + . We perform ablation study inTable 2 to understand how the performance changes withdifferent normalization techniques. It is crucial to combinethe two normalizations of WTN + to obtain the best resultsfor both seen and novel classes. Furthermore, we observeworse training losses with non-normalized WTN comparedwith WTN + , implying that model under-ﬁtting is theinherent cause of WTN’s under-performance. Choice of feature normalization.

GN [44] is chosen overthe typical BatchNorm (BN) [17] because for WTN + , BNis less robust towards novel-class inputs which do not havedetection annotations/loss [8]. We ﬁnd that the post-ReLUactivation (L2) norms of WTN + with BN have an unusuallylarge variance for novel classes. It is × (or . . ) thatof shared classes, despite allowing BN to normalize overall classes in training. Such unstable activations are not en-countered by the detection network during training. This mean varianceshared cls. novel cls. shared cls. novel cls. GN [44] 1.838 1.784 0.091 0.093BN [17] 1.379

Table 3.

Means and variances of post-ReLU activation norms.

Faster R-CNN WTN WTN + AE-WTN

Time (ms) 365 371 379 401Mem. (GB) 4.11 4.15 4.19 4.26

Table 4. Training time and memory usage. causes WTN + ’s predicted weights for novel classes to in-teract poorly with image-region features at test time, result-ing in unreliable class-score predictions. Table 3 shows theL2 norm means & variances of using GN and BN. Computational efﬁciency.

Computational efﬁciency is amajor concern in the training and/or deployment of ob-ject detectors, especially for large-scale detectors. In Ta-ble 4, we show the per-iteration training time (in millisec-onds/ ms ) and single-GPU memory usage of training withdifferent models. Overall, the WTN models add very lit-tle computational costs on top of Faster R-CNN’s. Dur-ing testing, all the weights can be transformed ofﬂine withWTN/WTN + /AE-WTN to reach vanilla Faster R-CNN’sefﬁciency.

6. Conclusion

Training large-scale object detectors is extremelyresource-demanding (e.g., data, computations). In thiswork, we introduce an efﬁcient and effective WTN ap-proach to scale up object detection, and propose novelmethods to strongly push the limits of WTN through nor-malization techniques and autoencoder-based reconstruc-tion loss. The reconstruction loss adopted by AE-WTN ef-fectively improves its capability to retain and exploit thesemantically-rich class information (of all classes) learnedby the pre-trained CLN. This leads to improved training ofDEN and better detection performances on both seen andnovel classes.

Acknowledgement

Jason Kuen was supported by NTU Research Scholar-ship under School of Electrical and Electronic Engineering,Nanyang Technological University, Singapore.

References [1] Takuya Akiba, Tommi Kerola, Yusuke Niitani, Toru Ogawa,Shotaro Sano, and Shuji Suzuki. Pfdet: 2nd place solutionto open images challenge 2018 object detection track. arXivpreprint arXiv:1809.00778 , 2018.igure 5. Qualitative results of the proposed AE-WTN. The blue detection boxes seen object classes D , and the red boxes labels are novel classes N . Our method can handle a variety of novel classes and concepts, while performing well on seen classes.[2] Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chel-lappa, and Ajay Divakaran. Zero-shot object detection. In ECCV , 2018.[3] Hakan Bilen, Marco Pedersoli, and Tinne Tuytelaars.Weakly supervised object detection with convex clustering.In

CVPR , pages 1081–1089, 2015.[4] Hakan Bilen and Andrea Vedaldi. Weakly supervised deepdetection networks. In

CVPR , pages 2846–2854, 2016.[5] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In

NIPS , pages 379–387, 2016.[6] Ramon Lopez De Mantaras and Eva Armengol. Machinelearning from examples: Inductive and lazy methods.

Data& Knowledge Engineering , 25(1-2):99–123, 1998.[7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, andA. Zisserman. The PASCAL Visual Object Classes Chal-lenge 2012 (VOC2012) Results.[8] Angus Galloway, Anna Golubeva, Thomas Tanay, Medhatoussa, and Graham W Taylor. Batch Normalization isa Cause of Adversarial Vulnerability. In

ICML Workshop ,2019.[9] Ross Girshick. Fast r-cnn. In

ICCV , pages 1440–1448, 2015.[10] Ross Girshick, Jeff Donahue, Trevor Darrell, and JitendraMalik. Rich feature hierarchies for accurate object detectionand semantic segmentation. In

CVPR , pages 580–587, 2014.[11] Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai,and Mingyang Ling. Scene graph generation with externalknowledge and image reconstruction. In

CVPR , 2019.[12] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Gir-shick. Mask r-cnn. In

ICCV , pages 2980–2988. IEEE, 2017.[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

CVPR ,pages 770–778, 2016.[14] Geoffrey E Hinton and Richard S Zemel. Autoencoders,minimum description length and helmholtz free energy. In

NIPS , pages 3–10, 1994.[15] Judy Hoffman, Sergio Guadarrama, Eric S Tzeng, RonghangHu, Jeff Donahue, Ross Girshick, Trevor Darrell, and KateSaenko. Lsda: Large scale detection through adaptation. In

NIPS , pages 3536–3544, 2014.[16] Ronghang Hu, Piotr Doll´ar, Kaiming He, Trevor Darrell, andRoss Girshick. Learning to segment every thing. In

CVPR ,pages 4233–4241, 2018.[17] Sergey Ioffe and Christian Szegedy. Batch normalization:Accelerating deep network training by reducing internal co-variate shift. In

ICML , pages 448–456, 2015.[18] Armand Joulin, Edouard Grave, Piotr Bojanowski, andTomas Mikolov. Bag of tricks for efﬁcient text classiﬁca-tion. In

ACL , pages 427–431. Association for ComputationalLinguistics, April 2017.[19] Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi Feng,and Trevor Darrell. Few-shot object detection via featurereweighting. arXiv preprint arXiv:1812.01866 , 2018.[20] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-tidis, Li-Jia Li, David A Shamma, et al. Visual genome:Connecting language and vision using crowdsourced denseimage annotations.

IJCV , 123(1):32–73, 2017.[21] Anders Krogh and John A Hertz. A simple weight decay canimprove generalization. In

NIPS , pages 950–957, 1992.[22] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui-jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, StefanPopov, Matteo Malloci, Tom Duerig, and Vittorio Ferrari.The open images dataset v4: Uniﬁed image classiﬁcation,object detection, and visual relationship detection at scale. arXiv:1811.00982 , 2018.[23] Lei Le, Andrew Patterson, and Martha White. Supervised au-toencoders: Improving generalization performance with un-supervised regularizers. In

NeurIPS , pages 107–117, 2018.[24] Dong Li, Jia-Bin Huang, Yali Li, Shengjin Wang, and Ming-Hsuan Yang. Weakly supervised object localization withprogressive domain adaptation. In

CVPR , pages 3512–3520,2016.[25] Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He,Bharath Hariharan, and Serge Belongie. Feature pyramidnetworks for object detection. In

CVPR , pages 2117–2125,2017.[26] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C LawrenceZitnick. Microsoft coco: Common objects in context. In

ECCV , pages 740–755. Springer, 2014.[27] Wei Liu, Dragomir Anguelov, Dumitru Erhan, ChristianSzegedy, Scott Reed, Cheng-Yang Fu, and Alexander CBerg. Ssd: Single shot multibox detector. In

ECCV , pages21–37. Springer, 2016.[28] Ilya Loshchilov and Frank Hutter. Decoupled weight decayregularization. In

ICLR , 2019.[29] Dhruv Mahajan, Ross B. Girshick, Vignesh Ramanathan,Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe,and Laurens van der Maaten. Exploring the limits of weaklysupervised pretraining. In

ECCV , 2018.[30] Stephen Merity, Bryan McCann, and Richard Socher. Re-visiting activation regularization for language rnns. arXivpreprint arXiv:1708.01009 , 2017.[31] Antti Rasmus, Mathias Berglund, Mikko Honkala, HarriValpola, and Tapani Raiko. Semi-supervised learning withladder networks. In

NIPS , pages 3546–3554, 2015.[32] Joseph Redmon, Santosh Divvala, Ross Girshick, and AliFarhadi. You only look once: Uniﬁed, real-time object de-tection. In

CVPR , pages 779–788, 2016.[33] Joseph Redmon and Ali Farhadi. Yolo9000: Better, faster,stronger. In

CVPR , pages 6517–6525. IEEE, 2017.[34] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: towards real-time object detection with regionproposal networks.

TPAMI , (6):1137–1149, 2017.[35] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, et al. Imagenet largescale visual recognition challenge.

IJCV , 115(3):211–252,2015.[36] Ravid Shwartz-Ziv and Naftali Tishby. Opening the blackbox of deep neural networks via information. arXiv preprintarXiv:1703.00810 , 2017.[37] Bharat Singh, Hengduo Li, Abhishek Sharma, and Larry SDavis. R-fcn-3000 at 30fps: Decoupling detection and clas-siﬁcation. In

CVPR , pages 1081–1090, 2018.[38] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, IlyaSutskever, and Ruslan Salakhutdinov. Dropout: a simpleway to prevent neural networks from overﬁtting.

JMLR ,15(1):1929–1958, 2014.[39] Yuxing Tang, Josiah Wang, Boyang Gao, Emmanuel Del-landr´ea, Robert Gaizauskas, and Liming Chen. Large scalesemi-supervised object detection using visual and semanticknowledge transfer. In

CVPR , pages 2119–2128, 2016.[40] Yuxing Tang, Josiah Wang, Xiaofang Wang, Boyang Gao,Emmanuel Dellandr´ea, Robert Gaizauskas, and LimingChen. Visual and semantic knowledge transfer for large scalesemi-supervised object detection.

TPAMI , 40(12):3045–3058, 2018.[41] Qingyi Tao, Hao Yang, and Jianfei Cai. Zero-annotation ob-ject detection with web knowledge transfer. In

ECCV , 2018.[42] Naftali Tishby and Noga Zaslavsky. Deep learning and theinformation bottleneck principle. In

IEEE Information The-ory Workshop (ITW) , pages 1–5. IEEE, 2015.[43] Jasper Uijlings, Stefan Popov, and Vittorio Ferrari. Revisit-ing knowledge transfer for training object class detectors. In

CVPR , 2018.[44] Yuxin Wu and Kaiming He. Group normalization. In

ECCV ,ages 3–19, 2018.[45] Yuting Zhang, Kibok Lee, and Honglak Lee. Augmentingsupervised neural networks with unsupervised objectives forlarge-scale image classiﬁcation. In

ICML , pages 612–621,2016.[46] Junbo Zhao, Michael Mathieu, Ross Goroshin, and Yann Le-cun. Stacked what-where auto-encoders.