[PDF] Regularizing Deep Networks by Modeling and Predicting Label Structure

Abstract

We construct custom regularization functions for use in supervised training of deep neural networks. Our technique is applicable when the ground-truth labels themselves exhibit internal structure; we derive a regularizer by learning an autoencoder over the set of annotations. Training thereby becomes a two-phase procedure. The first phase models labels with an autoencoder. The second phase trains the actual network of interest by attaching an auxiliary branch that must predict output via a hidden layer of the autoencoder. After training, we discard this auxiliary branch. We experiment in the context of semantic segmentation, demonstrating this regularization strategy leads to consistent accuracy boosts over baselines, both when training from scratch, or in combination with ImageNet pretraining. Gains are also consistent over different choices of convolutional network architecture. As our regularizer is discarded after training, our method has zero cost at test time; the performance improvements are essentially free. We are simply able to learn better network weights by building an abstract model of the label space, and then training the network to understand this abstraction alongside the original task.

Full PDF

RRegularizing Deep Networks by Modeling and Predicting Label Structure

Mohammadreza Mostajabi Michael Maire Gregory ShakhnarovichToyota Technological Institute at Chicago { mostajabi,mmaire,greg } @ttic.edu Abstract

We construct custom regularization functions for use insupervised training of deep neural networks. Our techniqueis applicable when the ground-truth labels themselves ex-hibit internal structure; we derive a regularizer by learn-ing an autoencoder over the set of annotations. Trainingthereby becomes a two-phase procedure. The ﬁrst phasemodels labels with an autoencoder. The second phase trainsthe actual network of interest by attaching an auxiliarybranch that must predict output via a hidden layer of the au-toencoder. After training, we discard this auxiliary branch.We experiment in the context of semantic segmentation,demonstrating this regularization strategy leads to consis-tent accuracy boosts over baselines, both when trainingfrom scratch, or in combination with ImageNet pretraining.Gains are also consistent over different choices of convolu-tional network architecture. As our regularizer is discardedafter training, our method has zero cost at test time; the per-formance improvements are essentially free. We are simplyable to learn better network weights by building an abstractmodel of the label space, and then training the network tounderstand this abstraction alongside the original task.

1. Introduction

The recent successes of supervised deep learning rely onthe availability of large-scale datasets with associated an-notations for training. In computer vision, annotation is asufﬁciently precious resource that it is commonplace to pre-train systems on millions of labeled ImageNet [8] examples.These systems absorb a useful generic visual representationability during pretraining, before being ﬁne-tuned to per-form more speciﬁc tasks using fewer labeled examples.Current state-of-the-art semantic segmentation meth-ods [25, 7, 46] follow such a strategy. Its necessity isdriven by the high relative cost of annotating ground-truthfor spatially detailed segmentations [12, 24], and the accu-racy gains achievable by combining different data sourcesand label modalities during training. A collection of manyimages, coarsely annotated with a single label per image ( e.g . ImageNet [8]), is still quite informative in comparisonto a smaller collection with detailed per-pixel label maps foreach image ( e.g . PASCAL [12] or COCO [24]).We show that detailed ground-truth annotation of thislatter form contains additional information that existingschemes for training deep convolutional neural networks(CNNs) fail to exploit. By designing a new training pro-cedure, we are able to capture some of this information, andas a result increase accuracy at test time.Our method is orthogonal to recent efforts, discussed inSection 2, on learning from images in an unsupervised orself-supervised manner [34, 31, 43, 22, 23, 44, 11]. It isnot dependent upon the ability to utilize an external poolof data. Rather, our focus on more efﬁciently utilizing pro-vided labels makes our contribution complementary to theseother learning techniques. Experiments show gains bothwhen training from scratch, and in combination with pre-training on an external dataset.Our innovation takes the form of a regularization func-tion that is itself learned from the training set labels. Thisyields two distinct training phases. The ﬁrst phase mod-els the structure of the labels themselves by learning an au-toencoder. The second phase follows the standard networktraining regime, but includes an auxiliary task of predictingthe output via the decoder learned in the ﬁrst phase. Weview this auxiliary branch as a regularizer; it is only presentduring training. Figure 1 illustrates this scheme.Section 3 further details our approach and the intuitionbehind it. Our regularizer can be viewed as a requirementthat the system understand context, or equivalently, as amethod for synthesizing context-derived labels at coarserspatial resolution. The auxiliary branch must predict thismore abstract, context-sensitive representation in order tosuccessfully interface with the decoder.Experiments, covered in Section 4, focus on the PAS-CAL semantic segmentation task. We take baseline CNNarchitectures, the established VGG [36] network and thestate-of-the-art DenseNet [16], and report performancegains of enhancing them with our custom regularizer dur-ing training. Section 4 also provides ablation studies, ex-plores an alternative regularizer implementation, and visu-1 a r X i v : . [ c s . C V ] A p r abels (Input) Encoder DecoderLabel Autoencoder Labels (Output)Image b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b Convolutional Neural NetworkCNN Hypercolumn Feature b b Primary OutputpredictDecoder Auxiliary Output

Figure 1.

Exploiting label structure when training semantic segmentation.

Top:

An initial phase looks only at the ground-truth anno-tation of training examples, ignoring the actual images. We learn an autoencoder that approximates an identity function over segmentationlabel maps. It is constrained to compress and reconstitute labels by passing them through a bottleneck connecting an encoder (red) anddecoder (blue).

Bottom:

The second phase trains a standard convolutional neural network (CNN) for semantic segmentation using hyper-column [14, 29] features for per-pixel output. However, we attach an auxiliary branch (and loss) that also predicts segmentation by passingthrough the decoder learned in the ﬁrst phase. After training, we discard this decoder branch, making the architecture appear standard. alizes representations learned by the label autoencoder.Results demonstrate performance gains under all settingsin which we applied our regularization scheme: VGG orDenseNet, with or without data augmentation, and with orwithout ImageNet pretraining. Performance of a very deepDenseNet, with data augmentation and ImageNet pretrain-ing, is still further improved with use of our regularizer dur-ing training. Together, these results indicate that we havediscovered a new and generally applicable method for reg-ularizing supervised training of deep networks. Moreover,our method has no cost at test time; it produces networksarchitecturally identical to baseline designs.Section 5 discusses implications of our demonstrationthat it is possible to squeeze more beneﬁt from detailed la-bel maps when training deep networks. Our results openup a new area of inquiry on how best to build datasets anddesign training procedures to efﬁciently utilize annotation.

2. Related Work

The abundance of data, but more limited availability ofground-truth supervision, has sparked a ﬂurry of recent in-terest in developing self-supervised methods for trainingdeep neural networks. Here, the idea is to utilize a largereserve of unlabeled data in order to prime a deep networkto encode generally useful visual representations. Subse-quently, that network can be ﬁne-tuned on a novel targettask, using actual ground-truth supervision on a smallerdataset. Pretraining on ImageNet [8] currently yields suchportable representations [10], but lacks the ability to scalewithout requiring additional human annotation effort.Recent research explores a diverse array of data sourcesand tasks for self-supervised learning. In the domain of im- ages, proposed tasks include inpainting using context [34],solving jigsaw puzzles [31], colorization [43, 22, 23], cross-channel prediction [44], and learning a bidirectional vari-ant [11] of generative adversarial networks (GANs) [13].In the video domain, recent works harness temporal co-herence [28, 18], co-occurrence [17], and ordering [27], aswell as tracking [40], sequence modeling [37], and motiongrouping [33]. Owens et al . [32] explore cross-modalityself-supervision, connecting vision and sound. Agrawal etal . [3] and Nair et al . [30] examine settings in which a robotlearns to predict the visual effects of its own actions.Training a network to perform ImageNet classiﬁcation ora self-supervised task, in addition to the task of interest, canbe viewed as a kind of implicit regularization constraint.Zhang et al . [45] explore explicit auxiliary reconstructiontasks to regularize training. However, they focus on encod-ing and decoding image feature representations. Our ap-proach differs entirely in the source of regularization.Speciﬁcally, by autoencoding the structure of the targettask labels, we utilize a different reserve of information thanall of the above methods. We design a new task, but whereasself-supervision formulates the new task on external data,we derive the new task from the annotation. This separa-tion of focus allows for possible synergistic combination ofour method with pretraining of either the self-supervised orsupervised (ImageNet) variety. Section 4 tests the latter.Another important distinction from recent self-supervised work is that, as detailed in Section 3, we use ageneric mechanism, based on an autoencoder, for derivingour auxiliary task. In contrast, the vast majority of effortin self-supervision has relied on using domain-speciﬁcknowledge to formulate appropriate tasks. Inpainting [34],jigsaw puzzles [31], and colorization [43, 22, 23] exemplify2 at grass tail (cid:63) ear (cid:1)(cid:1)(cid:11) head (cid:64)(cid:64)(cid:82) body (cid:0)(cid:0)(cid:18)

Figure 2.

Informative structure in annotation.

The shape oflabeled semantic regions hints at unlabeled parts (black arrows).Object co-occurrence provides a prior on scene composition. this mindset; BiGANs [11] are perhaps an exception, but todate their results compare less favorably [23].The work of Xie et al . [41] shares similarities to ourapproach along the aspect of modeling label space. How-ever, they focus on learning a shallow corrective modelthat essentially denoises a predicted label map using center-surround ﬁltering. In contrast, we build a deep model of la-bel space. Also, unlike [41], our approach has no test-timecost, as we impose it only as a regularizer during training,rather than as an ever-present denoising layer.Inspiration for our method traces back to the era of vi-sion prior to the pervasive use of deep learning. It was oncecommon to consider context as important [39], reason aboutobject parts, co-occurrence, and interactions [9], and designgraphical models to capture such relationships [38]. We re-fer to only a few sample papers as fully accounting for adecade of computer vision research is not possible here. Inthe following section, we open a pathway to pull such think-ing about compositional scene priors into the modern era:simply learn, and employ, a deep model of label space.

3. Method

Figure 2 is a useful aid in explaining the intuition behindthe regularization scheme outlined in Figure 1. Suppose wewant to train a CNN to recognize and segment cats, but ourlimited training set consists only of tigers. It is conceiv-able that the CNN will learn an equivalence between blackand orange striped texture and the cat category, as such as-sociation sufﬁces to classify every pixel on a tiger. It thusoverﬁts to the tiger subclass and fails when tested on imagesof house cats. This behavior could arise even if trained withdetailed supervision of the form shown in Figure 2.Yet, the semantic segmentation ground-truth suggests toany human that texture should not be the primary crite-ria. There are no stripes in the annotation. Over the entiretraining set, regions labeled as cat share a distinctive shapethat deforms in a manner suggestive of unlabeled parts ( e.g .head, body, tail, ear). The presence or absence of other ob-jects in the scene may also provide contextual cues as tothe chance of ﬁnding a cat. How can we force the CNN tonotice this wealth of information during training?We could consider treating the ground-truth label mapas an image, and clustering local patches. The patch con-

Layers DenseNet-67 DenseNet-121

Convolution  × conv, stride 2 × conv × conv  × × conv,stride 2Pooling × max pool, stride 2Dense Block (1) (cid:20) × conv × conv (cid:21) × (cid:20) × conv × conv (cid:21) × Transition Layer (1) × conv × average pool, stride 2Dense Block (2) (cid:20) × conv × conv (cid:21) × (cid:20) × conv × conv (cid:21) × Transition Layer (2) × conv × average pool, stride 2Dense Block (3) (cid:20) × conv × conv (cid:21) × (cid:20) × conv × conv (cid:21) × Transition Layer (3) × conv × average pool, stride 2Dense Block (4) (cid:20) × conv × conv (cid:21) × (cid:20) × conv × conv (cid:21) × Figure 3.

DenseNet architectural speciﬁcations. taining the skinny tail would fall in a different cluster thanthat containing the pointy ear. Adding the cluster identitiesas another semantic label, and requiring the CNN to predictthem, would force the CNN to differentiate between the tailand ear by developing a representation of shape. This clus-tering approach is reminiscent of Poselets [6, 5].Following this strategy, we would need to hand-craft an-other scheme for capturing object co-occurrence relations,perhaps by clustering descriptors spanning a larger spatialextent. We would prefer a general means of capturing fea-tures of the ground-truth annotations, and one not limited toa few hand-selected characteristics. Fortunately, deep net-works are a suitable general tool for building the kind ofabstract feature hierarchy we desire.

Speciﬁcally, as shown in Figure 1, we train an autoen-coder on the ground-truth label maps. This autoencoderconsumes a semantic segmentation label map as input andattempts to replicate it as output. By virtue of being requiredto pass through a small bottleneck representation, the job ofthe autoencoder is nontrivial. It must compress the labelmap into the bottleneck representation. This compressionconstraint will (ideally) force the autoencoder to discoverand implicitly encode parts and contextual relationships.Ground-truth semantic segmentation label maps are sim-pler than real images, so this autoencoder need not have ashigh of a capacity as a network operating on natural im-ages. We use a relatively simply autoencoder architecture,consisting of a mirrored encoder and decoder, with no skipconnections. The encoder is a sequence of ﬁve × convo-lutional layers, with × max-pooling between them. Thedecoder uses upsampling followed by × convolution.As a default, we set each layer to have -channels. We alsoexperiment with some higher-capacity variants: • conv1: -channels; conv2-5: channels each • conv1: ; conv2-4: ; conv5: channels3 abels (Input) b b b b b b b b b b Encoder Encoder Hypercolumn Predicted Hypercolumn

Loss ( , ) Image b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b

CNN Hypercolumn Feature b b Output

Figure 4.

Alternative regularization scheme.

Instead of predicting a representation to pass through the decoder, as in Figure 1, we cantrain with an auxiliary regression problem. We place a loss on directly predicting activations produced by the hidden layers of the encoder.

These channel progressions are for the encoder; the decoderuses the same in reverse order. We refer to these three au-toencoder variants by the number of channels in their re-spective bottleneck layers ( , , or ). Convolutional neural networks for image classiﬁcationgradually reduce spatial resolution with depth through aseries of pooling layers [21, 36, 15, 16]. As the seman-tic segmentation task requires output at ﬁne spatial res-olution, some method of preserving or recovering spatialresolution must be introduced into the architecture. Oneoption is to gradually re-expand spatial resolution via up-sampling [35, 4]. Other approaches utilize some form ofskip-connection to forward spatially resolved features fromlower layers of the network to the ﬁnal layer [26, 14, 29].Dilated [42] or atrous convolutions [7] can also be mixedin. Alternatively, the basic CNN architecture can be refor-mulated in a multigrid setting [19].Our goal is to examine the effects of a regulariza-tion scheme in isolation from major architectural designchanges. Hence, we choose hypercolumn [14, 29] CNN ar-chitectures as a primary basis for experimentation, as theyare are minimally separated from the established classiﬁca-tion networks in design space. They also offer the addedadvantage of having readily available ImageNet pretrainedmodels, easing experimentation in this setting.We consider hypercolumn variants of VGG-16 [36] andDenseNet [16]. These variants simply upsample and con-catenate features from intermediate network layers for usein predicting semantic segmentation. As shown in Figure 1,this can equivalently be viewed as associating with eachspatial location a feature formed by concatenating a localslice of every CNN layer. The label of the correspondingpixel in the output is predicted from that feature.VGG-16 is widely used, while DenseNet [16] representsthe latest high-performance evolution of ResNet [15]-likedesigns. We use 67-layer and 121-layer DenseNets withthe architectural details speciﬁed in Figure 3. The 67-layer net uses a channel growth rate of , while the 121-layernetwork, the same as in [16], uses a growth rate of . Wework with × input and output spatial resolutions inboth CNNs and our label autoencoder. As shown by the large gray arrow in Figure 1, we im-pose our regularizer by connecting a CNN ( e.g . VGG orDenseNet) to the decoder portion of our learned label au-toencoder. Importantly, the decoder parameters are frozenduring this training phase . The CNN now has two tasks,each with an associated loss, to perform during training. Asusual, it must predict semantic segmentation using hyper-columns. It must also predict the same semantic segmenta-tion via an auxiliary path through the decoder. Backpropa-gation from losses along both paths inﬂuences CNN param-eter updates. Though they participate in one of these paths,parameters internal to the decoder are never updated.We connect VGG-16 or DenseNet to the decoder by pre-dicting input for the decoder from the output of the penulti-mate CNN layer prior to global pooling. This is the second-to-last convolutional layer, and is selected because its spa-tial resolution matches that of the expected decoder input.The prediction itself is made via a new × convolutionallayer, dedicated for that purpose.If the label autoencoder learns useful abstractions, re-quiring the CNN to work through the decoder ensures thatit learns to work with those abstractions. The hypercolumnpathway allows the CNN to make direct predictions, whilethe decoder pathway ensures that the CNN has “good rea-sons” or a high-level abstract justiﬁcation for its predictions.Assuming autoencoder layers gradually build-up goodabstractions, there exist alternative methods of connectingit as a regularizer. Figure 4 diagrams one such alternative.Here, we ask the CNN to directly predict the feature rep-resentation built by the label encoder. Encoder parametersare, of course, frozen here. An auxiliary layer attempts topredict the encoder hypercolumn from the CNN hypercol-umn at the corresponding spatial location. The CNN must4lso still solve the original semantic segmentation task.As Section 4 shows, this alternative scheme works well,but not quite as well as using the decoder pathway. Us-ing the decoder is also appealing for more reasons than per-formance alone. Deﬁning an auxiliary loss in terms of de-coder semantic segmentation output is more interpretablethan deﬁning it in terms of mean square error (MSE) be-tween two hypercolumn features. Moreover, the decoderoutput is visually interpretable; we can see the semanticsegmentation predicted by the CNN via the decoder.

4. Experiments

The PASCAL dataset [12] serves as our experimentaltestbed. We follow standard procedure for semantic seg-mentation, using the ofﬁcial PASCAL 2012 training set, andreporting performance in terms of mean intersection overunion (mIoU) on the validation set (as validation ground-truth is publicly available). We explore both our decoder-and encoder-based regularization schemes in combinationwith multiple choices of base network, data augmentation,and pretraining. When applying the encoder as a regular-izer, we task the CNN with predicting the concatenation ofthe encoder’s activations in its conv1 and conv3 layers.

All experiments are done in PyTorch [1], using theAdam [20] update rule when training networks. Modelstrained from scratch use a batch size of and learning rateof 1 e − which after epochs decreased to 1 e − for anadditional epochs. For the case of ImageNet pretrainedmodels, we normalize hypercolumn features such that theyhave zero-mean and unit-variance. We keep the deep net-work weights frozen and train the classiﬁer for epochswith learning rate of 1 e − . Then we decrease the learningrate to 1 e − and train end-to-end for additional epochs.Data augmentation, when used, includes: a crop of ran-dom size in the (0.08 to 1.0) of the original size and a ran-dom aspect ratio of 3/4 to 4/3 of the original aspect ratio,which is ﬁnally resized to create a × image. Plusrandom horizontal ﬂip. Pretrained models are based on thePyTorch torchvision library [2].We use cross-entropy loss on auxiliary regularizationbranches, except where indicated by a superscript † in re-sults tables. For these experiments, we use MSE loss. Tables 1, 3, and 4 summarize the performance beneﬁtsof training with our regularizer. In the absence of pretrain-ing or data augmentation, we boost performance of bothVGG-16 and DenseNet-67 by . and . mIoU, respec-tively, which is more than a relative boost. Regular-ization with our decoder still improves mIoU (from . to Architecture Data-Aug? Auxiliary Regularizer mIoUno none 37.3

VGG-16 no Encoder (conv1 & conv3) † -hypercolumn no Decoder (32 channel) † yes none 55.2 yes Decoder (128 channel) 57.1

VGG-16 -FCN8s yes none 51.5 yes

Decoder (128 channel) 54.1 no none 40.5 no Encoder (conv1 & conv3) † Decoder (32 channel) † DenseNet-67 no Decoder (128 channel) 42.5 -hypercolumn yes none 58.8 yes Decoder (32 channel) 59.4yes

Decoder (128 channel) 60.6 yes Decoder (256 channel) 59.8Table 1.

PASCAL mIoU without ImageNet pretraining.

Ineach experimental setting (choice of architecture, and presence orabsence of data augmentation), training with any of our regulariz-ers improves performance over the baseline (shown in gray).Architecture Data-Aug Auxiliary Regularizer mIoUyes none 58.8

DenseNet-67 yes

Decoder (128 channel) 60.6 -hypercolumn yes Unfrozen Decoder 60.2yes Random Init. Decoder 58.8Table 2.

Ablation study.

PASCAL mIoU deteriorates if the de-coder parameters are not held ﬁxed while training the main CNN.Architecture Data-Aug? Auxiliary Regularizer mIoUVGG-16 no none 67.1 -hypercolumn no Decoder (32 channel) 68.8

DenseNet-121 yes none 71.6 -hypercolumn yes

Decoder (128 channel) 71.9

ResNet-101 yes none 75.4 -PSPNet yes

Decoder (128 channel) 75.9

Table 3.

PASCAL mIoU with ImageNet pretraining.

Architecture Data-Aug Auxiliary Regularizer mIoUDenseNet-67 yes none 72.3 -hypercolumn yes

Decoder (128 channel) 73.6

Table 4.

PASCAL mIoU with COCO pretraining. . ) of DenseNet-67 trained with data augmentation. Tofurther show the robustness of our regularization schemeto the choice of architecture, we also experiment with anFCN [26] version of VGG-16, as included in Table 1.Table 2 demonstrates the necessity of our two-phasetraining procedure. If we unfreeze the decoder and up-date its parameters in the second training phase, test perfor-mance of the primary output deteriorates. Likewise, if weskip the ﬁrst phase, and train from scratch with an unfrozen,randomly initialized decoder, the accuracy gain disappears.Thus, the regularization effect is due to a transfer of infor-mation from the learned label model, rather than stemmingfrom an architectural design of dual output pathways.5

10 20 30 40 50

Auxiliary Loss Weight (Relative)

PAS C A L m I o U VGG-16DenseNet-67

Figure 5.

Auxiliary loss weighting.

We plot test performance asa function of the relative weight of the losses on the auxiliary vsprimary output branches when training with the setup in Figure 1.Weighting is important, but the optimal balance appears consistentwhen changing architecture from VGG-16 (green) to DenseNet-67(magenta). Performance is mIoU on PASCAL, without ImageNetpretraining or data augmentation. Note that any nonzero weight onthe auxiliary loss (any regularization) improves over the baseline.

Table 3 shows that our regularization scheme synergizeswith ImageNet pretraining. It improves VGG-16 perfor-mance, and even provides some beneﬁt to a very deep 121-layer DenseNet pretrained on ImageNet, while using dataaugmentation. A baseline . mIoU for DenseNet appearsnear state-of-the-art for networks that do not employ addi-tional tricks ( e.g . custom pooling layers [46], use of multi-scale, or post-processing with CRFs [7]). Our improvementto . mIoU may be nontrivial. Expanding trials in com-bination with pretraining, our regularizer improves resultswhen pretraining on COCO, as shown in Table 4.We also combine our regularizer with the latest networkdesign for semantic segmentation: dilated ResNet aug-mented with the pyramid pooling module of PSPNet [46].We used the output of the pyramid pooling layer to predictinput for the decoder and semantic segmentation. Table 3shows gain over the corresponding PSPNet baseline.Beyond autoencoder architecture choice, application ofour regularizer involves one free parameter: the relativeweight of the auxiliary branch loss with respect to the pri-mary loss. Figure 5 shows how performance of the trainednetwork varies with this parameter, when using our 32-channel bottleneck layer decoder with MSE loss on the aux-iliary branch.We have also run similar experiments with cross-entropyloss on the auxiliary branch with the weight parameter in [0 , . Here, the weight parameter range is changed due tothe difference in the dynamic range of values between MSEloss and cross-entropy loss. Behaving similarly to Figure 5,relative weighting of . achieves the highest accuracy. Weuse this weight value across all of the experiments using ourdecoder with 128-channel bottleneck layer. While the reg-ularizer always provides a beneﬁt, placing a proper relative weight on the auxiliary loss is important.Figure 6 visualizes the impact of training with ourlearned label decoder as a regularizer. Most notably,the network trained with regularization appears to correctsome global or large-scale semantic errors in comparisonto the baseline. Contrast such behavior to CRF-based post-processing, which typically achieves impact through ﬁxinglocal mistakes. Also notable is that our auxiliary output it-self is quite reasonable. This suggests that the autoencodertraining phase is successful in creating encoders and de-coders that model label structure. To further investigate what the autoencoder learns, weconsider using the bottleneck representation produced bythe encoder as deﬁning features by which we can performqueries in label space. Speciﬁcally, we pick a region of atraining image label and represent that region with featuresextracted from bottleneck layer. As the bottleneck layer islow resolution, we are selecting features at coarse, but cor-responding spatial location.Next, we perform nearest neighbor search over all re-gions in the validation set and ﬁnd the two closest regionsto the query region. Figure 7 shows the results of this exper-iment. Returned regions not only have the same object classtypes as the query regions, but also share similar shapes tothat of the query. This reveals that our label autoencoderhas learned to capture object shape characteristics.We also repeat this experiment, except with queries start-ing from images. Here the bottleneck representation is pro-duced by a CNN, which was trained with both hypercol-umn and decoder prediction pathways; the latter yields therequired features. As shown in the top-right of Figure 7, re-turned regions have similar context and shape to the query.

5. Conclusion

Our novel regularization method, when applied to train-ing deep networks for semantic segmentation, consistentlyimproves their generalization performance. The intuitionbehind our work, that additional supervisory signal can besqueezed from highly detailed annotation, is supported bythe types of errors this regularizer corrects, as well as ourefforts at introspection into our learned label model.Our results also indicate that one should now reevalu-ate the relative utility of different forms of annotation; ourmethod makes detailed labeling more useful than previouslybelieved. This observation may be especially important forapplications of computer vision, such as self-driving cars,that demand detailed scene understanding, and for whichlarge-scale dataset construction is essential.

Acknowledgements.

This work was in part supported by theDARPA Lifelong Learning Machines program. mage Auxiliary Output Primary OutputOur System: DenseNet-67 trained with regularizer Ground-truth Baseline DenseNet-67 Figure 6.

Semantic segmentation results on PASCAL.

We show the output of a baseline 67-layer hypercolumn DenseNet (rightmostcolumn) compared to that of the same architecture trained with our auxiliary decoder branch as a regularizer (middle columns). Allexamples are from the validation set. While we can discard the auxiliary branch after training, we include its output here to displaythe decoder’s operation. Our network provides high-level signals to the decoder which, in turn, produces reasonable segmentations. Tobest illustrate the effect of regularization, all results shown are for networks trained from scratch, without ImageNet pretraining or dataaugmentation. This corresponds to the . to . jump in mIoU reported in Table 1, between the baseline and our primary output. igure 7. Finding regions with similar representations.

For each query image (green border) and region (green dot), the next twoimages to the right are those in the validation set containing the nearest regions to the query region. All query images are from the trainingset. For examples on red background, search is conducted not by looking at images, but via matching features produced by the encoder run on ground-truth label maps. The bottom-right shows failure cases, such as matching a cat’s arm to the car rear door. For examples ongray background, our

DenseNet-67-hypercolumn CNN is used to predict the label space search representations from images. eferences [1] PyTorch. https://github.com/pytorch/pytorch.[2] PyTorch torchvision. https://github.com/pytorch/vision.[3] P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine.Learning to poke by poking: Experiential learning of intu-itive physics. NIPS , 2016.[4] V. Badrinarayanan, A. Kendall, and R. Cipolla. SegNet: Adeep convolutional encoder-decoder architecture for imagesegmentation.

PAMI , 2017.[5] L. Bourdev, S. Maji, T. Brox, and J. Malik. Detecting peopleusing mutually consistent poselet activations.

ECCV , 2010.[6] L. Bourdev and J. Malik. Poselets: Body part detectorstrained using 3d human pose annotations.

ICCV , 2009.[7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. Deeplab: Semantic image segmentation withdeep convolutional nets, atrous convolution, and fully con-nected CRFs. arXiv:1606.00915 , 2016.[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database.

CVPR , 2009.[9] C. Desai, D. Ramanan, and C. Fowlkes. Discriminative mod-els for multi-class object layout.

IJCV , 2011.[10] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,E. Tzeng, and T. Darrell. DeCAF: A deep convolutional ac-tivation feature for generic visual recognition.

ICML , 2014.[11] J. Donahue, P. Kr¨ahenb¨uhl, and T. Darrell. Adversarial fea-ture learning.

ICLR , 2017.[12] M. Everingham, L. van Gool, C. Williams, J. Winn, andA. Zisserman. The PASCAL Visual Object Classes (VOC)challenge.

IJCV , 2010.[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial nets.

NIPS , 2014.[14] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hyper-columns for object segmentation and ﬁne-grained localiza-tion.

CVPR , 2015.[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition.

CVPR , 2016.[16] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger.Densely connected convolutional networks.

CVPR , 2017.[17] P. Isola, D. Zoran, D. Krishnan, and E. H. Adelson. Learningvisual groups from co-occurrences in space and time.

ICLR,workshop paper , 2016.[18] D. Jayaraman and K. Grauman. Slow and steady featureanalysis: Higher order temporal coherence in video.

CVPR ,2016.[19] T.-W. Ke, M. Maire, and S. X. Yu. Multigrid neural architec-tures.

CVPR , 2017.[20] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization.

ICLR , 2015.[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Ima-geNet classiﬁcation with deep convolutional neural net-works.

NIPS , 2012.[22] G. Larsson, M. Maire, and G. Shakhnarovich. Learning rep-resentations for automatic colorization.

ECCV , 2016. [23] G. Larsson, M. Maire, and G. Shakhnarovich. Colorizationas a proxy task for visual understanding.

CVPR , 2017.[24] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Com-mon objects in context.

ECCV , 2014.[25] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang. Semanticimage segmentation via deep parsing network.

ICCV , 2015.[26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation.

CVPR , 2015.[27] I. Misra, C. L. Zitnick, and M. Hebert. Unsupervisedlearning using sequential veriﬁcation for action recognition.

ECCV , 2016.[28] H. Mobahi, R. Collobert, and J. Weston. Deep learning fromtemporal coherence in video.

ICML , 2009.[29] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich.Feedforward semantic segmentation with zoom-out features.

CVPR , 2015.[30] A. Nair, D. Chen, P. Agrawal, P. Isola, P. Abbeel, J. Malik,and S. Levine. Combining self-supervised learning and imi-tation for vision-based rope manipulation.

ICRA , 2017.[31] M. Noroozi and P. Favaro. Unsupervised learning of visualrepresentations by solving jigsaw puzzles.

ECCV , 2016.[32] A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, andA. Torralba. Ambient sound provides supervision for visuallearning.

ECCV , 2016.[33] D. Pathak, R. Girshick, P. Doll´ar, T. Darrell, and B. Hariha-ran. Learning features by watching objects move.

CVPR ,2017.[34] D. Pathak, P. Kr¨ahenb¨uhl, J. Donahue, T. Darrell, andA. Efros. Context encoders: Feature learning by inpainting.

CVPR , 2016.[35] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolu-tional networks for biomedical image segmentation.

MIC-CAI , 2015.[36] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition.

ICLR , 2015.[37] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsuper-vised learning of video representations using LSTMs.

ICML ,2015.[38] E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Will-sky. Learning hierarchical models of scenes, objects, andparts.

ICCV , 2005.[39] A. Torralba. Contextual priming for object detection.

IJCV ,2003.[40] X. Wang and A. Gupta. Unsupervised learning of visual rep-resentations using videos.

ICCV , 2015.[41] S. Xie, X. Huang, and Z. Tu. Top-down learning for struc-tured labeling with convolutional pseudoprior.

ECCV , 2016.[42] F. Yu and V. Koltun. Multi-scale context aggregation by di-lated convolutions.

ICLR , 2016.[43] R. Zhang, P. Isola, and A. A. Efros. Colorful image coloriza-tion.

ECCV , 2016.[44] R. Zhang, P. Isola, and A. A. Efros. Split-brain autoencoders:Unsupervised learning by cross-channel prediction.

CVPR ,2017.

45] Y. Zhang, K. Lee, and H. Lee. Augmenting supervised neuralnetworks with unsupervised objectives for large-scale imageclassiﬁcation.

ICML , 2016.[46] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid sceneparsing network.

CVPR , 2017., 2017.