[PDF] A critical analysis of self-supervision, or what we can learn from a single image

Abstract

We look critically at popular self-supervision techniques for learning deep convolutional neural networks without manual labels. We show that three different and representative methods, BiGAN, RotNet and DeepCluster, can learn the first few layers of a convolutional network from a single image as well as using millions of images and manual labels, provided that strong data augmentation is used. However, for deeper layers the gap with manual supervision cannot be closed even if millions of unlabelled images are used for training. We conclude that: (1) the weights of the early layers of deep networks contain limited information about the statistics of natural images, that (2) such low-level statistics can be learned through self-supervision just as well as through strong supervision, and that (3) the low-level statistics can be captured via synthetic transformations instead of using a large image dataset.

Full PDF

PPublished as a conference paper at ICLR 2020 A CRITICAL ANALYSIS OF SELF - SUPERVISION , ORWHAT WE CAN LEARN FROM A SINGLE IMAGE

Yuki M. Asano Christian Rupprecht Andrea Vedaldi

Visual Geometry GroupUniversity of Oxford { yuki,chrisr,vedaldi } @robots.ox.ac.uk A BSTRACT

We look critically at popular self-supervision techniques for learning deep convo-lutional neural networks without manual labels. We show that three different andrepresentative methods, BiGAN, RotNet and DeepCluster, can learn the ﬁrst fewlayers of a convolutional network from a single image as well as using millions ofimages and manual labels, provided that strong data augmentation is used. How-ever, for deeper layers the gap with manual supervision cannot be closed even ifmillions of unlabelled images are used for training. We conclude that: (1) theweights of the early layers of deep networks contain limited information aboutthe statistics of natural images, that (2) such low-level statistics can be learnedthrough self-supervision just as well as through strong supervision, and that (3)the low-level statistics can be captured via synthetic transformations instead ofusing a large image dataset.

NTRODUCTION

Despite tremendous progress in supervised learning, learning without external supervision remainsdifﬁcult. Self-supervision has recently emerged as one of the most promising approaches to addressthis limitation. Self-supervision builds on the fact that convolutional neural networks (CNNs) trans-fer well between tasks (Shin et al., 2016; Oquab et al., 2014; Girshick, 2015; Huh et al., 2016). Theidea then is to pre-train networks via pretext tasks that do not require expensive manual annotationsand can be automatically generated from the data itself. Once pre-trained, networks can be appliedto a target task by using only a modest amount of labelled data.Early successes in self-supervision have encouraged authors to develop a large variety of pretexttasks, from colorization to rotation estimation and image autoencoding. Recent papers have shownperformance competitive with supervised learning by learning complex neural networks on verylarge image datasets. Nevertheless, for a given model complexity, pre-training by using an off-the-shelf annotated image datasets such as ImageNet remains much more efﬁcient.In this paper, we aim to investigate the effectiveness of current self-supervised approaches by char-acterizing how much information they can extract from a given dataset of images. Since deep net-works learn a hierarchy of representations, we further break down this investigation on a per-layerbasis. We are motivated by the fact that the ﬁrst few layers of most networks extract low-level in-formation (Yosinski et al., 2014), and thus learning them may not require the high-level semanticinformation captured by manual labels.Concretely, in this paper we answer the following simple question: “ is self-supervision able toexploit the information contained in a large number of images in order to learn different parts of aneural network? ”We contribute two key ﬁndings. First, we show that as little as a single image is sufﬁcient, whencombined with self-supervision and data augmentation, to learn the ﬁrst few layers of standard deepnetworks as well as using millions of images and full supervision (Figure 1). Hence, while self-supervised learning works well for these layers, this may be due more to the limited complexity ofsuch features than the strength of the supervisory technique. This also conﬁrms the intuition thatearly layers in a convolutional network amounts to low-level feature extractors, analogous to early1 a r X i v : . [ c s . C V ] F e b ublished as a conference paper at ICLR 2020 conv1 conv2 conv3 conv4 conv5 % s u p e r v i s e d p e r f o r m a n c e Linear Classifier on ImageNet

RandomRotNet1-RotNetBiGAN1-BiGANDeepCluster1-DeepCluster

Figure 1:

Single-image self-supervision.

Weshow that several self-supervision methods can beused to train the ﬁrst few layers of a deep neuralnetworks using a single training image , such asthis Image A , B or even C (above), provided thatsufﬁcient data augmentation is used.learned and hand-crafted features for visual recognition (Olshausen & Field, 1997; Lowe, 2004;Dalal & Triggs, 2005). Finally, it demonstrates the importance of image transformations in learningsuch low-level features as opposed to image diversity. Our second ﬁnding is about the deeper layers of the network. For these, self-supervision remainsinferior to strong supervision even if millions of images are used for training. Our ﬁnding is that thisis unlikely to change with the addition of more data. In particular, we show that training these layerswith self-supervision and a single image already achieves as much as two thirds of the performancethat can be achieved by using a million different images.We show that these conclusions hold true for three different self-supervised methods, BiGAN (Don-ahue et al., 2017), RotNet (Gidaris et al., 2018) and DeepCluster (Caron et al., 2018), which arerepresentative of the spectrum of techniques that are currently popular. We ﬁnd that performanceas a function of the amount of data is dependent on the method, but all three methods can indeedleverage a single image to learn the ﬁrst few layers of a deep network almost “perfectly”.Overall, while our results do not improve self-supervision per-se , they help to characterize the limi-tations of current methods and to better focus on the important open challenges.

ELATED W ORK

Our paper relates to three broad areas of research: (a) self-supervised/unsupervised learning, (b)learning from a single sample, and (c) designing/learning low-level feature extractors. We discussclosely related work for each.

Self-supervised learning:

A wide variety of proxy tasks, requiring no manual annotations, havebeen proposed for the self-training of deep convolutional neural networks. These methods usevarious cues and tasks namely, in-painting (Pathak et al., 2016), patch context and jigsaw puz-zles (Doersch et al., 2015; Noroozi & Favaro, 2016; Noroozi et al., 2018; Mundhenk et al., 2017),clustering (Caron et al., 2018), noise-as-targets (Bojanowski & Joulin, 2017), colorization (Zhanget al., 2016; Larsson et al., 2017), generation (Jenni & Favaro, 2018; Ren & Lee, 2018; Donahueet al., 2017), geometry (Dosovitskiy et al., 2016; Gidaris et al., 2018) and counting (Noroozi et al.,2017). The idea is that the pretext task can be constructed automatically and easily on images alone.Thus, methods often modify information in the images and require the network to recover them. In-painting or colorization techniques fall in this category. However these methods have the downsidethat the features are learned on modiﬁed images which potentially harms the generalization to un-modiﬁed ones. For example, colorization uses a gray scale image as input, thus the network cannotlearn to extract color information, which can be important for other tasks.Slightly less related are methods that use additional information to learn features. Here, oftentemporal information is used in the form of videos. Typical pretext tasks are based on temporal-context (Misra et al., 2016; Wei et al., 2018; Lee et al., 2017; Sermanet et al., 2018), spatio-temporal Example applications that only rely on low-level feature extractors include template matching (Kat et al.,2018; Talmi et al., 2017) and style transfer (Gatys et al., 2016; Johnson et al., 2016), which currently rely onpre-training with millions of images.

Learning from a single sample:

In some applications of computer vision, the bold idea of learn-ing from a single sample comes out of necessity. For general object tracking, methods such asmax margin correlation ﬁlters (Rodriguez et al., 2013) learn robust tracking templates from a singlesample of the patch. A single image can also be used to learn and interpolate multi-scale textureswith a GAN framework (Rott Shaham et al., 2019). Single sample learning was pursued by thesemi-parametric exemplar SVM model (Malisiewicz et al., 2011). They learn one SVM per positivesample separating it from all negative patches mined from the background. While only one sam-ple is used for the positive set, the negative set consists of thousands of images and is a necessarycomponent of their method. The negative space was approximated by a multi-dimensional Gaussianby the Exemplar LDA (Hariharan et al., 2012). These SVMs, one per positive sample, are pooledtogether using a max aggregation. We differ from both of these approaches in that we do not use alarge collection of negative images to train our model. Instead we restrict ourselves to a single or afew images with a systematic augmentation strategy.

Classical learned and hand-crafted low-level feature extractors:

Learning and hand-craftingfeatures pre-dates modern deep learning approaches and self-supervision techniques. For examplethe classical work of (Olshausen & Field, 1997) shows that edge-like ﬁlters can be learned viasparse coding of just 10 natural scene images. SIFT (Lowe, 2004) and HOG (Dalal & Triggs, 2005)have been used extensively before the advent of convolutional neural networks and, in many ways,they resemble the ﬁrst layers of these networks. The scatter transform of Bruna & Mallat (2013);Oyallon et al. (2017) is an handcrafted design that aims at replacing at least the ﬁrst few layers ofa deep network. While these results show that effective low-level features can be handcrafted, thisis insufﬁcient to clarify the power and limitation of self-supervision in deep networks. For instance,it is not obvious whether deep networks can learn better low level features than these, how manyimages may be required to learn them, and how effective self-supervision may be in doing so. Forinstance, as we also show in the experiments, replacing low-level layers in a convolutional networkswith handcrafted features such as Oyallon et al. (2017) may still decrease the overall performanceof the model. Furthermore, this says little about deeper layers, which we also investigate.In this work we show that current deep learning methods learn slightly better low-level representa-tions than hand crafted features such as the scattering transform. Additionally, these representationscan be learned from one single image with augmentations and without supervision. The resultsshow how current self-supervised learning approaches that use one million images yield only rel-atively small gains when compared to what can be achieved from one image and augmentations,and motivates a renewed focus on augmentations and incorporating prior knowledge into featureextractors.

ETHODS

We discuss ﬁrst our data and data augmentation strategy (section 3.1) and then we summarize thethree different methods for unsupervised feature learning used in the experiments (section 3.2).3ublished as a conference paper at ICLR 20203.1 D

ATA

Our goal is to understand the performance of representation learning methods as a function of theimage data used to train them. To make comparisons as fair as possible, we develop a protocol whereonly the nature of the training data is changed, but all other parameters remain ﬁxed.In order to do so, given a baseline method trained on d source images, we replace those with anotherset of d images. Of these, now only N (cid:28) d are source images (i.e. i.i.d. samples), while the remaining d − N are augmentations of the source ones. Thus, the amount of information in the training data iscontrolled by N and we can generate a continuum of datasets that vary from one extreme, utilizinga single source image N = 1 , to the other extreme, using all N = d original training set images. Forexample, if the baseline method is trained on ImageNet, then d = 1 , , . When N = 1 , it meansthat we train the method using a single source image and generate the remaining , , imagesvia augmentation. Other baselines use CIFAR-10/100 images, so in those cases d = 50 , instead.The data augmentation protocol, is an extreme version of augmentations already employed by mostdeep learning protocols. Each method we test, in fact, already performs some data augmentationinternally. Thus, when the method is applied on our augmented data, this can be equivalently thoughtof as incrementing these “native” augmentations by concatenating them with our own. Choice of augmentations.

Next, we describe how the N source images are expanded to additional d − N images so that the models can be trained on exactly d images, independent from the choice of N . The idea is to use an aggressive form of data augmentation involving cropping, scaling, rotation,contrast changes, and adding noise. These transformations are representative of invariances that onemay wish to incorporate in the features. Augmentation can be seen as imposing a prior on how weexpect the manifold of natural images to look like. When training with very few images, these priorsbecome more important since the model cannot extract them directly from data.Given a source image of size size H × W , we ﬁrst extract a certain number of random patches of size ( w, h ) , where w ≤ W and h ≤ H satisfy the additional constraints β ≤ whW H and γ ≤ hw ≤ γ − . Thus, the smallest size of the crops is limited to be at least βWH and at most the whole image.Additionally, changes to the aspect ratio are limited by γ . In practice we use β = 10 − and γ = .Second, good features should not change much by small image rotations, so images are rotated(before cropping to avoid border artifacts) by α ∈ ( − , degrees. Due to symmetry in imagestatistics, images are also ﬂipped left-to-right with 50% probability.Illumination changes are common in natural images, we thus expect image features to be robust tocolor and contrast changes. Thus, we employ a set of linear transformations in RGB space to modelthis variability in real data. Additionally, the color/intensity of single pixels should not affect thefeature representation, as this does not change the contents of the image. To this end, color jitterwith additive brightness, contrast and saturation are sampled from three uniform distributions in (0 . , . and hue noise from ( − . , . is applied to the image patches. Finally, the cropped andtransformed patches are scaled to the color range ( − , and then rescaled to full S × S resolutionto be supplied to each representation learning method, using bilinear interpolation. This formulationensures that the patches are created in the target resolution S , independent from the size and aspectratio W, H of the source image.

Real samples.

The images used for the N = 1 and N = 10 experiments are shown in Figure 1 andthe appendix respectively (this is all the training data used in such experiments). For the special caseof using a single training image, i.e. N = 1 , we have chosen one photographic ( × ) and onedrawn image ( × ), which we call Image A and Image B , respectively. The two images weremanually selected as they contain rich texture and are diverse, but their choice was not optimizedfor performance. We test only two images due to the cost of running a full set of experiments (eachimage is expanded up to 1.2M times for training some of the models, as explained above). However,this is sufﬁcient to prove our main points. We also test another ( × ) Image C to ablate the“crowdedness” of an image, as this latter contains large areas covering no objects. While resolutionmatters to some extent as a bigger image contains more pixels, the information within is still farmore correlated, and thus more redundant than sampling several smaller images. In particular, theresolution difference in Image A and B appears to be negligible in our experiments. For CIFAR-10,where S = 32 we only use Image B due to the resolution difference. In direct comparison, Image B d = 50 , . For N > , weselect the source images randomly from each method’s training set.3.2 R EPRESENTATION L EARNING M ETHODS

Generative models.

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) learn togenerate images using an adversarial objective: a generator network maps noise samples to imagesamples, approximating a target image distribution and a discriminator network is tasked with dis-tinguishing generated and real samples. Generator and discriminator are pitched one against theother and learned together; when an equilibrium is reached, the generator produces images indistin-guishable (at least from the viewpoint of the discriminator) from real ones.Bidirectional Generative Adversarial Networks (BiGAN) (Donahue et al., 2017; Dumoulin et al.,2016) are an extension of GANs designed to learn a useful image representation as an approximateinverse of the generator through joint inference on an encoding and the image. This method’s na-tive augmentation uses random crops and random horizontal ﬂips to learn features from S = 128 sized images. As opposed to the other two methods discussed below it employs leaky ReLU non-linearities as is typical in GAN discriminators. Rotation.

Most image datasets contain pictures that are ‘upright’ as this is how humans prefer totake and look at them. This photographer bias can be understood as a form of implicit data labelling.RotNet (Gidaris et al., 2018) exploits this by tasking a network with predicting the upright directionof a picture after applying to it a random rotation multiple of degrees (in practice this is formulatedas a -way classiﬁcation problem). The authors reason that the concept of ‘upright’ requires learninghigh level concepts in the image and hence this method is not vulnerable to exploiting low-levelvisual information, encouraging the network to learn more abstract features. In our experiments,we test this hypothesis by learning from impoverished datasets that may lack the photographer bias.The native augmentations that RotNet uses on the S = 256 inputs only comprise horizontal ﬂips andnon-scaled random crops to × . Clustering.

DeepCluster (Caron et al., 2018) is a recent state-of-the-art unsupervised represen-tation learning method. This approach alternates k -means clustering to produce pseudo-labels forthe data and feature learning to ﬁt the representation to these labels. The authors attribute the suc-cess of the method to the prior knowledge ingrained in the structure of the convolutional neuralnetwork (Ulyanov et al., 2018).The method alternatives between a clustering step, in which k -means is applied on the PCA-reducedfeatures with k = 10 , and a learning step, in which the network is trained to predict the clusterID for each image under a set of augmentations (random resized crops with β = 0 . , γ = andhorizontal ﬂips) that constitute its native augmentations used on top of the S = 256 input images. XPERIMENTS

We evaluate the representation learning methods on ImageNet and CIFAR-10/100 using linearprobes (Section 4.1). After ablating various choices of transformations in our augmentation pro-tocol (Section 4.2), we move to the core question of the paper: whether a large dataset is beneﬁcialto unsupervised learning, especially for learning early convolutional features (Section 4.3).4.1 L

INEAR PROBES AND BASELINE ARCHITECTURE

In order to quantify if a neural network has learned useful feature representations, we follow thestandard approach of using linear probes (Zhang et al., 2017). This amounts to solving a difﬁculttask such as ImageNet classiﬁcation by training a linear classiﬁer on top of pre-trained feature rep-resentations, which are kept ﬁxed. Linear classiﬁers heavily rely on the quality of the representationsince their discriminative power is low.We apply linear probes to all intermediate convolutional layers of networks and train on the Ima-geNet LSVRC-12 (Deng et al., 2009) and CIFAR-10/100 (Krizhevsky, 2009) datasets, which arethe standard benchmarks for evaluation in self-supervised learning. Our base encoder architecture isAlexNet (Krizhevsky et al., 2012) with BatchNorm, since this is a good representative model and ismost often used in other unsupervised learning work for the purpose of benchmarking. This model5ublished as a conference paper at ICLR 2020

CIFAR-10 conv1 conv2 conv3 conv4 (a) Fully sup. . . . . (b) Random feat. . . . . (c) No aug. . . . . (d) Jitter . . . . (e) Rotation . . . . (f) Scale 67.9 69.3 67.9 59.1(g) Rot. & jitter . . . . (h) Rot. & scale . . . . (i) Jitter & scale . . . . (j) All Table 1:

Ablating data augmentation usingMonoGAN (left).

Training a linear classiﬁeron the features extracted at different depthsof the network for CIFAR-10.Table 2:

ImageNet LSVRC-12 linear prob-ing evaluation (below).

A linear classiﬁer istrained on the (downsampled) activations ofeach layer in the pretrained model. We re-port classiﬁcation accuracy averaged over crops. The ‡ indicated that numbers are takenfrom (Zhang et al., 2017). ILSVRC-12Method, Reference conv1 conv2 conv3 conv4 conv5 (a) Full-supervision ‡ , ,

167 19 . . . . . (b) (Oyallon et al., 2017): Scattering 0 - . - - -(c) Random ‡ . . . . . (d) (Kr¨ahenb¨uhl et al., 2016): k -means ‡ ≈

160 17 . . . . . (e) (Donahue et al., 2017): BiGAN ‡ , ,

167 17 . . . . . (f) mono, Image A . . . . . (g) mono, Image B . . . . . (h) deka

10 16 . . . . . (i) kilo ,

000 16 . . . . . (j) (Gidaris et al., 2018): RotNet , ,

167 18 . . . . . (k) mono, Image A . . . . . (l) mono, Image B . . . . . (m) deka

10 19 . . . . . (n) kilo ,

000 21 . . . . . (o) (Caron et al., 2018): DeepCluster , ,

167 18 . . . . . (p) mono, Image A . . . . . (q) mono, Image B . . . . . (r) mono, Image C . . . . . (s) deka

10 18 . . . . . (t) kilo ,

000 19 . . . . . Table 3:

CIFAR-10/100.

Accuracy of linear classiﬁers on different network layers.Dataset CIFAR-10 CIFAR-100Model conv1 conv2 conv3 conv4 conv1 conv2 conv3 conv4

Fully supervised . . . . . . . . Random . . . . . . . . RotNet . . . . . . . . GAN (CIFAR-10) . . . . . . . GAN (CIFAR-100) - - - - . . . . MonoGAN . . . . . . . . has ﬁve convolutional blocks (each comprising a linear convolution later followed by ReLU andoptionally max pooling). We insert the probes right after the ReLU layer in each block, and denotethese entry points conv1 to conv5 . Applying the linear probes at each convolutional layer allowsstudying the quality of the representation learned at different depths of the network. Details.

While linear probes are conceptually straightforward, there are several technical detailsthat affect the ﬁnal accuracy by a few percentage points. Unfortunately, prior work has used severalslightly different setups, so that comparing results of different publications must be done with cau-tion. To make matters more difﬁcult, not all papers released evaluation source code. We prove thisstandardized testing code here . https://github.com/yukimasano/linear-probes , , , , dimensions for conv1-5 using adaptivemax-pooling, and absorb the batch normalization weights into the preceding convolutions. For eval-uation on ImageNet we follow RotNet to train linear probes: images are resized such that the shorteredge has a length of pixels, random crops of × are computed and ﬂipped horizontallywith probability. Learning lasts for epochs and the learning rate schedule starts from . and is divided by ﬁve at epochs , and . The top-1 accuracy of the linear classiﬁer is thenmeasured on the ImageNet validation subset. This uses DeepCluster’s protocol, extracting cropsfor each validation image (four at the corners and one at the center along with their horizontal ﬂips)and averaging the prediction scores before the accuracy is computed. For CIFAR-10/100 data, wefollow the same learning rate schedule and for both training and evaluation we do not reduce thedimensionality of the representations and keep the images’ original size of × .4.2 E FFECT OF A UGMENTATIONS

In order to better understand which image transformations are important to learn a good featurerepresentations, we analyze the impact of augmentation settings. For speed, these experiments areconducted using the CIFAR-10 images ( d = 50 , in the training set) and with the smaller sourceImage B and a GAN using the Wasserstein GAN formulation with gradient penalty (Gulrajani et al.,2017). The encoder is a smaller AlexNet-like CNN consisting of four convolutional layers (kernelsizes: , , , ; strides: , , , ) followed by a single fully connected layer as the discriminator.Given that the GAN is trained on a single image (w/ augmentations), we call this setting MonoGAN .Table 1 reports all combinations of the three main augmentations (scale, rotation, and jitter) anda randomly initialized network baseline (see Table 1 (b)) using the linear probes protocol discussedabove. Without data augmentation the model only achieves marginally better performance than therandom network (which also achieves a non-negligible level of performance (Ulyanov et al., 2017;Caron et al., 2018)). This is understandable since the dataset literally consists of a single trainingimage cloned d times. Color jitter and rotation slightly improve the performance of all probes by 1-2% points, but random rescaling adds at least ten points at every depth (see Table 1 (f,h,i)) and is themost important single augmentation. A similar conclusion can be drawn when two augmentationsare combined, although there are diminishing returns as more augmentations are combined. Overall,we ﬁnd all three types of augmentations are of importance when training in the ultra-low data setting.4.3 B ENCHMARK EVALUATION

We analyze how performance varies as a function N , the number of actual samples that are used togenerated the augmented datasets, and compare it to the gold-standard setup (in terms of choice oftraining data) deﬁned in the papers that introduced each method. The evaluation is again based onlinear probes (Section 4.1). Mono is enough.

From Table 2 we make the following observations. Training with just a singlesource image (f,g,k,l,p,q) is much better than random initialization (c) for all layers. Notably, thesemodels also outperform Gabor-like ﬁlters from Scattering networks (Bruna & Mallat, 2013), whichare hand crafted image features, replacing the ﬁrst two convolutional layers as in (Oyallon et al.,2017). Using the same protocol as in the paper, this only achieves an accuracy of . comparedto (p)’s conv2 > .More importantly, when comparing within pretext task, even with one image we are able to improvethe quality of conv1 – conv3 features compared to full (unsupervised) ImageNet training for GANbased self-supervision (e-i). For the other methods (j-n, o-s) we reach and also surpass the perfor-mance for the ﬁrst layer and are within . points for the second. Given that the best unsupervisedperformance for conv2 is . , our method using a single source Image A (Table 2, p) is remarkablyclose with . . Image contents.

While we surpass the GAN based approach of (Donahue et al., 2017) for bothsingle source images, we ﬁnd more nuanced results for the other two methods: For RotNet, as ex-pected, the photographic bias cannot be extracted from a single image. Thus its performance islow with little training data and increases together with the number of images (Table 2, j-n). Whencomparing Image A and B trained networks for RotNet, we ﬁnd that the photograph yields betterperformance than the hand drawn animal image. This indicates that the method can extract rotation7ublished as a conference paper at ICLR 2020Method, Image A Method, Image B BiGAN RotNet DeepCluster BiGAN RotNet DeepCluster20.4 19.9 20.7 20.5 17.8 19.7Figure 2: conv1 ﬁlters trained using a single image.

The 96 learned (3 × × ﬁlters for theﬁrst layer of AlexNet are shown for each single training image and method along with their linearclassiﬁer performance. For visualization, each ﬁlter is normalized to be in the range of ( − , .information from low level image features such as patches which is at ﬁrst counter intuitive. Con-sidering that the hand-drawn image does not work well, we can assume that lighting and shadowseven in small patches can indeed give important cues on the up direction which can be learned evenfrom a single (real) image. DeepCluster shows poor performance in conv1 which we can improveupon in the single image setting (Table 2, o-r). Naturally, the image content matters: a trivial imagewithout any image gradient (e.g. picture of a white wall) would not provide enough signal for anymethod. To better understand this issue, we also train DeepCluster on the much less cluttered Im-age C to analyze how much the image inﬂuences our claims. We ﬁnd that even though this imagecontains large parts of sky and sea, the performance is only slightly lower than that of Image A .This ﬁnding indicates that the augmentations can even compensate for large untextured areas andthe exact choice of image is not critical. More than one image.

While BiGAN fails to converge for N ∈ { , } , most likely due to is-sues in learning from a distribution which is neither whole images nor only patches, we ﬁnd that bothRotNet and DeepCluster improve their performance in deeper layers when increasing the numberof training images. However, for conv1 and conv2 , a single image is enough. In deeper layers,DeepCluster seems to require large amounts of source images to yield the reported results as thedeka- and kilo- variants start improving over the single image case (Table 2, o-t). This need for dataalso explains the gap between the two input images which have different resolutions. SummarizingTable 2, we can conclude that learning conv1 , conv2 and for the most part conv3 ( . vs. . )on over 1M images does not yield a signiﬁcant performance increase over using one single trainingimage — a highly unexpected result. Generalization.

In Table 3, we show the results of training linear classiﬁers for the CIFAR-10dataset and compare against various baselines. We ﬁnd that the GAN trained on the smaller Image B outperforms all other methods including the fully-supervised trained one for the ﬁrst convolutionallayer. We also outperform the same architecture trained on the full CIFAR-10 training set usingRotNet, which might be due to the fact that either CIFAR images do not contain much informationabout the orientation of the picture or because they do not contain as many objects as in ImageNet.While the GAN trained on the whole dataset outperforms the MonoGAN on the deeper layers, thegap stays very small until the last layer. These ﬁndings are also reﬂected in the experiments on theCIFAR-100 dataset shown in Table 3. We ﬁnd that our method obtains the best performance for theﬁrst two layers, even against the fully supervised version. The gap between our mono variant and theother methods increases again with deeper layers, hinting to the fact that we cannot learn very highlevel concepts in deeper layers from just one single image. These results corroborate the ﬁnding thatour method allows learning very generalizable early features that are not domain dependent.4.4 Q UALITATIVE A NALYSIS

Visual comparison of weights.

In Figure 2, we compare the learned ﬁlters of all ﬁrst-layer con-volutions of an AlexNet trained with the different methods and a single image. First, we ﬁnd thatthe ﬁlters closely resemble those obtained via supervised training: Gabor-like edge detectors andvarious color blobs. Second, we ﬁnd that the look is not easily predictive of its performance, e.g.8ublished as a conference paper at ICLR 2020while generatively learned ﬁlters (BiGAN) show many edge detectors, its linear probes performanceis about the same as that of DeepCluster which seems to learn many somewhat redundant point fea-tures. However, we also ﬁnd that some edge detectors are required, as we can conﬁrm from RotNetand DeepCluster trained on Image B , which yield less crisp ﬁlters and worse performances.Table 4: Finetuning experiments

The pre-trained model’s ﬁrst two convolutions are leftfrozen (or replaced by the Scattering trans-form) and the nework is retrained using Ima-geNet LSVRC-12 training set.Top-1Full sup. . Random . Scattering . BiGAN, A . RotNet, A . DeepCluster A . Figure 3:

Style transfer with single-image pre-training.

We show two style transfer results us-ing the Image A trained BiGAN and the ImageNetpretrained AlexNet. Fine-tuning instead of freezing.

In Tab. 4, we show the results of retraining a network with theﬁrst two convolutional ﬁlters, or the scattering transform from (Oyallon et al., 2017), left frozen.We observe that our single image trained DeepCluster and BiGAN models achieve performancescloses to the supervised benchmark. Notably, the scattering transform as a replacement for conv1-2performs slightly worse than the analyzed single image methods. We also show in the appendixthe results of retraining a network initialized with the ﬁrst two convolutional layers obtained froma single image and subsequently linearly probing the model. The results are shown in AppendixTab. 5 and we ﬁnd that we can recover the performance of fully-supervised networks, i.e. the ﬁrsttwo convolutional ﬁlters trained from just a single image generalize well and do not get stuck in animage speciﬁc minimum.

Neural style transfer.

Lastly, we show how our features trained on only a single image can beused for other applications. In Figure 3 we show two basic style transfers using the method of(Gatys et al., 2016) from an ofﬁcial PyTorch tutorial . Image content and style are separated andthe style is transferred from the source to target image using all CNN features, not just the shallowlayers. We visually compare the results of using our features and from full ImageNet supervision.We ﬁnd almost no visual differences in the stylized images and can conclude that our early featuresare equally powerful as fully supervised ones for this task. ONCLUSIONS

We have made the surprising observation that we can learn good and generalizable features throughself-supervision from one single source image, provided that sufﬁcient data augmentation is used.Our results complement recent works (Mahajan et al., 2018; Goyal et al., 2019) that have investigatedself-supervision in the very large data regime. Our main conclusion is that these methods succeedperfectly in capturing the simplest image statistics, but that for deeper layers a gap exist with strongsupervision which is compensated only in limited manner by using large datasets. This novel ﬁndingmotivates a renewed focus on the role of augmentations in self-supervised learning and criticalrethinking of how to better leverage the available data.A

CKNOWLEDGEMENTS .We thank Aravindh Mahendran for fruitful discussions. Yuki Asano gratefully acknowledges sup-port from the EPSRC Centre for Doctoral Training in Autonomous Intelligent Machines & Systems(EP/L015897/1). The work is supported by ERC IDIU-638009. https://pytorch.org/tutorials/advanced/neural_style_tutorial.html R EFERENCES

Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In

Proc. ICCV , pp.37–45. IEEE, 2015. 3R. Arandjelovi´c and A. Zisserman. Look, listen and learn. In

Proc. ICCV , 2017. 3Piotr Bojanowski and Armand Joulin. Unsupervised learning by predicting noise. In

Proc. ICML ,pp. 517–526. PMLR, 2017. 2Joan Bruna and St´ephane Mallat. Invariant scattering convolution networks.

IEEE transactions onpattern analysis and machine intelligence , 35(8):1872–1886, 2013. 3, 7M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering for unsupervised learning ofvisual features. In

Proc. ECCV , 2018. 2, 3, 5, 6, 7, 14N. Dalal and B Triggs. Histogram of Oriented Gradients for Human Detection. In

Proc. CVPR ,volume 2, pp. 886–893, 2005. 2, 3Virginia R de Sa. Learning classiﬁcation with unlabeled data. In

NIPS , pp. 112–119, 1994. 3J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchicalimage database. In

Proc. CVPR , 2009. 5Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning bycontext prediction. In

Proc. ICCV , pp. 1422–1430, 2015. 2Jeff Donahue, Philipp Krhenbhl, and Trevor Darrell. Adversarial feature learning.

Proc. ICLR ,2017. 2, 3, 5, 6, 7A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, and T. Brox. Discriminative un-supervised feature learning with exemplar convolutional neural networks.

IEEE PAMI , 38(9):1734–1747, Sept 2016. ISSN 0162-8828. doi: 10.1109/TPAMI.2015.2496141. 2V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville.Adversarially learned inference. arXiv preprint arXiv:1606.00704 , 2016. 5D. Erhan, Y. Bengio, A. Courville, and P. Vincent. Visualizing higher-layer features of a deepnetwork. Technical Report 1341, University of Montreal, Jun 2009. 14Chuang Gan, Boqing Gong, Kun Liu, Hao Su, and Leonidas J Guibas. Geometry guided convolu-tional neural networks for self-supervised video representation learning. In

Proc. CVPR , 2018.3Rouhan Gao, Dinesh Jayaraman, and Kristen Grauman. Object-centric representation learning fromunlabeled videos. In

Proc. ACCV , 2016. 3Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutionalneural networks. In

Proc. CVPR , pp. 2414–2423, 2016. 2, 9R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.In

International Conference on Learning Representations , 2019. 15Spyros Gidaris, Praveen Singh, and Nikos Komodakis. Unsupervised representation learning bypredicting image rotations. In

Proc. ICLR , 2018. 2, 3, 5, 6R. B. Girshick. Fast R-CNN. In

Proc. ICCV , 2015. 1Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In

NIPS , pp. 2672–2680,2014. 5Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking self-supervised visual representation learning. arXiv preprint arXiv:1905.01235 , 2019. 910ublished as a conference paper at ICLR 2020Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Im-proved training of wasserstein gans. In

NIPS , pp. 5767–5777, 2017. 7B. Hariharan, J. Malik, and D. Ramanan. Discriminative decorrelation for clustering and classiﬁca-tion. In

Proc. ECCV , 2012. 3M. Huh, P. Agrawal, and A. A. Efros. What makes imagenet good for transfer learning? arXivpreprint arXiv:1608.08614 , 2016. 1Phillip Isola, Daniel Zoran, Dilip Krishnan, and Edward H Adelson. Learning visual groups fromco-occurrences in space and time. In

Proc. ICLR , 2015. 3Dinesh Jayaraman and Kristen Grauman. Learning image representations tied to ego-motion. In

Proc. ICCV , 2015. 3Dinesh Jayaraman and Kristen Grauman. Slow and steady feature analysis: higher order temporalcoherence in video. In

Proc. CVPR , 2016. 3Simon Jenni and Paolo Favaro. Self-supervised feature learning by learning to spot artifacts. In

Proc. CVPR , 2018. 2J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In

Proc. ECCV , pp. 694–711, 2016. 2R. Kat, R. Jevnisek, and S. Avidan. Matching pixels using co-occurrence statistics. In

Proc. ICCV ,pp. 1751–1759, 2018. 2Philipp Kr¨ahenb¨uhl, Carl Doersch, Jeff Donahue, and Trevor Darrell. Data-dependent initializationsof convolutional neural networks.

Proc. ICLR , 2016. 6A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classiﬁcation with deep convolutionalneural networks. In

NIPS , pp. 1106–1114, 2012. 5Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer,2009. 5Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Colorization as a proxy task for visualunderstanding. In

Proc. CVPR , 2017. 2Hsin-Ying Lee, Jia-Bin Huang, Maneesh Kumar Singh, and Ming-Hsuan Yang. Unsupervised rep-resentation learning by sorting sequence. In

Proc. ICCV , 2017. 2D. Lowe. Distinctive image features from scale-invariant keypoints.

IJCV , 60(2):91–110, 2004. 2,3Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li,Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervisedpretraining. In

Proceedings of the European Conference on Computer Vision (ECCV) , pp. 181–196, 2018. 9A. Mahendran, J. Thewlis, and A. Vedaldi. Cross pixel optical-ﬂow similarity for self-supervisedlearning. In

Proc. ACCV , 2018. 3T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplar-SVMs for object detection andbeyond. In

Proc. ICCV , 2011. 3Ishan Misra, C. Lawrence Zitnick, and Martial Hebert. Shufﬂe and learn: Unsupervised learningusing temporal order veriﬁcation. In

Proc. ECCV , 2016. 2T Mundhenk, Daniel Ho, and Barry Y. Chen. Improvements to context based self-supervised learn-ing. In

Proc. CVPR , 2017. 2Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsawpuzzles. In

Proc. ECCV , pp. 69–84. Springer, 2016. 211ublished as a conference paper at ICLR 2020Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. Representation learning by learning to count.In

Proc. ICCV , 2017. 2Mehdi Noroozi, Ananth Vinjimoor, Paolo Favaro, and Hamed Pirsiavash. Boosting self-supervisedlearning via knowledge transfer. In

Proc. CVPR , 2018. 2Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, andAlexander Mordvintsev. The building blocks of interpretability.

Distill , 3(3):e10, 2018. 14B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employedby V1?

Vision Research , 37(23):3311–3325, 1997. 2, 3M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and Transferring Mid-Level Image Repre-sentations using Convolutional Neural Networks. In

Proc. CVPR , 2014. 1Andrew Owens, Phillip Isola, Josh H. McDermott, Antonio Torralba, Edward H. Adelson, andWilliam T. Freeman. Visually indicated sounds. In

Proc. CVPR , pp. 2405–2413, 2016. 3Edouard Oyallon, Eugene Belilovsky, and Sergey Zagoruyko. Scaling the scattering transform:Deep hybrid networks. pp. 5618–5627, 2017. 3, 6, 7, 9Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Contextencoders: Feature learning by inpainting. In

Proc. CVPR , pp. 2536–2544, 2016. 2Deepak Pathak, Ross Girshick, Piotr Doll´ar, Trevor Darrell, and Bharath Hariharan. Learning fea-tures by watching objects move. In

Proc. CVPR , 2017. 3Zhongzheng Ren and Yong Jae Lee. Cross-domain self-supervised multi-task feature learning usingsynthetic imagery. In

Proc. CVPR , 2018. 2A. Rodriguez, V. Naresh Boddeti, BVK V. Kumar, and A. Mahalanobis. Maximum margin cor-relation ﬁlter: A new approach for localization and classiﬁcation.

IEEE Transactions on ImageProcessing , 22(2):631–643, 2013. 3Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli. Singan: Learning a generative model from asingle natural image. In

Computer Vision (ICCV), IEEE International Conference on , 2019. 3Pierre Sermanet et al. Time-contrastive networks: Self-supervised learning from video. In

Proc.Intl. Conf. on Robotics and Automation , 2018. 2H. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura, and R. M. Summers. Deepconvolutional neural networks for computer-aided detection: Cnn architectures, dataset charac-teristics and transfer learning.

IEEE Trans. on Medical Imaging , 35(5):1285–1298, May 2016.ISSN 0278-0062. 1N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representationsusing lstms. In

Proc. ICML , 2015. 3I. Talmi, R. Mechrez, and L. Zelnik-Manor. Template matching with deformable diversity similarity.In

Proc. CVPR , pp. 175–183, 2017. 2D. Ulyanov, A. Vedaldi, and V. Lempitsky. Improved texture networks: Maximizing quality anddiversity in feed-forward stylization and texture synthesis. In

Proc. CVPR , 2017. 7D. Ulyanov, A. Vedaldi, and V. Lempitsky. Deep image prior. In

Proc. CVPR , 2018. 5X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In

Proc.ICCV , pp. 2794–2802, 2015. 3Xiaolong Wang, Kaiming He, and Abhinav Gupta. Transitive invariance for self-supervised visualrepresentation learning. In

Proc. ICCV , 2017. 3D. Wei, J. Lim, A. Zisserman, and W. T. Freeman. Learning and using the arrow of time. In

Proc.CVPR , 2018. 2 12ublished as a conference paper at ICLR 2020J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neuralnetworks? In

NIPS , pp. 3320–3328, 2014. 1M. D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In

Proc.ECCV , 2014. 14Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In

Proc. ECCV , pp.649–666. Springer, 2016. 2Richard Zhang, Phillip Isola, and Alexei A. Efros. Split-brain autoencoders: Unsupervised learningby cross-channel prediction. In

Proc. CVPR , 2017. 5, 6, 7, 1413ublished as a conference paper at ICLR 2020

A A

PPENDIX

A.1 I

MAGE N ET TRAINING IMAGES

Figure 4: ImageNet images for the N = 10 experiments.The images used for the N = 10 experiments are shown in ﬁg. 4.A.2 V ISUAL C OMPARISON OF F ILTERS

Figure 5:

Filter visualization.

We show activation maximization (left) and retrieval of top 9 ac-tivated images from the training set of ImageNet (right) for four random non-cherrypicked targetﬁlters. From top to bottom: conv1-5 of the BiGAN trained on a single image A . The ﬁlter visual-ization is obtained by learning a (regularized) input image that maximizes the response to the targetﬁlter using the library Lucid (Olah et al., 2018).In order to understand what deeper neurons are responding to in our model, we visualize randomneurons via activation maximization (Erhan et al., 2009; Zeiler & Fergus, 2014) in each layer. Ad-ditionally, we retrieve the top-9 images in the ImageNet training set that activate each neuron mostin Figure 5. Since the mono networks are not trained on the ImageNet dataset, it can be used herefor visualization. From the ﬁrst convolutional layer we ﬁnd typical neurons strongly reacting tooriented edges. In layers 2-4 we ﬁnd patterns such as grids ( conv2 :3), and textures such as leop-ard skin ( conv2 :2) and round grid cover ( conv4 :4). Conﬁrming our hypothesis that the neuralnetwork is only extracting patterns and not semantic information, we do not ﬁnd any neurons partic-ularly specialized to certain objects even in higher levels as for example dog faces or similar whichcan be fund in supervised networks. This ﬁnding aligns with the observations of other unsuper-vised methods (Caron et al., 2018; Zhang et al., 2017). As most neurons extract simple patterns and14ublished as a conference paper at ICLR 2020Table 5: Finetuning experiments

Models are initialized using conv1 and conv2 from various sin-gle image trained models and the whole network is ﬁne-tuned using ImageNet LSVRC-12 trainingset. Accuracy is averaged over crops. c1 c2 c3 c4 c5 Full sup. . . . . . BiGAN, A . . . . . RotNet, A . . . . . DeepCluster, A . . . . . textures, the surprising effectiveness of training a network using a single image can be explainedby the recent ﬁnding that even CNNs trained on ImageNet rely on texture (as opposed to shape)information to classify (Geirhos et al., 2019).A.3 R ETRAINING FROM SINGLE IMAGE INITIALIZATION

In Table 5, we initialize AlexNet models using the ﬁrst two convolutional ﬁlters learned from asingle image and retrain them using ImageNet. We ﬁnd that the networks recover their performancefully and the ﬁrst ﬁlters do not make the network stuck in a bad local minimum despite having beentrained on a single image from a different distribution. The difference from the BiGAN to the fullsupervision model is likely due to it using a smaller input resolution (112 instead of 224), as theBiGAN’s output resolution is limited.A.4 L

INEAR P ROBES ON I MAGE N ET We show two plots of the ImageNet linear probes results (Table 2 of the paper) in ﬁg. 6. On the leftwe plot performance per layer in absolute scale. Naturally the performance of the supervised modelimproves with depth, while all unsupervised models degrade after conv3 . From the relative ploton the right, it becomes clear that with our training scheme, we can even slightly surpass supervisedperformance on conv1 presumably since our model is trained with sometimes very small patches,thus receiving an emphasis on learning good low level ﬁlters. The gap between all self-supervisedmethods and the supervised baseline increases with depth, due to the fact that the supervised model istrained for this speciﬁc task, whereas the self-supervised models learn from a surrogate task withoutlabels. conv1 conv2 conv3 conv4 conv5 a b s o l u t e p e r f o r m a n c e Linear Classifier on ImageNet

SupervisedRandomRotNet1-RotNetBiGAN1-BiGANDeepCluster1-DeepCluster

Figure 6:

Linear Classiﬁers on ImageNet.

Classiﬁcation accuracies of linear classiﬁers trained onthe representations from Table 2 are shown in absolute scale.A.5 E

XAMPLE A UGMENTED T RAINING D ATA

In ﬁgs. 7 to 10 we show example patches generated by our augmentation strategy for the datasetswith different N. Even though the images and patches are very different in color and shape distribu-15ublished as a conference paper at ICLR 2020tion, our model learns weights that perform similarly in the linear probes benchmark (see Table 2 inthe paper). Figure 7:

Example crops of Image A ( N = 1 ) dataset. Example crops of Image B ( N = 1 ) dataset.

50 samples were selected randomly.Figure 9:

Example crops of deka ( N = 10 ) dataset.

50 samples were selected randomly.Figure 10:

Example crops of kilo ( N = 1000 ) dataset.) dataset.