Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey
11 Self-supervised Visual Feature Learning withDeep Neural Networks: A Survey
Longlong Jing and Yingli Tian ∗ , Fellow, IEEE
Abstract —Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visualfeature learning from images or videos for computer vision applications. To avoid extensive cost of collecting and annotatinglarge-scale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn generalimage and video features from large-scale unlabeled data without using any human-annotated labels. This paper provides an extensivereview of deep learning-based self-supervised general visual feature learning methods from images or videos. First, the motivation,general pipeline, and terminologies of this field are described. Then the common deep neural network architectures that used forself-supervised learning are summarized. Next, the schema and evaluation metrics of self-supervised learning methods are reviewedfollowed by the commonly used image and video datasets and the existing self-supervised visual feature learning methods. Finally,quantitative performance comparisons of the reviewed methods on benchmark datasets are summarized and discussed for both imageand video feature learning. At last, this paper is concluded and lists a set of promising future directions for self-supervised visualfeature learning.
Index Terms —Self-supervised Learning, Unsupervised Learning, Convolutional Neural Network, Transfer Learning, Deep Learning. (cid:70)
NTRODUCTION D UE to the powerful ability to learn different levels ofgeneral visual features, deep neural networks havebeen used as the basic structure to many computer visionapplications such as object detection [1], [2], [3], semanticsegmentation [4], [5], [6], image captioning [7], etc. The mod-els trained from large-scale image datasets like ImageNetare widely used as the pre-trained models and fine-tunedfor other tasks for two main reasons: (1) the parameterslearned from large-scale diverse datasets provide a goodstarting point, therefore, networks training on other taskscan converge faster, (2) the network trained on large-scaledatasets already learned the hierarchy features which canhelp to reduce over-fitting problem during the training ofother tasks, especially when datasets of other tasks are smallor training labels are scarce.The performance of deep convolutional neural networks(ConvNets) greatly depends on their capability and theamount of training data. Different kinds of network ar-chitectures were developed to increase the capacity of net-work models, and larger and larger datasets were collectedthese days. Various networks including AlexNet [8], VGG[9], GoogLeNet [10], ResNet [11], and DenseNet [12] and • L. Jing is with the Department of Computer Science, The GraduateCenter, The City University of New York, NY, 10016. E-mail:[email protected] • Y. Tian is with the Department of Electrical Engineering, The CityCollege, and the Department of Computer Science, the GraduateCenter, the City University of New York, NY, 10031. E-mail:[email protected] ∗ Corresponding authorThis material is based upon work supported by the National Science Founda-tion under award number IIS-1400802. large scale datasets such as ImageNet [13], OpenImage [14]have been proposed to train very deep ConvNets. Withthe sophisticated architectures and large-scale datasets, theperformance of ConvNets keeps breaking the state-of-the-arts for many computer vision tasks [1], [4], [7], [15], [16].However, collection and annotation of large-scaledatasets are time-consuming and expensive. As one of themost widely used datasets for pre-training very deep 2Dconvolutional neural networks (2DConvNets), ImageNet[13] contains about . million labeled images covering , classes while each image is labeled by human workerswith one class label. Compared to image datasets, collectionand annotation of video datasets are more expensive dueto the temporal dimension. The Kinetics dataset [17], whichis mainly used to train ConvNets for video human actionrecognition, consists of , videos belonging to categories and each video lasts around seconds. It tookmany Amazon Turk workers a lot of time to collect andannotate a dataset at such a large scale.To avoid time-consuming and expensive data anno-tations, many self-supervised methods were proposed tolearn visual features from large-scale unlabeled images orvideos without using any human annotations. To learnvisual features from unlabeled data, a popular solution is topropose various pretext tasks for networks to solve, whilethe networks can be trained by learning objective functionsof the pretext tasks and the features are learned through thisprocess. Various pretext tasks have been proposed for self-supervised learning including colorizing grayscale images[18], image inpainting [19], image jigsaw puzzle [20], etc.The pretext tasks share two common properties: (1) visualfeatures of images or videos need to be captured by Con-vNets to solve the pretext tasks, (2) pseudo labels for thepretext task can be automatically generated based on theattributes of images or videos. a r X i v : . [ c s . C V ] F e b Self-supervised Pretext Task TrainingSupervised Downstream Task Training
Unlabeled DatasetLabeled Dataset
PretextTask C o n v N e t … C o n v N e t DownstreamTask … Knowledge Transfer
Fig. 1. The general pipeline of self-supervised learning. The visualfeature is learned through the process of training ConvNets to solvea pre-defined pretext task. After self-supervised pretext task trainingfinished, the learned parameters serve as a pre-trained model andare transferred to other downstream computer vision tasks by fine-tuning. The performance on these downstream tasks is used to evaluatethe quality of the learned features. During the knowledge transfer fordownstream tasks, the general features from only the first several layersare unusually transferred to downstream tasks.
The general pipeline of self-supervised learning is shownin Fig. 1. During the self-supervised training phase, a pre-defined pretext task is designed for ConvNets to solve, andthe pseudo labels for the pretext task are automatically gen-erated based on some attributes of data. Then the ConvNetis trained to learn object functions of the pretext task. Af-ter the self-supervised training finished, the learned visualfeatures can be further transferred to downstream tasks(especially when only relatively small data available) as pre-trained models to improve performance and overcome over-fitting. Generally, shallow layers capture general low-levelfeatures like edges, corners, and textures while deeper layerscapture task related high-level features. Therefore, visualfeatures from only the first several layers are transferredduring the supervised downstream task training phase.
To make this survey easy to read, we first define the termsused in the remaining sections. • Human-annotated label:
Human-annotated labelsrefer to labels of data that are manually annotated byhuman workers. • Pseudo label:
Pseudo labels are automatically gener-ated labels based on data attributes for pretext tasks. • Pretext Task:
Pretext tasks are pre-designed tasks fornetworks to solve, and visual features are learned bylearning objective functions of pretext tasks. • Downstream Task:
Downstream tasks are computervision applications that are used to evaluate the qual-ity of features learned by self-supervised learning.These applications can greatly benefit from the pre-trained models when training data are scarce. In gen-eral, human-annotated labels are needed to solve thedownstream tasks. However, in some applications,the downstream task can be the same as the pretexttask without using any human-annotated labels. • Supervised Learning:
Supervised learning indi-cates learning methods using data with fine-grainedhuman-annotated labels to train networks. • Semi-supervised Learning:
Semi-supervised learn-ing refers to learning methods using a small amountof labeled data in conjunction with a large amount ofunlabeled data. • Weakly-supervised Learning:
Weakly supervisedlearning refers to learning methods to learn withcoarse-grained labels or inaccurate labels. The costof obtaining weak supervision labels is generallymuch cheaper than fine-grained labels for supervisedmethods. • Unsupervised Learning:
Unsupervised learningrefers to learning methods without using anyhuman-annotated labels. • Self-supervised Learning:
Self-supervised learningis a subset of unsupervised learning methods. Self-supervised learning refers to learning methods inwhich ConvNets are explicitly trained with automat-ically generated labels. This review only focuses onself-supervised learning methods for visual featurelearning with ConvNets in which the features canbe transferred to multiple different computer visiontasks.Since no human annotations are needed to generatepseudo labels during self-supervised training, very large-scale datasets can be used for self-supervised training.Trained with these pseudo labels, self-supervised methodsachieved promising results and the gap with supervisedmethods in performance on downstream tasks becomessmaller. This paper provides a comprehensive survey ofdeep ConvNets-based self-supervised visual feature learn-ing methods. The key contributions of this paper are asfollows: • To the best of our knowledge, this is the first compre-hensive survey about self-supervised visual featurelearning with deep ConvNets which will be helpfulfor researchers in this field. • An in-depth review of recently developed self-supervised learning methods and datasets. • Quantitative performance analysis and comparisonof the existing methods are provided. • A set of possible future directions for self-supervisedlearning is pointed out.
ORMULATION OF D IFFERENT L EARNING S CHEMAS
Based on the training labels, visual feature learning methodscan be grouped into the following four categories: super-vised, semi-supervised, weakly supervised, and unsuper-vised. In this section, the four types of learning methods arecompared and key terminologies are defined.
For supervised learning, given a dataset X, for each data X i in X, there is a corresponding human-annotated label Y i . For a set of N labeled training data D = { X i } Ni =0 , thetraining loss function is defined as: loss ( D ) = min θ N N (cid:88) i =1 loss ( X i , Y i ) . (1)Trained with accurate human-annotated labels, the su-pervised learning methods obtained break-through resultson different computer vision applications [1], [4], [8], [16].However, data collection and annotation usually are ex-pensive and may require special skills. Therefore, semi-supervised, weakly supervised, and unsupervised learningmethods were proposed to reduce the cost. For semi-supervised visual feature learning, given a smalllabeled dataset X and a large unlabeled dataset Z , for eachdata X i in X, there is a corresponding human-annotatedlabel Y i . For a set of N labeled training data D = { X i } Ni =0 and M unlabeled training data D = { Z i } Mi =0 , the trainingloss function is defined as: loss ( D , D ) = min θ N N (cid:88) i =1 loss ( X i , Y i )+ 1 M M (cid:88) i =1 loss ( Z i , R ( Z i , X )) , (2) where the R ( Z i , X ) is a task-specific function to representthe relation between each unlabeled training data Z i withthe labeled dataset X . For weakly supervised visual feature learning, given adataset X, for each data X i in X, there is a correspond-ing coarse-grained label C i . For a set of N training data D = { X i } Ni =0 , the training loss function is defined as: loss ( D ) = min θ N N (cid:88) i =1 loss ( X i , C i ) . (3)Since the cost of weak supervision is much lower thanthe fine-grained label for supervised methods, large-scaledatasets are relatively easier to obtain. Recently, severalpapers proposed to learn image features from web collectedimages using hashtags as category labels [21], [22], andobtained very good performance [21]. Unsupervised learning refers to learning methods that donot need any human-annotated labels. This type of methodsincluding fully unsupervised learning methods in whichthe methods do not need any labels at all, as well as self-supervised learning methods in which networks are ex-plicitly trained with automatically generated pseudo labelswithout involving any human annotation.
Recently, many self-supervised learning methods for visualfeature learning have been developed without using anyhuman-annotated labels [23], [24], [25], [26], [27], [28], [29],[30], [31], [32], [33], [33], [34], [35]. Some papers refer tothis type of learning methods as unsupervised learning [36],[37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48].Compared to supervised learning methods which require adata pair X i and Y i while Y i is annotated by human labors,self-supervised learning also trained with data X i alongwith its pseudo label P i while P i is automatically gener-ated for a pre-defined pretext task without involving anyhuman annotation. The pseudo label P i can be generated byusing attributes of images or videos such as the context ofimages [18], [19], [20], [36], or by traditional hand-designedmethods [49], [50], [51].Given a set of N training data D = { P i } Ni =0 , the trainingloss function is defined as: loss ( D ) = min θ N N (cid:88) i =1 loss ( X i , P i ) . (4)As long as the pseudo labels P are automatically gen-erated without involving human annotations, then themethods belong to self-supervised learning. Recently, self-supervised learning methods have achieved great progress.This paper focuses on the self-supervised learning methodsthat mainly designed for visual feature learning, while thefeatures have the ability to be transferred to multiple visualtasks and to perform new tasks by learning from limitedlabeled data. This paper summarizes these self-supervisedfeature learning methods from different perspectives includ-ing network architectures, commonly used pretext tasks,datasets, and applications, etc. OMMON D EEP N ETWORK A RCHITECTURES
No matter the categories of learning methods, they sharesimilar network architectures. This section reviews commonarchitectures for learning both image and video features.
Various 2DConvNets have been designed for image featurelearning. Here, five milestone architectures for image featurelearning including AlexNet [8], VGG [9], GoogLeNet [10],ResNet [11], and DenseNet [12] are reviewed.
AlexNet obtained a big improvement in the performance ofimage classification on ImageNet dataset compared to theprevious state-of-the-art methods [8]. With the support ofpowerful GPUs, AlexNet which has . million parameterswere trained on ImageNet with . million images. Asshown in Fig. 2, the architecture of AlexNet has layers inwhich are convolutional layers and are fully connectedlayers. The ReLU is applied after each convolutional layers. % of the network parameters come from the fully con-nected layers. With this scale of parameters, the networkcan easily be over-fitting. Therefore, different kinds of tech-niques are applied to avoid over-fitting problem includingdata augmentation, dropout, and normalization.
96 256 384 384 384 4096 4096 1000
Fig. 2. The architecture of AlexNet [8]. The numbers indicate the numberof channels of each feature map. Figure is reproduced based on AlexNet[8].
VGG is proposed by Simonyan and Zisserman and wonthe first place for ILSVRC 2013 competition [9]. Simonyanand Zisserman proposed various depth of networks, whilethe 16-layer VGG is the most widely used one due to itsmoderate model size and its superior performance. Thearchitecture of VGG-16 is shown in Fig. 3. It has convo-lutional layers belong to five convolution blocks. The maindifference between VGG and AlexNet is that AlexNet haslarge convolution stride and large kernel size while all theconvolution kernels in VGG have same small size ( × ) andsmall convolution stride ( × ). The large kernel size leads totoo many parameters and large model size, while the largeconvolution stride may cause the network to miss somefine features in the lower layers. The smaller kernel sizemakes the training of very deep convolution neural networkfeasible while still reserving the fine-grained information inthe network. VGG demonstrated that deeper networks are possible toobtain better performance. However, deeper networks aremore difficult to train due to two problems: gradient van-ishing and gradient explosion. ResNet is proposed by He et al. to use the skip connection in convolution blocks bysending the previous feature map to the next convolution
64 256409640961000128 256 512 512
Max poolingConvolution + ReluFully connect + ReluSoftmax
Fig. 3. The architecture of VGG [9]. Figure is reproduced based on VGG[9]. block to overcome the gradient vanishing and gradientexplosion [11]. The details of the skip connection are shownin Fig. 4. With the skip connection, training of very deepneural networks on GPUs becomes feasible. weight layerweight layer xF(x) F(x) + x x
Identity relurelu
Fig. 4. The architecture of Residual block [11]. The identity mappingcan effectively reduce gradient vanishing and explosion which make thetraining of very deep network feasible. Figure is reproduced based onResNet [11].
In ResNet [11], He et al. also evaluated networks withdifferent depths for image classification. Due to its smallermodel size and superior performance, ResNet is often usedas the base network for other computer vision tasks. Theconvolution blocks with skip connection also widely usedas the basic building blocks.
GoogLeNet, a -layer deep network, is proposed bySzegedy et al. which won ILSVRC-2014 challenge with a top-5 test accuracy of . % [10]. Compared to previous workthat to build a deeper network, Szegedy et al. explored tobuild a wider network in which each layer has multipleparallel convolution layers. The basic block of GoogLeNetis inception block which consists of parallel convolutionlayers with different kernel sizes and followed by × con-volution for dimension reduction purpose. The architecturefor the inception block of GoogLeNet is shown in Fig. 5.With a carefully crafted design, they increased the depthand width of the network while keeping the computationalcost constant. Filter Concatenation
Fig. 5. The architecture of Inception block [10]. Figure is reproducedbased on GoogLeNet [10].
C C C C
Fig. 6. The architecture of the Dense Block proposed in DenseNet [12].Figure is reproduced based on [12].
Most of the networks including AlexNet, VGG, andResNet follow a hierarchy architecture. The images are fedto the network and features are extracted by different layers.The shallow layers extract low-level general features, whilethe deep layers extract high-level task-specific features [52].However, when a network goes deeper, the deeper layersmay suffer from memorizing the low-level features neededby the network to accomplish the task.To alleviate this problem, Huang et al. proposed thedense connection to send all the features before a convolu-tion block as the input to the next convolution block in theneural network [12]. As shown in Fig. 6, the output featuresof all the previous convolution blocks serve as the input tothe current block. In this way, the shallower blocks focus onthe low-level general features while the deeper blocks canfocus on the high-level task-specific features.
To extract both spatial and temporal information fromvideos, several architectures have been designed for videofeature learning including 2DConvNet-based methods [53],3DConvNet-based methods [16], and LSTM-based methods[54]. The 2DConvNet-based methods apply 2DConvNeton every single frame and the image features of multipleframes are fused as video features. The 3DConvNet-basedmethods employ 3D convolution operation to simultane-ously extract both spatial and temporal features from mul-tiple frames. The LSTM-based methods employ LSTM tomodel long term dynamics within a video. This section briefly summarizes these three types of architectures ofvideo feature learning.
Spatial Stream ContNetTemporal Stream ContNet
Prediction
ArcheryEye MakeupBowling
Models
RGB Frame
Fusion
Optical Flow
Inputs
Fig. 7. The general architecture of the two-stream network which includ-ing one spatial stream and one temporal stream. Figure is reproducedbased on [53].
Videos generally are composed of various numbers offrames. To recognize actions in a video, networks are re-quired to capture appearance features as well as tempo-ral dynamics from frame sequences. As shown in Fig. 7,a two-stream 2DConvNet-based network is proposed bySimonyan and Zisserman for human action recognition,while using a 2DConvNet to capture spatial features fromRGB stream and another 2DConvNet to capture temporalfeatures from optical flow stream [53]. Optical flow encodesboundary of moving objects, therefore, the temporal streamConvNet is relatively easier to capture the motion informa-tion within the frames.Experiments showed that the fusion of the two streamscan significantly improve action recognition accuracy. Later,this work has been extended to multi-stream network [55],[56], [57], [58], [59] to fuse features from different types ofinputs such as dynamic images [60] and difference of frames[61].
3D convolution operation was first proposed in 3DNet[62] for human action recognition. Compared to 2DCon-vNets which individually extract the spatial information ofeach frame and then fuse them together as video features,3DConvNets are able to simultaneously extract both spatialand temporal features from multiple frames.C3D [16] is a VGG-like 11-layer 3DConvNet designed forhuman action recognition. The network contains convolu-tional layers, and fully connected layers. All the kernelshave the size of × × , the convolution stride is fixedto pixel. Due to its powerful ability of simultaneouslyextracting both spatial and temporal features from multipleframes, the network achieved state-of-the-art on severalvideo analysis tasks including human action recognition[63], action similarity labeling [64], scene classification [65],and object recognition in videos [66].The input of C3D is consecutive RGB frames wherethe appearance and temporal cues from 16-frame clips areextracted. However, the paper of long-term temporal convo-lutions (LTC) [67] argues that, for the long-lasting actions, 16frames are insufficient to represent whole actions which lastlonger. Therefore, larger numbers of frames are employed to train 3DConvNets and achieved better performance thanC3D [67], [68].With the success of applying 3D convolution on videoanalysis tasks, various 3DConvNet architectures have beenproposed [69], [70], [71]. Hara et al. proposed 3DResNetby replacing all the 2D convolution layers in ResNet with3D convolution layers and showed comparable performancewith the state-of-the-art performance on action recognitiontask on several datasets [70]. C o n v N e t Archery Golf Swing Bowling C o n v N e t C o n v N e t C o n v N e t Average FusionLSTM LSTM LSTM LSTM
Fig. 8. The architecture of long-term recurrent convolutional networks(LRCN) [54]. LSTM is employed to model the long term temporal infor-mation within a frame sequence. Figure is reproduced based on [54].
Due to the ability to model the temporal dynamicswithin a sequence, recurrent neural networks (RNN) areoften applied to videos as ordered frame sequences. Com-pared to standard RNN [72], long short term memory(LSTM) uses memory cells to store, modify, and accessinternal states, to better model the long-term temporal re-lationships within video frames [73].Based on the advantage of the LSTM, Donahue et al. pro-posed long-term recurrent convolutional networks (LRCN)for human action recognition [54]. The framework of theLRCN is shown in Fig. 8. The LSTM is sequentially appliedto the features extracted by ConvNets to model the temporaldynamics in the frame sequence. With the LSTM to model avideo as frame sequences, this model is able to explicitlymodel the long-term temporal dynamics within a video.Later on, this model is extended to a deeper LSTM for actionrecognition [74], [75], video captioning [76], and gesturerecognition tasks [77].
Deep ConvNets have demonstrated great potential in var-ious computer vision tasks. And the visualization of theimage and video features has shown that these networkstruly learned meaningful features that required by the cor-responding tasks [52], [78], [79], [80]. However, one commondrawback is that these networks can be easily over-fit when training data are scarce since there are over millions ofparameters in each network.Take 3DResNet for an example, the performance of an -layer 3DResNet on UCF101 action recognition dataset[63] is % when trained from scratch. However, witha supervised pre-trained model on the large-scale Kinet-ics dataset ( , videos of classes) with human-annotated class labels and then fine-tuned on UCF101dataset, the performance can increase to %. Pre-trainedmodels on large-scale datasets can speed up the trainingprocess and improve the performance on relatively smalldatasets. However, the cost of collecting and annotatinglarge-scale datasets is very expensive and time-consuming.In order to obtain pre-trained models from large-scaledatasets without expensive human annotations, many self-supervised learning methods were proposed to learn imageand video features from pre-designed pretext tasks. The nextsection describes the general pipeline of the self-supervisedimage and video feature learning. OMMONLY USED P RETEXT AND D OWNSTREAM T ASKS
Self-supervised Pretext Task Training
Unlabeled Dataset C o n v N e t … ( )(( )) ,,, P " P P $ Output Objective Function % " % % $ … &'(((P " , % " )&'(((P , % ) … &'(((P $ , % $ ) Fig. 9. Self-supervised visual feature learning schema. The ConvNet istrained by minimizing errors between pseudo labels P and predictions O of the ConvNet. Since the pseudo labels are automatically generated,no human annotations are involved during the whole process. Most existing self-supervised learning approaches fol-low the schema shown in Fig 9. Generally, a pretext taskis defined for ConvNets to solve and visual features can belearned through the process of accomplishing this pretexttask. The pseudo labels P for pretext task can be automat-ically generated without human annotations. ConvNet isoptimized by minimizing the error between the predictionof ConvNet O and the pseudo labels P . After the trainingon the pretext task is finished, ConvNet models that cancapture visual features for images or videos are obtained. To relieve the burden of large-scale dataset annotation, apretext task is generally designed for networks to solvewhile pseudo labels for the pretext task are automaticallygenerated based on data attributes. Many pretext tasks havebeen designed and applied for self-supervised learning suchas foreground object segmentation [81], image inpainting[19], clustering [44], image colorization [82], temporal orderverification [40], visual audio correspondence verification
Free Semantic Label-Based Methods
Moving Object SegmentationRelative Depth Prediction Contour DetectionDepth Estimation
Semantic Label
Surface Normal Prediction
Context Similarity
Clustering
Context-Based Methods
Spatial Context Structure Temporal Context Structure
Image Jigsaw PuzzleGeometric Transformation Frame Order VerificationFrame Order Recognition
Generation-Based Methods
Image Generation Video Generation
Image InpaintingImage Colorization Video Generation with GANVideo ColorizationVideo Future PredictionImage Generation with GANSuper-Resolution Semantic Segmentation
Cross Modal-Based Methods
Flow-RGB Correspondence Visual-Audio Correspondence
Optical Flow Estimation
Flow-RGB Correspondence Verification
Audio Visual Correspondence
Ego-motion
Ego-motion
Fig. 10. Categories of pretext tasks for self-supervised visual feature learning: generation-based, context-based, free semantic label-based, andcross modal-based. [25], and so on. Effective pretext tasks ensure that semanticfeatures are learned through the process of accomplishingthe pretext tasks.Take image colorization as an example, image coloriza-tion is a task to colorize gray-scale images into colorfulimages. To generate realistic colorful images, networks arerequired to learn the structure and context information ofimages. In this pretext task, the data X is the gray-scaleimages which can be generated by performing a lineartransformation in RGB images, while the pseudo label P is the RGB image itself. The training pair X i and P i can begenerated in real time with negligible cost. Self-Supervisedlearning with other pretext tasks follow a similar pipeline. According to the data attributes used to design pretexttasks, as shown in Fig. 10, we summarize the pretext tasksinto four categories: generation-based, context-based, freesemantic label-based, and cross modal-based.
Generation-based Methods:
This type of methods learnvisual features by solving pretext tasks that involve imageor video generation. • Image Generation:
Visual features are learnedthrough the process of image generation tasks. Thistype of methods includes image colorization [18],image super resolution [15], image inpainting [19], image generation with Generative Adversarial Net-works (GANs) [83], [84]. • Video Generation:
Visual features are learnedthrough the process of video generation tasks. Thistype of methods includes video generation withGANs [85], [86] and video prediction [37].
Context-based pretext tasks:
The design of context-based pretext tasks mainly employ the context features ofimages or videos such as context similarity, spatial structure,temporal structure, etc. • Context Similarity:
Pretext tasks are designed basedon the context similarity between image patches.This type of methods includes image clustering-based methods [34], [44], and graph constraint-basedmethods [43]. • Spatial Context Structure:
Pretext tasks are used totrain ConvNets are based on the spatial relationsamong image patches. This type of methods includesimage jigsaw puzzle [20], [87], [88], [89], contextprediction [41], and geometric transformation recog-nition [28], [36], etc. • Temporal Context Structure:
The temporal orderfrom videos is used as supervision signal. The Con-vNet is trained to verify whether the input framesequence in correct order [40], [90] or to recognizethe order of the frame sequence [39].
Free Semantic Label-based Methods:
This type of pre-text tasks train networks with automatically generated se-mantic labels. The labels are generated by traditional hard-code algorithms [50], [51] or by game engines [30]. Thepretext tasks include moving object segmentation [81], [91],contour detection [30], [47], relative depth prediction [92],and etc.
Cross Modal-based Methods:
This type of pretext taskstrain ConvNets to verify whether two different channelsof input data are corresponding to each other. This typeof methods includes Visual-Audio Correspondence Verifica-tion [25], [93], RGB-Flow Correspondence Verification [24],and egomotion [94], [95].
To evaluate the quality of the learned image or video fea-tures by self-supervised methods, the learned parameters byself-supervised learning are employed as pre-trained mod-els and then fine-tuned on downstream tasks such as im-age classification, semantic segmentation, object detection,and action recognition etc. The performance of the transferlearning on these high-level vision tasks demonstrates thegeneralization ability of the learned features. If ConvNetsof self-supervised learning can learn general features, thenthe pre-trained models can be used as a good starting pointfor other vision tasks that require capturing similar featuresfrom images or videos.Image classification, semantic segmentation, and objectdetection usually are used as the tasks to evaluate thegeneralization ability of the learned image features by self-supervised learning methods, while human action recog-nition in videos is used to evaluate the quality of videofeatures obtained from self-supervised learning methods.Below are brief introductions of the commonly used high-level tasks for visual feature evaluation.
Semantic segmentation, the task of assigning semantic labelsto each pixel in images, is of great importance in manyapplications such as autonomous driving, human-machineinteraction, and robotics. The community has recently madepromising progress and various networks have been pro-posed such as Fully Convolutional Network (FCN) [4],DeepLab [5], PSPNet [6] and datasets such as PASCAL VOC[96], CityScape [97], ADE20K [98].Among all these methods, FCN [4] is a milestone workfor semantic segmentation since it started the era of apply-ing fully convolution network (FCN) to solve this task. Thearchitecture of FCN is shown in Fig. 11. 2DConvNet suchas AlexNet, VGG, ResNet is used as the base network forfeature extraction while the fully connected layer is replacedby transposed convolution layer to obtain the dense pre-diction. The network is trained end-to-end with pixel-wiseannotations.When using semantic segmentation as downstream taskto evaluate the quality of image features learned by self-supervised learning methods, the FCN is initialized with theparameters trained with the pretext task and fine-tuned onthe semantic segmentation dataset, then the performance onthe semantic segmentation task is evaluated and comparedwith that of other self-supervised methods.
96 256 384 384 384 4096 4096 21 P i x e l w i s e p r e d i c t i o n I npu t I m a g e Fig. 11. The framework of the Fully Convolutional Neural Network pro-posed for semantic segmentation [4]. Figure is reproduced based on[4].
Object Detection, a task of localizing the position of objectsin images and recognizing the category of the objects, is alsovery import for many computer vision applications suchas autonomous driving, robotics, scene text detection andso on. Recently, many datasets such as MSCOCO [99] andOpenImage [14] have been proposed for object detectionand many ConvNet-based models [1], [2], [3], [100], [101],[102], [103], [104] have been proposed and obtained greatperformance.Fast-RCNN [2] is a two-stage network for object detec-tion. The framework of Fast-RCNN is shown in Fig. 12.Object proposals are generated based on feature maps pro-duced by a convolution neural network, then these propos-als are fed to several fully connected layers to generate thebounding box of objects and the categories of these objects.
Deep ConvNetROI Projection Feature Map ROIPoolingLayer FCs FC FCsoftmax bboxregressorOutputs:ROI feature vector
For each RoI
Fig. 12. The pipeline of the Fast-RCNN for object detection. Figure isreproduced based on [3].
When using object detection as downstream task to eval-uate the quality of the self-supervised image features, net-works that trained with the pretext task on unlabeled largedata are served as the pre-trained model for the Fast-RCNN[2] and then fine-tuned on object detection datasets, thenthe performance on the object detection task is evaluatedto demonstrate the generalization ability of self-supervisedlearned features.
Image Classification is a task of recognizing the category ofobjects in each image. Many networks have been designedfor this task such as AlexNet [8], VGG [9], ResNet [11],
GoogLeNet [10], DenseNet [12], etc. Usually, only one classlabel is available for each image although the image maycontains different classes of objects.When choosing image classification as a downstreamtask to evaluate the quality of image features learnedfrom self-supervised learning methods, the self-supervisedlearned model is applied on each image to extract featureswhich then are used to train a classifier such as Support Vec-tor Machine (SVM) [105]. The classification performance ontesting data is compared with other self-supervised modelsto evaluate the quality of the learned features.
Human action recognition is a task of identifying what peo-ple doing in videos for a list of pre-defined action classes.Generally, videos in human action recognition datasets con-tain only one action in each video [17], [63], [106]. Both thespatial and temporal features are needed to accomplish thistask.The action recognition task is often used to evaluate thequality of video features learned by self-supervised learningmethods. The network is first trained on unlabeled videodata with pretext tasks, then it is fine-tuned on action recog-nition datasets with human annotations to recognize theactions. The testing performance on action recognition taskis compared with other self-supervised learning methods toevaluate the quality of the learned features.
In addition to these quantitative evaluations of the learnedfeatures, there are also some qualitative visualization meth-ods to evaluate the quality of self-supervised learning fea-tures. Three methods are often used for this purpose: kernelvisualization, feature map visualization, and image retrievalvisualization [28], [36], [41], [44].
Kernel Visualization:
Qualitatively visualize the kernelsof the first convolution layer learned with the pretext tasksand compare the kernels from supervised models. Thesimilarity of the kernels learned by supervised and self-supervised models are compared to indicate the effective-ness of self-supervised methods [28], [44].
Feature Map Visualization:
Feature maps are visual-ized to show the attention of networks. Larger activationrepresents the neural network pays more attention to thecorresponding region in the image. Feature maps are usuallyqualitatively visualized and compared with that of super-vised models [28], [36].
Nearest Neighbor Retrieval:
In general, images withsimilar appearance usually are closer in the feature space.The nearest neighbor method is used to find the top K nearest neighbors from the feature space of the featureslearned by the self-supervised learned model [40], [41], [43]. ATASETS
This section summarizes the commonly used image andvideo datasets for training and evaluating of self-supervisedvisual feature learning methods. Self-supervised learningmethods can be trained with images or videos by dis-carding human-annotated labels, therefore, any datasets that collected for supervised learning can be used for self-supervised visual feature learning without using human-annotated labels. The evaluation of the quality of learnedfeatures is normally conducted by fine-tuned on high-levelvision tasks with relatively small datasets (normally withaccurate labels) such as video action recognition, objectdetection, semantic segmentation, etc. It is worth notingthat networks use these synthetic datasets for visual featurelearning are considered as self-supervised learning in thispaper since labels of synthetic datasets are automaticallygenerated by game engines and no human annotations areinvolved. Table 1 summarizes the commonly used imageand video datasets. • ImageNet:
The ImageNet dataset [13] contains . million images uniformly distributed into , classes and is organized according to the WordNethierarchy. Each image is assigned with only one classlabel. ImageNet is the most widely used dataset forself-supervised image feature learning. • Places:
The Places dataset [107] is proposed for scenerecognition and contains more than . million im-ages covering more than scene categories withmore than , images per category. • Places365:
The Places365 is the 2nd generation of thePlaces database which is built for high-level visualunderstanding tasks, such as scene context, objectrecognition, action and event prediction, and theory-of-mind inference [108]. There are more than million images covering more than classes and , to , training images per class. • SUNCG:
The SUNCG dataset is a large synthetic3D scene repository for indoor scenes which con-sists of over , different scenes with manuallycreated realistic room and furniture layouts [109].The synthetic depth, object level semantic labels, andvolumetric ground truth are available. • MNIST:
The MNIST is a dataset of handwritten dig-its consisting of , images while , imagesbelong to training set and the rest , images arefor testing [110]. All digits have been size-normalizedand centered in fixed-size images. • SVHN:
SVHN is a dataset for recognizing digitsand numbers in natural scene images which ob-tained from house numbers from Google Street Viewimages [111]. The dataset consists of over , images and all digits have been resized to a fixedresolution of × pixels. • CIFAR10:
The CIFAR10 dataset is a collection oftiny images for image classification task [112]. Itconsists of , images of size × that covers different classes. The classes include airplane,automobile, bird, cat, deer, dog, frog, horse, ship, andtruck. The dataset is balanced and there are , images of each class. • STL-10:
The STL-10 dataset is specifically designedfor developing unsupervised feature learning [113].It consists of labeled training images, testingimages, and , unlabeled images covering TABLE 1Summary of commonly used image and video datasets. Note that image datasets can be used to learn image features, while video datasets canbe used to learn both image and video features.
Dataset Data Type Size Synthetic . million images (cid:55) , Object category labelPlaces [107] Image . million images (cid:55) scene categories labelPlaces365 [108] Image million images (cid:55) scene categories labelSUNCG [109] Image , images (cid:51) depth, volumetric dataMNIST [110] Image , images (cid:55) Digit class labelSVHN [111] Image , Images (cid:55) Digit class labelCIFAR10 [112] Image , Images (cid:55) Object category labelSTL-10 [113] Image , Images (cid:55) Object category labelPASCAL VOC [96] Image , images (cid:55) Category label, bounding box, segmentation maskYFCC100M [114] Image/Video million media data (cid:55) — HashtagsSceneNet RGB-D [115] Video million images (cid:51) Depth, Instance Segmentation, Optical FlowMoment-in-Time [116] Video million 3-second videos (cid:55) Video category classKinetics [17] Video . million 10-second videos (cid:55) Human action classAudioSet [117] Video million 10-second videos (cid:55) Audio event classKITTI [118] Video videos (cid:55) — Data captured by various sensors are availableUCF101 [63] Video , videos (cid:55) Human action classHMDB51 [106] Video , videos (cid:55) Human action class classes which include airplane, bird, car, cat, deer,dog, horse, monkey, ship, and truck. • PASCAL Visual Object Classes (VOC):
The VOC2,012 dataset [96] contains object categories in-cluding vehicles, household, animals, and other:aeroplane, bicycle, boat, bus, car, motorbike, train,bottle, chair, dining table, potted plant, sofa,TV/monitor, bird, cat, cow, dog, horse, sheep, andperson. Each image in this dataset has pixel-level seg-mentation annotations, bounding box annotations,and object class annotations. This dataset has beenwidely used as a benchmark for object detection,semantic segmentation, and classification tasks. ThePASCAL VOC dataset is split into three subsets: , images for training, , images for valida-tion and a private testing [96]. All the self-supervisedimage representation learning methods are evaluatedon this dataset with the three tasks. • YFCC100M:
The Yahoo Flickr Creative Commons
Million Dataset (YFCC100M) is a large publicmultimedia collection from Flickr, consisting of million media data, of which around . million areimages and . million are videos [114]. The statisticson hashtags used in the YFCC100M dataset showthat the data distribution is severely unbalanced[119]. • SceneNet RGB-D:
The SceneNet RGB-D dataset is alarge indoor synthetic video dataset which consists of million rendered RGB-D images from over 15K tra-jectories in synthetic layouts with random but phys-ically simulated object poses [115]. It provides pixel-level annotations for scene understanding problemssuch as semantic segmentation, instance segmenta-tion, and object detection, and also for geometriccomputer vision problems such as optical flow, depthestimation, camera pose estimation, and 3D recon-struction [115]. • Moment in Time:
The Moment-in-Time dataset is alarge balanced and diverse dataset for video under- standing [116]. The dataset consists of million videoclips that cover classes, and each video lastsaround seconds. The average number of video clipsfor each class is , with a median of , . Thevideo in this dataset contains videos that capturingvisual and/or audible actions, produced by humans,animals, objects or nature [116]. • Kinetics:
The Kinetics dataset is a large-scale, high-quality dataset for human action recognition invideos [17]. The dataset consists of around , video clips covering human action classes with atleast video clips for each action class. Each videoclip lasts around 10 seconds and is labeled with asingle action class. • AudioSet:
The AudioSet consists of , , human-labeled 10-second sound clips drawn fromYouTube videos covers ontology of audio eventclasses [117]. The event classes cover a wide rangeof human and animal sounds, musical instrumentsand genres, and common everyday environmentalsounds. This dataset is mainly used for the self-supervised learning from video and audio consis-tence [26]. • KITTI:
The KITTI dataset is collected from drivinga car around a city which equipped with varioussensors including high-resolution RGB camera, gray-scale stereo camera, a 3D laser scanner, and high-precision GPS measurements and IMU accelerationsfrom a combined GPS/IMU system [118]. Videoswith various modalities captured by these sensorsare available in this dataset. • UCF101:
The UCF101 is a widely used video datasetfor human action recognition [63]. The dataset con-sists of , video clips with more than hoursbelonging to categories in this dataset. Thevideos in this dataset have a spatial resolution of × pixels and FPS frame rate. This datasethas been widely used for evaluating the performanceof human action recognition. In the self-supervisedsensorial, the self-supervised models are fine-tunedon the dataset and the accuracy of the action recog- TABLE 2Summary of self-supervised image feature learning methods based on the category of pretext tasks. Multi-task means the method explicitly orimplicitly uses multiple pretext tasks for image feature learning.
Method Category Code ContributionGAN [83] Generation (cid:51)
Forerunner of GANDCGAN [120] Generation (cid:51)
Deep convolutional GAN for image generationWGAN [121] Generation (cid:51)
Proposed WGAN which makes the training of GAN more stableBiGAN [122] Generation (cid:51)
Bidirectional GAN to project data into latent spaceSelfGAN [123] Multiple (cid:55)
Use rotation recognition and GAN for self-supervised learningColorfulColorization [18] Generation (cid:51)
Posing image colorization as a classification taskColorization [82] Generation (cid:51)
Using image colorization as the pretext taskAutoColor [124] Generation (cid:51)
Training ConvNet to predict per-pixel color histogramsSplit-Brain [42] Generation (cid:51)
Using split-brain auto-encoder as the pretext taskContext Encoder [19] Generation (cid:51)
Employing ConvNet to solve image inpaintingCompletNet [125] Generation (cid:51)
Employing two discriminators to guarantee local and global consistentSRGAN [15] Generation (cid:51)
Employing GAN for single image super-resolutionSpotArtifacts [126] Generation (cid:51)
Learning by recognizing synthetic artifacts in imagesImproveContext [33] Context (cid:55)
Techniques to improve context based self-supervised learning methodsContext Prediction [41] Context (cid:51)
Learning by predicting the relative position of two patches from an imageJigsaw [20] Context (cid:51)
Image patch Jigsaw puzzle as the pretext task for self-supervised learningDamaged Jigsaw [89] Multiple (cid:55)
Learning by solving jigsaw puzzle, inpainting, and colorization togetherArbitrary Jigsaw [88] Context (cid:55)
Learning with jigsaw puzzles with arbitrary grid size and dimensionDeepPermNet [127] Context (cid:51)
A new method to solve image patch jigsaw puzzleRotNet [36] Context (cid:51)
Learning by recognizing rotations of imagesBoosting [34] Multiple (cid:55)
Using clustering to boost the self-supervised learning methodsJointCluster [128] Context (cid:51)
Jointly learning of deep representations and image clustersDeepCluster [44] Context (cid:51)
Using clustering as the pretextClusterEmbegging [129] Context (cid:51)
Deep embedded clustering for self-supervised learningGraphConstraint [43] Context (cid:51)
Learning with image pairs mined with Fisher VectorRanking [38] Context (cid:51)
Learning by ranking video frames with a triplet lossPredictNoise [46] Context (cid:51)
Learning by mapping images to a uniform distribution over a manifoldMultiTask [32] Multiple (cid:51)
Using multiple pretext tasks for self-supervised feature learningLearning2Count [130] Context (cid:51)
Learning by counting visual primitiveWatching Move [81] Free Semantic Label (cid:51)
Learning by grouping pixels of moving objects in videosEdge Detection [81] Free Semantic Label (cid:51)
Learning by detecting edgesCross Domain [81] Free Semantic Label (cid:51)
Utilizing synthetic data and its labels rendered by game engines nition are reported to evaluate the quality of thefeatures. • HMDB51:
Compared to other datasets, the HMDB51dataset is a smaller video dataset for human actionrecognition. There are around , video clips inthis dataset belong to human action categories[106]. The videos in HMDB51 dataset have × pixels spatial resolution and FPS frame rate. In theself-supervised sensorial, the self-supervised modelsare fine-tuned on the dataset to evaluate the qualityof the learned video features.
MAGE F EATURE L EARNING
In this section, three groups of self-supervised image featurelearning methods are reviewed including generation-basedmethods, context-based methods, and free semantic label-based methods. A list of the image feature self-supervisedlearning methods can be found in Table 2. Since the crossmodal-based methods mainly learn features from videosand most methods of this type can be used for both imageand video feature learning, so cross modal-based methodsare reviewed in the video feature learning section.
Generation-based self-supervised methods for learning im-age features involve the process of generating images in-cluding image generation with GAN (to generate fake im-ages), super-resolution (to generate high-resolution images), image inpainting (to predict missing image regions), and im-age colorization (to colorize gray-scale images into colorfulimages). For these tasks, pseudo training labels P usuallyare the images themselves and no human-annotated labelsare needed during training, therefore, these methods belongto self-supervised learning methods.The pioneer work about the image generation-basedmethods is the Autoencoder [131] which learns to compressan image into a low-dimension vector which then is un-compressed into the image that closes to the original imagewith a bunch of layers. With an auto-encoder, networks canreduce the dimension of an image into a lower dimensionvector that contains the main information of the originalimage. The current image generation-based methods followa similar idea but with different pipelines to learn visualfeatures through the process of image generation. Generative Adversarial Network (GAN) is a type of deepgenerative model that was proposed by Goodfellow et al. [83]. A GAN model generally consists of two kinds ofnetworks: a generator which is to generate images fromlatent vectors and a discriminator which is to distinguishwhether the input image is generated by the generator. Byplaying the two-player game, the discriminator forces thegenerator to generate realistic images, while the generatorforces the discriminator to improve its differentiation ability.During the training, the two networks are competing againstwith each other and make each other stronger. The common architecture for the image generation froma latent variable task is shown in Fig. 13. The generator istrained to map any latent vector sampled from latent spaceinto an image, while the discriminator is forced to distin-guish whether the image from the real data distribution orgenerated data distribution. Therefore, the discriminator isrequired to capture the semantic features from images toaccomplish the task. The parameters of the discriminatorcan server as the pre-trained model for other computervision tasks.
Generator DiscriminatorReal World Images RealFakeRandomNoise
Fig. 13. The pipeline of Generative Adversarial Networks [83]. By playingthe two-player game, the discriminator forces the generator to generaterealistic images, while the generator forces the discriminator to improveits differentiation ability.
Mathematically, the generator G is trained to learn adistribution p z of real word image data to generate realistdata that undistinguished from the real data, while the dis-criminator D is trained to distinguish the distribution of thereal data p data and of the data distribution p z generated bythe generator G . The min-max game between the generator G and the discriminator D is formulated as: min G max D E x ∼ p data ( x )[ logD ( x )] + E z ∼ p z ( z )[ log (1 − D ( G ( z )))] , (5)where x is the real data, G ( z ) is the generated data.The discriminator D is trained to maximize the proba-bility for the real data x (that is, E x ∼ p data ( x )[ logD ( x )] ) andminimize the probability for the generated data G ( z ) (thatis, E x ∼ p data ( x )[ logD ( x )] ). The generator is trained to generatedata that close to real data x , so as the output of thediscriminator is maximized E x ∼ p data ( x )[ logD ( G ( z ))] .Most of the methods for image generation from randomvariables do not need any human-annotated labels. How-ever, the main purpose of this type of task is to generaterealistic images instead of obtaining better performance ondownstream applications. Generally, the inception scores ofthe generated images are used to evaluate the quality ofthe generated images [132], [133]. And only a few methodsevaluated the quality of the feature learned by the discrim-inator on the high-level tasks and compared with others[120], [122], [123].The adversarial training can help the network to capturethe real distribution of the real data and generate realistsdata, and it has been widely used in computer vision taskssuch as image generation [134], [135], video generation[85], [86], super-resolution [15], image translation [136], and image inpainting [19], [125]. When there is no human-annotated label involves, the method falls into the self-supervised learning. (a) Input Context (b) Human Artist (c) Network Prediction Fig. 14. Qualitative illustration of image inpainting task. Given an imagewith a missing region (a), a human artist has no trouble inpainting it(b). Automatic inpainting using context encoder proposed in [19] trainedwith L2 reconstruction loss and adversarial loss is shown in (c). Figureis reproduced based on [19].
Image inpainting is a task of predicting arbitrary miss-ing regions based on the rest of an image. A qualitativeillustration of the image inpainting task is shown in Fig. 14.The Fig. 14(a) is an image with a missing region, while theFig 14(c) is the prediction of networks. To correctly predictmissing regions, networks are required to learn the commonknowledge including the color and structure of the commonobjects. Only by knowing this knowledge, networks are ableto infer missing regions based on the rest part of the image.By analogy with auto-encoders, Pathak et al. made thefirst step to train a ConvNet to generate the contents ofan arbitrary image region based on the rest of the image[19]. Their contributions are in two folds: using a Con-vNet to tackle image inpainting problem, and using theadversarial loss to help the network generate a realistichypothesis. Most of the recent methods follow a similarpipeline [125]. Usually, there are two kinds of networks: agenerator network is to generate the missing region with thepixel-wise reconstruction loss and a discriminator networkis to distinguish whether the input image is real with anadversarial loss. With the adversarial loss, the network isable to generate sharper and realistic hypothesis for themissing image region. Both the two kinds of networks areable to learn the semantic features from images and canbe transferred to other computer vision tasks. However,only Pathak et al. [19] studied the performance of transferlearning for the learned parameters of the generator fromthe image inpainting task.The generator network which is a fully convolutionalnetwork has two parts: encoder and decoder. The input ofthe encoder is the image that needs to be inpainted and thecontext encoder learns the semantic feature of the image.The context decoder is to predict the missing region basedon this feature. The generator is required to understand thecontent of the image in order to generate a plausible hypoth-esis. The discriminator is trained to distinguish whether theinput image is the output of the generator. To accomplishthe image inpainting task, both networks are required tolearn semantic features of images. Image super-resolution (SR) is a task of enhancing theresolution of images. With the help of fully convolutionalnetworks, finer and realistic high-resolution images can begenerated from low-resolution images. SRGAN is a gener-ative adversarial network for single image super-resolutionproposed by Ledig et al. [15]. The insight of this approach isto take advantage of the perceptual loss which consists of anadversarial loss and a content loss. With the perceptron loss,the SRGAN is able to recover photo-realistic textures fromheavily downsampled images and show significant gains inperceptual quality.There are two networks: one is generator which is toenhance the resolution of the input low-resolution imageand the other is the discriminator which is to distinguishwhether the input image is the output of the generator. Theloss function for the generator is the pixel-wise L2 loss plusthe content loss which is the similarity of the feature ofthe predicted high-resolution image and the high-resolutionoriginal image, while the loss for the discriminator is the bi-nary classification loss. Compared to the network that onlyminimizing the Mean Squared Error (MSE) which generallyleads to high peak signal-to-noise ratios but lacking high-frequency details, the SRGAN is able to recover the finedetails of the high-resolution image since the adversarialloss pushes the output to the natural image manifold by thediscriminator network.The networks for image super-resolution task are ableto learn the semantic features of images. Similar to otherGANs, the parameters of the discriminator network can betransferred to other downstream tasks. However, no onetested the performance of the transferred learning on othertasks yet. The quality of the enhanced image is mainlycompared to evaluate the performance of the network.
Fig. 15. The architecture of image colorization proposed in [18]. Thefigure is from [18] with author’s permission.
Image colorization is a task of predicting a plausiblecolor version of the photograph given a gray-scale pho-tograph as input. A qualitative illustration of the imagecolorization task is shown in Fig. 15. To correctly colorizeeach pixel, networks need to recognize objects and to grouppixels of the same part together. Therefore, visual featurescan be learned in the process of accomplishing this task.Many deep learning-based colorization methods havebeen proposed in recent years [18], [137], [138]. A straight-forward idea would be to employ a fully convolution neuralnetwork which consists of an encoder for feature extractionand a decoder for the color hallucination to colorization.The network can be optimized with L2 loss between the predicted color and its original color. Zhang et al. proposedto handle the uncertainty by posting the task as a clas-sification task and used class-rebalancing to increase thediversity of predicted colors [18]. The framework for imagecolorization proposed by Zhang et al. is shown in Fig. 15.Trained in large-scale image collections, the method showsgreat results and fools human on % of the trials duringthe colorization test.Some work specifically employs the image colorizationtask as the pretext for self-supervised image representationlearning [18], [42], [82], [124]. After the image colorizationtraining is finished, the features learned through the col-orization process are specifically evaluated on other down-stream high-level tasks with transfer learning. The context-based pretext tasks mainly employ the con-text features of images including context similarity, spatialstructure, and temporal structure as the supervision signal.Features are learned by ConvNet through the process ofsolving the pretext tasks designed based on attributes of thecontext of images.
Fig. 16. The architecture of DeepClustering [44]. The features of im-ages are iteratively clustered and the cluster assignments are used aspseudo-labels to learn the parameters of the ConvNet. The figure is from[44] with author’s permission.
Clustering is a method of grouping sets of similar datain the same clusters. Due to its powerful ability of groupingdata by using the attributes of the data, it is widely usedin many fields such as machine learning, image processing,computer graphics, etc. Many classical clustering algorithmshave been proposed for various applications [139].In the self-supervised scenario, the clustering methodsmainly employed as a tool to cluster image data. A naivemethod would be to cluster the image data based on thehand-designed feature such as HOG [140], SIFT [141], orFisher Vector [49]. After the clustering, several clusters areobtained while the image within one cluster has a smallerdistance in feature space and images from different clustershave a larger distance in feature space. The smaller thedistance in feature space, the more similar the image inthe appearance in the RGB space. Then a ConvNet can betrained to classify the data by using the cluster assignmentas the pseudo class label. To accomplish this task, the Con-vNet needs to learn the invariance within one class and thevariance among different classes. Therefore, the ConvNet isable to learn semantic meaning of images.The existing methods about using the clustering variantsas the pretext task follow these principals [34], [43], [44], [128], [129]. Firstly, the image is clustered into different clus-ters which the images from the same cluster have smallerdistance and images from different clusters have largerdistance. Then a ConvNet is trained to recognize the clusterassignment [34], [44] or to recognize whether two imagedare from same cluster [43]. The pipeline of DeepCluster, aclustering based methods, is shown in Fig. 16. DeepClusteriteratively clusters images with Kmeans and use the sub-sequent assignments as supervision to update the weightsof the network. And it is the current state-of-the-art for theself-supervised image representation learning. Images contain rich spatial context information such as therelative positions among different patches from an imagewhich can be used to design the pretext task for self-supervised learning. The pretext task can be to predict therelative positions of two patches from same image [41], orto recognize the order of the shuffled a sequence of patchesfrom same image [20], [88], [89]. The context of full imagescan also be used as a supervision signal to design pretexttasks such as to recognize the rotating angles of the wholeimages [36]. To accomplish these pretext tasks, ConvNetsneed to learn spatial context information such as the shapeof the objects and the relative positions of different parts ofan object. (a) (b) (c) (b)
Fig. 17. The visualization of the Jigsaw Image Puzzle [20]. (a) is animage with sampled image patches, (b) is an example of shuffledimage patches, and (c) shows the correct order of the sampled patches. Figure is reproduced based on [20]. The method proposed by Doersch et al. is one of the pi-oneer work of using spatial context cues for self-supervisedvisual feature learning [41]. Random pairs of image patchesare extracted from each image, then a ConvNet is trainedto recognize the relative positions of the two image patches.To solve this puzzle, ConvNets need to recognize objects inimages and learn the relationships among different parts ofobjects. To avoid the network learns trivial solutions such assimply using edges in patches to accomplish the task, heavydata augmentation is applied during the training phase.Following this idea, more methods are proposed to learnimage features by solving more difficult spatial puzzles [20],[27], [87], [88], [89]. As illustrated in Fig. 17, one typical workproposed by Noroozi et al. attempted to solve an imageJigsaw puzzle with ConvNet [20]. Fig. 17(a) is an image with sampled image patches, Fig 17(b) is an example of shuffledimage patches, and Fig 17(c) shows the correct order of thesampled patches. The shuffled image patches are fed tothe network which trained to recognize the correct spatiallocations of the input patches by learning spatial context structures of images such as object color, structure, and high-level semantic information.Given image patches from an image, there are , possible permutations and a network is very unlikely torecognize all of them because of the ambiguity of the task.To limit the number of permutations, usually, hammingdistance is employed to choose only a subset of permuta-tions among all the permutations that with relative largehamming distance. Only the selected permutations are usedto train ConvNet to recognize the permutation of shuffledimage patches [20], [35], [88], [89].The main principle of designing puzzle tasks is to find asuitable task which is not too difficult and not too easy fora network to solve. If it is too difficult, the network may notconverge due to the ambiguity of the task or can easily learntrivial solutions if it is too easy. Therefore, a reduction in thesearch space is usually employed to reduce the difficulty ofthe task. The free semantic label refers to labels with semantic mean-ings that obtained without involving any human annota-tions. Generally, the free semantic labels such as segmenta-tion masks, depth images, optic flows, and surface normalimages can be rendered by game engine or generated byhard-code methods. Since these semantic labels are automat-ically generated, the methods using the synthetic datasetsor using them in conjunction with a large unlabeled imageor video datasets are considered as self-supervised learningmethods.
Given models of various objects and layouts of environ-ments, game engines are able to render realistic imagesand provide accurate pixel-level labels. Since game enginescan generate large-scale datasets with negligible cost, var-ious game engines such as Airsim [142] and Carla [143]have been used to generate large-scale synthetic datasetswith high-level semantic labels including depth, contours,surface normal, segmentation mask, and optical flow fortraining deep networks. An example of an RGB image withits generated accurate labels is shown in Fig. 18.
Instance SegmentationSynthetic Image Depth Optical Flow
Fig. 18. An example of an indoor scene generated by a game engine[115]. For each synthetic image, the corresponding depth, instancesegmentation, and optical flow can be automatically generated by theengine.
Game engines can generate realistic images with accu-rate pixel-level labels with very low cost. However, due tothe domain gap between synthetic and real-world images,the ConvNet purely trained on synthetic images cannot be directly applied to real-world images. To utilize syntheticdatasets for self-supervised feature learning, the domaingap needs to be explicitly bridged. In this way, the ConvNettrained with the semantic labels of the synthetic dataset canbe effectively applied to real-world images.To overcome the problem, Ren and Lee proposed an un-supervised feature space domain adaptation method basedon adversarial learning [30]. As shown in Fig. 19, the net-work predicts surface normal, depth, and instance contourfor the synthetic images and a discriminator network D is employed to minimize the difference of feature spacedomains between real-world and synthetic data. Helpedwith adversarial training and accurate semantic labels ofsynthetic images, the network is able to capture visualfeatures for real-world images. Synthetic Data C o n v N e t Real World Data C o n v N e t Domain D
Real / Synthetics
SharedParameters
Instance SegmentationSynthetic Image Depth Optical Flow
Edge SurfaceNormalDepth
Fig. 19. The architecture for utilizing synthetic and real-world imagesfor self-supervised feature learning [30]. Figure is reproduced based on[30].
Compared to other pretext tasks in which the pretexttasks implicitly force ConvNets to learn semantic features,this type of methods are trained with accurate semanticlabels which explicitly force ConvNets to learn features thathighly related to the objects in images.
Applying hard-code programs is another way to automati-cally generate semantic labels such as salience, foregroundmasks, contours, depth for images and videos. With thesemethods, very large-scale datasets with generated semanticlabels can be used for self-supervised feature learning. Thistype of methods generally has two steps: (1) label generationby employing hard-code programs on images or videos toobtain labels, (2) train ConvNets with the generated labels.Various hard-code programs have been applied to gen-erate labels for self-supervised learning methods includemethods for foreground object segmentation [81], edge de-tection [47], and relative depth prediction [92]. Pathak et al. proposed to learn features by training a ConvNet to segmentforeground objects in each frame of a video while the label isthe mask of moving objects in videos [81]. Li et al. proposedto learn features by training a ConvNet for edge predictionwhile labels are motion edges obtained from flow fields from videos [47]. Jing et al. proposed to learn features bytraining a ConvNet to predict relative scene depths whilethe labels are generated from optical flow [92].No matter what kind of labels used to train ConvNets,the general idea of this type of methods is to distill knowl-edge from hard-code detector. The hard-code detector canbe edge detector, salience detector, relative detector, etc.As long as no human-annotations are involved throughthe design of detectors, then the detectors can be used togenerate labels for self-supervised training.Compared to other self-supervised learning methods,the supervision signal in these pretext tasks is semantic la-bels which can directly drive the ConvNet to learn semanticfeatures. However, one drawback is that the semantic labelsgenerated by hard-code detector usually are very noisywhich need to specifically cope with.
IDEO F EATURE L EARNING
This section reviews the self-supervised methods for learn-ing video features, as listed in Table 3, they can be catego-rized into four classes: generation-based methods, context-based methods, free semantic label-based methods, andcross modal-based methods.Since video features can be obtained by various kindsof networks including 2DConvNet, 3DConvNet, and LSTMcombined with 2DConvNet or 3DConvNet. When 2DCon-vNet is employed for video self-supervised feature learning,then the 2DConvNet is able to extract both image andvideo features after the self-supervised pretext task trainingfinished.
Learning from video generation refers to the methods thatvisual features are learned through the process of video gen-eration while without using any human-annotated labels.This type of methods includes video generation with GAN[85], video colorization [145] and video prediction [37]. Forthese pretext tasks, the pseudo training label P usually isthe video itself and no human-annotated labels are neededduring training, therefore, these methods belong to self-supervised learning. Background Stream
Foreground Stream
Noise
MaskForegroundBackground
Replicate over Time m f + (1 m ) b Generated Video
Space-Time CuboidTanhSigmoidTanh
Fig. 20. The architecture of the generator in VideoGan for video gener-ation with GAN proposed in [85]. The figure is from [85] with author’spermission. TABLE 3Summary of self-supervised video feature learning methods based on the category of pretext tasks.
Mehtod SubCategory Code ContributionVideoGAN [85] Generation (cid:51)
Forerunner of video generation with GANMocoGAN [86] Generation (cid:51)
Decomposing motion and content for video generation with GANTemporalGAN [144] Generation (cid:51)
Decomposing temporal and image generator for video generationVideo Colorization [145] Generation (cid:51)
Employing video colorization as the pretext taskUn-LSTM [37] Generation (cid:51)
Forerunner of video prediction with LSTMConvLSTM [146] Generation (cid:51)
Employing Convolutional LSTM for video predictionMCNet [147] Generation (cid:51)
Disentangling motion and content for video predictionLSTMDynamics [148] Generation (cid:55)
Learning by predicting long-term temporal dynamic in videosVideo Jigsaw [87] Context (cid:55)
Learning by jointly reasoning about spatial and temporal contextTransitive [31] Context (cid:55)
Learning inter and intra instance variations with a Triplet loss3DRotNet [28] Context (cid:55)
Learning by recognizing rotations of video clipsCubicPuzzles [27] Context (cid:55)
Learning by solving video cubic puzzlesShuffleLearn [40] Context (cid:51)
Employing temporal order verification as the pretext taskLSTMPermute [149] Context (cid:51)
Learning by temporal order verification with LSTMOPN [39] Context (cid:51)
Using frame sequence order recognition as the pretext taskO3N [29] Context (cid:55)
Learning by identifying odd video sequencesArrowTime [90] Context (cid:51)
Learning by recognizing the arrow of time in videosTemporalCoherence [150] Context (cid:55)
Learning with the temporal coherence of features of frame sequenceFlowNet [151] Cross Modal (cid:51)
Forerunner of optical flow estimation with ConvNetFlowNet2 [152] Cross Modal (cid:51)
Better architecture and better performance on optical flow estimationUnFlow [153] Cross Modal (cid:51)
An unsupervised loss for optical flow estimationCrossPixel [23] Cross Modal (cid:55)
Learning by predicting motion from a single image as the pretext taskCrossModel [24] Cross Modal (cid:55)
Optical flow and RGB correspondence verification as pretext taskAVTS [25] Cross Modal (cid:55)
Visual and Audio correspondence verification as pretext taskAudioVisual [26] Cross Modal (cid:51)
Jointly modeling visual and audio as fused multisensory representationLookListenLearn [93] Cross Modal (cid:51)
Forerunner of Audio-Visual Correspondence for self-supervised learningAmbientSound [154] Cross Modal (cid:55)
Predicting a statistical summary of the sound from a video frameEgoMotion [155] Cross Modal (cid:51)
Learning by predicting camera motion and the scene structure from videosLearnByMove [94] Cross Modal (cid:51)
Learning by predicting the camera transformation from a pairs of imagesTiedEgoMotion [95] Cross Modal (cid:55)
Learning from ego-motor signals and video sequenceGoNet [156] Cross Modal (cid:51)
Jointly learning monocular depth, optical flow and ego-motion estimation from videosDepthFlow [157] Cross Modal (cid:51)
Depth and optical flow learning using cross-task consistency from videosVisualOdometry [158] Cross Modal (cid:51)
An unsupervised paradigm for deep visual odometry learningActivesStereoNet [159] Cross Modal (cid:51)
End-to-end self-supervised learning of depth from active stereo systems
After GAN-based methods obtained breakthrough re-sults in image generation, researchers employed GAN togenerate videos [85], [86], [144]. One pioneer work of videogeneration with GAN is VideoGAN [85], and the architec-ture of the generator network is shown in Fig. 20. To modelthe motion of objects in videos, a two-stream network isproposed for video generation while one stream is to modelthe static regions in in videos as background and anotherstream is to model moving object in videos as foreground[85]. Videos are generated by the combination of the fore-ground and background streams. The underline assumptionis that each random variable in the latent space representsone video clip. This method is able to generate videos withdynamic contents. However, Tulyakov et al. argues that thisassumption increases difficulties of the generation, instead,they proposed MocoGAN to use the combination of twosubspace to represent a video by disentangling the contextand motions in videos [86]. One space is context space whicheach variable from this space represents one identity, andanother space is motion space while the trajectory in thisspace represents the motion of the identity. With the twosub-spaces, the network is able to generate videos withhigher inception score.The generator learns to map latent vectors from latentspace into videos, while discriminator learns to distinguishthe real world videos with generated videos. Therefore, thediscriminator needs to capture the semantic features fromvideos to accomplish this task. When no human-annotated labels are used in these frameworks, they belong to theself-supervised learning methods. After the video genera-tion training on large-scale unlabeled dataset finished, theparameters of discriminator can be transferred to otherdownstream tasks [85].
Temporal coherence in videos refers to that consecutiveframes within a short time have similar coherent appear-ance. The coherence of color can be used to design pretexttasks for self-supervised learning. One way to utilize colorcoherence is to use video colorization as a pretext task forself-supervised video feature learning.Video colorization is a task to colorize gray-scale framesinto colorful frames. Vondrick et al. proposed to constraincolorization models to solve video colorization by learningto copy colors from a reference frame [145]. Given thereference RGB frame and a gray-scale image, the networkneeds to learn the internal connection between the referenceRGB frame and gray-scale image to colorize it.Another perspective is to tackle video colorization byemploying a fully convolution neural network. Tran et al. proposed an U-shape convolution neural network for videocolorization [160]. The network is an encoder-decoder based3DConvNet. The input of the network is a clip of grayscalevideo clip, while the output if a colorful video clip. Theencoder is a bunch of 3D convolution layers to extractfeatures while the decoder is a bunch of 3D deconvolution layers to generate colorful video clips from the extractedfeature.The color coherence in videos is a strong supervisionsignal. However, only a few work studied to employ it forself-supervised video feature learning [145]. More work canbe done by studying using color coherence as a supervisionsignal for self-supervised video feature learning. CONCATLSTM LSTM LSTM LSTM
Shared
Motion EncoderContent Encoder Decoder = Conv= Deconv
Fig. 21. The architecture for video prediction task proposed by [147].Figure is reproduced based on [147].
Video prediction is a task of predicting future framesequences based on a limited number of frames of a video.To predict future frames, network must learn the change inappearance within a given frame sequence. The pioneer ofapplying deep learning for video prediction is Un-LSTM[37]. Due to the powerful ability of modeling long-termdynamic in videos, LSTM is used in both the encoder anddecoder [37].Many methods have been proposed for video prediction[37], [147], [161], [162], [163], [164], [165]. Since its superiorability to model temporal dynamics, most of them use LSTMor LSTM variant to encode temporal dynamics in videosor to infer the future frames [37], [146], [147], [164], [165].These methods can be employed for self-supervised featurelearning without using human-annotations.Most of the frameworks follow the encoder-decoderpipeline in which the encoder to model spatial and tem-poral features from the given video clips and the decoderto generate future frames based on feature extracted byencoder. Fig. 21 shows a pipeline of MCnet proposed byVillegas et al. in [147]. McNet is built on Encoder-DecoderConvolutional Neural Network and Convolutional LSTMfor video prediction. It has two encoders, one is ContentEncoder to capture the spatial layout of an image, andthe other is Motion Encoder to model temporal dynamicswithin video clips. The spatial features and temporal fea-tures are concatenated to feed to the decoder to generate thenext frame. By separately modeling temporal and spatialfeatures, this model can effectively generate future framesrecursively. Video prediction is a self-supervised learning task andthe learned features can be transferred to other tasks. How-ever, no work has been done to study the generalizationability of features learned by video prediction. Generally,The Structural Similarity Index (SSIM) and Peak Signal toNoise Ratio (PSNR) are employed to evaluate the differencebetween the generated frame sequence and the ground truthframe sequence.
ConvNet
Prediction
Correct Order
Models FusionInputs
ConvNetConvNet
Incorrect Order share parameter
Fig. 22. The pipeline of Shuffle and Learn [40]. The network is trainedto verify whether the input frames are in correct temporal order. Figureis reproduced based on [40].
Videos consist of various lengths of frames which haverich spatial and temporal information. The inherent tempo-ral information within videos can be used as supervisionsignal for self-supervised feature learning. Various pretexttasks have been proposed by utilizing temporal contextrelations including temporal order verification [29], [40],[90] and temporal order recognition [27], [39]. Temporalorder verification is to verify whether a sequence of inputframes is in correct temporal order, while temporal orderrecognition is to recognize the order of a sequence of inputframes.As shown in Fig. 22, Misra et al. proposed to use thetemporal order verification as the pretext task to learn imagefeatures from videos with 2DConvNet [40] which has twomain steps: (1) The frames with significant motions aresampled from videos according to the magnitude of opticalflow, (2) The sampled frames are shuffled and fed to thenetwork which is trained to verify whether the input datais in correct order. To successfully verify the order of theinput frames, the network is required to capture the subtledifference between the frames such as the movement of theperson. Therefore, semantic features can be learned throughthe process of accomplishing this task. The temporal orderrecognition tasks use networks of similar architecture.However, the methods usually suffer from a massivedataset preparation step. The frame sequences that usedto train the network are selected based on the magnitudeof the optical flow, and the computation process of opticalflow is expensive and slow. Therefore, more straightforwardand time-efficiency methods are needed for self-supervisedvideo feature learning. Cross modal-based learning methods usually learn videofeatures from the correspondence of multiple data streamsincluding RGB frame sequence, optical flow sequence, audiodata, and camera pose.In addition to rich temporal and spatial information invideos, optical flow sequence can be generated to specifi-cally indicate the motion in videos, and the difference offrames can be computed with negligible time and space-time complexity to indicate the boundary of the movingobjects. Similarly, audio data also provide a useful hintabout the content of videos. Based on the type of data used,these methods fall into three groups: (1) methods that learnfeatures by using the RGB and optical flow correspondence[23], [24], (2) methods that learn features by utilizing thevideo and audio correspondence [25], [93], (3) ego-motionthat learn by utilizing the correspondence between egocen-tric video and ego-motor sensor signals [94], [95]. Usually,the network is trained to recognize if the two kinds of inputdata are corresponding to each other [24], [25], or is trainedto learn the transformation between different modalities[94].
Optical flow encodes object motions between adjacentframes, while RGB frames contain appearance information.The correspondence of the two types of data can be usedto learn general features [23], [24], [151], [152]. This type ofpretext tasks include optical flow estimation [151], [152] andRGB and optical flow correspondence verification [23].Sayed et al. proposed to learn video features by verifyingwhether the input RGB frames and the optical flow corre-sponding to each other. Two networks are employed whileone is for extracting features from RGB input and anotheris for extracting features from optical flow input [24]. Toverify whether two input data correspond to each other,the network needs to capture mutual information betweenthe two modalities. The mutual information across differentmodalities usually has higher semantic meaning comparedto information which is modality specific. Through this pre-text task, the mutual information that invariant to specificmodality can be captured by ConvNet.Optical flow estimation is another type of pretext tasksthat can be used for self-supervised video feature learning.Fischer et al. proposed FlowNet which is an end-to-endconvolution neural network for optical flow estimation fromtwo consecutive frames [151], [152]. To correctly estimateoptical flow from two frames, the ConvNet needs to captureappearance changes of two frames. Optical flow estimationcan be used for self-supervised feature learning becauseit can be automatically generated by simulators such asgame engines or by hard-code programs without humanannotation.
Recently, some researchers proposed to use the correspon-dence between visual and audio streams to design Visual-Audio Correspondence learning task [25], [26], [93], [154].The general framework of this type of pretext tasks isshown in Fig. 23. There are two subnetworks: the vision
Visual
DataAudio
Data
Visual Stream ConvNetAudio Stream ConvNet Fusion Layer
Correspond?Visual Audio Correspondence Network
Fig. 23. The architecture of video and audio correspondence verificationtask [93]. subnetwork and the audio subnetwork. The input of visionsubnetwork is a single frame or a stack of image frames andthe vision subnetwork learns to capture visual features ofthe input data. The audio network is a 2DConvNet and theinput is the Fast Fourier Transform (FFT) of the audio fromthe video. Positive data are sampled by extracting videoframes and audio from the same time of one video, whilenegative training data are generated by extracting videoframes and audio from different videos or from differenttimes of one video. Therefore, the networks are trained todiscover the correlation of video data and audio data toaccomplish this task.Since the inputs of the ConvNets are two kinds of data,the networks are able to learn the two kinds of informationjointly by solving the pretext task. The performance ofthe two networks obtained very good performance on thedownstream applications [25].
With the self-driving car which usually equipped withvarious sensors, the large-scale egocentric video along withego-motor signal can be easily collected with very low costby driving the car in the street. Recently, some researchersproposed to use the correspondence between visual signaland motor signal for self-supervised feature learning [94],[95], [155].
ConvNetConvNet
FusionLayer
Camera Pose
Transformation
Share Parameters
Fig. 24. The architecture of camera pose transformation estimation fromegocentric videos [94].
The underline intuition of this type of methods is thata self-driving car can be treated as a camera moving in ascene and thus the egomotion of the visual data capturedby the camera is as same as that of the car. Therefore, thecorrespondence between visual data and egomotion canbe utilized for self-supervised feature learning. A typicalnetwork of using ego-motor signal is shown in Fig. 24proposed by Agrawal et al. for self-supervised image feature learning [94]. The inputs to the network are two framessampled from an egocentric video within a short time. Thelabels for the network indicate the rotation and translationrelation between the two sampled images which can bederived from the odometry data of the dataset. With thistask, the ConvNet is forced to identify visual elements thatare present in both sampled images.The ego-motor signal is a type of accurate supervisionsignal. In addition to directly applying it for self-supervisedfeature learning, it has also been used for unsupervisedlearning of depth and ego-motion [155]. All these networkscan be used for self-supervised feature learning and trans-ferred for downstream tasks. ERFORMANCE C OMPARISON
This section compares the performance of image and videofeature self-supervised learning methods on public datasets.For image feature self-supervised learning, the performanceon downstream tasks including image classification, seman-tic segmentation, and object detection are compared. Forvideo feature self-supervised learning, the performance ona downstream task which is human action recognition invideos is reported.
As described in Section 4.3, the quality of features learned byself-supervised learned models is evaluated by fine-tuningthem on downstream tasks such as semantic segmentation,object detection, and image classification. This section sum-marizes the performance of the existing image feature self-supervised learning methods.Table 4 lists the performance of image classificationperformance on ImageNet [13] and Places [107] datasets.During self-supervised pretext tasks training, most of themethods are trained on ImageNet dataset with AlexNet asbased network without using the category labels. After pre-text task self-supervised training finished, a linear classifieris trained on top of different frozen convolutional layers ofthe ConvNet on the training split of ImageNet and Placesdatasets. The classification performances on the two datasetsare used to demonstrate the quality of the learned features.As shown in Table 4, the overall performance of theself-supervised models is lower than that of models trainedeither with ImageNet labels or with Places labels. Among allthe self-supervised methods, the DeepCluster [44] achievedthe best performance on the two dataset. Three conclusionscan be drawn based on the performance from the Table: (1)The features from different layers are always benefited fromthe self-supervised pretext task training. The performance ofself-supervised learning methods is always better than theperformance of the model trained from scratch. (2) All ofthe self-supervised methods perform well with the featuresfrom conv3 and conv4 layers while performing worse withthe features from conv1, conv2, and conv5 layers. Thisis probably because shallow layers capture general low-level features, while deep layers capture pretext task-relatedfeatures. (3) When there is a domain gap between dataset forpretext task training and the dataset of downstream task, theself-supervised learning method is able to reach comparableperformance with the model trained with ImageNet labels. In addition to image classification, object detection andsemantic segmentation are also used as the downstreamtasks to evaluate the quality of the features learned byself-supervised learning. Usually, ImageNet is used for self-supervised pretext task pre-training by discarding categorylabels, while the AlexNet is used as the base network andfine-tuned on the three tasks. Table 5 lists the performance ofimage classification, object detection, and semantic segmen-tation tasks on the PASCAL VOC dataset. The performanceof classification and detection is obtained by testing themodel on the test split of PASCAL VOC 2007 dataset, whilethe performance of semantic segmentation is obtained bytesting the model on the validation split of PASCAL VOC2012 dataset.As shown in Table 5, the performance of the self-supervised models on segmentation and detection datasetare very close to that of the supervised method which istrained with ImageNet labels during pre-training. Specif-ically, the margins of the performance differences on theobject detection and semantic segmentation tasks are lessthan % which indicate that the learned features by self-supervised learning have a good generalization ability.Among all the self-supervised learning methods, the Deep-Clustering [44] obtained the best performance on all thetasks. For self-supervised video feature learning methods, humanaction recognition task is used to evaluate the quality oflearned features. Various video datasets have been usedfor self-supervised pre-training, and different network ar-chitectures have been used as the base network. Usuallyafter the pretext task pre-training finished, networks arefine-tuned and tested on the commonly used UCF101 andHMDB51 datasets for human action recognition task. Ta-ble 6 compares the performance of existing self-supervisedvideo feature learning methods on UCF101 and HMDB51datasets.As shown in Table 6, the best performance of the fine-tune results on UCF101 is less than %. However, the su-pervised model which trained with Kinetics labels can easilyobtain an accuracy of more than %. The performanceof the self-supervised model is still much lower than theperformance of the supervised model. More effective self-supervised video feature learning methods are desired. Based on the results, conclusions can be drawn aboutthe performance and reproducibility of the self-supervisedlearning methods.
Performance:
For image feature self-supervised learn-ing, due to the well-designed pretext tasks, the perfor-mance of self-supervised methods are comparable to thesupervised methods on some downstream tasks, especiallyfor the object detection and semantic segmentation tasks.The margins of the performance differences on the objectdetection and semantic segmentation tasks are less than %which indicate that the learned features by self-supervisedlearning have a good generalization ability. However, the TABLE 4Linear classification on ImageNet and Places datasets using activations from the convolutional layers of an AlexNet as features. ”Convn” meansthe linear classifier is trained based on the n-th convolution layer of AlexNet. ”Places Labels” and ”ImageNet Labels” indicate using supervisedmodel trained with human-annotated labels as the pre-trained model.
ImageNet PlacesMethod Pretext Tasks conv1 conv2 conv3 conv4 conv5 conv1 conv2 conv3 conv4 conv5
Places labels [8] — — — — — — . . . . . ImageNet labels [8] — . . . . . . . . . . Random(Scratch) [8] — . . . . . . . . . . ColorfulColorization [18] Generation . . . . . . . . . . BiGAN [122] Generation . . . . . . . . . . SplitBrain [42] Generation . . . . . . . . . . ContextEncoder [19] Context . . . . . . . . . . ContextPrediction [41] Context . . . . . . . . . . Jigsaw [20] Context . . . . . . . . . Learning2Count [130] Context . . . . . . . . DeepClustering [44]
Context . . . TABLE 5Comparison of the self-supervised image feature learning methods on classification, detection, and segmentation on P
ASCAL
VOC dataset.”ImageNet Labels” indicates using supervised model trained with human-annotated labels as the pre-trained model.
Method Pretext Tasks Classification Detection Segmentation
ImageNet Labels [8] — . . . Random(Scratch) [8] — . . . ContextEncoder [19] Generation . . . BiGAN [122] Generation . . . ColorfulColorization [18] Generation . . . SplitBrain [42] Generation . . . RankVideo [38] Context . . . † PredictNoise [46] Context . . . † JigsawPuzzle [20] Context . . . ContextPrediction [41] Context . . —Learning2Count [130] Context . . . DeepClustering [44]
Context 73.7 55.4 45.1
WatchingVideo [81] Free Semantic Label . . —CrossDomain [30] Free Semantic Label . . —AmbientSound [154] Cross Modal . — —TiedToEgoMotion [95] Cross Modal — . —EgoMotion [94] Cross Modal . . — TABLE 6Comparison of the existing self-supervised methods for actionrecognition on the UCF101 and HMDB51 datasets. * indicates theaverage accuracy over three splits. ”Kinetics Labels” indicates usingsupervised model trained with human-annotated labels as thepre-trained model.
Method Pretext Task UCF101 HMDB51Kinetics Labels* [70] — . . VideoGAN [85] Generation . —VideoRank [38] Context . . ShuffleLearn [40] Context . . OPN [29] Context . . RL [35] Context . . AOT [90] Context . —3DRotNet [28] Context . [27] Context 65.8 33.7
RGB-Flow [24] Cross Modal . . PoseAction [48] Cross Modal . . performance of video feature self-supervised learning meth-ods is still much lower than that of the supervised models ondownstream tasks. The best performance of the 3DConvNet-based methods on UCF101 dataset is more than % lower than that of the supervised model [70]. The poor per-formance of 3DCovnNet self-supervised learning methodsprobably because 3DConvNets usually have more parame-ters which lead to easily over-fitting and the complexity ofvideo feature learning due to the temporal dimension of thevideo. Reproducibility:
As we can observe, for the image fea-ture self-supervised learning methods, most of the networksuse AlexNet as a base network to pre-train on ImageNetdataset and then evaluate on same downstream tasks forquality evaluation. Also, the code of most methods arereleased which is a great help for reproducing results.However, for the video self-supervised learning, variousdatasets and networks have been used for self-supervisedpre-training, therefore, it is unfair to directly compare dif-ferent methods. Furthermore, some methods use UCF101as self-supervised pre-training dataset which is a relativelysmall video dataset. With this size of the dataset, the powerof a more powerful model such as 3DCovnNet may not befully discovered and may suffer from server over-fitting.Therefore, larger datasets for video feature self-supervisedpre-training should be used.
Evaluation Metrics:
Another fact is that more evaluationmetrics are needed to evaluate the quality of the learned features in different levels. The current solution is to use theperformance on downstream tasks to indicate the qualityof the features. However, this evaluation metric does notgive insight what the network learned through the self-supervsied pre-training. More evaluation metrics such asnetwork dissection [78] should be employed to analysis theinterpretability of the self-supervised learned features. UTURE D IRECTIONS
Self-supervised learning methods have been achieving greatsuccess and obtaining good performance that close to super-vised models on some computer vision tasks. Here, somefuture directions of self-supervised learning are discussed.
Learning Features from Synthetic Data:
A rising trendof self-supervised learning is to train networks with syn-thetic data which can be easily rendered by game engineswith very limited human involvement. With the help ofgame engines, millions of synthetic images and videos withaccuracy pixel-level annotations can be easily generated.With accurate and detailed annotations, various pretexttasks can be designed to learn features from synthetic data.One problem needed to solve is how to bridge the domaingap between synthetic data and real-world data. Only afew work explored self-supervised learning from syntheticdata by using GAN to bridge the domain gap [30], [166].With more available large-scale synthetic data, more self-supervised learning methods will be proposed.
Learning from Web Data:
Another rising trend is to trainnetworks with web collected data [22], [167], [168] basedon their existing associated tags. With the search engine,millions of images and videos can be downloaded fromwebsites like Flickr and YouTube with negligible cost. Inaddition to its raw data, the title, keywords, and reviewscan also be available as part of the data which can beused as extra information to train networks. With carefullycurated queries, the web data retrieved by reliable searchengines can be relatively clean. With large-scale web dataand their associated metadata, the performance of self-supervised methods may be boosted up. One open problemabout learning from web data is how to handle the noise inweb data and their associated metadata.
Learning Spatiotemporal Features from Videos:
Self-supervised image feature learning has been well studiedand the margin of the performance between supervisedmodels and self-supervised models are very small on somedownstream tasks such as semantic segmentation and objectdetection. However, self-supervised video spatiotemporalfeature learning with 3DConvNet is not well addressed yet.More effective pretext tasks that specifically designed tolearn spatiotemporal features from videos are needed.
Learning with Data from Different Sensors:
Most exist-ing self-supervised visual feature learning methods focusedon only images or videos. However, if other types of datafrom different sensors are available, the constraint betweendifferent types of data can be used as additional sourcesto train networks to learn features [155]. The self-drivingcars usually are equipped with various sensors includingRGB cameras, gray-scale cameras, 3D laser scanners, andhigh-precision GPS measurements and IMU accelerations.Very large-scale datasets can be easily obtained through the driving, and the correspondence of data captured bydifferent devices can be used as a supervision signal forself-supervised feature learning.
Learning with Multiple Pretext Tasks:
Most existingself-supervised visual feature learning methods learn fea-tures by training ConvNet to solve one pretext tasks. Dif-ferent pretext tasks provide different supervision signalswhich can help the network learn more representative fea-tures. Only a few work explored the multiple pretext taskslearning for self-supervised feature learning [30], [32]. Morework can be done by studying the multiple pretext task self-supervised feature learning.
10 C
ONCLUSION
Self-supervised image feature learning with deep convolu-tion neural network has obtained great success and the mar-gin between the performance of self-supervised methodsand that of supervised methods on some downstream tasksbecomes very small. This paper has extensively reviewedrecently deep convolution neural network-based methodsfor self-supervised image and video feature learning fromall perspectives including common network architectures,pretext tasks, algorithms, datasets, performance compari-son, discussions, and future directions etc. The comparativesummary of the methods, datasets, and performance intabular forms clearly demonstrate their properties whichwill benefit researchers in the computer vision community. R EFERENCES [1] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmenta-tion,” in
CVPR , pp. 580–587, 2014.[2] R. Girshick, “Fast R-CNN,” in
ICCV , 2015.[3] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in
NIPS ,pp. 91–99, 2015.[4] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net-works for semantic segmentation,” in
CVPR , pp. 3431–3440, 2015.[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.Yuille, “Deeplab: Semantic image segmentation with deep con-volutional nets, atrous convolution, and fully connected crfs,”
TPAMI , 2018.[6] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsingnetwork,” in
CVPR , pp. 2881–2890, 2017.[7] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: Aneural image caption generator,” in
CVPR , pp. 3156–3164, 2015.[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet clas-sification with deep convolutional neural networks,” in
NIPS ,pp. 1097–1105, 2012.[9] K. Simonyan and A. Zisserman, “Very deep convolutional net-works for large-scale image recognition,”
ICLR , 2015.[10] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper withconvolutions,” CVPR, 2015.[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in
CVPR , pp. 770–778, 2016.[12] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten,“Densely connected convolutional networks,” in
CVPR , vol. 1,p. 3, 2017.[13] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,“Imagenet: A large-scale hierarchical image database,” in
CVPR ,pp. 248–255, IEEE, 2009.[14] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, and T. Duerig, “Theopen images dataset v4: Unified image classification, object de-tection, and visual relationship detection at scale,” arXiv preprintarXiv:1811.00982 , 2018. [15] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham,A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi,“Photo-realistic single image super-resolution using a generativeadversarial network,” in CVPR .[16] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri,“Learning spatiotemporal features with 3D convolutional net-works,” in
ICCV , 2015.[17] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vi-jayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. ,“The kinetics human action video dataset,” arXiv preprintarXiv:1705.06950 , 2017.[18] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,”in
ECCV , pp. 649–666, Springer, 2016.[19] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros,“Context encoders: Feature learning by inpainting,” in
CVPR ,pp. 2536–2544, 2016.[20] M. Noroozi and P. Favaro, “Unsupervised learning of visualrepresentions by solving jigsaw puzzles,” in
ECCV , 2016.[21] D. Mahajan, R. B. Girshick, V. Ramanathan, K. He, M. Paluri,Y. Li, A. Bharambe, and L. van der Maaten, “Exploring the limitsof weakly supervised pretraining,” in
ECCV , pp. 185–201, 2018.[22] W. Li, L. Wang, W. Li, E. Agustsson, and L. Van Gool, “Webvisiondatabase: Visual learning and understanding from web data,” arXiv preprint arXiv:1708.02862 , 2017.[23] A. Mahendran, J. Thewlis, and A. Vedaldi, “Cross pixel op-tical flow similarity for self-supervised learning,” arXiv preprintarXiv:1807.05636 , 2018.[24] N. Sayed, B. Brattoli, and B. Ommer, “Cross and learn: Cross-modal self-supervision,” arXiv preprint arXiv:1811.03879 , 2018.[25] B. Korbar, D. Tran, and L. Torresani, “Cooperative learning ofaudio and video models from self-supervised synchronization,”in
NIPS , pp. 7773–7784, 2018.[26] A. Owens and A. A. Efros, “Audio-visual scene analy-sis with self-supervised multisensory features,” arXiv preprintarXiv:1804.03641 , 2018.[27] D. Kim, D. Cho, and I. S. Kweon, “Self-supervised video repre-sentation learning with space-time cubic puzzles,” arXiv preprintarXiv:1811.09795 , 2018.[28] L. Jing and Y. Tian, “Self-supervised spatiotemporal featurelearning by video geometric transformations,” arXiv preprintarXiv:1811.11387 , 2018.[29] B. Fernando, H. Bilen, E. Gavves, and S. Gould, “Self-supervisedvideo representation learning with odd-one-out networks,” in
CVPR , 2017.[30] Z. Ren and Y. J. Lee, “Cross-domain self-supervised multi-taskfeature learning using synthetic imagery,” in
CVPR , 2018.[31] X. Wang, K. He, and A. Gupta, “Transitive invariance for self-supervised visual representation learning,” in
ICCV , 2017.[32] C. Doersch and A. Zisserman, “Multi-task self-supervised visuallearning,” in
ICCV , 2017.[33] T. N. Mundhenk, D. Ho, and B. Y. Chen, “Improvements tocontext based self-supervised learning,” in
CVPR , 2018.[34] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash, “Boostingself-supervised learning via knowledge transfer,” arXiv preprintarXiv:1805.00385 , 2018.[35] U. B ¨uchler, B. Brattoli, and B. Ommer, “Improving spatiotempo-ral self-supervision by deep reinforcement learning,” in
ECCV ,pp. 770–786, 2018.[36] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised represen-tation learning by predicting image rotations,” in
ICLR , 2018.[37] N. Srivastava, E. Mansimov, and R. Salakhutdinov, “Unsuper-vised Learning of Video Representations using LSTMs,” in
ICML ,2015.[38] X. Wang and A. Gupta, “Unsupervised learning of visual repre-sentations using videos,” in
ICCV , 2015.[39] H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang, “Unsupervisedrepresentation learning by sorting sequences,” in
ICCV , pp. 667–676, IEEE, 2017.[40] I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: unsu-pervised learning using temporal order verification,” in
ECCV ,pp. 527–544, Springer, 2016.[41] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual rep-resentation learning by context prediction,” in
ICCV , pp. 1422–1430, 2015.[42] R. Zhang, P. Isola, and A. A. Efros, “Split-brain autoencoders:Unsupervised learning by cross-channel prediction,” in
CVPR ,2017. [43] D. Li, W.-C. Hung, J.-B. Huang, S. Wang, N. Ahuja, and M.-H.Yang, “Unsupervised visual representation learning by graph-based consistent constraints,” in
ECCV , 2016.[44] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clus-tering for unsupervised learning of visual features,” in
ECCV ,2018.[45] E. Hoffer, I. Hubara, and N. Ailon, “Deep unsupervised learn-ing through spatial contrasting,” arXiv preprint arXiv:1610.00243 ,2016.[46] P. Bojanowski and A. Joulin, “Unsupervised learning by predict-ing noise,” arXiv preprint arXiv:1704.05310 , 2017.[47] Y. Li, M. Paluri, J. M. Rehg, and P. Doll´ar, “Unsupervised learningof edges,”
CVPR , pp. 1619–1627, 2016.[48] S. Purushwalkam and A. Gupta, “Pose from action: Unsuper-vised learning of pose features based on motion,” arXiv preprintarXiv:1609.05420 , 2016.[49] J. S´anchez, F. Perronnin, T. Mensink, and J. Verbeek, “Imageclassification with the fisher vector: Theory and practice,”
IJCV ,vol. 105, no. 3, pp. 222–245, 2013.[50] A. Faktor and M. Irani, “Video segmentation by non-local con-sensus voting.,” in
BMVC , vol. 2, 2014.[51] O. Stretcu and M. Leordeanu, “Multiple frames matching forobject discovery in video.,” in
BMVC , vol. 1, 2015.[52] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Under-standing deep learning requires rethinking generalization,” arXivpreprint arXiv:1611.03530 , 2016.[53] K. Simonyan and A. Zisserman, “Two-Stream ConvolutionalNetworks for Action Recognition in Videos,” in
NIPS , 2014.[54] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach,S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrentconvolutional networks for visual recognition and description,”in
CVPR , 2015.[55] C. Feichtenhofer, A. Pinz, and R. Wildes, “Spatiotemporal resid-ual networks for video action recognition,” in
NIPS , pp. 3468–3476, 2016.[56] C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Spatiotemporalmultiplier networks for video action recognition,” in
CVPR .[57] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in
CVPR ,pp. 1933–1941, 2016.[58] L. Wang, Y. Qiao, and X. Tang, “Action recognition withtrajectory-pooled deep-convolutional descriptors,” in
CVPR ,pp. 4305–4314, 2015.[59] L. Wang, Y. Xiong, Z. Wang, and Y. Qiao, “Towards goodpractices for very deep two-stream convnets,” arXiv preprintarXiv:1507.02159 , 2015.[60] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould,“Dynamic image networks for action recognition,” in
CVPR ,2016.[61] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, andL. Van Gool, “Temporal segment networks: towards good prac-tices for deep action recognition,” in
ECCV , 2016.[62] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neuralnetworks for human action recognition,”
TPAMI , vol. 35, no. 1,pp. 221–231, 2013.[63] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of101 human actions classes from videos in the wild,”
CRCV-TR ,vol. 12-01, 2012.[64] X. Peng, L. Wang, X. Wang, and Y. Qiao, “Bag of visual words andfusion methods for action recognition: Comprehensive study andgood practice,”
CVIU , vol. 150, pp. 109–125, 2016.[65] C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Bags of spacetimeenergies for dynamic scene recognition,” in
CVPR , pp. 2681–2688,2014.[66] X. Ren and M. Philipose, “Egocentric recognition of handledobjects: Benchmark and analysis,” in
CVPRW , pp. 1–8, IEEE, 2009.[67] G. Varol, I. Laptev, and C. Schmid, “Long-term Temporal Convo-lutions for Action Recognition,”
TPAMI , 2017.[68] L. Jing, X. Yang, and Y. Tian, “Video you only look once: Overalltemporal convolutions for action recognition,”
JVCIR , vol. 52,pp. 58–65, 2018.[69] J. Carreira and A. Zisserman, “Quo vadis, action recognition? anew model and the kinetics dataset,” in
CVPR , pp. 4724–4733,IEEE, 2017.[70] K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3d cnnsretrace the history of 2d cnns and imagenet,” in
CVPR , pp. 18–22,2018. [71] Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal represen-tation with pseudo-3d residual networks,” in ICCV , 2017.[72] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term de-pendencies with gradient descent is difficult,”
IEEE transactionson neural networks , vol. 5, no. 2, pp. 157–166, 1994.[73] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural computation , vol. 9, no. 8, pp. 1735–1780, 1997.[74] J. Y. Ng, M. J. Hausknecht, S. Vijayanarasimhan, O. Vinyals,R. Monga, and G. Toderici, “Beyond Short Snippets: Deep Net-works for Video Classification,” in
CVPR , 2015.[75] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek, “Videol-stm convolves, attends and flows for action recognition,”
CVIU ,vol. 166, pp. 41–50, 2018.[76] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell,and K. Saenko, “Sequence to sequence – video to text,” in
ICCV ,2015.[77] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz,“Online detection and classification of dynamic hand gestureswith recurrent 3d convolutional neural network,” in
CVPR ,pp. 4207–4215, 2016.[78] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, “Networkdissection: Quantifying interpretability of deep visual represen-tations,” in
CVPR , 2017.[79] D. Bau, J.-Y. Zhu, H. Strobelt, B. Zhou, J. B. Tenenbaum, W. T.Freeman, and A. Torralba, “Gan dissection: Visualizing andunderstanding generative adversarial networks,” arXiv preprintarXiv:1811.10597 , 2018.[80] M. D. Zeiler and R. Fergus, “Visualizing and understandingconvolutional networks,” in
ECCV , pp. 818–833, Springer, 2014.[81] D. Pathak, R. Girshick, P. Doll´ar, T. Darrell, and B. Hariharan,“Learning features by watching objects move,” in
CVPR , vol. 2,2017.[82] G. Larsson, M. Maire, and G. Shakhnarovich, “Colorization as aproxy task for visual understanding,” in
CVPR , 2017.[83] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver-sarial nets,” in
NIPS , pp. 2672–2680, 2014.[84] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,”in
ICCV , 2017.[85] C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videoswith scene dynamics,” in
NIPS , pp. 613–621, 2016.[86] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz, “Mocogan: Decom-posing motion and content for video generation,”
CVPR , 2018.[87] U. Ahsan, R. Madhok, and I. Essa, “Video jigsaw: Unsupervisedlearning of spatiotemporal context for video action recognition,” arXiv preprint arXiv:1808.07507 , 2018.[88] C. Wei, L. Xie, X. Ren, Y. Xia, C. Su, J. Liu, Q. Tian, and A. L.Yuille, “Iterative reorganization with weak spatial constraints:Solving arbitrary jigsaw puzzles for unsupervised representationlearning,” arXiv preprint arXiv:1812.00329 , 2018.[89] D. Kim, D. Cho, D. Yoo, and I. S. Kweon, “Learning imagerepresentations by completing damaged jigsaw puzzles,” arXivpreprint arXiv:1802.01880 , 2018.[90] D. Wei, J. Lim, A. Zisserman, and W. T. Freeman, “Learning andusing the arrow of time,” in
CVPR , pp. 8052–8060, 2018.[91] I. Croitoru, S.-V. Bogolin, and M. Leordeanu, “Unsupervisedlearning from video to detect foreground objects in single im-ages,” arXiv preprint arXiv:1703.10901 , 2017.[92] H. Jiang, G. Larsson, M. Maire Greg Shakhnarovich, andE. Learned-Miller, “Self-supervised relative depth learning forurban scene understanding,” in
ECCV , pp. 19–35, 2018.[93] R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in
ICCV , pp. 609–617, IEEE, 2017.[94] P. Agrawal, J. Carreira, and J. Malik, “Learning to see by mov-ing,” in
ICCV , pp. 37–45, 2015.[95] D. Jayaraman and K. Grauman, “Learning image representationstied to ego-motion,” in
ICCV , pp. 1413–1421, 2015.[96] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, andA. Zisserman, “The pascal visual object classes (voc) challenge,”
IJCV , vol. 88, no. 2, pp. 303–338, 2010.[97] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapesdataset for semantic urban scene understanding,” in
CVPR ,pp. 3213–3223, 2016. [98] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba,“Semantic understanding of scenes through the ade20k dataset,” arXiv preprint arXiv:1608.05442 , 2016.[99] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects incontext,” in
ECCV , pp. 740–755, Springer, 2014.[100] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You onlylook once: Unified, real-time object detection,” in
CVPR , pp. 779–788, 2016.[101] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in
CVPR , 2017.[102] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu,and A. C. Berg, “Ssd: Single shot multibox detector,” in
ECCV ,pp. 21–37, Springer, 2016.[103] T.-Y. Lin, P. Doll´ar, R. B. Girshick, K. He, B. Hariharan, and S. J.Belongie, “Feature pyramid networks for object detection.,” in
CVPR , vol. 1, p. 4, 2017.[104] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal lossfor dense object detection,”
TPAMI , 2018.[105] J. A. Suykens and J. Vandewalle, “Least squares support vectormachine classifiers,”
Neural processing letters , vol. 9, no. 3, pp. 293–300, 1999.[106] H. Kuehne, H. Jhuang, R. Stiefelhagen, and T. Serre, “Hmdb51: Alarge video database for human motion recognition,” in
HPCSE ,pp. 571–582, Springer, 2013.[107] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learn-ing deep features for scene recognition using places database,” in
NIPS , pp. 487–495, 2014.[108] B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva,“Places: An image database for deep scene understanding,” arXivpreprint arXiv:1610.02055 , 2016.[109] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, andT. Funkhouser, “Semantic scene completion from a single depthimage,” in
CVPR , pp. 190–198, IEEE, 2017.[110] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,”
Proceedings of theIEEE , vol. 86, no. 11, pp. 2278–2324, 1998.[111] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y.Ng, “Reading digits in natural images with unsupervised featurelearning,” in
NIPSW , vol. 2011, p. 5, 2011.[112] A. Krizhevsky and G. Hinton, “Learning multiple layers offeatures from tiny images,” tech. rep., Citeseer, 2009.[113] A. Coates, A. Ng, and H. Lee, “An analysis of single-layernetworks in unsupervised feature learning,” in
Proceedings ofthe fourteenth international conference on artificial intelligence andstatistics , pp. 215–223, 2011.[114] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni,D. Poland, D. Borth, and L.-J. Li, “The new data and new chal-lenges in multimedia research,” arXiv preprint arXiv:1503.01817 ,vol. 1, no. 8, 2015.[115] J. McCormac, A. Handa, S. Leutenegger, and A. J. Davison,“Scenenet rgb-d: Can 5m synthetic images beat generic imagenetpre-training on indoor segmentation,” in
ICCV , vol. 4, 2017.[116] M. Monfort, B. Zhou, S. A. Bargal, T. Yan, A. Andonian, K. Ra-makrishnan, L. Brown, Q. Fan, D. Gutfruend, C. Vondrick, et al. ,“Moments in time dataset: one million videos for event under-standing,”[117] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence,R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontol-ogy and human-labeled dataset for audio events,” in
ICASSP ,pp. 776–780, IEEE, 2017.[118] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for au-tonomous driving? the kitti vision benchmark suite,” in
CVPR ,pp. 3354–3361, IEEE, 2012.[119] A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache, “Learn-ing visual features from large weakly supervised data,” in
ECCV ,pp. 67–84, 2016.[120] A. Radford, L. Metz, and S. Chintala, “Unsupervised represen-tation learning with deep convolutional generative adversarialnetworks,” arXiv preprint arXiv:1511.06434 , 2015.[121] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXivpreprint arXiv:1701.07875 , 2017.[122] J. Donahue, P. Kr¨ahenb ¨uhl, and T. Darrell, “Adversarial featurelearning,” arXiv preprint arXiv:1605.09782 , 2016.[123] T. Chen, X. Zhai, and N. Houlsby, “Self-supervised gan to counterforgetting,” arXiv preprint arXiv:1810.11598 , 2018. [124] G. Larsson, M. Maire, and G. Shakhnarovich, “Learning repre-sentations for automatic colorization,” in ECCV , pp. 577–593,Springer, 2016.[125] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and LocallyConsistent Image Completion,”
SIGGRAPH , 2017.[126] S. Jenni and P. Favaro, “Self-supervised feature learning bylearning to spot artifacts,” arXiv preprint arXiv:1806.05024 , 2018.[127] R. Santa Cruz, B. Fernando, A. Cherian, and S. Gould, “Visualpermutation learning,”
TPAMI , 2018.[128] J. Yang, D. Parikh, and D. Batra, “Joint unsupervised learningof deep representations and image clusters,” in
CVPR , pp. 5147–5156, 2016.[129] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embed-ding for clustering analysis,” in
ICML , pp. 478–487, 2016.[130] M. Noroozi, H. Pirsiavash, and P. Favaro, “Representation learn-ing by learning to count,” in
ICCV , 2017.[131] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimension-ality of data with neural networks,” science , vol. 313, no. 5786,pp. 504–507, 2006.[132] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford,and X. Chen, “Improved techniques for training gans,” in
NIPS ,pp. 2234–2242, 2016.[133] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, andS. Hochreiter, “Gans trained by a two time-scale update ruleconverge to a local nash equilibrium,” in
NIPS , pp. 6626–6637,2017.[134] A. Brock, J. Donahue, and K. Simonyan, “Large scale gan train-ing for high fidelity natural image synthesis,” arXiv preprintarXiv:1809.11096 , 2018.[135] T. Karras, S. Laine, and T. Aila, “A style-based generator ar-chitecture for generative adversarial networks,” arXiv preprintarXiv:1812.04948 , 2018.[136] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-imagetranslation with conditional adversarial networks,” arxiv , 2016.[137] R. Zhang, J.-Y. Zhu, P. Isola, X. Geng, A. S. Lin, T. Yu, and A. A.Efros, “Real-time user-guided image colorization with learneddeep priors,” arXiv preprint arXiv:1705.02999 , 2017.[138] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Let there be color!:joint end-to-end learning of global and local image priors forautomatic image colorization with simultaneous classification,”
TOG , vol. 35, no. 4, p. 110, 2016.[139] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: areview,”
ACM computing surveys (CSUR) , vol. 31, no. 3, pp. 264–323, 1999.[140] N. Dalal and B. Triggs, “Histograms of oriented gradients forhuman detection,” in
CVPR , vol. 1, pp. 886–893, IEEE, 2005.[141] J. S´anchez, F. Perronnin, T. Mensink, and J. Verbeek, “Imageclassification with the fisher vector: Theory and practice,”
IJCV ,vol. 105, no. 3, pp. 222–245, 2013.[142] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelityvisual and physical simulation for autonomous vehicles,” in
Fieldand Service Robotics , 2017.[143] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,“CARLA: An open urban driving simulator,” in
Proceedings of the1st Annual Conference on Robot Learning , pp. 1–16, 2017.[144] M. Saito, E. Matsumoto, and S. Saito, “Temporal generativeadversarial nets with singular value clipping,” in
ICCV , vol. 2,p. 5, 2017.[145] C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, andK. Murphy, “Tracking emerges by colorizing videos,” in
ECCV ,2018.[146] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, andW.-c. Woo, “Convolutional lstm network: A machine learningapproach for precipitation nowcasting,” in
NIPS , pp. 802–810,2015.[147] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, “Decomposingmotion and content for natural video sequence prediction,” in
ICLR , 2017.[148] Z. Luo, B. Peng, D.-A. Huang, A. Alahi, and L. Fei-Fei, “Unsu-pervised learning of long-term motion dynamics for videos,” in
CVPR , 2017.[149] B. Brattoli, U. B ¨uchler, A.-S. Wahl, M. E. Schwab, and B. Ommer,“Lstm self-supervision for detailed behavior analysis,” in
CVPR ,pp. 3747–3756, IEEE, 2017.[150] D. Jayaraman and K. Grauman, “Slow and steady feature anal-ysis: higher order temporal coherence in video,” in
CVPR ,pp. 3852–3861, 2016. [151] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas,V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet:Learning optical flow with convolutional networks,” in
ICCV ,pp. 2758–2766, 2015.[152] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox,“Flownet 2.0: Evolution of optical flow estimation with deepnetworks,” in
CVPR , pp. 1647–1655, IEEE, 2017.[153] S. Meister, J. Hur, and S. Roth, “Unflow: Unsupervised learningof optical flow with a bidirectional census loss,” in
AAAI , 2018.[154] A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Tor-ralba, “Ambient sound provides supervision for visual learning,”in
ECCV , pp. 801–816, Springer, 2016.[155] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervisedlearning of depth and ego-motion from video,” in
CVPR , vol. 2,p. 7, 2017.[156] Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth,optical flow and camera pose,” in
CVPR , vol. 2, 2018.[157] Y. Zou, Z. Luo, and J.-B. Huang, “Df-net: Unsupervised jointlearning of depth and flow using cross-task consistency,” in
ECCV , pp. 38–55, Springer, 2018.[158] G. Iyer, J. K. Murthy, G. Gupta, K. M. Krishna, and L. Paull,“Geometric consistency for self-supervised end-to-end visualodometry,” arXiv preprint arXiv:1804.03789 , 2018.[159] Y. Zhang, S. Khamis, C. Rhemann, J. Valentin, A. Kowdle,V. Tankovich, M. Schoenberg, S. Izadi, T. Funkhouser, andS. Fanello, “Activestereonet: End-to-end self-supervised learningfor active stereo systems,” in
ECCV , pp. 784–801, 2018.[160] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Deepend2end voxel2voxel prediction,” in
CVPRW , pp. 17–24, 2016.[161] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale videoprediction beyond mean square error,” in
ICLR , 2016.[162] F. A. Reda, G. Liu, K. J. Shih, R. Kirby, J. Barker, D. Tarjan, A. Tao,and B. Catanzaro, “Sdc-net: Video prediction using spatially-displaced convolution,” in
ECCV , pp. 718–733, 2018.[163] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, andS. Levine, “Stochastic variational video prediction,” arXiv preprintarXiv:1710.11252 , 2017.[164] X. Liang, L. Lee, W. Dai, and E. P. Xing, “Dual motion gan forfuture-flow embedded video prediction,” in
ICCV , vol. 1, 2017.[165] C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning forphysical interaction through video prediction,” in
NIPS , pp. 64–72, 2016.[166] P. Krhenbhl, “Free supervision from video games,” in
CVPR , June2018.[167] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici,B. Varadarajan, and S. Vijayanarasimhan, “Youtube-8m: Alarge-scale video classification benchmark,” arXiv preprintarXiv:1609.08675 , 2016.[168] L. Gomez, Y. Patel, M. Rusi˜nol, D. Karatzas, and C. Jawahar,“Self-supervised learning of visual features through embeddingimages into text topic spaces,” in