[PDF] Disentangle Perceptual Learning through Online Contrastive Learning

Abstract

Pursuing realistic results according to human visual perception is the central concern in the image transformation tasks. Perceptual learning approaches like perceptual loss are empirically powerful for such tasks but they usually rely on the pre-trained classification network to provide features, which are not necessarily optimal in terms of visual perception of image transformation. In this paper, we argue that, among the features representation from the pre-trained classification network, only limited dimensions are related to human visual perception, while others are irrelevant, although both will affect the final image transformation results. Under such an assumption, we try to disentangle the perception-relevant dimensions from the representation through our proposed online contrastive learning. The resulted network includes the pre-training part and a feature selection layer, followed by the contrastive learning module, which utilizes the transformed results, target images, and task-oriented distorted images as the positive, negative, and anchor samples, respectively. The contrastive learning aims at activating the perception-relevant dimensions and suppressing the irrelevant ones by using the triplet loss, so that the original representation can be disentangled for better perceptual quality. Experiments on various image transformation tasks demonstrate the superiority of our framework, in terms of human visual perception, to the existing approaches using pre-trained networks and empirically designed losses.

Full PDF

DDisentangle Perceptual Learning through OnlineContrastive Learning

Kangfu Mei , Yao Lu , Qiaosi Yi , Haoyu Wu , Juncheng Li , Rui Huang The Chinese University of Hong Kong, Shenzhen East China Normal University [email protected]

Abstract

Pursuing realistic results according to human visual perception is the centralconcern in the image transformation tasks. Perceptual learning approaches like [13]are empirically powerful for such tasks but they usually rely on the pre-trainedclassiﬁcation network to provide features, which are not necessarily optimal interms of visual perception of image transformation. In this paper, we argue that,among the features representation from the pre-trained classiﬁcation network,only limited dimensions are related to human visual perception, while others areirrelevant, although both will affect the ﬁnal image transformation results. Undersuch an assumption, we try to disentangle the perception-relevant dimensions fromthe representation through our proposed online contrastive learning. The resultednetwork includes the pre-training part and a feature selection layer, followed bythe contrastive learning module, which utilizes the transformed results, targetimages, and task-oriented distorted images as the positive, negative, and anchorsamples, respectively. The contrastive learning aims at activating the perception-relevant dimensions and suppressing the irrelevant ones by using the triplet loss, sothat the original representation can be disentangled for better perceptual quality.Experiments on various image transformation tasks demonstrate the superiority ofour framework, in terms of human visual perception, to the existing approachesusing pre-trained networks and empirically designed losses.

Image transformation aims at transforming images from one condition/scenario into another, e.g.,low-resolution into high-resolution, low-lighting into normal-lighting, etc. Recent deep learning-based methods [7][16][26] have achieved signiﬁcant improvements in transforming the contents ofimages, but the visual quality of the transformed images are often not perfect, especially in terms ofhuman perception.More recent works [15][21][14] introduce perceptual learning to address this issue. They utilize apre-trained classiﬁcation network Ψ to extract high-dimension features as the representations of boththe generated images ˜ X and the target images Y , and then measure the distance between these tworepresentations as the loss function, formulated as: L perceptual ( ˜ X, Y ) = || Ψ( ˜ X ) − Ψ( Y ) || . (1)Compared with the distance metrics like MAE or MSE used in earlier works, the perceptual distanceis measured in the feature space instead of the pixel space, which is considered to be more compactand more relevant to human perception. Furthermore, Mechrez et al. [18] propose the contextual loss, Preprint. Under review. a r X i v : . [ c s . C V ] J un igure 1: The left part: the image quality tradeoff between the contextual loss and L1 loss. The rightpart: changes in feature distance between different seasons before and after learning.which measures the distance in a feature contextual space, deﬁned as: L CX ( ˜ X, Y ) = − log ( 1 N (cid:88) j max i CX ij ) , (2)where CX ij is the similarity between features Ψ( ˜ X ) = ˜ x i and Ψ( Y ) = y j , and is usually calculatedusing the normalized cosine distance. Compared to L perceptual ( ˜ X, Y ) , L CX ( ˜ X, Y ) is calculated inthe feature contextual space, which is supposed to be more robust when the training images ˜ X and Y are not aligned. The impact of these methods is analyzed in more detail in Mechrez et al. [17] andYang et al. [23].In general, the extracted features from pre-trained networks can be regarded as probability distri-butions over the input images ˜ X and Y , denoted as P ˜ X and P Y . Then we can demonstrate thatminimizing the distance between Ψ( ˜ X ) and Ψ( Y ) using L perceptual or L CX is similar to minimizingthe Kullback-Leibler (KL) divergence between two distributions P ˜ X and P Y , as shown in Eq.(3): D KL ( P ˜ X || P Y ) = (cid:90) P ˜ X log P ˜ X P Y . (3)Since Ψ( ˜ X ) and Ψ( Y ) are extracted features from the pre-trained network Ψ (usually a classiﬁcationnetwork trained on large-scale classiﬁcation datasets), we can consider P ˜ X and P Y as the mappingsfrom the pixel space into the semantic manifold of nature images learned by the pre-trained network Ψ . Therefore, the generated images using these perceptual learning approaches can be more realistic.It should be noted, however, these two loss functions often account for relatively small roles in theﬁnal loss function of the entire network training, even though they might be important for humanperception. In practice, they are often combined with the traditional pixel-wise losses or adversariallosses, and only worked as the auxiliary to avoid artifacts. We argue that these perceptual losses donot work well alone because of the irrelevant features contained in the pre-trained representationsand the poor generalization ability of the feature projections when used on new datasets. For the ﬁrstproblem, we conduct several experiments in combining the contextual loss with L1 loss with differentweights, and observe that directly using the contextual loss will lead to unexpected artifacts in thegenerated images, e.g., color offset, ripple artifacts, blurring, etc. These results are shown in the leftpart of Figure 1. Increasing the weight of the L1 loss will reduce the artifacts but cause blur in thegenerated images. Furthermore, in the middle part of ﬁgure 1, we visualize the dimension-reducedrepresentation of three images from two scenes and two seasons mapped using the pre-trainednetworks. We notice that these two images from the same scene but different seasons have a smallerdistance than that of the two images from different scenes but the same season, which means thepre-trained features are not suitable to the transformation tasks like season transfer. To achieve seasontransfer, we need to push the distance between the images of different seasons away and pull thedistance between the same-season images closer, as shown in the right part of ﬁgure 1.In other words, we believe that features pre-trained for classiﬁcation tasks contain information that isirrelevant to other image transformation tasks. The perceptual losses designed upon these features,though capture additional perceptual information than the simple pixel-wise losses, also introducemany irrelevant information that misleads the training of the image transformation networks. Suchfeatures, therefore, should not be directly applied in the image transformation network trainingprocess. To overcome this drawback, in this paper we propose to ﬁne-tune the pre-trained networkusing the task-oriented instance triplet samples, so that the traditional pre-trained representations2an be disentangled as a set of task-relevant dimensions and a set of task-irrelevant dimensions. Toachieve this goal, we introduce a novel online contrastive learning scheme to activate the task-relevantdimensions and suppress the task-irrelevant dimensions. Pursuing realistic image transformation has been addressed by several recent works. Both adversarialbased methods, e.g., SRGAN [15] and perceptual based methods, e.g., Perceptual Loss [13] areproposed to push the generated images’ realism improvements by minimizing the distance betweenhigh-dimension features, which represent either semantic or classic information from the pre-trainingor parallel-training deep networks. Compared with the pixel-based loss function, e.g., MAE andMSE Losses, usually leading to over-smooth results, these works tend to generate ﬁner texture details.However, they suffer ambiguous convergence during adversarial training or unpleasant artifacts,mainly caused by the unreasonable deep features represent both realistic relevant dimensions andnon-relevant dimensions. To better reduce the artifacts, some works, e.g., ESRGAN [21] haveresorted to select stronger features from the pre-training network to represent the images, and enhancethe network architecture on generation and discrimination to reduce the convergence difﬁculties.Different from the representation enhancement, the idea of optimizing the distance measuring hasalso been discussed in recent works, e.g., Contextual Loss [18] computes the similarities on thecontextual features, which could be seen as maximizing the similarities with images’ real distributionsvia non-parametric estimation, instead of minimizing the Euclidean distance as discussed previously.Even these methods are robustness on unaligned data due to the optimized distance measuring andtend to generate pleasant results, there still existed many failed cases of the results. Moreover, thesemethods lack of convincing reasons for the feature selection or effectiveness analysis of the realistic,which further increase the difﬁculties of improvement in themselves and can only be applied as ablack-box.

Extracting efﬁcient representation from images is crucial to measuring the distance between trans-formed images and target images. Although Perceptual Loss [13] and Contextual Loss [18] havedeclared that the features extracted from the pre-trained VGG-19 or AlexNet can best representthe human perception in images, there still have many works proposed to reﬁne representationsfrom pre-training. Using networks pre-trained in unsupervised common tasks and then ﬁne-tunerepresentations in speciﬁc sub-tasks have shown its superiority in both vision and language tasks,e.g., MoCo [8], SimCLR [4], GPT [19], and BERT [6]. These methods use contrastive learningto pre-train on common tasks so that they can capture the relations between similar objects anddissimilar relations without labels. For example, triplet loss [9][20] maximizes the distance fromanchor samples to negative samples and minimizes the distance from anchor samples to positivesamples. Under such a process, the optimized networks can learn useful representations from data.However, these vision-related pre-training methods are usually applied to the recognition [4] relatedtasks rather than generation related tasks. Regarding to the image generation related tasks, mostworks use the representation of images as the latent code between the encoder and decoder. Thendiscriminator is used to regularize the generated results with speciﬁc latent code. Chen et al. [5]introduce InfoGAN that uses unsupervised learning to learn disentangled representations, whichdecomposes the input into an incompressible noise and a latent code so that the representation isrelated to the latent code only. Even these methods can generate diversiﬁed results using the encodedrepresentation, the quality of generated results cannot be ensured. Jolicoeur et al. [14] and Wanget al. [21] apply the probabilities of generated images to stable the training process and enhancethe quality of images. The probability similarities are measured in the feature differences of thediscriminator, which could be seen as the special case of measuring representation differences.

In this paper, we adapt the perceptual loss as the main loss function in training networks, whichuses the deep feature-based loss instead of pixel loss, adversarial loss, or any other handcrafted loss3 lgorithm 1 our proposed Disentangled Perceptual Learning. input: source images X , target images Y , generator network F ( · ) , pre-trained network Ψ( · ) ,feature selection layer Φ( · ) , accumulate interval N , random crop function f c , task-orienteddistortion function f d . for sampled mini-batch { x k } Nk =1 , { y k } Nk =1 dofor all k ∈ { , . . . , N } dodo freeze parameters of F ˜ x k = F ( x k ) < y (cid:48) , y (cid:48)(cid:48) , ˜ x (cid:48) > k = f c ( y k ) , f c ( f d ( y k )) , f c (˜ x k ) < h n , h a , h p > k = Ψ( < y (cid:48) , y (cid:48)(cid:48) , ˜ x (cid:48) > k ) < e n , e a , e p > k = Φ( < h n , h a , h p > k ) d c d c ← d c + max( || e a − e p || − || e a − e n || + margin , < h n , h p > k = Ψ( < y, ˜ x > k ) < e n , e p > k = Φ( < h n , h p > k ) do unfreeze parameters of FF ← F + Adam ( F, || e n − e p || ) end for Φ ← Φ +

Adam (Φ , d c ) end forreturn generator network F ( · ) , and throw away Ψ( · ) and Φ( · ) functions. Deep feature-based loss function tends to generate images with more realistic details andprovides a more stable training process empirically, but it is usually incorporated as an auxiliary lossfunction in previous works due to its inexplicable and uncontrollable problems. To overcome theseissues, in this section we describe the details of our proposed Disentangled Perceptual Learning (DPL)with online contrastive learning as a new general framework of perceptual learning. We separate theDPL into three different components: Online Contrastive Learning, Feature Selection as Fine-tune,and Task-Oriented Disentanglement. We summarize the overall method in Algorithm 1. The superiority of perceptual learning mainly comes from the applied pre-trained classiﬁcationnetwork. A classiﬁcation network Ψ pre-trained on a large-scale image classiﬁcation dataset canmap the input image into a high-level feature space, where images with similar contents will beprojected into similar embeddings. Past works [13][21] declare the distance calculated at the high-level embedding space is more similar to human perception since the pre-trained Ψ can omit theinformation that is helpless to human recognition. The introduced perceptual loss works as minimizingpixel loss of some compacter images without trivial information. Here we formulate the perceptualloss in terms of widely used MSE loss in the feature manifold as min θ F || Ψ( F ( X ; θ )) − Ψ( Y ) || , (4)where F is the generator with parameters θ F that learns to transform the input images X into targetimage Y , and the weights of Ψ are ﬁxed during the training phase. However, the generated imagesthat using the original pre-trained networks usually have various artifacts. Here we assume therepresentation extracted from Ψ is not powerful enough to represent images due to the distributiondivergence on pre-trained classiﬁcation dataset and transformation dataset. One straightforwardmodiﬁcation is to ﬁne-tune the pre-trained Ψ for image transformation tasks, but it tends to beunpractical due to the lacking labeled dataset.Here we introduce the online contrastive learning to perform simultaneous learning on both the pre-trained Ψ and generator F . It aims to learn the distinctiveness where the similar objects are similar infeature space and different objects are different in feature space. By learning in the self-supervisedmanner, it does not need the categorical label during the training. Inspired from the unsupervisedlearning methods [4] in the recognition tasks, the triplet is constructed using the random crop function f c on instance images, which produces different cropped results at each call. The ﬁnal loss function4an be formulated as: min θ Ψ max( || Ψ( f c ( Y )) − Ψ( f c ( Y )) || − || Ψ( f c ( Y )) − Ψ( f c ( ˜ X )) || + margin , , (5)where θ Ψ is the parameters of pre-trained networks and we set as the margin between the positivepairs and negative pairs. During the training, the optimization of F and optimization of Ψ areconducted simultaneously and two different optimizers are used. However, it is difﬁcult to ﬁnd ameaningful triplet from randomly sampled images, thus we apply the gradient accumulation whenupdating the parameters of Ψ . To be speciﬁc, we update the parameters of Ψ after K iterations that F has forwarded. A more general understanding of this process can be to learn a pre-trained networkto distinguish the generated images from the ground truth. Thus the pre-trained networks Ψ canbe thought as a discriminator used in the RealisticGAN [14] except the weight of discriminator istransferred and the training difﬁculties at the initial stage are greatly reduced. In the previous section, we introduce the online contrastive learning that updates the parametersof the pre-trained networks Ψ during training. However, it should be noted that such operationwill become overﬁtting easily and the representation will be classiﬁcation related only. Hence thecontrasting learning, as well as the generator optimizing will become difﬁcult to convergence due tothe adversarial learning degenerated.To overcome this, we introduce a non-linear feature selection layer Φ after the pre-trained network Ψ and freeze the parameters of Ψ at the ﬁne-tuning process. The feature selection layer consists of twoconvolution layers with × kernel size, and one activation layer is inserted between them. Sucharchitecture is also used in the other representation networks, e.g., SimCLR [4]. The difference isthat in our work, the parameters of the pre-trained network are frozen, and only the parameters ofthe feature selection layer are trainable. With the introduced feature selection layer, we can combinethe features from different channels and learn to activate the features per-channel using the featureselection layer. Thus the features related to the images difference are activated and the representationis further disentangled with less disturbance from irrelevant features. Besides, a channel reduction isalso used to further compress the dimension of output features. Different from disentangling irrelevant dimensions from extracted representation, we further extendit into decomposing perceptual relevant factors, e.g., color, sharp, and other perceptual factors thataffect images, called task-oriented disentanglement. To implement this, we construct instance tripletsamples in the online contrastive learning, but the negative samples is generated from Y using thetask-speciﬁc distortion. For example, the perceptual factor: colorful accuracy is hard to be measuredexplicitly, but a human can easily distinguish which image is more accurate in color between twodistorted images when given a reference image, even if the given reference image is blurred oraffected by other factors. That is to say, human disentangle perceptual factors using contrastivesamples. Inspired by this intuition, we introduce Task-Oriented Disentanglement to separate eachperceptual factors from networks implicitly.More speciﬁcally, we disentangle the perceptual relevant factors by constructing anchor samplesfrom the target images Y using speciﬁc distortion f d randomly. where the used distortion is relatedto the separated factor only, then online contrastive learning is performed with the representationnetwork E which is composed of online contrastive learning and feature selection as ﬁne-tune and isoptimized via minimizing max( || E ( f d ( f c ( Y ))) − E ( f c ( ˜ X )) || − || E ( f d ( f c ( Y )) − E ( f c ( Y )) || + margin , . (6)The whole convergence process is also illustrated in Figure 1. It can be concluded as ﬁrstly maxi-mizing the distance between E ( f d ( f c ( Y ))) and E ( f c ( Y )) to optimize E where the representationsare varied due to the difference in factors, then minimizing the distance between E ( f d ( f c ( Y ))) and E ( f c ( ˜ X )) to optimize E , which are similar to enlarge the distance between the E ( f c ( ˜ X )) and E ( f d ( f c ( Y ))) in the speciﬁc dimensions that related to the perceptual factors.5 Experimental Results

In this section, we conduct three different experiments to validate the performance of our introducedmethods, and analyze the relations between the perceptual quality. Furthermore, we explore therelationship between perceptual quality and different settings. Theses three experiments are performedon the Season image transfer [10], RAW low-light image illumination [2], and RAW image super-resolution [25] respectively. Before training, we apply the data augmentations including randomﬂipping, random rotation, and random cropping on the above datasets. During training, we use Adamoptimizer to update the networks with parameters ( β = 0 . , β = 0 . ). The learning rate 1e-4 isused during the whole training and the mini-batch size is 1. Moreover, all experiments are conductedusing PyTorch 1.4 and CUDA 10.0 on Ubuntu 18.04 with 8x2080TI (11GB version). The sourcecode of our implementation is available at supplementary material. Figure 2: Transformation results comparisons in Winter → Summer and Summer → WinterSeason image transfer is one of the most representative tasks in the unpaired image transformations.During learning, the transformation network is trained on a set of winter images and a set of summerimages without corresponding relations, and GAN based networks are usually used to learn thecycle relation, i.e., winter → summer → winter. In order to enhance the quality of generated images,perceptual learning is applied to maximize preserve the perceptual similarity between the originalimages X and transformed images ˜ X .Here we utilize the MUNIT proposed by Huang et al. [10] as the baseline of our methods. Morespeciﬁcly, for online contrastive learning, we construct instance triplet samples between the sourceimages and generated images as < f c ( X ) , f c ( X ) , f c ( ˜ X ) > , where f c is the random crop operationsand the triplet is used as < Anchor , Positive , Negative > . With such settings, the online contrastivelearning aims to maximize the distance in a feature space that only related to the season and minimizethe distance in a feature space that unrelated to seasons. In other words, the representation networkis optimized to focus on more scenes that most affect the human recognized perception in winterand summer. Since there existed recognizable divergence between the winter images and summerimages, we only update the perceptual network after every 100 iterations of the generator, and wedo not use the feature selection and task-oriented augmentation. The generated results are shownin Figure 2. In the left part of Figure 2, we show the results of Winter → Summer, and it is clear tosee that applying the perceptual can generate brighter results but have no distinct effects in scenechanging. However, after applying the perceptual with our introduced online contrastive learning, thegenerated results look like containing more green plants with fewer white snow in the scenes. Theresults differences also exist in the right part of Figure 2 which shows the Summer → Winter results.Therefore, we can conclude the proposed online contrastive learning is beneﬁcial for preserving theperceptual similarity between images even when they belong to different seasons.

In the area of low-light enhancement, RAW images have got lots of attention in recent works [3][2].Compared with the previously used RGB images, RAW images can provide more information and6igure 3: Transformation results comparisons in low-light images enhancement taskhave less information loss since they are generated from CMOS sensors directly. However, theseimages have notable differences in visualization than normal images that humans acceptable, e.g.,color space offset, optical noise, optical distortion, and so on. Therefore, enhancing the perceptualquality of generated normal-light images have crucial practical signiﬁcance.Here we adapt the SMD proposed by Chen et al. [2] as the baseline of our methods, as well asthe EnlightenGAN [12] for comparison, which is the state-of-the-art methods in low-light imageenhancement. SMD utilizes the extracted features from different layers of pre-trained VGG19network, and it calculates the distance between the illuminated images and target images in theextracted feature spaces. The ﬁnal loss of training is the combination of VGG19 loss and L1 loss.However, as the Figure 3 shows that the generated results using such loss function tend to have unevenillumination appearance. In order to boost the visual quality of generated results, we construct task-oriented instance triplet samples to ﬁne-tune the pre-trained VGG19 networks with online contrastivelearning. In Table 1 we show each component of our methods incrementally aims to provide a detailedanalysis. The evaluation metrics include pixel-based PSNR, perception related MS-SSIM [22] andLPIPS [24]. For convenience, we use FS to denote whether the feature selection as ﬁne-tune is used.It is clear to see that our methods can not only achieve state-of-the-art performance in perception, butalso can generate a result with the best visual quality and more realistic details, even compared to thehandcrafted texture loss.Table 1: Quantitative comparisons in low-light enhancement datasets. MethodsBackbone Loss FS Task PSNR ↑ MS-SSIM ↑ LPIPS ↓ Traditional - - - 17.096 0.8039 0.4185EnlightenGAN [12] Adversarial - - 20.556 0.9168 0.2525SMD [2] VGG - - 23.541 0.9147 0.1946SMD [2] VGG + Texture Loss - - 22.147 0.8791 0.2218Ours VGG (cid:88) - 23.710 (cid:88)

Blur

Recent works [25][1] have proven the remarkable differences existed between the real-world super-resolution problem and the simulated super-resolution, especially in the degeneration way of low-resolution images. However, even training on their proposed super-resolution datasets can improvethe effects especially applied the perceptual learning, two problems are also raised which cannot beignored. The ﬁrst problem is the misalignment between the low-resolution images and correspondinghigh-resolution images during the collection. The second problem is the color space divergencebetween RAW images and RGB images. Both affect the ﬁnal results seriously and make the pixel-7able 2: Quantitative performance compassion between different super-resolution methods and ours.

MethodsBackbone Loss FS Task PSNR ↑ MS-SSIM ↑ LPIPS ↓ RGB + Bi-cubic - - - (cid:88)

Color 14.455 0.5870 based loss function and contextually related loss function working defectively. In the left part ofFigure 1 we show the current trade-off solution that adjusts weights between L1 loss and contextualLoss to get the balance between color and sharpness.To address the weight adjustment problem that existed for long, we apply the online contrastivelearning to the representation network that contextual use. It should note that even the contextualloss calculates the distance in the feature contextual spaces instead of euclidean spaces, our proposedonline contrastive learning is also robustness enough since the triplet loss is relative to the differencebetween < Anchor , Positive> and < Anchor , Negative > instead of the difference between samples.In table 2 we show the quantitative performance comparisons between the baseline methods andour methods. Here we apply the random color jitter to the high-resolution images as the negativesamples, which makes the results with contextual only to be more colorful. The qualitative visualresults is also shown in Figure 4. It is easy to conclude that our method can get the best trade-offbetween the sharpness and color, even no elaborate weights are used. Noted that since the imagesused for validation have obviously misalignment so that the pixel-based metrics like PSNR havelimited reference values.Figure 4: Transformation results comparisons in RAW image super-resolution. In this paper, we argue that even though the perceptual learning approaches using perceptual losseson pre-trained features can capture perceptual information better than the approaches used pixel-wiselosses, they also bring irrelevant information into the image transformation tasks. We then introducean online contrastive learning scheme to ﬁne-tune the pre-training so that the learned representationcan better represent the relationships between the results and the target images. Speciﬁcally, wepropose a feature selection layer while freezing the pre-training to preserve the natural image statisticsfrom pre-training and reduce the irrelevant features. Furthermore, we construct task-oriented tripletsamples during the ﬁne-tuning, which drive the feature selection layer to be more sensitive to the taskrelated statistics. Finally, the proposed disentangled representation can achieve more realistic resultsin many image transformation tasks. Our further work will focus on disentangling the representationof human perception with ﬁner controlling during image transformation.8

Additional Network Details

Instance Triplet.

Construct instance triplets takes an essential role in online contrastive learning.It utilizes the self-similarity properties to enhance the learning that distinguishes the positive samplesfrom the anchor samples. To help readers better understand it, We illustrate its process in Figure 5with the random crop function in two times.Figure 5: Visualization of constructing instance triplet during online contrastive learning.

Task-Oriented Instance Triplet.

Different from the online contrastive learning that depends on theself-similarity, the task-oriented disentanglement focus more on the speciﬁc perceptual factors insteadof differences between the target images and generated images. To achieve ﬁner disentanglementduring the contrastive learning, we apply the different distortion algorithms in the target images asthe anchor samples. To help readers better understand it, we illustrate the distorted results in Figure 6with two different distortion algorithms.Figure 6: Visualization of constructing task-oriented instance triplet during online contrastive learning.9

Compared Handcrafted Losses Details

In order to demonstrate our method with the handcrafted loss functions that can disentangle speciﬁcperceptual factors, we have conducted several comparison in the experiments. In this section we willdescribe the details of the used handcrafted loss functions: color loss and texture loss proposed byIgnatov et al. [11].

Color Loss.

We use the color loss function to measure the difference of images in the brightness,contrast and color instead of pixel itself. Denote the transformed image and the target image as ˜ X and Y , they are ﬁrst processed by the Gaussian Blur as G ( · ) for removing high-level details, e.g.,edges and textures. With such operation, the left parts G ( ˜ X ) and G ( Y ) only contain the informationrelated to brightness, contrast, and color. Hence the distance in the color and other related perceptualfactors could be calculated via the Euclidean distance as: L color = || G ( ˜ X ) − G ( Y ) || . Texture loss.

Similar to the motivation of color loss, texture loss are calculated in the grayscaleversion of two images, which aims to eliminate the effect of color. The formula for texture loss is: L color = || ˜ X gray − Y gray || . Additional Result Images

On the basis of the Season Image Transfer, we also conduct experiments in edges2shoes dataset. Inedges2shoes dataset, we still utilize the MUNIT proposed by Huang et al.[10] as the baseline. Andthe method of constructing instance triplet samples is the same as the task of Season Image Transfer.The result is shown in Figure 7. The left is the input, and the right is the ground truth output. Eachfollowing column shows 3 random outputs from a method. And as can be seen from Figure 7, boththe

M U N IT and

M U N IT + P erceptual methods produce some violation artifacts, but the resultof our introduced online contrastive learning are more in line with human perception, and have moredetails. Figure 7: Transformation results comparisons in edges2shoes dataset11igure 8: Transformation results comparisons in RAW image super-resolution12 roader Impact

This paper introduces an online contrastive learning scheme to disentangle the classiﬁcation-orientedpre-trained image representations for better perceptual learning in the image transformation tasks.This framework enables the transformed images to be more realistic with fewer artifacts. Artists,photographers, creative workers, as well as every end-user, can all beneﬁt from it. It is possiblethis technique can be used to make more realistic “fake” images. However, we believe ultimatelythis technique will help people to understand the mechanism of image transformation even deeper.Besides, no consequences of failure of the system exist, and no biases in the data are leveraged.

References [1] Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single imagesuper-resolution: A new benchmark and a new model. In

Proceedings of the IEEE International Conferenceon Computer Vision , pages 3086–3095, 2019.[2] Chen Chen, Qifeng Chen, Minh N Do, and Vladlen Koltun. Seeing motion in the dark. In

Proceedings ofthe IEEE International Conference on Computer Vision , pages 3185–3194, 2019.[3] Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun. Learning to see in the dark. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , pages 3291–3300, 2018.[4] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework forcontrastive learning of visual representations. arXiv preprint arXiv:2002.05709 , 2020.[5] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:Interpretable representation learning by information maximizing generative adversarial nets. In

Advancesin neural information processing systems , pages 2172–2180, 2016.[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec-tional transformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018.[7] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deepconvolutional networks.

TPAMI , 2015.[8] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervisedvisual representation learning. arXiv preprint arXiv:1911.05722 , 2019.[9] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In

International Workshop onSimilarity-Based Pattern Recognition , pages 84–92. Springer, 2015.[10] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-imagetranslation. In

Proceedings of the European Conference on Computer Vision (ECCV) , pages 172–189,2018.[11] Andrey Ignatov, Nikolay Kobyshev, Radu Timofte, Kenneth Vanhoey, and Luc Van Gool. Dslr-qualityphotos on mobile devices with deep convolutional networks. In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 3277–3285, 2017.[12] Yifan Jiang, Xinyu Gong, Ding Liu, Yu Cheng, Chen Fang, Xiaohui Shen, Jianchao Yang, Pan Zhou, andZhangyang Wang. Enlightengan: Deep light enhancement without paired supervision. arXiv preprintarXiv:1906.06972 , 2019.[13] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer andsuper-resolution. In

European conference on computer vision , pages 694–711. Springer, 2016.[14] Alexia Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standard gan. arXivpreprint arXiv:1807.00734 , 2018.[15] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta,Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In

Proceedings of the IEEE conference on computervision and pattern recognition , pages 4681–4690, 2017.[16] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residualnetworks for single image super-resolution. In

Proceedings of the IEEE conference on computer visionand pattern recognition workshops , pages 136–144, 2017.

17] Roey Mechrez, Itamar Talmi, Firas Shama, and Lihi Zelnik-Manor. Maintaining natural image statisticswith the contextual loss. In

Asian Conference on Computer Vision , pages 427–443. Springer, 2018.[18] Roey Mechrez, Itamar Talmi, and Lihi Zelnik-Manor. The contextual loss for image transformation withnon-aligned data. In

Proceedings of the European Conference on Computer Vision (ECCV) , pages 768–783,2018.[19] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving languageunderstanding by generative pre-training.

URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf , 2018.[20] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A uniﬁed embedding for facerecognition and clustering. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pages 815–823, 2015.[21] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy.Esrgan: Enhanced super-resolution generative adversarial networks. In

Proceedings of the EuropeanConference on Computer Vision (ECCV) , pages 0–0, 2018.[22] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image qualityassessment. In

The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003 , volume 2,pages 1398–1402. Ieee, 2003.[23] Wenming Yang, Xuechen Zhang, Yapeng Tian, Wei Wang, Jing-Hao Xue, and Qingmin Liao. Deep learningfor single image super-resolution: A brief review.

IEEE Transactions on Multimedia , 21(12):3106–3121,2019.[24] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonableeffectiveness of deep features as a perceptual metric. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 586–595, 2018.[25] Xuaner Zhang, Qifeng Chen, Ren Ng, and Vladlen Koltun. Zoom to learn, learn to zoom. In

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition , pages 3762–3770, 2019.[26] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolutionusing very deep residual channel attention networks. In

ECCV , 2018., 2018.