[PDF] Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization

Abstract

Gatys et al. recently introduced a neural algorithm that renders a content image in the style of another image, achieving so-called style transfer. However, their framework requires a slow iterative optimization process, which limits its practical application. Fast approximations with feed-forward neural networks have been proposed to speed up neural style transfer. Unfortunately, the speed improvement comes at a cost: the network is usually tied to a fixed set of styles and cannot adapt to arbitrary new styles. In this paper, we present a simple yet effective approach that for the first time enables arbitrary style transfer in real-time. At the heart of our method is a novel adaptive instance normalization (AdaIN) layer that aligns the mean and variance of the content features with those of the style features. Our method achieves speed comparable to the fastest existing approach, without the restriction to a pre-defined set of styles. In addition, our approach allows flexible user controls such as content-style trade-off, style interpolation, color & spatial controls, all using a single feed-forward neural network.

Full PDF

AArbitrary Style Transfer in Real-time with Adaptive Instance Normalization

Xun Huang Serge BelongieDepartment of Computer Science & Cornell Tech, Cornell University { xh258,sjb344 } @cornell.edu Abstract

Gatys et al. recently introduced a neural algorithm thatrenders a content image in the style of another image,achieving so-called style transfer. However, their frame-work requires a slow iterative optimization process, whichlimits its practical application. Fast approximations withfeed-forward neural networks have been proposed to speedup neural style transfer. Unfortunately, the speed improve-ment comes at a cost: the network is usually tied to a ﬁxedset of styles and cannot adapt to arbitrary new styles. In thispaper, we present a simple yet effective approach that for theﬁrst time enables arbitrary style transfer in real-time. At theheart of our method is a novel adaptive instance normaliza-tion (AdaIN) layer that aligns the mean and variance of thecontent features with those of the style features. Our methodachieves speed comparable to the fastest existing approach,without the restriction to a pre-deﬁned set of styles. In ad-dition, our approach allows ﬂexible user controls such ascontent-style trade-off, style interpolation, color & spatialcontrols, all using a single feed-forward neural network.

1. Introduction

The seminal work of Gatys et al . [16] showed that deepneural networks (DNNs) encode not only the content butalso the style information of an image. Moreover, the im-age style and content are somewhat separable: it is possibleto change the style of an image while preserving its con-tent. The style transfer method of [16] is ﬂexible enough tocombine content and style of arbitrary images. However, itrelies on an optimization process that is prohibitively slow.Signiﬁcant effort has been devoted to accelerating neuralstyle transfer. [24, 51, 31] attempted to train feed-forwardneural networks that perform stylization with a single for-ward pass. A major limitation of most feed-forward meth-ods is that each network is restricted to a single style. Thereare some recent works addressing this problem, but they areeither still limited to a ﬁnite set of styles [11, 32, 55, 5], ormuch slower than the single-style transfer methods [6].In this work, we present the ﬁrst neural style transferalgorithm that resolves this fundamental ﬂexibility-speed dilemma. Our approach can transfer arbitrary new stylesin real-time, combining the ﬂexibility of the optimization-based framework [16] and the speed similar to the fastestfeed-forward approaches [24, 52]. Our method is inspiredby the instance normalization (IN) [52, 11] layer, whichis surprisingly effective in feed-forward style transfer. Toexplain the success of instance normalization, we proposea new interpretation that instance normalization performsstyle normalization by normalizing feature statistics, whichhave been found to carry the style information of an im-age [16, 30, 33]. Motivated by our interpretation, we in-troduce a simple extension to IN, namely adaptive instancenormalization (AdaIN). Given a content input and a styleinput, AdaIN simply adjusts the mean and variance of thecontent input to match those of the style input. Throughexperiments, we ﬁnd AdaIN effectively combines the con-tent of the former and the style latter by transferring featurestatistics. A decoder network is then learned to generate theﬁnal stylized image by inverting the AdaIN output back tothe image space. Our method is nearly three orders of mag-nitude faster than [16], without sacriﬁcing the ﬂexibility oftransferring inputs to arbitrary new styles. Furthermore, ourapproach provides abundant user controls at runtime, with-out any modiﬁcation to the training process.

2. Related Work

Style transfer.

The problem of style transfer has its originfrom non-photo-realistic rendering [28], and is closely re-lated to texture synthesis and transfer [13, 12, 14]. Someearly approaches include histogram matching on linear ﬁl-ter responses [19] and non-parametric sampling [12, 15].These methods typically rely on low-level statistics and of-ten fail to capture semantic structures. Gatys et al . [16] forthe ﬁrst time demonstrated impressive style transfer resultsby matching feature statistics in convolutional layers of aDNN. Recently, several improvements to [16] have beenproposed. Li and Wand [30] introduced a framework basedon markov random ﬁeld (MRF) in the deep feature space toenforce local patterns. Gatys et al . [17] proposed ways tocontrol the color preservation, the spatial location, and thescale of style transfer. Ruder et al . [45] improved the quality1 a r X i v : . [ c s . C V ] J u l f video style transfer by imposing temporal constraints.The framework of Gatys et al . [16] is based on a slowoptimization process that iteratively updates the image tominimize a content loss and a style loss computed by a lossnetwork. It can take minutes to converge even with mod-ern GPUs. On-device processing in mobile applications istherefore too slow to be practical. A common workaroundis to replace the optimization process with a feed-forwardneural network that is trained to minimize the same ob-jective [24, 51, 31]. These feed-forward style transfer ap-proaches are about three orders of magnitude faster thanthe optimization-based alternative, opening the door to real-time applications. Wang et al . [53] enhanced the granularityof feed-forward style transfer with a multi-resolution archi-tecture. Ulyanov et al . [52] proposed ways to improve thequality and diversity of the generated samples. However,the above feed-forward methods are limited in the sense thateach network is tied to a ﬁxed style. To address this prob-lem, Dumoulin et al . [11] introduced a single network thatis able to encode styles and their interpolations. Con-current to our work, Li et al . [32] proposed a feed-forwardarchitecture that can synthesize up to textures and trans-fer styles. Still, the two methods above cannot adapt toarbitrary styles that are not observed during training.Very recently, Chen and Schmidt [6] introduced a feed-forward method that can transfer arbitrary styles thanks toa style swap layer. Given feature activations of the contentand style images, the style swap layer replaces the contentfeatures with the closest-matching style features in a patch-by-patch manner. Nevertheless, their style swap layer cre-ates a new computational bottleneck: more than of thecomputation is spent on the style swap for × inputimages. Our approach also permits arbitrary style transfer,while being - orders of magnitude faster than [6].Another central problem in style transfer is which styleloss function to use. The original framework of Gatys etal . [16] matches styles by matching the second-order statis-tics between feature activations, captured by the Gram ma-trix. Other effective loss functions have been proposed,such as MRF loss [30], adversarial loss [31], histogramloss [54], CORAL loss [41], MMD loss [33], and distancebetween channel-wise mean and variance [33]. Note that allthe above loss functions aim to match some feature statisticsbetween the style image and the synthesized image. Deep generative image modeling.

There are several al-ternative frameworks for image generation, including varia-tional auto-encoders [27], auto-regressive models [40], andgenerative adversarial networks (GANs) [18]. Remarkably,GANs have achieved the most impressive visual quality.Various improvements to the GAN framework have beenproposed, such as conditional generation [43, 23], multi-stage processing [9, 20], and better training objectives [46,1]. GANs have also been applied to style transfer [31] and cross-domain image generation [50, 3, 23, 38, 37, 25].

3. Background

The seminal work of Ioffe and Szegedy [22] introduceda batch normalization (BN) layer that signiﬁcantly ease thetraining of feed-forward networks by normalizing featurestatistics. BN layers are originally designed to acceler-ate training of discriminative networks, but have also beenfound effective in generative image modeling [42]. Givenan input batch x ∈ R N × C × H × W , BN normalizes the meanand standard deviation for each individual feature channel:BN ( x ) = γ (cid:18) x − µ ( x ) σ ( x ) (cid:19) + β (1)where γ, β ∈ R C are afﬁne parameters learned from data; µ ( x ) , σ ( x ) ∈ R C are the mean and standard deviation,computed across batch size and spatial dimensions indepen-dently for each feature channel: µ c ( x ) = 1 N HW N (cid:88) n =1 H (cid:88) h =1 W (cid:88) w =1 x nchw (2) σ c ( x ) = (cid:118)(cid:117)(cid:117)(cid:116) N HW N (cid:88) n =1 H (cid:88) h =1 W (cid:88) w =1 ( x nchw − µ c ( x )) + (cid:15) (3)BN uses mini-batch statistics during training and replacethem with popular statistics during inference, introducingdiscrepancy between training and inference. Batch renor-malization [21] was recently proposed to address this issueby gradually using popular statistics during training. Asanother interesting application of BN, Li et al . [34] foundthat BN can alleviate domain shifts by recomputing popularstatistics in the target domain. Recently, several alternativenormalization schemes have been proposed to extend BN’seffectiveness to recurrent architectures [35, 2, 47, 8, 29, 44]. In the original feed-forward stylization method [51], thestyle transfer network contains a BN layer after each con-volutional layer. Surprisingly, Ulyanov et al . [52] foundthat signiﬁcant improvement could be achieved simply byreplacing BN layers with IN layers:IN ( x ) = γ (cid:18) x − µ ( x ) σ ( x ) (cid:19) + β (4)Different from BN layers, here µ ( x ) and σ ( x ) are com-puted across spatial dimensions independently for eachchannel and each sample : µ nc ( x ) = 1 HW H (cid:88) h =1 W (cid:88) w =1 x nchw (5) Iteration S t y l eLo ss ( × ) Batch NormInstance Norm (a) Trained with original images.

Iteration S t y l eLo ss ( × ) Batch NormInstance Norm (b) Trained with contrast normalized images.

Iteration S t y l eLo ss ( × ) Batch NormInstance Norm (c) Trained with style normalized images.

Figure 1. To understand the reason for IN’s effectiveness in style transfer, we train an IN model and a BN model with (a) original imagesin MS-COCO [36], (b) contrast normalized images, and (c) style normalized images using a pre-trained style transfer network [24]. Theimprovement brought by IN remains signiﬁcant even when all training images are normalized to the same contrast, but are much smallerwhen all images are (approximately) normalized to the same style. Our results suggest that IN performs a kind of style normalization. σ nc ( x ) = (cid:118)(cid:117)(cid:117)(cid:116) HW H (cid:88) h =1 W (cid:88) w =1 ( x nchw − µ nc ( x )) + (cid:15) (6)Another difference is that IN layers are applied at testtime unchanged, whereas BN layers usually replace mini-batch statistics with population statistics. Instead of learning a single set of afﬁne parameters γ and β , Dumoulin et al . [11] proposed a conditional instancenormalization (CIN) layer that learns a different set of pa-rameters γ s and β s for each style s :CIN ( x ; s ) = γ s (cid:18) x − µ ( x ) σ ( x ) (cid:19) + β s (7)During training, a style image together with its index s are randomly chosen from a ﬁxed set of styles s ∈{ , , ..., S } ( S = 32 in their experiments). The con-tent image is then processed by a style transfer networkin which the corresponding γ s and β s are used in the CINlayers. Surprisingly, the network can generate images incompletely different styles by using the same convolutionalparameters but different afﬁne parameters in IN layers.Compared with a network without normalization layers,a network with CIN layers requires F S additional param-eters, where F is the total number of feature maps in thenetwork [11]. Since the number of additional parametersscales linearly with the number of styles, it is challenging toextend their method to model a large number of styles ( e.g. ,tens of thousands). Also, their approach cannot adapt toarbitrary new styles without re-training the network.

4. Interpreting Instance Normalization

Despite the great success of (conditional) instance nor-malization, the reason why they work particularly well forstyle transfer remains elusive. Ulyanov et al . [52] attribute the success of IN to its invariance to the contrast of the con-tent image. However, IN takes place in the feature space,therefore it should have more profound impacts than a sim-ple contrast normalization in the pixel space. Perhaps evenmore surprising is the fact that the afﬁne parameters in INcan completely change the style of the output image.It has been known that the convolutional feature statisticsof a DNN can capture the style of an image [16, 30, 33].While Gatys et al . [16] use the second-order statistics astheir optimization objective, Li et al . [33] recently showedthat matching many other statistics, including channel-wisemean and variance, are also effective for style transfer. Mo-tivated by these observations, we argue that instance nor-malization performs a form of style normalization by nor-malizing feature statistics, namely the mean and variance.Although DNN serves as a image descriptor in [16, 33], webelieve that the feature statistics of a generator network canalso control the style of the generated image.We run the code of improved texture networks [52] toperform single-style transfer, with IN or BN layers. Asexpected, the model with IN converges faster than the BNmodel (Fig. 1 (a)). To test the explanation in [52], we thennormalize all the training images to the same contrast byperforming histogram equalization on the luminance chan-nel. As shown in Fig. 1 (b), IN remains effective, sug-gesting the explanation in [52] to be incomplete. To ver-ify our hypothesis, we normalize all the training images tothe same style (different from the target style) using a pre-trained style transfer network provided by [24]. Accordingto Fig. 1 (c), the improvement brought by IN become muchsmaller when images are already style normalized. The re-maining gap can explained by the fact that the style nor-malization with [24] is not perfect. Also, models with BNtrained on style normalized images can converge as fast asmodels with IN trained on the original images. Our resultsindicate that IN does perform a kind of style normalization.Since BN normalizes the feature statistics of a batch ofamples instead of a single sample, it can be intuitivelyunderstood as normalizing a batch of samples to be cen-tered around a single style. Each single sample, however,may still have different styles. This is undesirable when wewant to transfer all images to the same style, as is the casein the original feed-forward style transfer algorithm [51].Although the convolutional layers might learn to compen-sate the intra-batch style difference, it poses additional chal-lenges for training. On the other hand, IN can normalize thestyle of each individual sample to the target style. Trainingis facilitated because the rest of the network can focus oncontent manipulation while discarding the original style in-formation. The reason behind the success of CIN also be-comes clear: different afﬁne parameters can normalize thefeature statistics to different values, thereby normalizing theoutput image to different styles.

5. Adaptive Instance Normalization

If IN normalizes the input to a single style speciﬁed bythe afﬁne parameters, is it possible to adapt it to arbitrarilygiven styles by using adaptive afﬁne transformations? Here,we propose a simple extension to IN, which we call adaptiveinstance normalization (AdaIN). AdaIN receives a contentinput x and a style input y , and simply aligns the channel-wise mean and variance of x to match those of y . UnlikeBN, IN or CIN, AdaIN has no learnable afﬁne parameters.Instead, it adaptively computes the afﬁne parameters fromthe style input:AdaIN ( x, y ) = σ ( y ) (cid:18) x − µ ( x ) σ ( x ) (cid:19) + µ ( y ) (8)in which we simply scale the normalized content inputwith σ ( y ) , and shift it with µ ( y ) . Similar to IN, these statis-tics are computed across spatial locations.Intuitively, let us consider a feature channel that detectsbrushstrokes of a certain style. A style image with this kindof strokes will produce a high average activation for thisfeature. The output produced by AdaIN will have the samehigh average activation for this feature, while preserving thespatial structure of the content image. The brushstroke fea-ture can be inverted to the image space with a feed-forwarddecoder, similar to [10]. The variance of this feature chan-nel can encoder more subtle style information, which is alsotransferred to the AdaIN output and the ﬁnal output image.In short, AdaIN performs style transfer in the fea-ture space by transferring feature statistics, speciﬁcally thechannel-wise mean and variance. Our AdaIN layer playsa similar role as the style swap layer proposed in [6].While the style swap operation is very time-consuming andmemory-consuming, our AdaIN layer is as simple as an INlayer, adding almost no computational cost. V GG E n c ode r AdaIN D e c ode r V GG E n c ode r L 𝑠 L 𝑐 Style Transfer Network

Figure 2. An overview of our style transfer algorithm. We use theﬁrst few layers of a ﬁxed VGG-19 network to encode the contentand style images. An AdaIN layer is used to perform style transferin the feature space. A decoder is learned to invert the AdaINoutput to the image spaces. We use the same VGG encoder tocompute a content loss L c (Equ. 12) and a style loss L s (Equ. 13).

6. Experimental Setup

Fig. 2 shows an overview of our style transfer net-work based on the proposed AdaIN layer. Code and pre-trained models (in Torch [7]) are available at: https://github.com/xunhuang1995/AdaIN-style Our style transfer network T takes a content image c andan arbitrary style image s as inputs, and synthesizes an out-put image that recombines the content of the former and thestyle latter. We adopt a simple encoder-decoder architec-ture, in which the encoder f is ﬁxed to the ﬁrst few lay-ers (up to relu4 1 ) of a pre-trained VGG-19 [48]. Afterencoding the content and style images in feature space, wefeed both feature maps to an AdaIN layer that aligns themean and variance of the content feature maps to those ofthe style feature maps, producing the target feature maps t : t = AdaIN ( f ( c ) , f ( s )) (9)A randomly initialized decoder g is trained to map t backto the image space, generating the stylized image T ( c, s ) : T ( c, s ) = g ( t ) (10)The decoder mostly mirrors the encoder, with all poolinglayers replaced by nearest up-sampling to reduce checker-board effects. We use reﬂection padding in both f and g to avoid border artifacts. Another important architecturalchoice is whether the decoder should use instance, batch, orno normalization layers. As discussed in Sec. 4, IN normal-izes each sample to a single style while BN normalizes abatch of samples to be centered around a single style. Bothare undesirable when we want the decoder to generate im-ages in vastly different styles. Thus, we do not use normal-ization layers in the decoder. In Sec. 7.1 we will show thatIN/BN layers in the decoder indeed hurt performance. .2. Training We train our network using MS-COCO [36] as contentimages and a dataset of paintings mostly collected fromWikiArt [39] as style images, following the setting of [6].Each dataset contains roughly , training examples.We use the adam optimizer [26] and a batch size of content-style image pairs. During training, we ﬁrst resizethe smallest dimension of both images to while pre-serving the aspect ratio, then randomly crop regions of size × . Since our network is fully convolutional, it canbe applied to images of any size during testing.Similar to [51, 11, 52], we use the pre-trained VGG-19 [48] to compute the loss function to train the decoder: L = L c + λ L s (11)which is a weighted combination of the content loss L c and the style loss L s with the style loss weight λ . Thecontent loss is the Euclidean distance between the targetfeatures and the features of the output image. We use theAdaIN output t as the content target, instead of the com-monly used feature responses of the content image. We ﬁndthis leads to slightly faster convergence and also aligns withour goal of inverting the AdaIN output t . L c = (cid:107) f ( g ( t )) − t (cid:107) (12)Since our AdaIN layer only transfers the mean and stan-dard deviation of the style features, our style loss onlymatches these statistics. Although we ﬁnd the commonlyused Gram matrix loss can produce similar results, wematch the IN statistics because it is conceptually cleaner.This style loss has also been explored by Li et al . [33]. L s = L (cid:88) i =1 (cid:107) µ ( φ i ( g ( t ))) − µ ( φ i ( s )) (cid:107) + L (cid:88) i =1 (cid:107) σ ( φ i ( g ( t ))) − σ ( φ i ( s )) (cid:107) (13)where each φ i denotes a layer in VGG-19 used to com-pute the style loss. In our experiments we use relu1 1 , relu2 1 , relu3 1 , relu4 1 layers with equal weights.

7. Results

In this subsection, we compare our approach with threetypes of style transfer methods: 1) the ﬂexible but slowoptimization-based method [16], 2) the fast feed-forwardmethod restricted to a single style [52], and 3) the ﬂexiblepatch-based method of medium speed [6]. If not mentionedotherwise, the results of compared methods are obtained byrunning their code with the default conﬁgurations. For We run iterations of [16] using Johnson’s public implementation: https://github.com/jcjohnson/neural-style

Iteration l n ( L s ) (a) Style Loss Iteration l n ( L c ) Gatys et al.

Ulyanov et al.

OursContent Image (b) Content Loss

Figure 3. Quantitative comparison of different methods in terms ofstyle and content loss. Numbers are averaged over style imagesand content images randomly chosen from our test set. [6], we use a pre-trained inverse network provided by theauthors. All the test images are of size × . Qualitative Examples.

In Fig. 4 we show example styletransfer results generated by compared methods. Note thatall the test style images are never observed during the train-ing of our model, while the results of [52] are obtained byﬁtting one network to each test style. Even so, the qual-ity of our stylized images is quite competitive with [52]and [16] for many images ( e.g. , row , , ). In some othercases ( e.g. , row ) our method is slightly behind the qual-ity of [52] and [16]. This is not unexpected, as we be-lieve there is a three-way trade-off between speed, ﬂexibil-ity, and quality. Compared with [6], our method appearsto transfer the style more faithfully for most compared im-ages. The last example clearly illustrates a major limita-tion of [6], which attempts to match each content patch withthe closest-matching style patch. However, if most contentpatches are matched to a few style patches that are not rep-resentative of the target style, the style transfer would fail.We thus argue that matching global feature statistics is amore general solution, although in some cases ( e.g. , row )the method of [6] can also produce appealing results. Quantitative evaluations.

Does our algorithm trade offsome quality for higher speed and ﬂexibility, and if so byhow much? To answer this question quantitatively, we com-pare our approach with the optimization-based method [16]and the fast single-style transfer method [52] in terms ofthe content and style loss. Because our method uses a styleloss based on IN statistics, we also modify the loss functionin [16] and [52] accordingly for a fair comparison (their re-sults in Fig. 4 are still obtained with the default Gram matrixloss). The content loss shown here is the same as in [52, 16].The numbers reported are averaged over style imagesand content images randomly chosen from the test set ofthe WikiArt dataset [39] and MS-COCO [36].As shown in Fig. 3, the average content and style loss ofour synthesized images are slightly higher but comparableto the single-style transfer method of Ulyanov et al . [52]. Inparticular, both our method and [52] obtain a style loss simi-lar to that of [16] between 50 and 100 iterations of optimiza-tyle Content Ours Chen and Schmidt Ulyanov et al . Gatys et al . Figure 4. Example style transfer results. All the tested content and style images are never observed by our network during training. tion. This demonstrates the strong generalization ability ofour approach, considering that our network has never seenthe test styles during training while each network of [52] isspeciﬁcally trained on a test style. Also, note that our styleloss is much smaller than that of the original content image.

Speed analysis.

Most of our computation is spent on con-tent encoding, style encoding, and decoding, each roughlytaking one third of the time. In some application scenar-ios such as video processing, the style image needs to beencoded only once and AdaIN can use the stored stylestatistics to process all subsequent images. In some othercases ( e.g. , transferring the same content to different styles),the computation spent on content encoding can be shared.In Tab. 1 we compare the speed of our method with pre- vious ones [16, 52, 11, 6]. Excluding the time for style en-coding, our algorithm runs at and FPS for × and × images respectively, making it possible toprocess arbitrary user-uploaded styles in real-time. Amongalgorithms applicable to arbitrary styles, our method isnearly orders of magnitude faster than [16] and - or-ders of magnitude faster than [6]. The speed improvementover [6] is particularly signiﬁcant for images of higher res-olution, since the style swap layer in [6] does not scale wellto high resolution style images. Moreover, our approachachieves comparable speed to feed-forward methods limitedto a few styles [52, 11]. The slightly longer processing timeof our method is mainly due to our larger VGG-based net-work, instead of methodological limitations. With a moreefﬁcient architecture, our speed can be further improved. ethod Time ( px) Time ( px) et al . . ( . ) . ( . ) ∞ Chen and Schmidt . ( . ) . ( . ) ∞ Ulyanov et al . (N/A) (N/A) 1Dumoulin et al . (N/A) (N/A) 32Ours ( . ) ( . ) ∞ Table 1. Speed comparison (in seconds) for × and × images. Our approach achieves comparable speed to methodslimited to a small number styles [52, 11], while being much fasterthan other existing algorithms applicable to arbitrary styles [16,6]. We show the processing time both excluding and including (inparenthesis) the style encoding procedure. Results are obtainedwith a Pascal Titan X GPU and averaged over images. In this subsection, we conduct experiments to justify ourimportant architectural choices. We denote our approachdescribed in Sec. 6 as Enc-AdaIN-Dec. We experiment witha model named Enc-Concat-Dec that replaces AdaIN withconcatenation, which is a natural baseline strategy to com-bine information from the content and style images. In ad-dition, we run models with BN/IN layers in the decoder,denoted as Enc-AdaIN-BNDec and Enc-AdaIN-INDec re-spectively. Other training settings are kept the same.In Fig. 5 and 6, we show examples and training curves ofthe compared methods. In the image generated by the Enc-Concat-Dec baseline (Fig. 5 (d)), the object contours of thestyle image can be clearly observed, suggesting that the net-work fails to disentangle the style information from the con-tent of the style image. This is also consistent with Fig. 6,where Enc-Concat-Dec can reach low style loss but fail todecrease the content loss. Models with BN/IN layers alsoobtain qualitatively worse results and consistently higherlosses. The results with IN layers are especially poor. Thisonce again veriﬁes our claim that IN layers tend to normal-ize the output to a single style and thus should be avoidedwhen we want to generate images in different styles.

To further highlight the ﬂexibility of our method, weshow that our style transfer network allows users to con-trol the degree of stylization, interpolate between differentstyles, transfer styles while preserving colors, and use dif-ferent styles in different spatial regions. Note that all thesecontrols are only applied at runtime using the same network,without any modiﬁcation to the training procedure.

Content-style trade-off.

The degree of style transfer canbe controlled during training by adjusting the style weight λ in Eqa. 11. In addition, our method allows content-styletrade-off at test time by interpolating between feature mapsthat are fed to the decoder. Note that this is equivalent to (a) Style (b) Content (c) Enc-AdaIN-Dec(d) Enc-Concat-Dec (e) Enc-AdaIN-BNDec (f) Enc-AdaIN-INDec Figure 5. Comparison with baselines. AdaIN is much more effec-tive than concatenation in fusing the content and style information.Also, it is important not to use BN or IN layers in the decoder.

Iteration Lo ss Enc-AdaIN-Dec (Style)Enc-AdaIN-Dec (Content)Enc-Concat-Dec (Style)Enc-Concat-Dec (Content)Enc-AdaIN-BNDec (Style)Enc-AdaIN-BNDec (Content)Enc-AdaIN-INDec (Style)Enc-AdaIN-INDec (Content)

Figure 6. Training curves of style and content loss. interpolating between the afﬁne parameters of AdaIN. T ( c, s, α ) = g ((1 − α ) f ( c ) + α AdaIN ( f ( c ) , f ( s ))) (14)The network tries to faithfully reconstruct the contentimage when α = 0 , and to synthesize the most stylizedimage when α = 1 . As shown in Fig. 7, a smooth transi-tion between content-similarity and style-similarity can beobserved by changing α from to . Style interpolation.

To interpolate between a set of K style images s , s , ..., s K with corresponding weights w , w , ..., w K such that (cid:80) Kk =1 w k = 1 , we similarly inter-polate between feature maps (results shown in Fig. 8): T ( c, s , ,...K , w , ,...K ) = g ( K (cid:88) k =1 w k AdaIN ( f ( c ) , f ( s k ))) (15) Spatial and color control.

Gatys et al . [17] recently intro-duced user controls over color information and spatial loca-tions of style transfer, which can be easily incorporated intoour framework. To preserve the color of the content image,we ﬁrst match the color distribution of the style image to = 0 α = 0 . α = 0 . α = 0 . α = 1 StyleFigure 7. Content-style trade-off. At runtime, we can control the balance between content and style by changing the weight α in Equ. 14.Figure 8. Style interpolation. By feeding the decoder with a con-vex combination of feature maps transferred to different styles viaAdaIN (Equ. 15), we can interpolate between arbitrary new styles. that of the content image (similar to [17]), then perform anormal style transfer using the color-aligned style image asthe style input. Examples results are shown in Fig. 9.In Fig. 10 we demonstrate that our method can trans-fer different regions of the content image to different styles.This is achieved by performing AdaIN separately to differ-ent regions in the content feature maps using statistics fromdifferent style inputs, similar to [4, 17] but in a completelyfeed-forward manner. While our decoder is only trained oninputs with homogeneous styles, it generalizes naturally toinputs in which different regions have different styles.

8. Discussion and Conclusion

In this paper, we present a simple adaptive instance nor-malization (AdaIN) layer that for the ﬁrst time enables ar-bitrary style transfer in real-time. Beyond the fascinatingapplications, we believe this work also sheds light on ourunderstanding of deep image representations in general.It is interesting to consider the conceptual differences be-tween our approach and previous neural style transfer meth-ods based on feature statistics. Gatys et al . [16] employ anoptimization process to manipulate pixel values to matchfeature statistics. The optimization process is replaced byfeed-forward neural networks in [24, 51, 52]. Still, the net-

Figure 9. Color control. Left: content and style images. Right:color-preserved style transfer result.Figure 10. Spatial control. Left: content image. Middle: two styleimages with corresponding masks. Right: style transfer result. work is trained to modify pixel values to indirectly matchfeature statistics. We adopt a very different approach that directly aligns statistics in the feature space in one shot , theninverts the features back to the pixel space.Given the simplicity of our approach, we believe there isstill substantial room for improvement. In future works weplan to explore more advanced network architectures suchas the residual architecture [24] or an architecture with addi-tional skip connections from the encoder [23]. We also planto investigate more complicated training schemes like theincremental training [32]. Moreover, our AdaIN layer onlyaligns the most basic feature statistics (mean and variance).It is possible that replacing AdaIN with correlation align-ment [49] or histogram matching [54] could further improvequality by transferring higher-order statistics. Another in-teresting direction is to apply AdaIN to texture synthesis.

Acknowledgments

We would like to thank Andreas Veit for helpful discus-sions. This work was supported in part by a Google Fo-cused Research Award, AWS Cloud Credits for Researchand a Facebook equipment donation. igure 11. More examples of style transfer. Each row shares the same style while each column represents the same content. As before, thenetwork has never seen the test style and content images. eferences [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875 , 2017. 2[2] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450 , 2016. 2[3] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, andD. Krishnan. Unsupervised pixel-level domain adapta-tion with generative adversarial networks. arXiv preprintarXiv:1612.05424 , 2016. 2[4] A. J. Champandard. Semantic style transfer and turn-ing two-bit doodles into ﬁne artworks. arXiv preprintarXiv:1603.01768 , 2016. 8[5] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua. Stylebank:An explicit representation for neural image style transfer. In

CVPR , 2017. 1[6] T. Q. Chen and M. Schmidt. Fast patch-based style transferof arbitrary style. arXiv preprint arXiv:1612.04337 , 2016. 1,2, 4, 5, 6, 7[7] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7:A matlab-like environment for machine learning. In

NIPSWorkshop , 2011. 4[8] T. Cooijmans, N. Ballas, C. Laurent, C¸ . G¨ulc¸ehre, andA. Courville. Recurrent batch normalization. In

ICLR , 2017.2[9] E. L. Denton, S. Chintala, R. Fergus, et al. Deep genera-tive image models using a laplacian pyramid of adversarialnetworks. In

NIPS , 2015. 2[10] A. Dosovitskiy and T. Brox. Inverting visual representationswith convolutional networks. In

CVPR , 2016. 4[11] V. Dumoulin, J. Shlens, and M. Kudlur. A learned represen-tation for artistic style. In

ICLR , 2017. 1, 2, 3, 5, 6, 7[12] A. A. Efros and W. T. Freeman. Image quilting for texturesynthesis and transfer. In

SIGGRAPH , 2001. 1[13] A. A. Efros and T. K. Leung. Texture synthesis by non-parametric sampling. In

ICCV , 1999. 1[14] M. Elad and P. Milanfar. Style-transfer via texture-synthesis. arXiv preprint arXiv:1609.03057 , 2016. 1[15] O. Frigo, N. Sabater, J. Delon, and P. Hellier. Split andmatch: example-based adaptive patch sampling for unsuper-vised style transfer. In

CVPR , 2016. 1[16] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transferusing convolutional neural networks. In

CVPR , 2016. 1, 2,3, 5, 6, 7, 8[17] L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, andE. Shechtman. Controlling perceptual factors in neural styletransfer. In

CVPR , 2017. 1, 7, 8[18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial nets. In

NIPS , 2014. 2[19] D. J. Heeger and J. R. Bergen. Pyramid-based texture analy-sis/synthesis. In

SIGGRAPH , 1995. 1[20] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie.Stacked generative adversarial networks. In

CVPR , 2017. 2[21] S. Ioffe. Batch renormalization: Towards reducing minibatchdependence in batch-normalized models. arXiv preprintarXiv:1702.03275 , 2017. 2 [22] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. In

JMLR , 2015. 2[23] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-imagetranslation with conditional adversarial networks. In

CVPR ,2017. 2, 8[24] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses forreal-time style transfer and super-resolution. In

ECCV , 2016.1, 2, 3, 8[25] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning todiscover cross-domain relations with generative adversarialnetworks. arXiv preprint arXiv:1703.05192 , 2017. 2[26] D. Kingma and J. Ba. Adam: A method for stochastic opti-mization. In

ICLR , 2015. 5[27] D. P. Kingma and M. Welling. Auto-encoding variationalbayes. In

ICLR , 2014. 2[28] J. E. Kyprianidis, J. Collomosse, T. Wang, and T. Isenberg.State of the” art: A taxonomy of artistic stylization tech-niques for images and video.

TVCG , 2013. 1[29] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Ben-gio. Batch normalized recurrent neural networks. In

ICASSP ,2016. 2[30] C. Li and M. Wand. Combining markov random ﬁelds andconvolutional neural networks for image synthesis. In

CVPR ,2016. 1, 2, 3[31] C. Li and M. Wand. Precomputed real-time texture synthesiswith markovian generative adversarial networks. In

ECCV ,2016. 1, 2[32] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang.Diversiﬁed texture synthesis with feed-forward networks. In

CVPR , 2017. 1, 2, 8[33] Y. Li, N. Wang, J. Liu, and X. Hou. Demystifying neuralstyle transfer. arXiv preprint arXiv:1701.01036 , 2017. 1, 2,3, 5[34] Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou. Revisitingbatch normalization for practical domain adaptation. arXivpreprint arXiv:1603.04779 , 2016. 2[35] Q. Liao, K. Kawaguchi, and T. Poggio. Streaming normal-ization: Towards simpler and more biologically-plausiblenormalizations for online and recurrent learning. arXivpreprint arXiv:1610.06160 , 2016. 2[36] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com-mon objects in context. In

ECCV , 2014. 3, 5[37] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervisedimage-to-image translation networks. arXiv preprintarXiv:1703.00848 , 2017. 2[38] M.-Y. Liu and O. Tuzel. Coupled generative adversarial net-works. In

NIPS , 2016. 2[39] K. Nichol. Painter by numbers, wikiart. , 2016. 5[40] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixelrecurrent neural networks. In

ICML , 2016. 2[41] X. Peng and K. Saenko. Synthetic to real adaptationwith deep generative correlation alignment networks. arXivpreprint arXiv:1701.05524 , 2017. 242] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks. In

ICLR , 2016. 2[43] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, andH. Lee. Generative adversarial text to image synthesis. In

ICML , 2016. 2[44] M. Ren, R. Liao, R. Urtasun, F. H. Sinz, and R. S. Zemel.Normalizing the normalizers: Comparing and extending net-work normalization schemes. In

ICLR , 2017. 2[45] M. Ruder, A. Dosovitskiy, and T. Brox. Artistic style transferfor videos. In

GCPR , 2016. 1[46] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Rad-ford, and X. Chen. Improved techniques for training gans. In

NIPS , 2016. 2[47] T. Salimans and D. P. Kingma. Weight normalization: Asimple reparameterization to accelerate training of deep neu-ral networks. In

NIPS , 2016. 2[48] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. In

ICLR , 2015.4, 5[49] B. Sun, J. Feng, and K. Saenko. Return of frustratingly easydomain adaptation. In

AAAI , 2016. 8[50] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-domain image generation. In

ICLR , 2017. 2[51] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Tex-ture networks: Feed-forward synthesis of textures and styl-ized images. In

ICML , 2016. 1, 2, 4, 5, 8[52] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Improved texturenetworks: Maximizing quality and diversity in feed-forwardstylization and texture synthesis. In

CVPR , 2017. 1, 2, 3, 5,6, 7, 8[53] X. Wang, G. Oxholm, D. Zhang, and Y.-F. Wang. Mul-timodal transfer: A hierarchical deep convolutional neu-ral network for fast artistic style transfer. arXiv preprintarXiv:1612.01895 , 2016. 2[54] P. Wilmot, E. Risser, and C. Barnes. Stable and controllableneural texture synthesis and style transfer using histogramlosses. arXiv preprint arXiv:1701.08893 , 2017. 2, 8[55] H. Zhang and K. Dana. Multi-style generative network forreal-time transfer. arXiv preprint arXiv:1703.06953arXiv preprint arXiv:1703.06953