[PDF] Style Mixer: Semantic-aware Multi-Style Transfer Network

Abstract

Recent neural style transfer frameworks have obtained astonishing visual quality and flexibility in Single-style Transfer (SST), but little attention has been paid to Multi-style Transfer (MST) which refers to simultaneously transferring multiple styles to the same image. Compared to SST, MST has the potential to create more diverse and visually pleasing stylization results. In this paper, we propose the first MST framework to automatically incorporate multiple styles into one result based on regional semantics. We first improve the existing SST backbone network by introducing a novel multi-level feature fusion module and a patch attention module to achieve better semantic correspondences and preserve richer style details. For MST, we designed a conceptually simple yet effective region-based style fusion module to insert into the backbone. It assigns corresponding styles to content regions based on semantic matching, and then seamlessly combines multiple styles together. Comprehensive evaluations demonstrate that our framework outperforms existing works of SST and MST.

Full PDF

SStyle Mixer: Semantic-aware Multi-Style Transfer Network.

Zixuan HUANG ∗ City University of Hong Kong [email protected]

Jinghuai Zhang ∗ City University of Hong Kong [email protected]

Jing Liao † City University of Hong Kong [email protected]

Figure 1: In SST, our backbone model transfers style faithfully based on semantic correspondence. In MST, our proposedframework Style Mixer incorporates multiple styles based on regional semantics, therefore best preserve the characteristicsof individual style and generates more favorable results than existing methods. LB indicates linear blending.

Abstract

Recent neural style transfer frameworks have obtainedastonishing visual quality and ﬂexibility in Single-styleTransfer (SST), but little attention has been paid to Multi-style Transfer (MST) which refers to simultaneously trans-ferring multiple styles to the same image. Compared toSST, MST has the potential to create more diverse and vi-sually pleasing stylization results. In this paper, we pro-pose the ﬁrst MST framework to automatically incorporatemultiple styles into one result based on regional semantics.We ﬁrst improve the existing SST backbone network by in-troducing a novel multi-level feature fusion module and apatch attention module to achieve better semantic corre-spondences and preserve richer style details. For MST, we ∗ indicates equal contribution. † indicates corresponding author. designed a conceptually simple yet effective region-basedstyle fusion module to insert into the backbone. It assignscorresponding styles to content regions based on semanticmatching, and then seamlessly combines multiple styles to-gether. Comprehensive evaluations demonstrate that ourframework outperforms existing works of SST and MST.

1. Introduction

The target of style transfer is to confer the style of a ref-erence image to another image while preserving the con-tent of the latter one. The seminal work of Gatys et al .[5, 6] demonstrated that the correlation between the deepfeatures is superior in capturing visual style. It opened upthe era of neural style transfer. Later signiﬁcant effort hasbeen devoted to improving the speed, ﬂexibility, and vi-sual quality of neural style transfer. The most recent works a r X i v : . [ c s . C V ] O c t

12, 19, 17, 3, 25] support efﬁcient arbitrary transfer stylewith a single convolutional neural network model, whichserve as the state-of-the-art baselines.However, most studies in neural style transfer focus onSST, i.e., the image is transferred by a single style refer-ence. To generate more diverse and visually pleasing re-sults, two straightforward attempts are proposed to extendexisting techniques to MST, allowing the user to transfer thecontents into an aggregation of multiple styles. One is lin-ear blending [6, 9, 19, 25, 23, 3], which interpolates featuresof different styles linearly by given weights. However, asis shown in Fig. 1, this method tends to generate muddledresults since the colors and textures of different styles aresimply mixed, and also dull results since the combination isspatially invariant. Another method is to spatially combinemultiple styles by asking users to provide a mask and man-ually assign the styles to different regions [23, 19], whichresults in the desired effect but involves tedious work.In this paper, we propose a semantic-aware MST net-work: Style Mixer. It can automatically incorporate multi-ple styles into one result according to the regional seman-tics. Our Style Mixer consists of a backbone SST networkand a multi-style fusion module. The backbone network canachieve semantic-level SST by learning the semantic corre-lations between the content and style features. It is inspiredfrom two arbitrary style transfer networks: Avatar-Net [25]and SANet [23]. In order to build correspondences, Avatar-Net uses a ﬁxed patch-swap module while SANet uses alearnable attention module. We incorporate the merits ofboth methods (leveraging patch information while allowinglearnable parameters) by proposing a novel patch attention(PA) module for more accurate correspondences. PA im-proves traditional attention module by enabling the control-lability of the size of the receptive ﬁeld, which will ben-eﬁt the works in other ﬁelds as well. Besides, we furtherimprove the richness of style features by introducing multi-level feature fusion (MFF). Compared to the state-of-the-art style transfer networks, our backbone network is betterin both capturing semantic correspondences and preservingstyle richness.In the inference stage, we design an efﬁcient region-based multi-style fusion module to embed in the middle ofthe backbone network. The module ﬁrst segments the con-tent feature map into regions based on semantic informa-tion, and then assigns the most suitable style to each regionaccording to the correspondence conﬁdences generated bythe PA module. After decoding this hybrid future map, ournetwork will create a seamless and coherent MST result.Comprehensive evaluations show that our approach can pro-duce more vivid and diverse results than existing SST andMST methods.In summary, the contributions in this paper are threefolds: (1) We propose the ﬁrst MST framework to automati-cally and spatially incorporate different styles into one re-sult based on the semantic information.(2) We design a patch attention module for semantic cor-respondence, which broadens the form of attention moduleand enables the controllability of the size of the receptiveﬁeld.(3) We propose a conceptually simple yet effectiveregion-based multi-style fusion module for MST to assignmultiple styles to their semantically related regions and thenseamlessly fuse them.

2. Related Work

Neural style transfer.

Starting from the seminal workof Gatys et al . [5, 6], Convolution Neural Network (CNN)demonstrates its remarkable ability to transfer style bymatching statistical information between features of contentand style images. The framework of Gatys et al . [6] is basedon iterative updates of the image by optimizing content andstyle loss, which is applicable to arbitrary image but compu-tationally expensive. Numerous study have since been de-veloped to improve style transfer in different aspects suchas visual quality [15, 30, 24], perceptual control [7], strokecontrol [13, 34]. A great number of researchers try to accel-erate the transfer [14, 26, 27, 16, 2, 18, 4] by approximat-ing the iterative optimization with a feed-forward network.Although speed is improved dramatically, the ﬂexibility iscompromised since each network is restricted to a singlestyle or a ﬁnite set of styles. The dilemma between speed,ﬂexibility, and quality [35] impedes the further developmentof style transfer. Recently some fast Arbitrary-Style-Per-Model methods are proposed to resolve the dilemma. Theidea is to train a style-agnostic autoencoder and convert thecontent feature into a given style domain while preservingcontent structures. [12, 19, 17] transfers the global style bycoordinating the statistical distribution between them; while[3] swap the content feature patch with the nearest style fea-ture patch in terms of cosine similarity, which achieves localsemantic-aware style transfer results. Avatar-Net [25] fur-ther extends the AdaIN [12] to multi-scale style adaptationand loosen the restrictions of Style Swap [3] by performingprojection before matching.Despite the success in SST, little attention has been paidto the ﬁeld of MST, which is likely to create more vi-brant and distinctive artistic effects. Some works extendtheir SST framework to MST as a simple add-in by linearlyblending the feature from different styles [6, 9, 19, 25, 23, 3]or manually specifying the masks [23, 19]. They eithergenerate undesired results or require tedious user efforts.The challenge of MST is how to automatically combine thefeature of different styles harmoniously without damagingthe characteristics of each style. We effectively resolve thischallenge by regional semantic matching and produce state-igure 2: An overview of our proposed network. Multi-level features F C and F S will ﬁrst be fused in MFF module bychannel-wise attention. Then the fused style feature F fuseds will be reassembled into F fused (cid:48) s guided by the semantic corre-spondence between F relu c and F relu s . In SST case, F fuseds will be merged with F fusedc and decoded. While in MSTcase, as is shown by the dashed line, multiple F fused (cid:48) s and conf corresp will be fed to our multi-style fusion module andintegrated based on regional correspondence conﬁdence.of-the-art MST results. Attention Module.

Recently, attention mechanismshave become a key ingredient for models that need to in-corporate global dependency [1, 8, 32, 33]. It allows themodel to look globally but attend selectively at the data.Particularly, self-attention [10, 22] calculates the correlationbetween every two positions in a sequence. Such mecha-nism has been proved to be exceptionally effective in ma-chine translation [28, 1], image classiﬁcation [31, 37], vi-sual question answering [32] and image generation [36].Recently, [23] introduces style-attention to capture the cor-respondence between content image and style image andoutperforms prior works in terms of visual quality. Com-pared to [23], we further improve the capability of semanticmatching to catalyze the performance of our multi-style fu-sion module.

3. Proposed Method

The architecture of

Style Mixer is shown in Fig. 2. Thebackbone style transfer model comprises an encoder and adecoder, with the multi-level feature fusion (MFF) moduleand patch attention (PA) module in the middle. In the caseof MST, a multi-style fusion module is further embedded todistribute style features from different style references.

A pretrained VGG-19 network is employed as a feed-forward encoder to extract features of the input pairs. Toincorporate multi-level feature produced by the encoder, anMFF module is placed after the encoder and takes featuresfrom 3 different layers as input. Being able to classify the objects correctly despite thehuge low-level variations, VGG-19 proves its efﬁciency androbustness in extracting semantic information. Therefore,by calculating the patch attention between the high-levelfeature of content and style images, we can obtain a mean-ingful semantic attention map and reassemble the style fea-tures accordingly. At last, we merge the reassembled stylefeature F fused (cid:48) s with F fusedc and decode them into an artis-tic image.Since the problems of multi-level feature fusion and se-mantic correspondence functions are common in both SSTand MST, these two modules can be trained with SST andthen applied to MST. In MST, Style Mixer will process mul-tiple styles in a parallel manner, and incorporate them withour region-based style fusion strategy. The correspondenceconﬁdence conf corresp produced by PA module will guidethe distribution of different styles based on semantic match-ing. In this way, every style will be assigned to the mostsemantically related region with local consistency. Features from different layers of VGG carry informa-tion of different scales and abstractness levels. To incor-porate multi-level information, Avatar-net [25] introducesmulti-level AdaIN [12] to conduct style adaptation progres-sively. However, holistic statistic alignment sometimes cre-ates unpleasant artifacts. After that, SANet [23] integratestwo separate style-attention modules to extract style fea-tures of layer relu and relu to improve style rich-ness but also introduces an expensive computational cost.To obtain faithful stylization with affordable computationcost (which is especially critical when adopting PA), we de-igure 3: The process of multi-level feature fusion can besummarized as: concatenate, attend, and squeeze.sign an MFF module to coalesce the features from 3 differ-ent layers adaptively.Figure 4: The second image, result of single-level feature,is rendered in large stroke and lacks style patterns. Afteradding the feature from relu , the result (the 3rd col-umn) is richer in spiral patterns, for instance, the nose andupper-right corner. If we further integrate the feature froma lower level, which has a smaller receptive ﬁeld, the high-frequency area, i.e., the hair and eyes of the women, becomeﬁner. At the same time, the cheek remains coarse. Featureswith different scales are combined pleasingly.The whole process of our MFF module is as depicted inFig. 3. Features from relu , relu , relu will ﬁrst berecalibrated by a 1 × × F fused . The com-parison between different choices of input layers is shownin Fig. 4 Style Swap [3] is a pioneer work that introduces lo-cal patterns matching to style transfer. However, due tothe ﬁxed cosine similarity metric and the overlap betweenpatches, it produces undesired overly smooth results withmismatches. SANet [23] proposed a novel style-attentionmechanism to replace the ﬁxed cosine similarity with aﬂexible learnable similarity kernel. Following the tradi-tion of self-attention [28] and non-local block [29], it con- Figure 5: Unfolding operation.ducts point-wise attention between content and style fea-tures. Due to the limited size of the receptive ﬁeld and localvariation of the input image, point-wise attention performsunstably despite the learnable similarity kernel. To solvethis problem, we extend the attention module to a moregeneric form, patch attention (PA), which enables the con-trollability of the size of the receptive ﬁeld and better graspsthe structural information. The mechanism of our PA mod-ule is illustrated in Fig. 6. Together with the abundant se-mantic information in the high-level feature of VGG-19, ourPA module achieves robust semantic matching. Also, it isworth noting that Style-attentional module in SANet is aspecial case of PA.The PA module takes content feature F c , style feature F s and F fuseds from MFF module as its inputs. It shouldbe noted that in SANet [23], attention is carried out be-tween the content feature and the style feature, which willbe reassembled. On the contrary, we calculated patch at-tention on the original feature of VGG-19, which is fromlayer relu 4 1, to best preserve the semantic information,and use the resulted pair-wise correspondence to guide therearrangements of fused style feature F fuseds . PA startsFigure 6: Patch attention module. In order to best preservethe semantic information, we calculate the correspondencescore between original feature from VGG-19 to guide thereassembling of fused multi-level feature F fuseds .with channel-wise normalization to put F c and F s into acommon domain. This can be regarded as style normaliza-ion [12, 20] and encourages matching to rely only on struc-tural and semantic similarities. Then we perform a × convolution to the normalized feature to enable the learn-ing of a suitable similarity kernel by itself. To improve thematching accuracy, we take neighboring information intoconsideration by unfolding patches at each position. Theunfold operation is demonstrated in Fig. 5. In Eq. 1, F rep-resents channel-wise normalized feature, and P i indicates avectorized patch feature at i -th position, which consists ofthe information of the i -th position and its neighborhood. P ik = U nf old ( θ k ( F k )) i , where k ∈ { s, c } . (1)Next, the correspondence score S and semantic attentionmap M are calculated with patch attention mechanism asEq. 2. After performing softmax operation on each row of S , we obtain the attention map needed for the reallocationof F fuseds : M i,j = exp( S ij ) (cid:80) Nj =1 exp( S ij ) , where S ij = ( P ic ) T P js and N = spatial size of F s . (2)Driven by contextual loss and identity loss, similar featureswill obtain a larger correspondence score, resulting in thelarger attention value in the M . Thanks to the rich semanticinformation provided by encoder, the correspondence scorecan be interpreted as semantic afﬁnity. Thus, in the real-location process, as is depicted in Eq. 3, style feature thatis more semantically related will be emphasized. F fused (cid:48) s refers to the reassembled style feature from PA module. F fused (cid:48) s = M θ fused ( F fuseds ) . (3)To measure the conﬁdence that F fused (cid:48) s have same se-mantic implication as F c , we further conduct element-wisemultiplication between correspondence score S and seman-tic attention map M to derive a correspondence conﬁdence conf corresp . In essence, conf corresp is the weighted aver-age correspondence score of F fused (cid:48) s , representing the se-mantic correspondence between a given style feature F s and F c . conf corresp plays a critical role in the distribution ofstyles in MST. We deﬁne it as: conf icorresp = N (cid:88) j =1 S i,j M i,j , (4)where conf icorresp indicates the correspondence conﬁdenceof F fused (cid:48) s at location i.The size of the receptive ﬁeld is an intrinsic character-istic of a chosen layer and always ﬁxed. PA enables theadjustability of the receptive ﬁeld and further releases thepotential of attention mechanism. From Fig. 7, we may see Figure 7: Investigation of patch size. × PA completelyfails to differentiate bird with people or ﬂower. While × PA obtains a good overall matching accuracy, it sometimesmismatches objects due to noisy neighboring information,i.e., some ﬂowers in the second example are mistakenlyidentiﬁed as background and disappear.how different patch size affect matching and stylization re-sults. In all 3 cases, × (traditional point-wise attention)failed to capture semantic correspondence correctly. In theﬁrst pair, the bird was wrongly rendered in the style of theportrait. While in the other two pairs, styles of bird andﬂower respectively dominate the whole image, disregardingthe semantic meaning of different objects. On the contrary,both × and × PA demonstrate an excellent capabilityof semantic matching. However, larger patch size tends tocompromise the detail. For instance, in the third image ofthe second row, some ﬂowers in the background disappear.It is probably because the neighboring information domi-nates the matching so that the ﬂowers wrongly match withthe background of the styles. In addition, with considera-tion of computation cost in mind, we choose × PA inour model.

In MST, the most challenging problem lies in how to har-moniously incorporate different styles without hurting thecharacteristics of each style. This has two underlying im-plications.Firstly, styles should not be mixed; otherwise, they willobfuscate each other and compromise style integrity. Whatis worse, mixing distinctive styles may produce disturbingand nondescript patterns. Thus, the assignment of differ-ent styles should be mutually exclusive. Secondly, a metricneeds to be deﬁned to decide the distribution of multiplestyles. Semantic correspondence is a natural idea since,with semantic consideration, the overall effect will lookmore reasonable and intuitive. Correspondence conﬁdence conf corresp is precisely the objective measure of semanticigure 8: Comparison between region-based strategy and discrete strategy. The discrete strategy introduces noises in certainregions, eliminates the characteristics of style features and therefore fails to generate high-quality stylized images.correspondence among different styles.Given the two consideration above, a straightforwardidea is to assign the style with the highest conﬁdence toeach position. However, local variation and noise some-times intervene in the calculation of correspondence, induc-ing false match, and producing unpleasing discrete patterns.In Fig. 8, we can see that the discrete strategy producesmany scattered pattern and deteriorates local consistency.To resolve the problem, we utilize clustering to segmentour content feature map ( relu ) and calculate regionalcorrespondence conﬁdence. The regional voting strategy in-creases the robustness of matching by ﬁxing individual mis-match. As we mentioned before, high-level feature com-prises abundant semantic information, clustering in high-dimensional feature space is efﬁcient in distinguish objectswith different semantic implication. Speciﬁcally, we applyK-means to cluster all feature vectors as well as their spatiallocation in Euclidean Distance to ensure spatial afﬁnity ofthe result.The pipeline of MST is depicted as the dashed line inFig. 2. MFF module and PA module will process multiplestyle references in a parallel way and pass all the reassem-bled style features F fused (cid:48) s and correspondence conﬁdence conf corresp to multi-style fusion module.To allocate semantically nearest style for each region,we calculate the regional sum of correspondence conﬁdenceand choose the style with the highest value for each region.The assignment policy is conceptually simple but proves itsrobustness by comprehensive evaluation. Formally, let R to be a speciﬁc region, we calculate the sum of correspon-dence conﬁdence in R for every style, and style k with thehighest sum will be the assignment result I R for region R.Formally, the strategy is deﬁned as: I R = argmax k ( (cid:88) i ∈ R conf i,kcorresp ) , (5)where conf i,kcorresp indicates the correspondence conﬁdenceof style k at position i . Compared to the straightforward discrete strategy, ourproposed region-based strategy improves the visual qual-ity and matching robustness. In Fig. 8, the results of dis-crete strategy are suffered from mismatch and local incon-sistency, such as the blemishes on the horse and grassland inupper-right pair. By conducting regional voting, those ﬂawsare ﬁxed automatically. Both horse and grass are faithfullytransformed according to the reference image. With Fig. 9,Figure 9: By changing different styles of architecture, thecorresponding region in content image changes simultane-ously.you will have a better idea about how the styles are dis-tributed.

4. Experiments and Results

We train our network using MSCOCO and WikiArtdatasets as content images and style images, respectively,both of which contain roughly 80000 images. We use anAdam optimizer to train the backbone model with a batch-size of 6 content-style pairs and a learning rate initially setto e − . During the training process, we ﬁrstly resize thesmaller dimension to 512 pixels while preserving the aspectigure 10: Visual comparison with existing works on SST.ratio, and then randomly crop regions of × pixelsfor end-to-end training.Our loss function is deﬁned as below to drive the trainingprocess: L = λ c L c + λ s L s + L identity + λ cx L cx . (6)Similar to [12], our perceptual loss L c is deﬁned as Eu-clidean distance between channel-wise normalized VGG-19features extracted from content image and synthesized im-age. Feature layer relu , relu and relu are used tocompute perceptual loss. For style loss L s , we apply styleloss same as AdaIN [12] to drive the global style transfer. We also apply contextual loss proposed by [21] to facil-itate the semantic matching between style feature and con-tent feature. The cosine distances d Li,j are calculated be-tween each pair of feature vectors in the feature maps ofstyle and synthesized image. After d Li,j being normalized as d L ( i, j ) = d Li,j min k d Li,k + (cid:15) , the afﬁnity between any two featurepoints in layer L is represented as: A L ( i, j ) = sof tmax j (1 − d L ( i, j ) /bw ) , where bw is the bandwidth, typically set to 0.1. The contex-tual loss is deﬁned to maximize such afﬁnity between thesynthesized image and the semantically nearest style fea-ure: L cx = (cid:88) L [ − log ( 1 N L (cid:88) i max j A L ( i, j ))] , (7)where N L is the number of feature vectors at layer L and lis set to to in our case.In order to guide the network to gain the powerful abil-ity of semantic matching and image reconstruction, an ad-vanced identity loss proposed by [23] is employed, as isshown in Fig. 2. Two symmetric pairs of content and styleimages are fed to the network with the hope that the net-work should be able to reconstruct the original images, andthe results are identiﬁed as I cc and I ss separately. Formally,the identity loss is deﬁned as below: L identity = λ identity ( || ( I cc − I c ) || + || ( I ss − I s ) || )+ λ identity (cid:88) i =1 ( || V GG i ( I cc ) − V GG i ( I c ) || + || V GG i ( I ss ) − V GG i ( I s ) || ) . (8)In addition, we change the behavior of merging module dur-ing identity loss calculation to: F cs = k × F fused (cid:48) s , (9)where k is a learnable scale factor and we name the mod-ule as Ampliﬁer. The advantage of Ampliﬁer is further dis-cussed in sec. 5.1.The weight parameters λ c , Λ s , λ cx , λ identity , λ identity are set to 3, 3, 3, 1, 50 respectively according to our experi-ments. To evaluate the effectiveness of our backbone model andregion-based style fusion strategy, we conduct a comparisonwith existing methods. All the inputs are chosen outsidethe training set. For a fair comparison, we generate resultsby running the released codes of the aforementioned workswith the default conﬁguration, except for SANet (We usethe ofﬁcial demo page). The visual comparisons of SST andMST methods are shown in Fig. 10 and Fig. 11 respectively.Additionally, extra examples of our work can be found inFig. 18.

Single-style transfer.

Single style performance compar-ison results are available in Fig. 10. The optimization-basedmethod [6] is unstable since it is likely to stick in the localminimum for some pairs, which can be seen in column 3,4 of Gatys et al. in Fig. 10. The two faces suffer heavilyfrom the loss of details and deviation of style. Both AdaIN[12] and WCT [19] holistically adjust the content featuresto match the global statistics of the style features, which leads to blurring effect and textual distortion in some localregions (e.g., the last column of AdaIN and WCT, the pat-tern of trees grow indiscriminately to the sky). AlthoughAvatar Net [25] shrinks the domain gap between contentand style features and utilizes patch-wise semantics, it tendsto produce fuzzy effects due to overlapping patches and re-peated patterns because of global statistical alignment (e.g.,column 1, 2, 7 of Avatar in Fig. 10). LST [17] originatesfrom [19] and generates some good results, but it is vul-nerable to wash-out artifacts (e.g., column 3, 4 of LST inFig. 10) and halation around the edges (e.g., column 1, 6).Besides, this method fails to display desired stylized effectfor some images (e.g., column 2, 5 of LST). SANet [23]applies style-attention mechanism to ﬂexibly conduct styletransfer. However, false matching and distortions still occurfor this method, such as the pink pattern on trees in the ﬁrstcolumn.Our method achieves the most balanced performanceamong all the above models. Our method greatly improvesthe content preservation by incorporating content featuresfrom relu3, 4, 5, which can be seen in column 1, 5, 6, 7of Fig. 10. At the same time, it presents rich style patternsthat are both appealing and meaningful (e.g., column 2, 4 ofOurs in Fig. 10). Besides, learnable patch attention moduletakes contextual information into consideration and ﬂexiblyreassembles style patterns, which makes a breakthrough insemantic feature transfer (e.g., column 1, 3, 6).

Multi-style Transfer.

To illustrate the effectiveness ofour region-based strategy for MST, we compare it with thetraditional linear blending strategy implemented by AdaIN[12], AvatarNet [25] as well as our backbone model.All the results are shown in Fig. 11. Generally speaking,linear blending mixes different styles; therefore, the char-acteristics of the individual style are not preserved. It tendsto produce muddled results with fade-out effects. By apply-ing linear blending strategy, our model and AdaIN [12] failto retain characteristics of individual style as the structuraland color information is fused indiscriminately (column 2,3, 5, 7 in Fig. 11). Although AvatarNet [25] preserves thestyle patterns for certain images, it seriously suffers fromfade-out effects (column 2, 5, 6, 7 of Avatar in Fig. 11).On the other hand, Style Mixer eliminates the interferencebetween different styles with a spatially exclusive transferstrategy. In the last column of Fig. 11, three linear blendingbased methods produce results with colors that do not existin style references, while our proposed Style Mixer faith-fully transfer the ﬁeld, mountain, and sky in style referencesto the result.

In order to validate our work, we further conduct twouser studies to evaluate the SST performance of our back-bone model and MST performance of Style Mixer. Bothigure 11: Visual Comparison with existing works on MST.Figure 12: User preference towards different SST algo-rithms in terms of different metrics.studies are conducted among 40 participants uniformlyranging from university students to normal ofﬁcers. Foreach question, we display the results of all methods in ran-dom order and ask the participants to choose the one thatbest conforms to the given metrics. All the questions arepresented in random order, and the participants are givenunlimited time to ﬁnish the questions. Unlike the settings inregular user studies, we do not choose the test images ran-domly. Instead, we handpicked semantically related content Figure 13: User preference towards different MST strate-gies.and style image pairs to evaluate the performance on seman-tic matching. Each user studies involves 36 pairs of imagesin total, and each user will be presented with six randomlychosen ones.

Single-style Transfer.

Firstly, we access the abilityof our backbone model on SST. 5 state-of-the-art models[6, 12, 19, 25, 23] are chosen for comparison. We follow[34] to evaluate content preservation and style faithfulness.Besides, we introduce the semantic matching ability as anew metric, indicating whether the styles are transferred ac-cording to semantic matching, i.e., tree-to-tree, face-to-face.igure 14: Exemplar images in MST survey. The results ofour region-based fusing strategy demonstrate the best stylefaithfulness with local consistency.We manually make explicit instructions with exemplar im-ages to deﬁne the criteria for each metric. For a fair com-parison, we run the released code with the default settingfor the aforementioned models. As we can see in Fig. 12,our model obtains the most impressive performance in vi-sual perspective, especially in content preservation. Evenin terms of style faithfulness, our model is competitive withiteration-based method [6]. Also, the semantic matchingscore of our proposed method is the highest among thesix models, and this should be credited to the PA module.The extraordinary visual quality and semantic matching ofour backbone model serve as the cornerstone of our MSTframework.

Multi-style transfer.

In order to evaluate the user pref-erence towards different MST strategies, we eliminate theeffect of the backbone model by using the same one (ourmodel) for all strategies. Our region-based strategy is com-pared with linear blending as well as the discrete strategy inthe user study.The result illustrated in Fig. 13 shows that our region-based strategy is superior to the other two methods. Linearblending is the least favorable probably because of the mud-dled results and insipid color, as is shown in 14. The dis-crete strategy produces more vivid results with some ﬂawsdue to unstable local matching (i.e., the green color on thehorse in the ﬁrst image and mottled sky in the second imageof Fig. 14). While our proposed method ﬁxes those falsematching by regional voting mechanism and thus obtainsmore decent results. Method SST Time MST TimeGatys et al. [6] 51.04 -AdaIN [12] 0.014 0.032 (Linear blending)WCT [19] 0.933 -Avatar-Net [25] 0.330 0.526 (Linear blending)SANet [23] 0.034 -Our 0.045 0.371 (Region-based)Table 1: Execution time comparison (in seconds).Figure 15: MST with more style references. Our StyleMixer is able to potentially handle arbitrary number of stylereferences.Figure 16: Comparison between add operation and Ampli-ﬁer as the merging module in calculation of identity loss.

A run time evaluation has also been conducted, and theresults are displayed in Tab. 1. All the inputs are rescaledto 512 px ×

512 px. In SST, due to the adoption of PA, ourmodel is slightly slower than SANet [23], but is still verycompetitive compared to WCT [19] and Avatar Net [25]. Interms of MST, our region-based feature fusion strategy canrun at near real-time speed, faster than WCT [19] but slowerthan AdaIN [12] due to the expensive cost of clustering.igure 17: Investigation on different number of clusters.

Fig. 15 shows examples of MST with three references.Our region-based strategy is able to assign different styles toappropriate regions according to semantic correspondenceand potentially handle an arbitrary number of references.

5. Discussion [23] introduces identity loss to improve the contentpreservation and matching ability of style-attention module.When calculating identity loss, SANet merges content fea-ture F c with swapped style feature F s by F cs = F c + F s ,which is same as normal inference process. However, F c has already contained the necessary information to com-plete the reconstruction. The chances are that although thenetwork is capable of rebuilding the image, the weightsof the attention module is wrongly trained to be 0, whichmeans it makes no effect at all. To solve this vulnerabil-ity, we apply Eq. 9 to replace the original add operation.Without the supply of content image, the PA module is con-fronted with a bigger challenge and forced to learn moreaccurate correspondence, which is corroborated by exper-iments. For example, in Fig. 16, with add operation asmerging module, the wings of the bird are wrongly matchwith the background of ﬂower reference. On the other hand,when the ampliﬁer is being utilized, the wings of the birdsare transferred to green color in accordance with that of birdreference. To investigate how the number of clusters (K) affectsthe MST results, we carry out experiments with variouscontent-style pairs, two of which are shown in Fig. 17. Theexperimental results illustrate that the quality of the synthe-sized result is not sensitive to the size of K when K lies ina restricted range. Typically, K with a size between 5 to 7inclines to produce appealing results. When K is relatively large, content image is segmented into smaller regions pos-sessing similar characteristics, which are very likely to beassigned with the same style. However, if we further in-crease the K, unpleasant patterns will occur since small seg-ments are easily inﬂuenced by local features and noises,thus producing false matching. It should also be noted thatwhen K is set to a small number, the results are sensitive tothe initial seeds of K-means and are not consistent with thesemantic information of the content image.

Semantic mismatch.

The phenomenon can be attributedto the limitation of the encoder. Since VGG-19 is pretrainedon ImageNet, which may not be able to handle the objectsthat are beyond the predeﬁned categories. Also, there is adistinct domain gap between photos and paintings. As aconsequence, some style patterns may be too abstract forVGG19 to extract accurate semantic information. For ex-ample, the cloud in the 4th column of Fig. 9 is wronglytransformed into the pattern of the ground rather than thecloud in that style reference. We believe the developmentof a more suitable encoder for style images will help to al-leviate the problem.

Halos near the boundary.

The segmentation we ap-plied on features is coarser than segmentation of originalimage due to the shrinking of size. And this deviation willbe ampliﬁed by the upsampling process and lead to halos.A progressive fusion strategy may be a good direction toresolve this problem.

6. Conclusion

In this work, we propose an advanced style transfer net-work and efﬁcient region-based multi-style transfer strat-egy. The proposed patch attention module dramaticallyelevates the ability of semantic style transfer and is ap-plicable to any current attention-based model. Also, wecome up with the ﬁrst region-based strategy for MST, whichis proved to be efﬁcient and is capable of improving theigure 18: More results of Style Mixer. The upper rows are results of SST which showcase the competency of semanticmatching of our backbone model, i.e., the eyes of the lady are transferred accordingly in the 5th column. The two rows belowprovide more examples of results to demonstrate the superiority of Style Mixer in terms of MST.consistency of multi-style transfer. Comprehensive exper-iments demonstrate that our proposed method is favorablecompared to other existing methods.

Acknowledgement

We thank the anonymous reviewers for helping us to im-prove this paper. And we acknowledge to the authors of ourimage and style examples. This work was partly supportedby CityU start-up grant 7200607 and Hong Kong ECS grant21209119.

References [1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machinetranslation by jointly learning to align and translate. arXivpreprint arXiv:1409.0473 , 2014. 3[2] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua. Stylebank: Anexplicit representation for neural image style transfer. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 1897–1906, 2017. 2[3] T. Q. Chen and M. Schmidt. Fast patch-based style transferof arbitrary style. arXiv preprint arXiv:1612.04337 , 2016. 2,4[4] V. Dumoulin, J. Shlens, and M. Kudlur. A learned represen-tation for artistic style.

Proc. of ICLR , 2, 2017. 2 [5] L. Gatys, A. S. Ecker, and M. Bethge. Texture synthesisusing convolutional neural networks. In

Advances in neuralinformation processing systems , pages 262–270, 2015. 1, 2[6] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style trans-fer using convolutional neural networks. In

Proceedings ofthe IEEE conference on computer vision and pattern recog-nition , pages 2414–2423, 2016. 1, 2, 8, 9, 10[7] L. A. Gatys, A. S. Ecker, M. Bethge, A. Hertzmann, andE. Shechtman. Controlling perceptual factors in neural styletransfer. In

Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 3985–3993,2017. 2[8] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, andD. Wierstra. Draw: A recurrent neural network for imagegeneration. arXiv preprint arXiv:1502.04623 , 2015. 3[9] S. Gu, C. Chen, J. Liao, and L. Yuan. Arbitrary style transferwith deep feature reshufﬂe. In

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages8222–8231, 2018. 2[10] S. Hochreiter and J. Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780, 1997. 3[11] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation net-works. In

Proceedings of the IEEE conference on computervision and pattern recognition , pages 7132–7141, 2018. 4[12] X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In

Proceedingsf the IEEE International Conference on Computer Vision ,pages 1501–1510, 2017. 2, 3, 5, 7, 8, 9, 10[13] Y. Jing, Y. Liu, Y. Yang, Z. Feng, Y. Yu, D. Tao, and M. Song.Stroke controllable fast style transfer with adaptive receptiveﬁelds. In

Proceedings of the European Conference on Com-puter Vision (ECCV) , pages 238–254, 2018. 2[14] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses forreal-time style transfer and super-resolution. In

Europeanconference on computer vision , pages 694–711. Springer,2016. 2[15] C. Li and M. Wand. Combining markov random ﬁelds andconvolutional neural networks for image synthesis. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 2479–2486, 2016. 2[16] C. Li and M. Wand. Precomputed real-time texture syn-thesis with markovian generative adversarial networks. In

European Conference on Computer Vision , pages 702–716.Springer, 2016. 2[17] X. Li, S. Liu, J. Kautz, and M.-H. Yang. Learning lineartransformations for fast arbitrary style transfer. In

IEEE Con-ference on Computer Vision and Pattern Recognition , 2019.2, 8[18] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang.Diversiﬁed texture synthesis with feed-forward networks. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 3920–3928, 2017. 2[19] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang.Universal style transfer via feature transforms. In

Advancesin neural information processing systems , pages 386–396,2017. 2, 8, 9, 10[20] Y. Li, N. Wang, J. Liu, and X. Hou. Demystifying neuralstyle transfer. arXiv preprint arXiv:1701.01036 , 2017. 5[21] R. Mechrez, I. Talmi, and L. Zelnik-Manor. The contextualloss for image transformation with non-aligned data. In

Pro-ceedings of the European Conference on Computer Vision(ECCV) , pages 768–783, 2018. 7[22] A. P. Parikh, O. T¨ackstr¨om, D. Das, and J. Uszkoreit. A de-composable attention model for natural language inference. arXiv preprint arXiv:1606.01933 , 2016. 3[23] D. Y. Park and K. H. Lee. Arbitrary style transfer with style-attentional networks. arXiv preprint arXiv:1812.02342 ,2018. 2, 3, 4, 8, 9, 10, 11[24] E. Risser, P. Wilmot, and C. Barnes. Stable and controllableneural texture synthesis and style transfer using histogramlosses. arXiv preprint arXiv:1701.08893 , 2017. 2[25] L. Sheng, Z. Lin, J. Shao, and X. Wang. Avatar-net: Multi-scale zero-shot style transfer by feature decoration. In

Com-puter Vision and Pattern Recognition (CVPR), 2018 IEEEConference on , pages 1–9, 2018. 2, 3, 8, 9, 10[26] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. S. Lempitsky.Texture networks: Feed-forward synthesis of textures andstylized images. In

ICML , volume 1, page 4, 2016. 2[27] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Improved texturenetworks: Maximizing quality and diversity in feed-forwardstylization and texture synthesis. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 6924–6932, 2017. 2 [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is allyou need. In

Advances in neural information processing sys-tems , pages 5998–6008, 2017. 3, 4[29] X. Wang, R. Girshick, A. Gupta, and K. He. Non-localneural networks. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 7794–7803, 2018. 4[30] X. Wang, G. Oxholm, D. Zhang, and Y.-F. Wang. Mul-timodal transfer: A hierarchical deep convolutional neuralnetwork for fast artistic style transfer. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 5239–5247, 2017. 2[31] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang.The application of two-level attention models in deep convo-lutional neural network for ﬁne-grained image classiﬁcation.In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 842–850, 2015. 3[32] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudi-nov, R. Zemel, and Y. Bengio. Show, attend and tell: Neuralimage caption generation with visual attention. In

Interna-tional conference on machine learning , pages 2048–2057,2015. 3[33] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked at-tention networks for image question answering. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 21–29, 2016. 3[34] Y. Yao, J. Ren, X. Xie, W. Liu, Y.-J. Liu, and J. Wang.Attention-aware multi-stroke style transfer. In

IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,2019. 2, 9[35] C. Zhang, Y. Zhu, and S.-C. Zhu. Metastyle: Three-waytrade-off among speed, ﬂexibility, and quality in neural styletransfer. arXiv preprint arXiv:1812.05233 , 2018. 2[36] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial networks. arXiv preprintarXiv:1805.08318 , 2018. 3[37] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Tor-ralba. Learning deep features for discriminative localization.In