[PDF] Image Synthesis and Style Transfer

Abstract

Affine transformation, layer blending, and artistic filters are popular processes that graphic designers employ to transform pixels of an image to create a desired effect. Here, we examine various approaches that synthesize new images: pixel-based compositing models and in particular, distributed representations of deep neural network models. This paper focuses on synthesizing new images from a learned representation model obtained from the VGG network. This approach offers an interesting creative process from its distributed representation of information in hidden layers of a deep VGG network i.e., information such as contour, shape, etc. are effectively captured in hidden layers of neural networks. Conceptually, if Φ is the function that transforms input pixels into distributed representations of VGG layers h , a new synthesized image X can be generated from its inverse function, X= Φ −1 (h) . We describe the concept behind the approach, present some representative synthesized images and style-transferred image examples.

Full PDF

aa r X i v : . [ c s . G R ] J a n Image Synthesis and Style Transfer

Somnuk Phon-Amnuaisuk , Media Informatics Special Interest Group, Centre for Innovative Engineering, Universiti Teknologi Brunei, School of Computing & Informatics, Universiti Teknologi Brunei. [email protected]

Abstract.

Aﬃne transformation, layer blending, and artistic ﬁlters arepopular processes that graphic designers employ to transform pixels ofan image to create a desired eﬀect. Here, we examine various approachesthat synthesize new images: pixel-based compositing models and in par-ticular, distributed representations of deep neural network models. Thispaper focuses on synthesizing new images from a learned representa-tion model obtained from the VGG network. This approach oﬀers aninteresting creative process from its distributed representation of infor-mation in hidden layers of a deep VGG network i.e., information suchas contour, shape, etc. are eﬀectively captured in hidden layers of neuralnetworks. Conceptually, if Φ is the function that transforms input pix-els into distributed representations of VGG layers h , a new synthesizedimage X can be generated from its inverse function, X = Φ − ( h ). Wedescribe the concept behind the approach, present some representativesynthesized images and style-transferred image examples. Keywords

Image synthesis, Creative generative process, Distributed represen-tations, VGG network

Computer has been extensively employed to process 2D vector graphics andraster graphics. Human designers apply various techniques such as cropping,compositing, transforming e.g., scaling, rotating, and applying various visual ef-fects using ﬁltering techniques. We may further describe the graphic contentprocessing of raster graphics into two main approaches: (i) manually or algorith-mically composing existing images into a new graphical content; and (ii) creatinga new graphical content using generative models that are algorithmically crafted [1–4]; or learning from image examples, e.g., deep dream . The ﬁrst approachis the more popular contemporary techniques where graphic designers employoﬀ-the-shelf graphics authoring tools to create new content from existing images.In this work, we are particularly interested in the second approach where a new One of the pioneers in this area is Harold Cohen aaronshome.com/aaron/index.html https://github.com/google/deepdream raphical content is synthesized at pixel level using a generative model learntfrom examples . This approach has received much interest in recent years.Recent advances in deep neural networks have shed some interesting ideas onthe knowledge representation and learning issues [5, 6]. It is found that hiddennodes of a deep convolution neural networks (CNNs) trained with audio or visualstimuli could represent the basic fundamental frequency of sound or basic visualpatterns [7, 8]. These ideas motivate us to look into the synthesis of an imagefrom a distributed representation. In this paper, we explore a generative modelthat synthesizes new images using a distributed representation obtained fromVGG16 network [9]. More information about the VGG16 deep convolution neuralnetwork will be discussed in Section 4.The rest of the paper is organized into the following sections: Section 2 dis-cusses the background and some representative related works; Section 3 discussesour approach and gives the details of the techniques behind it; Section 4 providesa critical discussion of the output from the proposed approach; and ﬁnally, theconclusion and further research are presented in Section 5. A two dimensional image is a projection of the three dimensional world onto a2D plane, with a primitive unit as pixel. This process abstracts the world to onlypixel-intensity and colour. Features such as histograms, edges, contours, cornerpoints, object skeleton after erosion, lines regions, Fourier coeﬃcients, convolu-tion ﬁlter coeﬃcients, etc. have been extensively exploited to infer the originalinformation of the world from pixel information. Much research in computer vi-sion has been poured into object recognition and understanding of scenic contentfrom features derived from pixel information. It is interesting to explore whetherit would be possible to reverse the process and synthesize an image using thefeatures mentioned.Abstract patterns have been commonly generated using mathematical func-tions [10]. Random patterns, fractal and abstract patterns can normally be ex-pressed using rather short ﬁnite length programs. Although many insights havebeen observed and formulated in the graphic design domain, such as the rule ofthird and the golden ratio, there is still a big gap in our understanding of how anon-abstract image can be automatically generated using a computer programi.e., creating an outdoor scenery on a blank canvas. Due to its complexity, re-searchers have often choosen to explicitly describe the generative process as acomputer program [1] or describing a sequence of image processing operationsto be applied to existing images, e.g., non-photo realistic rendering (NPR) [11].In the more popular contemporary approach, a new image is created by modi-fying original image using transformation, artistic ﬁlters and various compositiontactics. The existing images will go through various processes, where their color,shapes, texture are modiﬁed and then re-composited into a new image [12]. In In this paper, the terms synthesis and generate are often used interchangably. his style, the content is consciously modiﬁed by a human designer who as-serts all extraneous information to the composited content. Various commercialgraphics software packages have been designed to assist human designers in thisapproach. Humans employ both top-down and bottom-up creative processes inthis approach.One of the early works attempting to automate the above creative process isfrom [14], the authors propose the concept named image analogies . Given images

A, A ′ and B where f ( A ) → A ′ , f is a function that transform A to A ′ e.g., f could be an artistic ﬁlter function. The authors show that the transformingfunction f could be learnt from an example pair ( A, A ′ ) and then apply the learntfunction to generate a new output B ′ . This approach provides an interesting styletransferred process.With the recent advances in deep learning , it has been shown that the con-volution ﬁlters of a convolution neural network (CNN) exhibit important char-acteristics of the way our brain responds to visual patterns; e.g., simple patternssuch as lines, dots, colors that emerge in early layers and complex patterns suchas textures, and compound structures that emerge in deeper layers [7, 8].The weights of CNN can be seen as a function that re-represents an inputimage I in hidden nodes h of CNN f : I h . Given an input image to atrained CNN, information residing in the trained CNN can be transferred tothe input image by enhancing the activation signal on the hidden nodes thatresponds strongly to the input image. The gradient of the hidden nodes withrespect to the input image could be used to modify the input image. Iterativelyrepeating the process will enhance the features strongly correlated with thosehidden nodes e.g., abstract geometrical patterns and complex patterns that havebeen learnt by the network in the resulting image. Google deep dream is one ofthe inﬂuential works that employs this technique to generate images from itsnetwork trained with images of cats and dogs. Feeding a new input image suchas the sky, the network generates a kind of hallucinative ﬂavour by embeddingparts of cats and dogs to the output images. This sparks many subsequent worksin image synthesis and style transfer [15]. A composite model generates a new image by modifying existing pixels informa-tion through the modiﬁcation of pixel intensity, through aﬃne transformation,convolution, or compositing pixels from various sources. We highlight some im-portant operations in the categories below.

Convolutional Filters:

Let X r × ci be a matrix of pixels of image i where x i ( r, c )is the pixel intensity. We can express convolution operation between image i anda convolution kernel K as X ′ ← K ∗ X i (1)where ∗ is a convolution operator and X ′ is the output image. ayers Blending: We can express the composition between image i and j as X ′ ← αX i + (1 − α ) X j (2)where α ∈ [0 ,

1] controls the blending percentage of images i and j . Variousoperations such as aﬃne transformation, intensity/color manipulation, and con-volution operations can be expressed as a sequence of these basic processes. Fig. 1.

Composite models: the image in column two is generated by applying a ﬁlterto emulate the pencil sketch eﬀect (see [12]). The image in column three is generatedby compositing two images together.

Fig. 2.

Automated compositing models: the images in columns one and two are gen-erated using cellular automata [16] with rule 30 and rule 90 respectively. The imagein column three is generated using a compact mathematical formula while the imagein column four is generated using an elaborate procedure that draws sunﬂower seeds,petals and their compositions (see [1]).

Figure 1 shows the output from composite approach. Columns one and threeshow the original image and columns two and four shows the processed images.The output in columns two and four are generated using Eqs. 1 and 2 respec-tively. The generated images closely resemble the original images. In this style,the variation is manually controlled by a human graphic designer. The processbove could be automated using computers. Weaving many small generativeunits together into algorithms, complex processes such as drawing sunﬂowerseeds and petals could be automated [1]. Figure 2 shows a more complex com-posite model generated by computer programs.

Let X r × c be a matrix of input pixels and G r × ch be a gradient matrix computedfrom hidden nodes parameters. Conceptually speaking, the generative modelthat optimizes the activations of hidden nodes can be expressed as: X ← X + δG h (3)where δ is the step size, that is, an image X gradually transforms accordingto the added gradient. Images produced in this style are interesting but thecontents are random in nature since the generated image is dependent on theinput pixels and the contents learnt by the network. There is no means to controlthe relationships among the components in the generated contents.A generative model can be viewed as a function. If this function can beprecisely determined, then, given an input image, a synthesized image can becomputed and vice versa, given a synthesized image, an original image can becomputed using its reverse function. Here the functionality of this generativefunction is emulated in an artiﬁcial neural network architecture using the processexplained below.Let Φ be the function that transforms input pixels into a feature vector h derived from hidden nodes (e.g., weights or activations) distributed in diﬀer-ent layers of the network h ← Φ ( X ). Conceptually, the original image can bereproduced from a distributed representation of h [17]: X ← Φ − ( h ) (4)This concept can be extended to the generation of X from weighted h i of multipletransform functions, i . X ← X i w i Φ − i ( h i ) (5) We exploit the VGG deep neural network in our image synthesis task. VGGis the acronym for

Visual Geometry Group from the University of Oxford . Thegroup has released two fully trained deep convolution neural networks, namelyVGG16 and VGG19, to the public . Here, we experiment with image synthesisusing parameters read from convolution layers of VGG16 network. Fig. 3 showsthe architecture of VGG16 network. Each block represents a hidden layer, forexample, × denotes 64 convolution ﬁlters of size 3 × ig. 3. The graphical abstract representation of the VGG16.

Feeding an image X r × ci to the VGG network, we observe the activations inthe all hidden convolution layers l , h li . In other words, activations in all hiddenlayers represent the pixel information of the input image and, conceptually, acopy of X should be reproducible by reverse the process X ′ i ← Φ − ( h li ). In thiswork, instead of analytically solving of a function Φ and its inverse Φ − , weadopt an optimization method that gradually adjusts the activation of hiddennodes to the desired h li .In [18], the authors approach the style transfer task by mnimizing two kindsof loss functions: content loss and style loss. Let P l ∈ R N l × M l , S l ∈ R N l × M l and F l ∈ R N l × M l be three matrices derived from the layer l of the VGG networkfed with content image, style image and noise input respectively. N l denotes thenumber of feature maps and M l denotes the size of the feature map. The styleloss and content loss are deﬁned as follows: L style = 14 N l M l N l X i =1 N l X j =1 ( G lij − A lij ) (6) L content = 12 N l X i =1 M l X j =1 ( F lij − P lij ) (7) A lij = M l X k =1 S lik S ljk (8) G lij = M l X k =1 F lik F ljk (9)where A lij and G lij is the Gram matrix which is calculated from the inner productof S lik S ljk and F lik F ljk respectively. .1 Synthesized Images Figure 4 presents twelve images in two groups. The ﬁrst and the third rowsare the original images; the second and the third rows are the synthesized im-ages based on the original images. These twelve images are chosen to highlightthe characteristics of synthesized images using Eq 6. Four patterns (the two-rightmost columns) are chosen to display strong long range dependency. It isclear that all synthesized images successfully capture local dependency usingthe L style loss. However, long range dependency is not successfully captured. Itis clear from the synthesized images in the two rightmost columns that althoughthe synthesized patterns seem to capture the style (local dependency) in gen-eral, the dependency among components in a longer distance are lost and a lotof information is missing. Fig. 4.

Synthesized images using equation 6 display good similarities at the local levelbut relationships among image components over a spatial distance are lost (see syn-thesized images in the last two columns).

An image synthesized using the loss function from equation 7 displays anexact replica of the image since it is a generation based on content loss. Com-bining the losses from both content loss and style loss, the relationships amongcomponents can be obtained via content loss while the texture is obtained via ig. 5.

Images of a mask (top row) and images of a child (bottom row) are synthesizedusing combination of the style loss (Eq 6) and the content loss (Eq 7). style loss. Figure 5 shows synthesized images from equations 6 and 7 which rep-resent content loss and style loss respectively. Combining weighted losses fromboth functions produces an interesting output since the pixel information fromtwo diﬀerent sources is blended together. The blending is not according to thespatial positions of the pixels(as in Eq 2) but from a deeper abstraction ob-tained from hidden layers of a deep neural network. This gives a kind of controlknown as style transfer , where one image provides information on the contentand the other image provides information about the style. New synthesized im-ages successfully capture both content and style information. This provides anew interesting generative approach.

Reﬂection & Discussion

In [14], the image analogies learn a generative function between a pair of images g : A A ′ . The generative function g ( · ) learns a speciﬁc transformation whichcan be applied to other images. The transformation is, however, limited to thespeciﬁc learnt function g ( · ). Leverage on recent advances in deep learning, pre-trained models (e.g., deep dream, VGG networks) are employed in the generativeprocess. This allows a richer transformation style since the deep neural networkacts as a transform function that re-represents information of a given image inthe hidden layers. In [15], two classes of loss functions: content loss L content andstyle loss L style are proposed. This allows diﬀerent combinations of style andcontent to be realized with ease.We oﬀer a summary of the creative process using distributed representationas follows: let N r × c and T r × c be a matrix of input pixels from white noise andthe target image T respectively. Feeding N r × c and T r × c to the VGG networkproduces two set of activations h ln and h lt in the hidden layers. Gradually re-ducing the discrepancy between h ln and h lt should conceptually synthesize theimage based on information from the image T r × c . Let G r × cL be a gradient matrixcomputed from the loss function L ( h n , h i ), hence, the generative model can beexpressed as an iterative update: N r × c ← N r × c + δG r × cL (10)where δ is the step size. That is, an image N gradually transforms into a newimage using information from T . The synthesized image will share many char-acteristics with the original image depending on the loss functions. The contentloss is, in essence, the diﬀerences between the synthesized image (initialized usingwhite noise) and the target image: L c ( h n , h t ) ∝ k c X l ( h n − h t ) l (11)where k c is a constant normalizing the loss. L c minimizes one to one relationshipbetween the nodes in the hidden layers and thus preserve original content. Onthe other hand, the style loss L s minimizes the gram matrix in the hidden layers.Minimizing the Gram matrix abstracts away spatial information (since the innerproduct only correlates the feature map as a whole and not the detail inside thefeature map). L s ( h n , h t ) ∝ k s X l ( h Tn h n − h Tt h t ) l (12)where k s is a constant normalizing the loss. In [19], the authors argue that theessence of style transfer is to match the feature distribution between the style andthe generated image and shows that minimizing the gram matrix is equivalentto minimizing the Maximum Mean Discrepancy (MMD) with the second orderpolynomial kernel. Conclusion & Future Work

The synthesis of an image using the information obtained from distributed VGGlayers have many strengths (i) the approach often produces visually appealingimages, more appealing than those produced by a ﬁlter technique e.g., artisticﬁlters; (ii) the approach oﬀers a ﬂexible means to combine diﬀerent contentimages and style images together. The synthesized output convincingly showsthat style loss produces an image with a clear local texture but often lacks aclear relationship among texture components over a long spatial distance. Sourceimages having strong local texture such as pebble, line drawing, etc., will produceimpressive outcomes.The issue of long range dependency is a universal issue in all domains andresearchers have approached this diﬀerently in diﬀerent domains. For example, aLong Short-Term Memory (LSTM) [20] is an enhanced recurrent neural networkthat has been successfully applied to speech, text and image processing. Com-bining context loss and style loss to synthesize a new image oﬀers a means todeal with long range dependency issue in images. The approach always producesinteresting output since the content loss always preserves the content while thestyle loss decorates the existing content with the style texture. In future works,we wish to further explore how to assert controls into the generative process [1,21].

Acknowledgments

We would like to thank the GSR oﬃce for their partial ﬁnancialsupport given to this research.