[PDF] Transform the Set: Memory Attentive Generation of Guided and Unguided Image Collages

Abstract

Cutting and pasting image segments feels intuitive: the choice of source templates gives artists flexibility in recombining existing source material. Formally, this process takes an image set as input and outputs a collage of the set elements. Such selection from sets of source templates does not fit easily in classical convolutional neural models requiring inputs of fixed size. Inspired by advances in attention and set-input machine learning, we present a novel architecture that can generate in one forward pass image collages of source templates using set-structured representations. This paper has the following contributions: (i) a novel framework for image generation called Memory Attentive Generation of Image Collages (MAGIC) which gives artists new ways to create digital collages; (ii) from the machine-learning perspective, we show a novel Generative Adversarial Networks (GAN) architecture that uses Set-Transformer layers and set-pooling to blend sets of random image samples - a hybrid non-parametric approach.

Full PDF

TTransform the Set: Memory Attentive Generation of Guidedand Unguided Image Collages

Nikolay Jetchev

Zalando ResearchBerlin, Germany [email protected]

Urs Bergmann

Zalando ResearchBerlin, Germany [email protected]

Gokhan Yildirim

Zalando ResearchBerlin, Germany [email protected]

Abstract

Cutting and pasting image segments feels intuitive: the choice of source templates gives artistsﬂexibility in recombining existing source material. Formally, this process takes an imageset as input and outputs a collage of the set elements. Such selection from sets of sourcetemplates does not ﬁt easily in classical convolutional neural models requiring inputs of ﬁxedsize. Inspired by advances in attention and set-input machine learning, we present a novelarchitecture that can generate in one forward pass image collages of source templates using set-structured representations. This paper has the following contributions: (i) a novel frameworkfor image generation called Memory Attentive Generation of Image Collages (MAGIC) whichgives artists new ways to create digital collages; (ii) from the machine-learning perspective, weshow a novel Generative Adversarial Networks (GAN) architecture that uses Set-Transformerlayers and set-pooling to blend sets of random image samples – a hybrid non-parametricapproach. Upon publication of the paper, we will release code allowing artists and researchersto use and modify MAGIC. C on t en t T e m p l a t e s Blended collage Refined output Zoom 3.5x Zoom 3.5x W e i gh t s Figure 1:

Target content I C is a human portrait of size 1200x1600 pixels. Set M of 50 memory templates is sampledby cropping from few satellite images of Berlin (4 shown). MAGIC outputs a collage by predicting blending weights A (visualized with random color for each element). The collage I M is a convex combination of the memory templates, andoutput I is further reﬁned by a convolutional network for better perceptual quality. Trained using patches of size 256px. Collages are a classical technique. Throughout ages, artists would stitch pieces of paper for a surreal abstraction ormosaic-like effect to paint a target image [16]. Modern digital collage methods [9] solve an optimization problemto paint a target image by combining a set of memory templates. However, the output has non-overlappingtiles with clear borders. Non-parametric quilting [1] combines patches to reconstruct a target image – "texturetransfer". However, stitching errors happen and there is no way to ensure perceptually plausible output. [12, 13]uses neural patch matching (ﬁnd and copy closest patch in feature space) combined with a content loss, but relyon pre-trained networks and this can negatively impact performance if image distributions differ too much. As afurther drawback, all of [1, 12, 13] are slow due to exhaustive patch search routines, and optimization is donefrom scratch for one input memory set and one output image only – no learning. In contrast, our method MAGIC a r X i v : . [ c s . C V ] N ov collage output warped templates copied segmentsweights Figure 2:

Using MAGIC for unsupervised generative model learning: given randomly sampled sets of images M (256x256pixels), we learn to predict coefﬁcients A to create a blended collage I M , subsequently reﬁned into face I . learns to reason and generalize over the memory sets and their interactions, and so make collages in a singleforward pass, amortizing the expensive optimization.Convolutional networks can be trained to represent in their parameters the statistics of a training imagedistribution. Learning is adversarial (GANs[3]) or supervised [2]. They are also used to create smooth mosaicsstylizing a target content, using optimization [2, 6] or more efﬁciently a single forward pass [4]. However, suchstylization is not a collage, since parametric methods do not recombine a set of clearly visible style memorytemplates. While GANs trained properly (big data, big model, long training times) can generate convincingimages, this is expensive and data hungry, and can fail to reproduce perfectly all the details of the training data.In constrast, collaging copies image patches and preserves visual details, without model capacity limitations. Wepropose an architecture that combines collaging ( non-parametric : preserve visual details) and GAN approach( parametric : differentiable training, perceptually plausible, fast inference). Our method MAGIC is such a hybrid generative model: using end-to-end learning, it learns how to recombine sampled sets of input memorytemplates into collages, and it learns how to reﬁne the collages in a smooth way.Creating a collage from source images is an inherently set-structured problem. The order of the images shouldnot matter – the method needs to be permutation-invariant . We propose to use the Set Transformer (ST) [11]as a way to give the MAGIC model the ability to operate permutation-invariantly on randomly sampled sets ofmemory templates. Formally, in a non-parametric collage setting the inputs are style images I T , from whichwe can deﬁne a cropping patch distribution I t ∼ P T of patches of ﬁxed size H × W . It is used to sampleindependently K elements ( K can also vary randomly) forming the memory templates set M . Sampling suchsets M ∼ P T is used for input to the generator G – they replace the usual prior noise distributions for GANs (e.g.Gaussian). The generator ﬁrst calculates the set of spatial blending weights A , and then uses them to generatecollage image I M as a convex combination of the elements of M . Please see Appendix I for details on thegenerator structure and training loss.The training of MAGIC optimizes an adversarial loss [3] ensuring that the generated output distribution isperceptually close to the training style distribution P T . Once the model is trained, we can sample a new memoryset M and a single forward pass of MAGIC quickly infers the output collage – this can be useful for creativeexploration of various sampled sets and combinations from them, or smooth stylization of animations and videos.We can also view that MAGIC performs relaxed set-structured amortized quilting , since it learns a model topredict blending weights over the whole distribution of memory templates P T .We illustrate how MAGIC works in two different tasks. First, we can generate seamless collages in a content-guided way (Fig.1 ). We add as additional input a set of content images I C , from which we can deﬁne thecropped content patch distribution I c ∼ P C , same size as the style patches. Given patch I c , we would replicate it K times and concatenate it to the templates M , i.e. we add 3 extra image channels. For training we need contentreconstruction loss L content ([4, 6]), in addition to the adversarial loss. The augmented set M informs G how tominimize loss L content and is informative for the interactions of the template set and the target content. Oncetrained on patches, MAGIC inference can be rolled-out on any image size due to the fully convolutional generator.The fully convolutional model can create very large images at inference time – all calls to G can be efﬁcientlysplit into small chunks seamlessly forming a whole image [7].Second, we can learn a collage-based statistical model of data in an unsupervised way , see Fig. 2 for an exampleon the CelebA dataset [14]. Such generative learning is an interesting alternative to traditional parametricgenerative models – MAGIC learns to generate images recombining elements from the randomly sampledmemory template set, "copy and paste", rather then requiring large model capacity to learn the full data statistics.In addition to the blending coefﬁcients A , the generator can also predict transformation parameters and dowarping of each element of the memory set M , before blending them into a collage. The integration of warpingand blending allows an even more expressive collage model.In related work FAMOS[8] also introduced a hybrid (non-)parametric model to blend templates. However,MAGIC is a more general and advanced method that can cut memory templates in a much ﬁner way, see AppendixIII for detailed comparison. 2 cknowledgements

The authors would like to give special thanks to Kashif Rasul, whose PyTorch expertise greatly aided the project,and to Roland Vollgraf for the many useful generative model discussions.

Ethical Considerations

The MAGIC algorithm we presented here is a tool for digital collages. It allows artist practitioners to experimentand iterate faster when exploring various artistic choices, e.g. choice of source material memory templates asstyle or which content image to stylize. While the new tool amortizes the costs of the collage process – fasterthan either manually cutting paper or using traditional optimization based tools – this tool is inherently meant tobe a part of a collaboration between artist and AI machine. In that sense, we would say that our tool is ethical anddoes not risk radical negative disruption of the artistic landscape (which is a risk when tools completely automateand replace processes). MAGIC is rather a subtle evolution towards ﬁner and faster intelligent control of speciﬁcsteps involved in the artistic process for a speciﬁc artform – collages and mosaics. Such a tool empowers artiststo explore and create more interesting artworks.

References [1] Alexei A. Efros and William T. Freeman. Image quilting for texture synthesis and transfer. In

Proceedingsof the 28th Annual Conference on Computer Graphics and Interactive Techniques , SIGGRAPH, 2001.[2] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. A neural algorithm of artistic style.

CoRR ,abs/1508.06576, 2015.[3] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C.Courville, and Yoshua Bengio. Generative adversarial nets. In

Advances in Neural Information ProcessingSystems 27 , 2014.[4] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditionaladversarial networks.

CoRR , abs/1611.07004, 2016.[5] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and koray kavukcuoglu. Spatial transformer networks.In

Advances in Neural Information Processing Systems 28 . 2015.[6] Nikolay Jetchev, Urs Bergmann, and Calvin Seward. Ganosaic: Mosaic creation with generative texturemanifolds.

CoRR , abs/1712.00269, 2017.[7] Nikolay Jetchev, Urs Bergmann, and Roland Vollgraf. Texture synthesis with spatial generative adversarialnetworks.

CoRR , abs/1611.08207, 2016.[8] Nikolay Jetchev, Urs Bergmann, and Gökhan Yildirim. Copy the old or paint anew? an adversarialframework for (non-) parametric image stylization.

CoRR , abs/1811.09236, 2018.[9] J. Kim and F. Pellacini. Jigsaw image mosaics. In

Proc. of the 29th Annual Conference on ComputerGraphics and Interactive Techniques , SIGGRAPH, 2002.[10] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.

CoRR , abs/1412.6980,2014.[11] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek, Seungjin Choi, and Yee Whye Teh. Settransformer.

CoRR , abs/1810.00825, 2018.[12] Chuan Li and Michael Wand. Combining markov random ﬁelds and convolutional neural networks forimage synthesis. In

CVPR , pages 2479–2486. IEEE Computer Society, 2016.[13] Jing Liao, Yuan Yao, Lu Yuan, Gang Hua, and Sing Bing Kang. Visual attribute transfer through deepimage analogy.

ACM Trans. Graph. , 36(4):120:1–120:15, 2017.[14] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In

Proceedings of International Conference on Computer Vision (ICCV) , 2015.[15] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutionalgenerative adversarial networks.

CoRR , abs/1511.06434, 2015.[16] Samuel Reilly. Stick ‘em up! a surprising history of collage. , 2019.3 rue/fake ?True/fake ?

Discriminator D: patch discrimination

Generator G

Set-tranformer U-Net:Conv+SetTransformer Blocks Set-Pooling Refinement U-Net:Convolutional Blocks

Figure 3:

The generator G has as input a random set M ∼ P T consisting of K images sampled independently – templatememory. The architecture of G has three components. (i) a U-Net with ST blocks (see Figure 4) can reason about interactionsbetween set elements and output the blending weights A . (ii) The permutation-invariant pooling operation creates thecollage image I M ∈ R × H × W as a convex combination over the K set elements using softmax on A . (iii) a purelyconvolutional U-Net reﬁnes I M (cid:55)→ I . The discriminator D distinguishes true patches I T ∼ P T from the generated patches I ∼ G ( M ) .

11 1221 2211 1221 2211 1221 2211 1221 22

Conv+IN+ReLUConv+IN+ReLU

Convolutional Block

11 1221 2211 1221 2211 11 11 11 ST

12 12 12 12 ST

21 21 21 21 ST

22 22 22 22

STreshape

11 1221 2211 1221 22

SetTransformer ST Block reshape

Figure 4:

The structure of the blocks of the U-Nets used inside G . For illustration, we show how the operation works for H = W = K = 2 . The input tensor is size X ∈ R K × C × H × W with C channel feature maps. The set-transformer layeroperates on HW number of sets, each of K elements, taken from the same spatial position inside the memory tensor M .Each slice X : , : ,h,w across the K dimension is a set. Appendix I: Architecture and loss details

The structure of the generator G is shown in Figure 3. The following tensors are used for the generator G , (forsimplicity we skip minibatch index and write the tensor sizes when having a single instance minibatch) • all spatial patch dimensions are H × W for training. • w.l.o.g. we can apply the generator on spatially larger input tensors M since all convolutional and STlayers can handle such size adjustment. • let K be the number of memory template set elements. • input is the sampled memory set M ∈ R K × × H × W • output of the ﬁrst ST-U-Net is A ∈ R K × H × W , a set of convex combination weights. By applyingsoftmax on it we ensure that it holds that (cid:80) Kk =1 A khw = 1 . • the collage I M ∈ R × H × W is the convex combination of M with A as weights – this is a form ofpermutation-invariant set-pooling, since the output does not change if we permute in the ﬁrst dimension,a slice along the K set elements of set A and set M . • the second U-Net takes I M as input and output I , using the parametric architecture of stacked convolu-tional blocks to correct the collage results and make them more perceptually plausible G takes as input randomly sampled memory sets M with K elements (optionally combined with contentguidance patches I c ). This is fed to a U-Net (with skip connections as in [4]). However, in addition to standardconvolutional blocks, we also add blocks that can process the set structure of the data using Set Transformer(ST) layers – we call this network architecture ST-U-Net. It outputs mixing coefﬁcients A and processes theset elements permutation-equivariantly (the order does not matter) while also taking care of the interactions.4 ampled content patchessampled style patches generated patches with MAGIC (guided) Figure 5: Illustrating the patch-wise training of MAGIC in guided mode. Training distribution: patches of size H = W = 256 x px.Figure 6: Illustrating the inference roll-out of the model. Sampled collages of size 2100x1600px using seamlessstitching of rectangular components. Each component can be cut according to memory constraintsSee Figure 4 for illustration how the set operation is used exactly. Note that the usual convolutional block is permutation-equivariant by deﬁnition, since it works on each element independently. Afterwards, by applyingsoftmax on A , we can calculate a convex combination of the memories from M and produce the collage image I M with permutation-invariant set pooling.In addition, we can also optionally do spatial warping on each image the memory set M , using a parametrizationlike the Spatial Transformer [5] or directly a full optical ﬂow. For this purpose, we just predict for each setelement k its deformation parameters θ k . We calculate these parameters { θ k } Kk =1 as output of the ST-U-Nettogether with A , and apply the warping deformation before the set-pooling.The discriminator D uses a classical patchGAN [12] approach: it should discriminate the sampled trainingpatches from the generated image patches. The overall loss for training MAGIC (and ﬁnding a good generator G ∗ ) combines adversarial and content guidance terms (for the content-guidance use case, see [4, 6]) is thefollowing: L adv ( G, D ) = E I t ∼ P T log D ( x ) + E I c ∼ P C ,M ∼ P T log(1 − D ( G ( M, I c )) , (1) L content = E I c ∼ P C ,M ∼ P T (cid:107) φ ( I c ) − φ ( G ( I c , M )) (cid:107) , (2) G ∗ = arg min G arg max D L adv + λ C L content . (3)Figure 5 shows visually how patch-based training works for our model, similar to other GAN methods. Figure6 show how much larger sizes are possible in inference by using convolutional roll-out. We can ﬂexibly doinference on any size image, as long as the memory templates and guided images are cut to the right size.Stitching allows to go beyond the GPU memory limits. E.g. for a size 2000x2000 pixels poster we usually would5 ollage output templates copied segmentsweights Figure 7:

The effect of entropy regularization of the blending coefﬁcients A . Large regularization term weight λ E = 5 leadsto solutions with coefﬁcient values close to 0 or 1 – low entropy, "hard attention" sparse solutions. E.g. the top row showshow we copy the hat, the left and right face parts from 3 different memory templates. Conversely a weaker λ E = 0 . doesnot constrain the generator to predict low entropy coefﬁcients A and allows it to relax and soften the blending weights andleads to less sparse collages, blending softly the whole memory images. Example using the CelebA dataset and 256 pixelsresolution, with K = 3 for inference. take 6x6 overlapping grid with overlapping squares of size 384x384 pixels, and trim their edges accordingly (see[7]). Appendix II: Finer collage control by imposing additional GAN generator constraints

The outputs of the set-transformer U-Net are the convex combination weights A . These determine how thememory templates M are blended (a form of set-pooling). By constraining the generator to output weights A with different statistics, we gain a way to enforce different artistic choices for the collage generation. Weexperimented with three additional constraints that inﬂuence the generator G : • the entropy L E ( A ) determines how sparse the convex combination weights are. Low entropy impliesthe property of having one memory template be fully copied in a spatial region with weight A k = 1 ,and others left out with weight A k = 0 . Conversely, high entropy will be more soft and blend moregently different templates with weight A k ≈ . . Please see Figure 7 for an example how this changesthe face blending for unsupervised MAGIC. • total variation L T V determines whether we have bigger segments with small borders, or many smallsegments that vary spatially. On Figure 8 we show the effects of such regularization, using as example alarge guided mosaic collage. • It is desirable to have a collage using a varied selection of the memory template elements M . To achievethis, we propose to penalize the spatial size of the largest memory template for the whole spatial region.Formally, we deﬁne this term as L M ( A ) = max k (cid:80) hw A khw . This term is required especially forthe unguided case, where a trivial solution would be to set A k = 1 everywhere spatially and copycompletely a single memory element. This would fool the discriminator ideally, but is a failure modefor the collage purpose. Our design of L M ( A ) prevents that case easily.By tuning the scalar weights λ for each regularization loss term we can tune the contribution of each regularizationterm to the total loss, in addition to the adversarial L adv and content L content terms we deﬁned already in theprevious section: G ∗ = arg min G arg max D L adv + λ C L content + λ T V L T V ( A ) + λ E L E ( A ) + λ M L M ( A ) (4) Appendix III: Detailed comparison with Fully Adversarial Mosaics (FAMOS)

We can compare MAGIC, the method we presented in the current paper, with FAMOS[8], another approachproposing a hybrid combination of non-parametric patch copying and fully adversarial end-to-end learning. Thenovel ST architecture of MAGIC improves image generation quality and convergence, allowing more ﬂexiblecutting of regions of interest from memory templates. Using the set-structure allows to ﬂexibly generalize torandomly sampled sets M – unlike FAMOS where a predeﬁned ordered tensor M was required, and patchcoordinates were "memorized" explicitly. We can compare visually FAMOS and MAGIC on the guided mosaictask, supported by both models. We visualize the results of FAMOS, both fully parametric and hybrid memory6 oom-in 3x Figure 8:

Example of a guided collage of a human portrait content (size 1200x1600 pixels) and Berlin city fragments used tosample as set of 50 memory templates. We illustrate the effect of TV regularization of the blending coefﬁcients A on the lookof the generated collages I and I M . A small value λ TV = 1 does not constrain the generator to output smooth A and leads tosmaller segment cuts copying smaller details spatially and painting the content more accurately (top row). Conversely, largeregularization term weight λ TV = 100 leads to solutions with small total variation, implying bigger segments with smootherborders (bottom row). copying mode, and MAGIC, our novel method. For training we used patches cropped from 4 Milan city satelliteimages for template distribution, Archimboldo portrait as content Image of size 1200x1800 pixels. • The fully parametric convolutional approach (top row) is smooth, but the city image lacks visual detailsand has some distortion – the training distribution of Milan city maps is not accurately learned. • The hybrid memory template copying and reﬁning approach (middle row) shows a different mosaicresult. K = 80 memory templates were available, ﬁxed for the whole training procedure. A few ofthe memory templates were copied as background and then the convolutional layers added some moredetails on top of them, mainly depicting the content image more accurately. However, this collagehas quite rough structure: too big segments are copied from the memory templates, see the plot ofthe mixing coefﬁcients A at the bottom right. While such large memory segment cutting also has acertain charming visual look, it is actually imprecise control of exact patch cutting and placement forthe collage. • For comparison see the respective MAGIC results (bottom row), with K = 40 memory templatesrandomly sampled from a whole distribution. The size of cut and pasted segment is much smaller,and MAGIC can control much better what is copied and pasted where. In addition, despite trainingwith K = 40 MAGIC can work with different number of set elements, and sample them ﬂexibly – anadvantage of its set transformer generator architecture.While all 3 tested methods can produce beautiful mosaics with good stylization and content properties, theaesthetic quality of a mosaic is a subjective estimate of the artist or audience. We think that the ﬁne control thatMAGIC offers over the placement and cutting of memory templates makes it a worthwhile addition to an AIartist toolbox. 7 osaic Zoom-in 3x

Weights F A M O S pa r a m e t r i c G A N F A M O S c o ll age M A G I C c o ll age Content Example template

Figure 9:

Illustration of three adversarial guided mosaic stylization approaches, trained on Milan city images as styletemplates and Archimboldo portrait of size 1200x1800 pixels as content. (i) A fully convolutional U-Net generator. (ii) Ahybrid method that blends from ﬁxed templates and reﬁnes them with another U-Net. (iii) MAGIC collage using the samedata. Arguably, MAGIC has the ﬁnest control over source material recombination, see text for details.

Appendix IV: Technical details

We implemented our code using PyTorch 1.0, and ran experiments on a Tesla V100 GPU. Each convolutionalblock had convolution with kernel 5x5, instance-normalization and ReLU nonlinearity. Typical for U-Nets [4], weuse downsampling and upsampling to form an hourglass shape. Channels were 48 at the largest spatial resolutionand doubled when the spatial resolution was halved. For the discriminator, we could use much more channels,128 at the ﬁrst layer and doubling after every layer. We also used FP16 precision in order to get lower memorycosts and ﬁt larger set sizes K – note that the complexity of the ST block is square in K . The U-Nets we usedhad skip connections.We used for training the usual cross-entropy GAN loss [15], using minibatches of size B , effectively meaningthat the sampled memory templates were 5-dimensional tensors M ∈ R B × K × C × H × W . Since we use instancenormalisation (and not batch normalisation), the batch size B can be chosen ﬂexibly depending on the GPUmemory constraints. We trained on 3 V100 GPUs. 8or the two experiments we showed in Figures 2 and 1 we had the following settings: • guided generation Fig. 1: raining data distribution for memory templates: randomly cropped 256x256pixel patches from 6 Berlin city map images (each of resolution 1800x900 pixels). K=50 memorytemplates, U-Nets of depth 5. Discriminator depth 6. Batch size B = 6 . For inference of the largeoutput mosaic we can unroll on any size (typically for posters many megapixels can be used), and decidehow to split spatially the rendering given the system GPU memory constraints. • unguided generation Fig. 2: image size 256x256 pixels, same size for training and inference. U-Nets ofdepth 5. Discriminator depth 7. Batchsize B = 12 . Element count of memory set K ∈ { , }}