[PDF] Channel Decomposition into Painting Actions

Abstract

This work presents a method to decompose a convolutional layer of the deep neural network into painting actions. To behave like the human painter, these actions are driven by the cost simulating the hand movement, the paint color change, the stroke shape and the stroking style. To help planning, the Mask R-CNN is applied to detect the object areas and decide the painting order. The proposed painting system introduces a variety of extensions in artistic styles, based on the chosen parameters. Further experiments are performed to evaluate the channel penetration and the channel sensitivity on the strokes.

Full PDF

CChannel Decomposition into Painting Actions

Shih-Chieh Su ∗ MicrosoftSan Diego, CA 92130 [email protected]

Abstract

This work presents a method to decompose a convolution layer of the deep neuralnetwork into painting actions. The pre-trained knowledge in the appointed oper-ation layer is used to guide the neural painter. To behave like the human painter,the actions are driven by the cost simulating the hand movement, the paint colorchange, the stroke shape and the stroking style. To help planning, the Mask R-CNNis applied to detect the object areas and decide the painting order. The proposedpainting system introduces a variety of extensions in artistic styles, based on thechosen parameters. Further experiments are performed to evaluate the channelpenetration and the channel sensitivity on the strokes.

For years, the convolutional neural networks have been digesting the visual world and serving a widerange of applications. Neural style transfer is one of the most popular applications. Soon after thepioneering work by Gatys et al. [1], the feed-forward network incorporating convolutional layers hasbeen introduced to perform near realtime style transfer [2, 3]. In recent work [4, 5], the stroke andthe attention factors have been considered. Although fast, current style transfer work generates thewhole transferred frame in one feed. This leaves the audience wondering that, in which stroke orderthe neural painter would paint the art that can lead to the style transfer output.On the other hand, the stroke composure process has been studied in the literature. The steps to paintthe image (or to write the letters) is typically referred to as stroke-based rendering (or inverse graphics).To learn the painting behavior without pairing stroke-wise training data, reinforcement learning isapplied to help stroke planning, in recent work like SPIRAL[6], StrokeNet[7] and LearningToPaint[8].While the strokes can typically be arranged in the coarse-to-ﬁne order as the painting architecture ofthe designed, the stroke shape and stroke order may differ from that of the human painter.Decomposing the target into reasonable amount of stroking actions of reasonable amount of strokeshapes remains challenging. Which parts of the target needs to be painted, in which kind of artisticstyle? How does the human painter plan and compose the paint with strokes? For different painter topaint the same object or the same scene, how different would their approaches be? We try to ﬁndclues from the generator network.This work presents a strategy to decompose the channel response of the generator networks intostroke actions, called the channel stroke . The channel stroke considers the burden of the humanpainter in changing paint brushes and changing colors. Leveraging the channel depth of the generatornetworks, the proposed strategy strokes through the same channels continuously over the regionswith high receptive ﬁeld response. ∗ In Workshop of Knowledge Representation and Reasoning. Demo and code are available and updated at https://github.com/jessysu/cpia .33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. a r X i v : . [ c s . G R ] N ov a) big, l = 9 (b) small, l = 9 (c) small, l = 9 (d) small, l = 9 (e) small, l = 8 (f) small, l = 7 Figure 1: CPIA over the style transfer transformer of the MSG[10]: l indicates the operation layerand big/small means the stroke size (see N in Section 3.2). The top row consists of the content image,the style images (from left to right, Udnie , Starry Night and the Great Wave Off Kanagawa ), andoriginal style transfer outputs. The remaining rows are the intermediate outputs, the intermediatestroke maps, and the ﬁnal outputs. (a) and (b) are styled by

Udnie . (c) is styled by

Starry Night . (d-f)are styled by the Great Wave Off Kanagawa . MSG has 13 layers where layers 2-9 are the operablebottleneck layers.Experiments are performed over the generator in the GANs and the transformer network of the styletransfer. The layer in any of these networks decides the stroke-able space ( c, h, w ) for the channelstroke. The cost in actions in the stroke-able space then quantiﬁes the burden of the action andmakes decision on either continuing stroke, change color, or stop painting. Depending on the learnedknowledge in the pre-trained CNN layers, the channel stroking location and style varies. Whenapplied toward the style transfer network, the stroke style varies on top of the neural style (Fig. 1).Mask R-CNN[9] is used to help the neural painter plan via understanding what it is painting. Withthe knowledge of recognized objects, the neural painter then focus in painting the object regions oneby one. The plan thus covers what to paint, whether the background is painted, and to what detaileach region is painted, where the stroke detail is already parameterized in the channel stroke above.The tune-able planning helps to put appropriate focus over different regions. The whole system iscalled channel painter in action , CPIA (Fig. 4).The paper details the method on channel decomposition and rendering over limited amount ofchannels. Qualitative results are presented with quantiﬁed channel coverage at the operation layer.Based on the existing pre-trained networks, the proposed CPIA provides: 1. stroke composure actions,2. additional tune-able artistic outcome, 3. controllable brush shape and movement. It is unsupervisedand without additional training data. 2 Related Work

Generative neural networks provides the tensor space for channel decomposition. The back-propagation concept of the autoencoders has been introduced in 1980’s [11, 12], making the neuronslearn on the errors between the generated target and the real target. The infrastructural improvementon parallel computing later leads to two popular generative approaches: the variational autoencoders(VAEs[13]) and the generative adversarial networks (GANs[14]). The VAEs uses a framework ofprobabilistic graphical models to generate the output by maximizing the lower bound of the likelihoodof the data. While the GANs leverages a discriminative network to judge and improve the outputof the generative network. After the adoption of the deep convolutional nets (DCGAN[15]), thetask-oriented GANs have been applied to image-to-image translation (Pix2Pix[16], CycleGAN[17],GDWCT[18]), concept-to-image translation (GAWWN[19], PG [20], StyleGAN[21]) and text-to-image translation (StackGAN[22], BigGAN[23]), among other domain-speciﬁc GANs[24, 25, 26].In this paper, the BigGAN is used to generate the CPIA painting targets from keywords, as in Fig. 1. Neural style transfer is one major domain that CPIA can be applied. In the work by Gatys et al. [1],the authors formulate the style transfer cost as a combination of the content loss and the style loss.The loss is measured over the pre-trained VGGnet[27], from the generated image to both the contentimage and the style image. The transformer networks with deep convolutional layers are introducedin [2, 3] to speed up the style transfer - the whole transformer is trained on a particular style. Thencomes the transformer attempting to learn multiple styles in one single network, such as [28, 10]. Inthe following sections, the transformer of MSG[10] is decomposed into the CPIA actions.

Stroke-based rendering , or inverse graphic, without the training stroke sequence is challenging. Todeal without the training stroke sequence, a discriminative network guides the distributed reinforce-ment learners to make meaningful progress in SPIRAL[6]. The computation cost is high for the deepreinforcement learners with large and continuous action space. That can be mitigated by creatinga differentiable environment, like the ones in WorldModels[29], PlaNet[30] and StrokeNet[7]. Theongoing research has delved into various stroking agents that generate very different output styles.For cartoon-like stroking, LearningToPaint[8] efﬁciently generates the simple strokes to compose thecomplex image. On the other hand, the NeuralPainter[31] abstracts and recreates the image into asketch-like output.

Let Λ ( l ) ( Y ) denote the layer operation of the l th layer on its input Y . The generator network of L layers can be expressed as Y ( l ) = Λ ( l ) (cid:16) Y ( l − (cid:17) , ∀ l ∈ { , , · · · , L } , (1)where Y (0) is the input of the network and Y ( L ) is the output. The operations and correspondingweights in Λ ( l ) were trained with or without the input Y (0) .Extending the forward path of the neural network to allow additional layer operations Φ ( l ) leads to Y ( l ) = Φ ( l ) (cid:18) Λ ( l ) (cid:16) Y ( l − (cid:17)(cid:19) , ∀ l ∈ { , , · · · , L } , (2) Next, we provide implementations of the layer operations Φ ( l ) . One key ﬁnding in our experiments isthat, for the intermediate layer images Y (1) · · · Y ( L − , the decomposed representation preserves thespatial information of the output image Y ( L ) , while the detail at each location can be truncated andrepresented by the high response channel(s). By only showing τ channels out of the total C channelsat layer L , we force the later layers to respond only on the top τ channels at each location of L . This3 a) raw image generated from [23] (b) channel ﬂush l = 3 , τ = 512 (c) channel ﬂush l = 6 , τ = 512 (d) channel stroke l = 4 , τ = 256 (e) channel stroke l = 3 , τ = 512 (f) channel stroke l = 6 , τ = 512 Figure 2: Channel decomposition of the BigGAN [23]: (a) is generated using the keyword "seashore".From (a), we use the channel ﬂush in Section 3.1 to generate (b) and (c), and the channel stroke inSection 3.2 to generate (d-f). BigGAN has 15 layers and 14 of them are generative bottleneck blocks.observation shed light on the following operation Φ ( l ) (cid:16) Y (cid:17) = M (cid:16) Y (cid:17) (cid:12) YandM c,h,w ( Y, τ ) =  if (cid:13)(cid:13)(cid:13) Y ( c,h,w ) < Y ( c (cid:48) ,h,w ) (cid:13)(cid:13)(cid:13) ≤ τ, ∀ c (cid:48) ∈ { , · · · , C } otherwise, (3)where (cid:12) is the Hadamard product, (cid:107) · (cid:107) is the cardinality of a set, and τ is the channel limit out of the C channels at layer l per location h, w . The tensor M masks the original layer output Y and picksonly the top τ channels of Y at each location h, w for rendering into later layers.The channel ﬂush provides a way to focus the image render on high response channels. On any spatiallocation of the operating layer, the lower response channels are muted. The number of channels tochoose from also matters. In CNNs [32, 33], the depth (channels) is traded with the breadth (spatialsize). As a consequence, the operation layer of the channel ﬂush is better in the middle of the network- avoiding the last layers which have low channel variety, and the initial layers which have low spatialresolution.The result of the channel decomposition is shown in Fig. 2. We start with an image generated fromthe BigGAN[23], then apply the channel ﬂush and the channel stroke, which is described next. Sometimes the channel ﬂush incurs unnecessary discontinuities over the output image. To deal withthis issue, we extend the Φ ( l ) (cid:0) · (cid:1) in Eq. 3 into an operation set of channel strokes at layer l . Let ( C, H, W ) denote the channel depth, height and width of its output Y ( l ) . We deﬁne N ( c, h, w ; m ) ∈ N as the set of neighborhood pixels ( c (cid:48) , h (cid:48) , w (cid:48) ) near pixel ( c, h, w ) , where Y c (cid:48) ,h (cid:48) ,w (cid:48) ≥ m · Y c,h,w ( c (cid:48) , h (cid:48) , w (cid:48) ) in N ( c, h, w ; m ) . The quantiﬁer m is a real number from (0 , , which means thesensitivity of the stroke. When m is close to one, the stroke sensitivity is high. Thus the channelstroke can only turn on the neighboring pixels with highly similar response as the stroke pixel. Thesimplest case of N ( c, h, w ; m ) is a square box centered at ( c, h, w ) with each side of z + 1 pixelson channel c . In this case, the parameter z is the stroke size.The channel stroke algorithm (Alg.1) updates the mask tensor M in Z C × H × W and the cost tensor G in R C × H × W , on each of its iteration. The mask M then ﬁlters the layer l to response Y in R C × H × W .The stopping criterion S terminates the procedure when current response Y c,h,w is lower than afraction of Y max . Other possible choices for S include number of strokes and the fraction of paintedlocations. Algorithm 1

Channel Stroke procedure C HANNEL S TROKE ( S , N , m, τ ) mask M ← cost G ← while true do choose pixel ( c, h, w ) ← argmax c,h,w (1 − G ) (cid:12) Y conditioned on M c,h,w = 0 and C (cid:88) c (cid:48) =1 M c (cid:48) ,h,w < τ if stopping criterion S met then break extend stroke M c (cid:48) ,h (cid:48) ,w (cid:48) ← for all ( c (cid:48) , h (cid:48) , w (cid:48) ) in N ( c, h, w ; m ) update G according to the chosen pixel ( c, h, w ) The intuition of channel stroke is that, for a human painter to paint, one needs to select a color ofpaint, the painting brush, and the pattern to paint. Once these items are selected, the painter canstroke on the canvas and then extend the stroke over a certain region. In Alg.1, the color of paint andpattern to paint are controlled by channel c , which is chosen in Step 5 according the current mostresponsive pixel ( c, h, w ) . The neighborhood N ( c, h, w ; m ) decides the stroke shape, which can berelated to the painting brush of the human painter.At the end of each stroke, the human painter can either continue to use the same color to paint otherarea on the canvas, or switch to other color. To be effective, it is desirable to keep using the brush ofcurrent color as much as possible on the same level of painting detail. We factor this behavior intothe cost G at the stroke pixel ( c, h, w ) . The channel stroke will continue the stroke into nearby strokepixel at the same channel. Note that the neighborhood extension in Step 8 is about the shape of thestroke, while the stroke continuation in Step 9 models the behavior of the human painter in switchingcolor. We ﬁrst consider the channel cost, which reﬂects the cost in changing the brush color for human. Let g c be the constant cost accounting to changing channel. The channel cost tensor J is J c (cid:48) ,h (cid:48) ,w (cid:48) ( c ) = (cid:40) g c if c (cid:48) = c , where c is the current stroking channel otherwise (4)The next cost factor is the stroke movement. Naturally, the stroke continues into its nearby region.Therefore, the movement cost increases as the distance from the current stroke location increases.The movement cost K is K c (cid:48) ,h (cid:48) ,w (cid:48) ( h, w ; σ ) = 1 − e − ( h (cid:48)− h )2+( w (cid:48)− w )22 σ , (5)where σ is the standard deviation of the Gaussian kernel centered at the current stroking location h, w . 5igure 3: Comparison between the channel stroke (left of pair) and channel ﬂush (right of pair).Histogram of channel coverage ﬂoats over each image. From top to bottom, the results in each row are(1) BigGAN [23] from keywords "pineapple", "vase" and "volcano" respectively (2) l = 4 , τ = 256 (3) l = 4 , τ = 512 (4) l = 5 , τ = 256 (5) l = 5 , τ = 512 (6) l = 6 , τ = 512 (7) l = 7 , τ = 128 The overall cost is then the Hadamard product of the individual cost components, G = J (cid:12) K .This cost is being updated in every channel stroke iteration at Alg.1 Step 9. Further stroke behaviormodeling may incorporate location cost for top-to-bottom and left-to-right handwriting behavior. It isalso possible to add the directional cost to make the stroke continue in the same direction and attaincertain artistic feel. Here we focus on quantifying the burden in between continuing with the currentbrush or changing color. The cost G provides the next stroking location and channel.In Fig.3, we compare the results from channel ﬂush and channel stroke over different operation layers l and different channel limits τ . We also provide a histogram of channel coverage for each outcomeimage. For an arbitrary channel, the coverage means the fraction of locations having their mask M turned on. The results from the channel ﬂush have more concentrated coverage compared tothose from the channel stroke. Because channel stroke extends the stroke into its neighborhood and6igure 4: Block diagram for applying the CPIA over generator network (a) raw (b) styled (c) painted @ l = 9 (d) painted @ l = 7 Figure 5: CPIA result over the style transfer network in [10]. Images are painted with τ = 20 in (b)and with τ = 128 in (c). Each row of the stroke maps illustrates the painting actions of the (c) imagein the following row.continues onto the nearby stroke-able region, some channels tend to have higher coverage then others.That causes in the diverged channel coverage.With neighborhood extension and stroke continuation, the channel stroke overcomes the occasionaldiscontinuity issue in the channel ﬂush, while keeping the artistic look from blocking out the lowresponse channels at each h, w location. When the channel limit τ becomes closer to the numberof channels at the operation layer, the output image becomes closer to the original output withoutchannel operations. 7 .4 Channel Painting in Action Based on the channel stroke, we propose the channel painting-in-action (CPIA) framework. Thisframework ﬁrst analyzes the input image into painting regions, and come up with a painting plan.The painting plan contains a list of step images, which are composed of certain masked regions ofthe input image. The step images are then sequentially fed into the pre-trained generator network.The operation layer carries out the stroke actions and paint on the output canvas. The framework ispresented in Fig. 4. In the implementation of this work, we use the Mask R-CNN to mark out theobjects as regions of interest (ROIs). The step images are prepared according to these ROIs.The step images mask out the regions to ensure channel continuity on those regions. When the ﬁrststep image is fed, the channel stroke algorithm is applied to paint on the masked regions of the currentstep image, until the stopping criterion S is met. The output canvas contains only strokes on theregions that have been exposed to channel stroke so far. Then the next step image is fed. The paintingprocess continues until the list of step images is exhausted.Because of the convolution kernels, the stroked response will propagate into its nearby regions in thenext convolutional layer of the CNNs. The cascaded propagation tends to tint the whole canvas at thelast layer. Therefore, the stroke maps are used as the post-generator masks to wipe out the responseof the non-stroking regions at the last layer.Note that although there can be as many as τ channels being stroked for each h, w location, thepainting process does not guarantee that every location has exactly τ channels stroked. In fact, at theoperation layer, it is very likely that the locations with more high response channels have reachedthe maximum τ channels being stroked. While the less response locations have less than τ channelsbeing stroked.Fig. 5 presents the CPIA result over the style transfer network in [10]. The original style transferoffers the sharper result in general. On the other hand, the CPIA introduces additional artistic stylingcomponents on top of the existing style transfer. Such components are deﬁned by the parameters S , N , m , l and τ . At the same time, the step images in the painting plan can also be seen as a controllingfactor that decides the painting priority of each masked region. Working with the stroke penetration parameter, the visually plausible stroke outcome requires theproper stroke mask after the last layer. With low penetration in channels, the response can be verydimmed and need to be masked in low opacity on the background. The mask opacity increases as thelocation collects more stroked channels.In current implementations, the painting area ordering is based on the object and its detection scoreof each ROIs from the Mask R-CNN. We can optionally add side information about the semantics tothe planning - such as paint the persons at the end, or paint the large objects ﬁrst.To stop the painting at each step image feed of the CPIA, we leverage the stopping criterion S inAlg.1. S can be a global threshold cutting of the response ratio at each ROI, or can be the stoppingcondition tailored toward different object types or sizes. The former is adopted in this work. The proposed channel stroke strategy utilizes the knowledge learned and stored within the deep con-volutional generator network. On top of that, the CPIA leverages the learned object and segmentationknowledge in frameworks like Mask R-CNN to plan the painting regions. The CPIA works with theexisting generator networks and the existing image segmentation tools, without additional trainingdata on stroking order.Looking ahead, we seek to drive the stroking factors in a more adaptive way. The stroke size (bundledin the neighborhood N ) can vary based on the stroking location. Depending on the stroking channel c , the stroke penetration can possibly change in a responsive fashion. In this work, one single layerof the CNN is chosen as the operation layer. A further expansion is to investigate the multi-layercoordination of the channel stroke. The channel decomposition for the generator networks still hasmuch to explore and the application may be beyond the artistic rendering and the step-wise painting.8 eferences [1] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neuralnetworks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) ,pages 2414–2423, 2016.[2] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In

Proceedings of the European conference on computer vision (ECCV) , pages 694–711,2016.[3] Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor S Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In

International Conference on Machine Learning(ICML) , volume 1, page 4, 2016.[4] Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. Universal style transfervia feature transforms. In

Advances in neural information processing systems (NIPS) , pages 386–396,2017.[5] Yongcheng Jing, Yang Liu, Yezhou Yang, Zunlei Feng, Yizhou Yu, Dacheng Tao, and Mingli Song. Strokecontrollable fast style transfer with adaptive receptive ﬁelds. In

Proceedings of the European Conferenceon Computer Vision (ECCV) , pages 238–254, 2018.[6] Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, SM Eslami, and Oriol Vinyals. Synthesizing programsfor images using reinforced adversarial learning.

International Conference on Learning Representations(ICLR) , 2019.[7] Ningyuan Zheng, Yifan Jiang, and Dingjiang Huang. Strokenet: A neural painting environment. In

International Conference on Learning Representations (ICLR) , 2019.[8] Zhewei Huang, Wen Heng, and Shuchang Zhou. Learning to paint with model-based deep reinforcementlearning. arXiv preprint arXiv:1903.04411 , 2019.[9] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In

Proceedings of the IEEEinternational conference on computer vision (ICCV) , pages 2961–2969, 2017.[10] Hang Zhang and Kristin Dana. Multi-style generative network for real-time transfer. In

Proceedings of theEuropean Conference on Computer Vision (ECCV) , 2018.[11] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by errorpropagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.[12] Dana H Ballard. Modular learning in neural networks. In

Proceedings of the sixth National conference onArtiﬁcial intelligence-Volume 1 , pages 279–284. AAAI Press, 1987.[13] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In

International Conference onLearning Representations (ICLR) , 2014.[14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative adversarial nets. In

Advances in neural information processingsystems (NIPS) , pages 2672–2680, 2014.[15] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deepconvolutional generative adversarial networks. In

International Conference on Learning Representations(ICLR) , 2016.[16] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditionaladversarial networks. In

Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR) , pages 1125–1134, 2017.[17] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation usingcycle-consistent adversarial networks. In

Proceedings of the IEEE international conference on computervision (ICCV) , pages 2223–2232, 2017.[18] Wonwoong Cho, Sungha Choi, David Keetae Park, Inkyu Shin, and Jaegul Choo. Image-to-imagetranslation via group-wise deep whitening-and-coloring transformation. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR) , pages 10639–10647, 2019.[19] Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. Learningwhat and where to draw. In

Advances in Neural Information Processing Systems (NIPS) , pages 217–225,2016.[20] Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. Pose guided personimage generation. In

Advances in Neural Information Processing Systems (NIPS) , pages 406–416, 2017.[21] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarialnetworks. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,pages 4401–4410, 2019.

22] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris NMetaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks.In

Proceedings of the IEEE International Conference on Computer Vision (CVPR) , pages 5907–5915,2017.[23] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high ﬁdelity naturalimage synthesis. In

International Conference on Learning Representations (ICLR) , 2019.[24] Yanghua Jin, Jiakai Zhang, Minjun Li, Yingtao Tian, Huachun Zhu, and Zhihao Fang. Towards the auto-matic anime characters creation with generative adversarial networks. In

Advances in neural informationprocessing systems (NIPS) , 2017.[25] Yang Chen, Yu-Kun Lai, and Yong-Jin Liu. Cartoongan: Generative adversarial networks for photocartoonization. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , pages 9465–9474, 2018.[26] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In

Proceedings of the IEEEconference on computer vision and pattern recognition (CVPR) , pages 8798–8807, 2018.[27] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-tion. In

International Conference on Learning Representations (ICLR) , 2015.[28] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style.

International Conference on Learning Representations (ICLR) , 2017.[29] David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In

Advances inNeural Information Processing Systems (NIPS) , pages 2450–2462, 2018.[30] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and JamesDavidson. Learning latent dynamics for planning from pixels. In

International Conference on MachineLearning (ICML) , pages 2555–2565, 2019.[31] Reiichiro Nakano. Neural painters: A learned differentiable constraint for generating brushstroke paintings. arXiv preprint arXiv:1904.08410 , 2019.[32] Yann LeCun, Fu Jie Huang, Leon Bottou, et al. Learning methods for generic object recognition withinvariance to pose and lighting. In

Proceedings of the IEEE conference on computer vision and patternrecognition (CVPR) , pages 97–104, 2004.[33] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutionalneural networks. In

Advances in neural information processing systems (NIPS) , pages 1097–1105, 2012.

A Appendix

In this section, we discuss additional parameters that can futher ﬁne-tune the channel stroke and the CPIA result.

A.1 Stroke Sensitivity

In the channel stroke (Alg. 1), the parameter m is used to qualify which of the neighboring pixels can be turnedon during the current channel stroke. Therefore, both the scope of neighborhood N and the sensitivity m decides the shape of the current stroke. In Fig. 6, different m ’s are compared over the generated images from theBigGAN [23] network.Picking a smaller m can further extend the current stroke into its neighboring pixels and create a wider strokeshape. For any pixel at the operation layer, being turned on at one channel reduces the chance of being turned onat other channels. The channel coverage histograms also conﬁrms that smaller m ’s cause more channels notbeing turned on. The channels not being turned on are the less responsive channels across the operation layer.Thus the output image from the smaller m ’s focuses more on the globally responsive channels. A.2 Stroke Penetration

A stroke typically affects multiple channels. For example, the last layer in the generator consists RGB channels.A stroke in yellow color at least impacts the red and the green channels. Reconsidering the channel strokes,another option is to enable the stroke to penetrate through several channels. The penetration happens when thereexists several channels having similarly high response as the stroking channel c at location h, w .When stroking at the pixel ( c, h, w ) , other p − top response channels at the same location are also turned on,conditioned on they are previously mufﬂed. The penetrating stroke then follow Alg.1 to complete the strokingprocess. The set of the penetrating channels decides the color and the pattern of the stroke. Depending on the hannel ﬂush m = 0 . m = 0 . m = 0 . m = 0 . m = 0 . raw output Figure 6: Comparison on various stroke sensitivities m over the BigGAN[23] outputs. The channeloperations are on layer l = 5 with τ = 512 (out of 1024 available channels). p = 1 p = 2 p = 4 p = 8 p = 16 p = 32 p = 64 p = 128 Figure 7: Comparison on various stroke penetration p over the BigGAN[23] outputs. The channeloperations are on layer l = 5 with τ = 512 (out of 1024 available channels). operation layer, there can be more than a thousand paint-able channels. The depth of the layer gives room for thestroke penetration. We evaluate the stroke penetration in Fig. 7, where the operation layer has channels.The passing channels are less coupled on lower stroke penetration. This induces the variety in channel coverage,and generates high contrast bold results. On the other hand, the higher stroke penetration brings the output closerto the original full channel output. This can be veriﬁed with the histograms of channel coverage. While strokesensitivity m maintains the spatial continuity, the channel continuity is controlled by the stroke penetration p .Note that in Fig. 6 the penetration is ﬁxed at p = 2 , and in Fig. 7 the sensitivity is ﬁxed at m = 0 . ..