[PDF] Patch-Based Image Inpainting with Generative Adversarial Networks

Abstract

Area of image inpainting over relatively large missing regions recently advanced substantially through adaptation of dedicated deep neural networks. However, current network solutions still introduce undesired artifacts and noise to the repaired regions. We present an image inpainting method that is based on the celebrated generative adversarial network (GAN) framework. The proposed PGGAN method includes a discriminator network that combines a global GAN (G-GAN) architecture with a patchGAN approach. PGGAN first shares network layers between G-GAN and patchGAN, then splits paths to produce two adversarial losses that feed the generator network in order to capture both local continuity of image texture and pervasive global features in images. The proposed framework is evaluated extensively, and the results including comparison to recent state-of-the-art demonstrate that it achieves considerable improvements on both visual and quantitative evaluations.

Full PDF

PPatch-Based Image Inpainting with Generative Adversarial Networks

Ugur DemirIstanbul Technical University [email protected]

Gozde UnalIstanbul Technical University [email protected]

Abstract

Area of image inpainting over relatively large missing re-gions recently advanced substantially through adaptation ofdedicated deep neural networks. However, current networksolutions still introduce undesired artifacts and noise to therepaired regions. We present an image inpainting methodthat is based on the celebrated generative adversarial net-work (GAN) framework. The proposed PGGAN method in-cludes a discriminator network that combines a global GAN(G-GAN) architecture with a patchGAN approach. PGGANﬁrst shares network layers between G-GAN and patchGAN,then splits paths to produce two adversarial losses that feedthe generator network in order to capture both local conti-nuity of image texture and pervasive global features in im-ages. The proposed framework is evaluated extensively, andthe results including comparison to recent state-of-the-artdemonstrate that it achieves considerable improvements onboth visual and quantitative evaluations.

1. Introduction

Image inpainting is a widely used reconstruction tech-nique by advanced photo and video editing applications forrepairing damaged images or reﬁlling the missing parts.The aim of the inpainting can be stated as reconstructionof an image without introducing noticeable changes. Al-though ﬁxing small deteriorations are relatively simple, ﬁll-ing large holes or removing an object from the scene are stillchallenging due to huge variabilities and complexity in thehigh dimensional image texture space. We propose a neuralnetwork model and a training framework that completes thelarge blanks in the images. As the damaged area(s) take uplarge space, hence the loss of information is considerable,the CNN model needs to deal with both local and globalharmony and conformity to produce realistic outputs.Recent advances in generative models show that deepneural networks can synthesize realistic looking images re-markably, in applications such as super-resolution [15, 18,6], deblurring [28], denoising [39] and inpainting [25, 34,11, 21]. One of the essential questions about realistic tex-ture synthesis is: how can we measure ”realism” or ”nat-uralness”? One needs to formulate a yet inexistent formu-1 a r X i v : . [ c s . C V ] M a r ation or an algorithm that determines precisely whether animage is real or artiﬁcially constructed. Primitive objectivefunctions like Euclidean Distance assist in measuring andcomparing information on the general structure of the im-ages, however, they tend to converge to the mean of possibleintensity values that cause blurry outputs. In order to solvethis challenging problem, Goodfellow et al . proposed Gen-erative Adversarial Networks (GAN) [7], which is a syn-thesis model trained based on a comparison of real imageswith generated outputs. Additionally, a discriminative net-work is included to classify whether an image comes from areal distribution or a generator network output. During thetraining, the generative network is scored by an adversarialloss that is calculated by the discriminator network.Grading a whole image as real or fake can be employedfor small images [25], however high resolution synthesisneeds to pay more attention to local details along with theglobal structure [34, 11, 21]. Isola et al . introduced thePatchGAN that reformulates the discriminator in the GANsetting to evaluate the local patches from the input [13].This work showed that PatchGAN improves the quality ofthe generated images, however it is not yet explored forimage inpainting. We design a new discriminator that ag-gregates the local and global information by combining theglobal GAN (G-GAN) and PatchGAN approaches for thatpurpose.In this paper, we propose an image inpainting architec-ture with the following contributions: • Combination of PatchGAN and G-GAN that ﬁrstshares network layers, later uses split paths with twoseparate adversarial losses in order to capture both lo-cal continuity and holistic features in images; • Addition of dilated and interpolated convolutions toResNet [14] in an overall end-to-end training networkcreated for high-resolution image inpainting; • Analysis of different network components through ab-lation studies; • A detailed comparison to latest state-of-the-art inpaint-ing methods.

2. Related works

The idea of

AutoEncoders (AE) dominated the genera-tive modeling literature in the last decade. Theoretical de-velopments in connecting probabilistic inference with efﬁ-cient approximate optimization as in Variational AutoEn-coders [17] and the intuitive expansion of AEs to Denois-ing Autoencoders (DAE) [31] constitute building blocks ofimage synthesis models both in terms of theory and neuralnetwork (NN) implementations. Particularly, the design ofNN architectures has a crucial effect on texture generation F a k e L a b e l P r e d i c t i o n R e a l L a b e l Figure 1: PatchGAN discriminator. Each value of the out-put matrix represents the probability of whether the corre-sponding image patch is real or it is artiﬁcially generated.as it shapes the information ﬂow through the layers as de-sired. The AE framework transforms the input image to anabstract representation, then recover the image from learntfeatures. To improve gradient ﬂow in backpropagations,skip connections are added to improve synthesis quality in[26]. Residual connections [9, 10, 37, 29, 33] that enhancethe gradient ﬂow are also adapted to generative models[14, 13, 39, 8, 19]. Apart from the architectural design, re-cently introduced components as batch normalization [12],instance normalization [30], dilated convolution [36] andinterpolated convolution [24] produce promising effects onthe results of image generation process [14, 26, 18, 15, 11].

Adversarial training has become a vital step for texturegenerator Convolutional Neural Networks (CNNs). It pro-vides substantial gradients to drive the generative networkstoward producing more realistic images without any humansupervision. However, it suffers from unstable discrimi-nator behavior during training which frustrates the gener-ator convergence. Furthermore, the GAN considers imagesholistically and focuses solely on the realistic image gener-ation rather than generation of an image patch well-matchedto the global image. That property of GAN is incompatiblewith the original goal of the inpainting. Numerous GAN-like architectures have been proposed during the last yearsto solve those issues to some degree [40, 23, 27, 4, 13].Recently proposed PatchGAN [13, 20] provides a simpleframework that can be adapted to various image generationproblems. Instead of grading the whole image, it slides awindow over the input and produces a score that indicateswhether the patch is real or fake. As the local continuity ispreserved, a generative network can reveal more detail fromthe available context as illustrated in the cover ﬁgure whichpresents some results of the proposed technique. To our x256 1x1

PatchGAN DiscriminatorGlobal Discriminator

ResNet

PGGAN Discriminator

Shared Layers

Figure 2: Generative ResNet architecture and PGGAN discriminator which is formed by combining PatchGAN and G-GAN.knowledge, our work is the ﬁrst to accommodate PatchGANapproach to work with the inpainting problem.

Inpainting : Early inpainting studies, which worked ona single image, [2, 3, 22, 1] typically created solutionsthrough ﬁlling the missing region with texture from simi-lar or closest image areas, hence they suffered from the lackof global structural information.A pioneering study that incorporated CNNs into the in-painting is proposed by Pathak et al . [25]. They developedContext-Encoder (CE) architecture and applied adversarialtraining [7] to learn features while regressing the missingpart of the images. Although the CE had shown promis-ing results, inadequate representation generation skills of anAutoEncoder network in the CE led to substantial amountof implausible results as well.An importance-weighted context loss that considerscloseness to the corrupted region is utilized in [35]. In Yang et al . [34], a CE-like network is trained with an adversarialand a Euclidean loss to obtain the global structure of the in-put. Then, the style transfer method of [20] is used, whichforces features of the small patches from the masked areato be close to those of the undamaged region to improvetexture details.Two recent studies on arbitrary region completion [21,11] add a new discriminator network that considers onlythe ﬁlled region to emphasize the adversarial loss on topof the global GAN discriminator (G-GAN). This additionalnetwork, which is called the local discriminator (L-GAN),facilitates exposing the local structural details. Althoughthose works have shown prominent results for the large holeﬁlling problem, their main drawback is the LGAN’s em-phasis on conditioning to the location of the mask. It isobserved that this leads to disharmony between the masked area where the LGAN is interested in and the uncorruptedtexture in the unmasked area. The same problem is indi-cated in [11] and solved by applying post-processing meth-ods to the synthesized image. In [21], LGAN pushes thegenerative network to produce independent textures that areincompatible with the whole image semantics. This prob-lem is solved by adding an extension network that correctsthe imperfections. Our proposed method on the other handexplores every possible local region as well as dependen-cies among them to exploit local information to the fullestdegree.

3. Proposed Method

We introduce a generative CNN model and a trainingprocedure for the arbitrary and large hole ﬁlling problem.The generator network takes the corrupted image and triesto reconstruct the repaired image. We utilized the ResNet[14] architecture as our generator model with a few alter-ations. During the training, we employ the adversarial lossto obtain realistic looking outputs. The key point of ourwork is the following: we design a novel discriminator net-work that combines G-GAN structure with PatchGAN ap-proach which we call PGGAN. The proposed network ar-chitecture is shown in Figure 2.

The generative ResNet that we compose consists ofdown-sampling, residual blocks and up-sampling parts us-ing the architectural guidelines introduced in [14]. Down-sampling layers are implemented by using strided convolu-tions without pooling layers. Residual blocks do not changethe width or height of the activation maps. Since our net-work performs completion operation in an end-to-end man- ormalization ReLU + X Normalization ReLU + X Normalization ReLU + X Figure 3: Residual block types. a: standard residual block.b: dilated convolution is placed ﬁrst. c: Dilated convolutionis placed second.ner, the output must have the same dimension with the in-put. Thus, in the conﬁguration of all our experiments, thenumber of down-sampling and up-sampling layers are se-lected as equal.Receptive ﬁeld sizes, which dictate dependency betweendistant regions, have a critical effect on texture generation.If the amount of sub-sampling is raised to increase the re-ceptive ﬁeld, the up-sampling part of the generator networkwill be faced with a more difﬁcult problem that typicallyleads to low quality or blurry outputs. The dilated convolu-tion operation is utilized in [36] in order to increase the re-ceptive ﬁeld size without applying sub-sampling or addingexcessive amount of convolution layers. Dilated convolu-tion spreads out the convolution weights to over a widerarea to expand the receptive ﬁeld size signiﬁcantly withoutincreasing the number of parameters. This was ﬁrst used by[11] for inpainting. We also investigate the effect of the di-lated convolution for texture synthesis problem. Three dif-ferent residual block types are used in our experiments asshown in the Figure 3. First residual block which is calledtype-a contains only two standard convolutions, normaliza-tion, activation and a residual connection. Other types in-troduce dilated convolution. Type-b block places dilationbefore the normalization layer and type-c block uses dila-tion after the activation layer. While dilation is used in ournetwork, dilation parameter is increased by a factor of twoin each residual block starting from one.

Interpolated convolution is proposed by Odena et al .[24] to overcome the well-known checkerboard artifactsduring the up-sampling operation caused by the transposedconvolution (also known as deconvolution). Instead of learning a direct mapping from a low resolution feature mapto high resolution, the input is resized to the desired size andthen the convolution operation is applied. Figure 5 showshow the interpolated convolution affects the image synthe-sis elegantly.

Discriminator network D takes the generated and realimages and aims to distinguish them while the generatornetwork G makes an effort to fool it. As long as D suc-cessfully classiﬁes its input, G beneﬁts from the gradientprovided by the D network via its adversarial loss.We achieve our goal of obtaining an objective value thatmeasures the quality of the image as a whole as well as theconsistency in local details through our PGGAN approachdepicted in Figure 2. Rather than training two separate net-works simultaneously, we design a weight sharing architec-ture at the ﬁrst few layers so that they learn common lowlevel visual features. After a certain layer, they are split intotwo pathways. The ﬁrst path ends up with a binary out-put which decides whether the whole image is real or not.The second path evaluates the local texture details similarto the PatchGAN. Fully connected layers are added at theend of the second path of our discriminator network to re-veal full dependency across the local patches. The overallarchitecture hence provides an objective evaluation of thenaturalness of the whole image as well as the coherence ofthe local texture.

At the training stage, we use a combination of three lossfunctions. They are optimized jointly via backpropagationusing Adam optimizer [16]. We describe each loss functionbrieﬂy as follows.

Reconstruction loss computes the pixel-wise L1 dis-tance between the synthesized image and the ground truth.Even though it forces the network to produce a blurry out-put, it guides the network to roughly predict texture colorsand low frequency details. It is deﬁned as: L rec = 1 N N (cid:88) n =1 W HC || y − x || (1)where N is the number of samples, x is the ground truth, y is the generated output image, W , H , C are width, height,and channel size of the images, respectively. Adversarial loss is computed by the both paths of PG-GAN discriminator network D that is introduced in thetraining phase. Generator G and D are trained simultane-ously by solving arg min G max D L GAN ( G, D ) : L GAN ( G, D ) = E x ∼ p ( x ) [log D ( x )]+ E y ∼ p G (˜ x ) [log(1 − D ( G (˜ x )))] (2)here ˜ x is the corrupted image. Joint loss function deﬁnes the objective used in the train-ing phase. Each component of the loss function is governedby a coefﬁcient λ : L = λ L rec + λ L g adv + λ L p adv (3)where L g adv and L p adv refer to L GAN in Equation 2 cor-responding to two output paths of the PGGAN (see Figure3). We update the generator parameters by joint loss L , un-shared G-GAN layers by L g adv , unshared P-GAN layersby L p adv and shared layers by L g adv + L p adv .

4. Results

In this section, we evaluate the performance of ourmethod and compare PGGAN with the recent inpaintingmethods through ablation studies, quantitative measure-ments, perceptual scores and visual evaluations.

Paris Street View [5] has 14900 training images and 100test images which is collected from Paris. Comparisons andour ablation study are mostly performed on this dataset.

Google Street View [38] consist of 62058 high qualityimages. It is divided into 10 parts. We use the ﬁrst and tenthparts as the testing set, the ninth part for validation, and therest of the parts are included in the training set. In this way,46200 images are used for training.

Places [41] is one of the largest dataset for visual tasksthat has nearly 8 million training images. Since there is con-siderable amount of data in the set, it is helpful for testinggeneralizability of out networks.

All of the experimental setup is implemented using Py-torch with GPU support. Our networks are trained sep-arately on four NVIDIA TM Tesla P100 and a K40 graphiccards.In order to obtain comparable results from our generativeResNet implementation, we use 3 subsampling blocks whentype-a blocks are used. If dilated convolution is used inthe residual blocks, subsampling is set to two since dilationparameter makes it possible to reach wider regions withoutsubsampling.While training our networks with PGGAN discriminator,we set λ = 0 . , λ = 0 . and λ = 0 . inEquation 3. In order to analyze effects of different components intro-duced, we perform several experiments by changing param- http://pytorch.org/ G - GAN P a t c h GAN P GGAN

Figure 4: Results are obtained by training the same genera-tor network with different discriminator architectures.eters one at a time. First, we compare the different discrim-inator architectures on the same generator network ResNet.All the networks are trained until no signiﬁcant change oc-curs. Figure 4 shows sample results. It can be observed forinstance in the last column, the window details are recon-structed differently across the methods. As expected, theG-GAN discriminator aids in completing only the coarseimage structures. PatchGAN demonstrates signiﬁcant im-provement compared to G-GAN but reconstructed imagesstill have a sign of global misconception. PGGAN blendsboth local and global structure and provides visually moreplausible results.Along with the discriminator design, another importantfactor for image synthesis is the layers used in generatornetwork models. In this study, we prefer interpolated con-volution rather than transposed convolution because it pro-vides smooth outputs. To illustrate the impact of the inter-polated convolution, we tested the same PGGAN except theupsampling layer as demonstrated in Figure 5.Impact of the interpolated convolution can be clearly ob-served by zooming to the results of Figure 5. It clears thenoise also known as checkerboard artifacts caused by thetransposed convolution. However, there are examples thathave more consistent structures obtained by the transposedconvolution (e.g. see the ﬁrst column of the ﬁgure). Theselayers have distinct characteristics that each direct the gen-erator to a different point in the solution space. Both layersshould be analyzed further which is not in the scope of thisstudy. c onv i c onv Figure 5: Sample outputs; top: transposed convolution(tconv) and bottom: interpolated convolution (iconv) [24].

Original CE GLGAN PGGAN-Res PGGAN-DRes0102030405060708090100 N a t u r a l ne ss ( % ) Figure 6: Perceptual comparison of Paris [5] images in-painted by different approaches.

We compare our PGGAN with ResNet (PGGAN-Res)and PGGAN with ResNet-Dilated convolution (PGGAN-DRes) to three current inpainting methods: (i) CE-Context-Encoder is adapted from [25] to work with 256x256 imageswhere full images are reconstructed; (ii) GLGAN [11] over256x256 images; (iii) Neural Patch Synthesis (NPS) [34]over 512x512 images.

Speed:

As PGGAN and GLGAN are both end-to-endtexture generators, their computation times are similar onthe order of miliseconds. On the other hand, NPS approachtakes several seconds due to their local texture constraint.

PSNR and

SSIM [32] are the two mostly used evaluationcriteria among the image generation community although itis known that they are not sufﬁcient for quality assessment.Nonetheless, in order to quantitatively compare our methodwith the current works, we report PSNR, SSIM, mean L1, and mean L2 loss in Table 1 and Table 2 for 256x256 and512x512 images, respectively.Method L1 Loss L2 Loss psnr(dB) ssimCE [25] 6.21 1.34 18.12 0.838GLGAN[11] 5.82 2.33 18.28 0.863PGGAN-DRes

PGGAN-Res 5.46 1.2 18.92 0.865Table 1: Performance comparison on 256x256 images fromParis Street View evaluation set.Method L1 Loss L2 Loss psnr(dB) ssimNPS[34] 10.01 2.21 18.0 -PGGAN-DRes

We perform perceptual evaluation among PGGAN-Res,PGGAN-DRes, CE and GLGAN. 12 voters from our labo-ratory scored naturalness (as natural/not natural) of the orig-inal images and inpainting results of the methods. Overalleach tester evaluated randomly sorted and blinded 500 im-ages (5 x 100 images of the Paris Street View validationset). Figure 6 shows the boxplot of the percent naturalnessscore accumulated over users for each method.Results indicate that CE presented for 128x128 imageshas low performance on the 256x256 test images as alsoreported in [25]. Rest of the methods performed similarlyhowever, slightly better scores for PGGAN were obtained.This suggests that further emphasis of local coherence alongwith global structure can help to generate more plausibletextures.

We compare visual performance of PGGAN, NPS, andGLGAN on the common Paris Street View dataset. Figures7 and 8 show the results for images of size 256x256 and512x512 respectively. Some fail case results can be seenin Figure 9. Results from Places and Google Street Viewdatasets are shown in Figures 10 and 11. See supplementary materials for extensive results. nput CE[25] GLGAN [11] PGGAN-DRes(Ours) PGGAN-Res (Ours)Figure 7: Visual comparison on 256x256 Paris Street View Dataset [5].

5. Conclusion

The image inpainting results in this paper suggest thatlow-level merging then high-level splitting a patch-basedtechnique such as PatchGAN with a traditional GAN net-work can aid in acquiring local continuity of image texturewhile conforming to the holistic nature of the images. Thismerger produces visually and quantitatively better resultsthan the current inpainting methods. However, the inpaint-ing problem which is tightly coupled to the generative mod- eling problem is still open to further progress. npu t N PS [ ] P GGAN - R E S Figure 8: Visual comparison between PGGAN-RES and NPS [34] on 512x512 Paris Street View Dataset [5].Figure 9: Non-cherry picked results from PGGAN-DRes.

References [1] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Gold-man. Patchmatch: A randomized correspondence algo-rithm for structural image editing.

ACM Trans. Graph. ,28(3):24:1–24:11, July 2009. 3[2] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester. Imageinpainting. In

Proceedings of the 27th Annual Conferenceon Computer Graphics and Interactive Techniques , SIG-GRAPH ’00, pages 417–424, New York, NY, USA, 2000.ACM Press/Addison-Wesley Publishing Co. 3[3] A. Criminisi, P. Perez, and K. Toyama. Region ﬁllingand object removal by exemplar-based image inpainting.

IEEE Transactions on Image Processing , 13/9:1200–1212,September 2004. 3[4] E. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep gen-erative image models using a laplacian pyramid of adversar-ial networks. In

Proceedings of the 28th International Con- ference on Neural Information Processing Systems - Volume1 , NIPS’15, pages 1486–1494, Cambridge, MA, USA, 2015.MIT Press. 2[5] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. A. Efros.What makes paris look like paris?

ACM Trans. Graph. ,31(4):101:1–101:9, July 2012. 5, 6, 7, 8[6] C. Dong, C. C. Loy, K. He, and X. Tang. Learning adeep convolutional network for image super-resolution. In

Proceedings of European Conference on Computer Vision(ECCV) , 2014. 1[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial nets. In Z. Ghahramani, M. Welling,C. Cortes, N. D. Lawrence, and K. Q. Weinberger, edi-tors,

Advances in Neural Information Processing Systems 27 ,pages 2672–2680. Curran Associates, Inc., 2014. 2, 3[8] Y. Han, J. J. Yoo, and J. C. Ye. Deep residual learning for igure 10: Sample outputs of PGGAN-DRes on Placesdataset [41].Figure 11: Sample outputs of PGGAN-DRes on GoogleStreet View dataset [38]. compressed sensing ct reconstruction via persistent homol-ogy analysis.

CoRR , abs/1611.06391, 2016. 2[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In , pages 770–778, 2016. 2[10] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings indeep residual networks. In

Computer Vision - ECCV 2016- 14th European Conference, Amsterdam, The Netherlands,October 11-14, 2016, Proceedings, Part IV , pages 630–645,2016. 2[11] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally andLocally Consistent Image Completion.

ACM Transactionson Graphics (Proc. of SIGGRAPH 2017) , 36(4), 2017. 1, 2,3, 4, 6, 7[12] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift.In F. R. Bach and D. M. Blei, editors,

ICML , volume 37 of

JMLR Workshop and Conference Proceedings , pages 448–456, 2015. 2[13] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-imagetranslation with conditional adversarial networks. arxiv , 2016. 2[14] J. Johnson, A. Alahi, and L. Fei-Fei.

Perceptual Losses forReal-Time Style Transfer and Super-Resolution , pages 694–711. Springer International Publishing, Cham, 2016. 2, 3[15] J. Kim, J. K. Lee, and K. M. Lee. Accurate image super-resolution using very deep convolutional networks. In ,pages 1646–1654, 2016. 1, 2[16] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization.

CoRR , abs/1412.6980, 2014. 4[17] D. P. Kingma and M. Welling. Auto-encoding variationalbayes.

CoRR , abs/1312.6114, 2013. 2[18] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken,A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realisticsingle image super-resolution using a generative adversarialnetwork.

CoRR , abs/1609.04802, 2016. 1, 2[19] L. Lettry, K. Vanhoey, and L. V. Gool. DARN: a deep ad-versial residual network for intrinsic image decomposition.

CoRR , abs/1612.07899, 2016. 2[20] C. Li and M. Wand. Combining markov random ﬁelds andconvolutional neural networks for image synthesis. In ,pages 2479–2486, 2016. 2, 3[21] Y. Li, S. Liu, J. Yang, and M.-H. Yang. Generative facecompletion. In

The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , July 2017. 1, 2, 3[22] Y. Liu and V. Caselles. Exemplar-based image inpaintingusing multiscale graph cuts.

Trans. Img. Proc. , 22(5):1699–1711, May 2013. 3[23] A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, andJ. Clune. Plug & play generative networks: Conditional it-erative generation of images in latent space. In

ComputerVision and Pattern Recognition (CVPR), 2017 IEEE Confer-ence on . 2017. 2[24] A. Odena, V. Dumoulin, and C. Olah. Deconvolution andcheckerboard artifacts.

Distill , 2016. 2, 4, 6[25] D. Pathak, P. Kr¨ahenb¨uhl, J. Donahue, T. Darrell, andA. Efros. Context encoders:feature learning by inpainting.In

CVPR , 2016. 1, 2, 3, 6, 7[26] O. Ronneberger, P.Fischer, and T. Brox. U-net: Convolu-tional networks for biomedical image segmentation. In

Med-ical Image Computing and Computer-Assisted Intervention(MICCAI) , volume 9351 of

LNCS , pages 234–241. Springer,2015. (available on arXiv:1505.04597 [cs.CV]). 2[27] J. T. Springenberg. Unsupervised and semi-supervised learn-ing with categorical generative adversarial networks. In

In-ternational Conference on Learning Representations (ICLR) .2016. 2[28] S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, andO. Wang. Deep video deblurring for hand-held cameras.In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , July 2017. 1[29] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4,inception-resnet and the impact of residual connections onlearning.

CoRR , abs/1602.07261, 2016. 230] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky. Instancenormalization: The missing ingredient for fast stylization.

CoRR , abs/1607.08022, 2016. 2[31] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A.Manzagol. Stacked denoising autoencoders: Learning use-ful representations in a deep network with a local denoisingcriterion.

J. Mach. Learn. Res. , 11:3371–3408, Dec. 2010. 2[32] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli.Image quality assessment: From error visibility to structuralsimilarity.

Trans. Img. Proc. , 13(4):600–612, Apr. 2004. 6[33] S. Xie, R. Girshick, P. Doll´ar, Z. Tu, and K. He. Aggregatedresidual transformations for deep neural networks. In

CVPR ,2017. 2[34] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li.High-resolution image inpainting using multi-scale neuralpatch synthesis. arXiv preprint arXiv:1611.09969 , 2016. 1,2, 3, 6, 8[35] R. A. Yeh ∗ , C. Chen ∗ , T. Y. Lim, S. A. G., M. Hasegawa-Johnson, and M. N. Do. Semantic image inpainting withdeep generative models. In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition , 2017. ∗ equal contribution. 3[36] F. Yu and V. Koltun. Multi-scale context aggregation by di-lated convolutions. CoRR , abs/1511.07122, 2015. 2, 4[37] S. Zagoruyko and N. Komodakis. Wide residual networks.In

BMVC , 2016. 2[38] A. Zamir and M. Shah. Image geo-localization based on mul-tiple nearest neighbor feature matching using generalizedgraphs.

Pattern Analysis and Machine Intelligence, IEEETransactions on , 2014. 5, 9[39] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Be-yond a gaussian denoiser: Residual learning of deep CNNfor image denoising.

CoRR , abs/1608.03981, 2016. 1, 2[40] J. J. Zhao, M. Mathieu, and Y. LeCun. Energy-based gen-erative adversarial network.

CoRR , abs/1609.03126, 2016.2[41] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba.Places: A 10 million image database for scene recognition.

IEEE Transactions on Pattern Analysis and Machine Intelli-gence , 2017. 5, 9 upplementary Materials: Patch-Based Image Inpainting with GenerativeAdversarial Networks

1. Additional visual results

Following ﬁgures show the visual results obtained by the proposed PGGAN algorithm. Input images are taken fromImageNet , Google Street View and Places2 datasets. We perform high resolution inpainting experiments on ImageNet dataset. Input images are scaled to 512x512 and ran-domly located regions are cropped. Our model can successfully ﬁll the blank areas as demonstrated in following ﬁgures.Input Output http://image-net.org http://crcv.ucf.edu/data/GMCP Geolocalization http://places2.csail.mit.edu .2. Google Street View Images from the Google Street View dataset are scaled to 256x256. 128x128 sized center patches are extracted frominputs. Our network reconstructs whole images without using the mask location.Input Output Input Output10nput Output Input Output11nput Output Input Output12nput Output Input Output13nput Output Input Output14 .3. Places2.3. Places2