Image Completion and Extrapolation with Contextual Cycle Consistency
IIMAGE COMPLETION AND EXTRAPOLATION WITH CONTEXTUAL CYCLECONSISTENCY
Sai Hemanth Kasaraneni, Abhishek Mishra
Samsung Research Institute - Noida, [email protected], [email protected]
ABSTRACT
Image Completion refers to the task of filling in the miss-ing regions of an image and Image Extrapolation refers tothe task of extending an image at its boundaries while keep-ing it coherent. Many recent works based on GAN haveshown progress in addressing these problem statements butlack adaptability for these two cases, i.e. the neural networktrained for the completion of interior masked images doesnot generalize well for extrapolating over the boundaries andvice-versa. In this paper, we present a technique to train bothcompletion and extrapolation networks concurrently whilebenefiting each other. We demonstrate our method’s effi-ciency in completing large missing regions and we show thecomparisons with the contemporary state of the art baseline.
Index Terms — Image Completion, Image Extrapolation,Generative Adversarial Networks, Image Inpainting, ImageOutpainting
1. INTRODUCTION
Image Completion and Extrapolation (often referred to asimage inpainting and outpainting respectively) becomes ex-tremely difficult when the area to be filled is high. It is oftenrequired to complete high structural objects and reproducethe textural characteristics of the image. Previously, manypatch-based image synthesis methods [1, 2, 3] were proposedto complete missing regions in images. But those methodslack the capability to generate new content from the availablecontext of the image. They also fail to complete objects withhigh-level structural characteristics.Recently, Deep Learning based methods [4],[5],[6],[7],[8]made significant progress on filling missing regions of imagesbased on context. All these methods are based on GenerativeAdversarial Networks (GAN) [9] which employ adversarialtraining to train generator and discriminator. [4] trained aconvolution neural network (CNN) based encoder-decoder toencode the context information of masked image and utilizeit to generate content over the masked regions. [5] demon-strated an iterative approach to find the appropriate noise priorof GAN trained on real image dataset for finding semanticallyclosest possible complete image to the masked input image. But it comes at a cost of significant inference time due to theiterative search. [6, 7, 8] employed two discriminators trainedalong with the completion network, a global discriminatorand a local discriminator. Global discriminator was trained todistinguish inpainted images from real images while local dis-criminator was trained to distinguish the network completedregions from the random patches of real images. However,these methods incur heavy distortions and blurriness when thearea of the regions to be filled is high.Despite the fact that the image extrapolation is a subcaseof image completion where the masked regions exist at theboundaries of the image, it is hard to extrapolate at all theboundaries without being explicitly trained because of thelong distance textural dependencies.
2. PRELIMINARIES
GANs [9] are the deep generative models that contain twonetworks that compete with each other, Generator G and Dis-criminator D and are trained adversarially. The Generator isoptimized to generate images that are distributed similar tothe real image dataset distribution so that the discriminatorwill not be able to distinguish them. The discriminator isoptimized concurrently to distinguish between real and syn-thetic images efficiently. Their objectives are contrary andboth the optimizations occur concurrently over a loss func-tion like playing a minimax game. min G max D L ( G, D ) = E x ∼ p d ata ( x ) [log D ( x )]+ E z ∼ p z ( z ) [log(1 − D ( G ( z )))] (1)Eq.(1) is the loss function for a typical GAN. Cyclegan[10] used two such generators and discriminators to performunpaired image to image translation where one generatortranslates a domain ’A’ image to domain ’B’ while othergenerator translates a domain ’B’ image back to domain ’A’.Problem domains like image to image translation, inpaint-ing and extrapolation often require to reproduce the existingsemantics and complex textures from subject image.[10] employed cycle consistency loss to achieve transi-tivity in unpaired image to image domain translation. They a r X i v : . [ c s . C V ] J un ig. 1 : The left figure shows the architecture of the forward cycle and the right figure shows the architecture of the backward cycle. Forward cycle performs completion and thenextrapolation and vice-versa. Both the cycle losses are optimized together. achieved semantically plausible results and were able to pre-serve the context of the image throughout the cycle. Moti-vated by the encouraging results of [10], we use a similarapproach to train completion and extrapolation networks to-gether to maintain contextual cycle consistency.
3. METHOD
Our architecture contains a completion network ’C’, an ex-trapolating network ’E’ and a discriminator ’D’. The com-pletion network is trained to fill the missing portions of theinput image and the extrapolating network is trained simulta-neously to extrapolate the input image while the discriminatoris trained to differentiate the distributions of synthetic imagesand real images.
Similarity with existing baselines: [6],[7],[8] used aglobal discriminator and a local discriminator along withthe completion network to complete masked images. Globaldiscriminator was a normal discriminator that discriminatesinpainted images from the real images. Local discrimina-tor was trained to discriminate between random patches ofreal images and only the inpainted portions of the completedimages. We use only one global discriminator which distin-guishes whole inpainted images from the real images. Insteadof training another local discriminator, we train the extrapo-lation network to extrapolate based on the inpainted regionand to reconstruct back the original image. We define thisprocedure as forward cycle. Both the networks are optimizedover the reconstruction error, driving them to propagate thetextural characteristics as well as the context of the imagethroughout the cycle. We aim to propagate the context of theimage between inside masked and outside masked imagesvia the completion and extrapolation networks by drivingcontext-based content creation. When both the networks aresuccessful in propagating the context of the image through generated regions, the final reconstructed image or the cycleoutput will be semantically and visually similar to the origi-nal image or contextually consistent with the original image.We also perform training on the reverse cycle and optimizeboth networks over the loss functions of both the cycles. Insection 4, we demonstrate that our approach is suitable forsemantically filling large missing regions in the image. Fig. 1illustrates the training procedure of our architecture.We define a binary mask matrix ”M” that has the samespatial resolution as the input image and is filled with oneswithin surrounding boundaries filled with zeros. We experi-mented with masks of square shapes of area chosen randomlyin the range 25 to 35 percent of the image. This is a criticalparameter as both the networks require sufficient unmaskedcontent to extract the context of the image. The pixels withvalues ’1’ denote the region that is to be painted syntheticallyand ’0’s denote the unmasked regions in the image. Let theinput image be ’x’ which is drawn from the training dataset.The inside masked image will be ( − M ) (cid:12) x , where standsfor the unity matrix of same size as that of mask.The masked image, concatenated along with the binarymask representing the region to be filled is given as input tothe completion network to get the inpainted image. We takethe adversarial loss over the inpainted region to converge thecompleted image to be perceptually plausible. This percep-tual loss ensures that the inpainted image lies near the realdata manifold. The perceptual loss function associated withthe inpainted image will be: L adv ( C, D, M ) = E [log D ( x )+log(1 − D ( C ((1 − M ) (cid:12) x ))] (2)We also consider the contextual loss or the fidelity lossincurred in the completion of the image to preserve the fidelityin completion. It is computed as the L1 norm of the differencef the pixel values of the original image and the output imageover the completed region : L ctx ( C, M ) = (cid:107) M (cid:12) ( C ((1 − M ) (cid:12) x ) − x ) (cid:107) (3)The inpainted image will be given as input to extrapola-tion network after inverse masking and concatenating with theinverse of the defined mask ’M’. We compute L1 reconstruc-tion loss between the final output and the original input imageand optimize both the networks over this loss to preserve thecontext of the original image throughout the cycle. L rec ( C, E, M ) = (cid:107) (1 − M ) (cid:12) ( E ( M (cid:12) C ((1 − M ) (cid:12) x )) − x ) (cid:107) (4)In the above equations (cid:107) . (cid:107) denotes L1 - norm and (cid:12) de-notes the element wise multiplication of the matrices. Weoptimize over the adversarial loss of the extrapolator networkas well. The total forward cycle loss L cyc ( C, E, M ) would bethe weighted sum of the above three losses. L cyc ( C, E, M ) = L adv ( C, D, M ) + α. L ctx ( C, M )+ β. L rec ( C, E, M ) (5)Similarly we compute losses for the reverse cycle as welli.e. we give outside masked input to extrapolation networkand further to the completion network after inverse mask-ing and finally compute the total loss for the reverse cycle L cyc ( C, E, - M ) . L cyc ( C, E, − M ) = L adv ( E, D, − M )+ α. L ctx ( E, − M )+ β. L rec ( C, E, − M ) (6)Now the completion, extrapolation and discriminator net-works are optimized over the combined loss function. min C,E max D L ( C, E, D, M ) = min
C,E max D [ L cyc ( C, E, M )+ L cyc ( C, E, − M )] (7)
4. EXPERIMENTS
The completion and extrapolation network architectures arederived from the generator architecture in cyclegan. We con-cur with [6] in using dilated convolutions in the middle lay-ers of the generator networks to accommodate exponentialgrowth of the receptive field.
Mask M:
Our architecture is suitable for mask shapes ofany enclosed figure. Nevertheless, for all of the comparisons,we trained and tested our models with an enclosed mask of
Fig. 2 : Completion and Extrapolation results on the Flowers102 test dataset. Firstthree rows demonstrate extrapolation and the last two rows demonstrate the completion.First column images are the masked inputs. Second column images are the outputs fromextrapolation and completion networks respectively before restoring the unmasked datafrom the input image. Third column are the outputs after restoring the unmasked data.Fourth column are the ground truth images GT . square shape. The area of the inside masked region carries acrucial role in the convergence of the model. A large amountof masked region imposes large area to be filled by the com-pletion network and decelerates its convergence. On the con-trary, small masked regions result in large inverse masked in-puts to extrapolation network and would require it to repro-duce the structural and textural characteristics based on thelimited available content in the image. In our experiments, weset the mask area randomly in the range of 25 to 35 percentof the input image area and observed the stable convergencein both the networks. α and β : α and β are the hyperparameters to imposerelative importance between and perceptual and contextuallosses formulated in Eq.5 and Eq.6 . As proposed in [10],we set α = β = 10 in all our experiments.As suggested in [6], general inpainting and outpaintingmethods restore the pixel values at the unmasked region fromthe input image to the output so as to avoid changes in theexisting data. We manifest that our model can replicate theexisting data in the input while completing and extrapolat-ing images. Fig.2 shows the inpainted and outpainted imagesby the completion and extrapolation networks trained on Ox-ford flowers 102 [11] dataset. We presented the immediateoutputs from the completion and extrapolation networks andcompared them with the restored outputs. It is evident fromthis figure that our model is able to reproduce the existing ig. 3 : Visual comparison of completion and extrapolation results. First columnare the masked inputs. Second column are outputs of completion and extrapolationnetworks trained independently without cycle reconstruction loss on inside masked andoutside masked images respectively. Third column are the inference outputs of GIP [7].We inferred their pre-trained model on CelebA-HQ dataset. For results on flowers andfacades, we trained their model using their official implementation. Fourth column arethe outputs from our completion and extrapolation networks. Fifth column images areGround Truth
Table 1 : Average PSNR (in dB) values on test sets of different datasets. We con-sidered both outpainting and inpainting results for each test image. Higher value meansbetter reconstruction.
Ours (w/o cycle loss) GIP [7] Ours (full)
CelebA-HQ 22 22.3
Flowers102 23.1 shapes and textural characteristics of the input image.
Qualitative Analysis :
We evaluate our model’s perfor-mance on CelebA-HQ faces [12], Oxford 102 Category Flow-ers [11] and CMP Facade [13] datasets. In all our experimentswe trained with input images of resolution 256 ×
256 and nopost-processing or image blending techniques have been ap-plied after image completion or extrapolation. We emphasizethat the cycle consistency loss (Eq.4) is critical in modelingthe transitivity and convergence of the model. Fig.3 shows thevisual comparison of completion and extrapolation of maskedimages with and without using cycle consistency loss. We ob-served that the completion network was not able to convergeon celebrity faces dataset without cycle consistency loss. Weobserved similar characteristics for the extrapolation of fa-cades. Notice that the GIP [7] particularly fails to give com-prehensible results for outpainting of images at boundaries.Fig.4 shows more inferences on celebrity faces inpainting andoutpainting. We used the official repository and pre-trainedmodel on CelebA-HQ dataset of GIP [7].
Fig. 4 : Visual Comparison of Face Completion and Extrapolation results on theCelebA-HQ dataset. First column images are the masked inputs. Second column are theoutputs from
GIP [7]. We used their official repository and pre-trained model on theCelebA-HQ dataset for inference. Third column are our results using completion andextrapolation networks. Fourth column are the ground truth images GT . [Notice that GIP failed to reconstruct hair at the boundaries. Best viewed when zoomed-in]
Quantitative Analysis :
There are several metrics likePSNR, Inception Score [14] to evaluate image inpainting andoutpainting techniques but are not flawless. However, we re-port PSNR values on test sets of the mentioned datasets andcompare with that of [7] in Table.[1]
5. CONCLUSION
In this paper, we first discussed the problem of compatibilitybetween image inpainting and outpainting models. We thenproposed a technique to train both completion and extrapo-lation networks while benefitting each other. Several com-parisons with qualitative and quantitative analysis were madewith the contemporary state of the art solution and our ap-proach outperformed them. Future work will be focused onfurther improving the consistency loss to make it more appro-priate to image completion.
6. REFERENCES [1] Connelly Barnes, Eli Shechtman, Adam Finkelstein,and Dan B Goldman, “PatchMatch: A randomizedcorrespondence algorithm for structural image editing,”
ACM Transactions on Graphics (Proc. SIGGRAPH) ,vol. 28, no. 3, Aug. 2009.[2] Soheil Darabi, Eli Shechtman, Connelly Barnes, Dan Boldman, and Pradeep Sen, “Image Melding: Combin-ing inconsistent images using patch-based synthesis,”
ACM Transactions on Graphics (TOG) (Proceedings ofSIGGRAPH 2012) , vol. 31, no. 4, pp. 82:1–82:10, 2012.[3] Jia-Bin Huang, Sing Bing Kang, Narendra Ahuja, andJohannes Kopf, “Image completion using planar struc-ture guidance,”
ACM Transactions on Graphics , vol. 33,pp. 1–10, 07 2014.[4] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue,Trevor Darrell, and Alexei A. Efros, “Context encoders:Feature learning by inpainting,” in
The IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR) , June 2016.[5] Raymond A Yeh, Chen Chen, Teck Yian Lim, Alexan-der G Schwing, Mark Hasegawa-Johnson, and Minh NDo, “Semantic image inpainting with deep generativemodels,” in
CVPR , 2017, pp. 5485–5493.[6] Satoshi Iizuka, Edgar Simo-Serra, and HiroshiIshikawa, “Globally and Locally Consistent ImageCompletion,”
ACM Transactions on Graphics (Proc. ofSIGGRAPH 2017) , vol. 36, no. 4, pp. 107, 2017.[7] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu,and Thomas S. Huang, “Generative image inpaintingwith contextual attention,” in
The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , June2018.[8] X. Wu, R. Li, F. Zhang, J. Liu, J. Wang, A. Shamir, andS. Hu, “Deep portrait image completion and extrapola-tion,”
IEEE Transactions on Image Processing , vol. 29,pp. 2344–2355, 2020.[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio, “Generative adversarialnets,” in
NIPS , 2014, pp. 2672–2680.[10] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A.Efros, “Unpaired image-to-image translation usingcycle-consistent adversarial networks,” in
The IEEE In-ternational Conference on Computer Vision (ICCV) , Oct2017.[11] Maria-Elena Nilsback and Andrew Zisserman, “Au-tomated flower classification over a large number ofclasses,” in
Indian Conference on Computer Vision,Graphics & Image Processing 6 . IEEE, 2008, pp. 722–729.[12] Tero Karras, Timo Aila, Samuli Laine, and JaakkoLehtinen, “Progressive growing of gans for improvedquality, stability, and variation,” 2017. [13] Radim Tyleˇcek and Radim ˇS´ara, “Spatial pattern tem-plates for recognition of objects with regular structure,”in
Proc. GCPR , Saarbrucken, Germany, 2013.[14] Tim Salimans, Ian Goodfellow, Wojciech Zaremba,Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen,“Improved techniques for training gans,” in