[PDF] Image Morphing with Perceptual Constraints and STN Alignment

Abstract

In image morphing, a sequence of plausible frames are synthesized and composited together to form a smooth transformation between given instances. Intermediates must remain faithful to the input, stand on their own as members of the set, and maintain a well-paced visual transition from one to the next. In this paper, we propose a conditional GAN morphing framework operating on a pair of input images. The network is trained to synthesize frames corresponding to temporal samples along the transformation, and learns a proper shape prior that enhances the plausibility of intermediate frames. While individual frame plausibility is boosted by the adversarial setup, a special training protocol producing sequences of frames, combined with a perceptual similarity loss, promote smooth transformation over time. Explicit stating of correspondences is replaced with a grid-based freeform deformation spatial transformer that predicts the geometric warp between the inputs, instituting the smooth geometric effect by bringing the shapes into an initial alignment. We provide comparisons to classic as well as latent space morphing techniques, and demonstrate that, given a set of images for self-supervision, our network learns to generate visually pleasing morphing effects featuring believable in-betweens, with robustness to changes in shape and texture, requiring no correspondence annotation.

Full PDF

IImage Morphing with Perceptual Constraints and STN Alignment

N. Fish , R. Zhang , L. Perry , D. Cohen-Or , E. Shechtman and C. Barnes Tel Aviv University, Israel Adobe Research, USA

Figure 1:

A morphing sequence generated by our approach.

Abstract

In image morphing, a sequence of plausible frames are synthesized and composited together to form a smooth transformationbetween given instances. Intermediates must remain faithful to the input, stand on their own as members of the set, and main-tain a well-paced visual transition from one to the next. In this paper, we propose a conditional GAN morphing frameworkoperating on a pair of input images. The network is trained to synthesize frames corresponding to temporal samples alongthe transformation, and learns a proper shape prior that enhances the plausibility of intermediate frames. While individualframe plausibility is boosted by the adversarial setup, a special training protocol producing sequences of frames, combinedwith a perceptual similarity loss, promote smooth transformation over time. Explicit stating of correspondences is replacedwith a grid-based freeform deformation spatial transformer that predicts the geometric warp between the inputs, instituting thesmooth geometric effect by bringing the shapes into an initial alignment. We provide comparisons to classic as well as latentspace morphing techniques, and demonstrate that, given a set of images for self-supervision, our network learns to generatevisually pleasing morphing effects featuring believable in-betweens, with robustness to changes in shape and texture, requiringno correspondence annotation.

CCS Concepts • Computing methodologies → Image processing;

Neural networks;

Keywords: image morphing, generative adversarial networks,spatial transformers, perceptual similarity

1. Introduction

Morphing is the process of transformation between states of ap-pearance, and may involve operations ranging from basic trans-lation and rotation, to changes in color and texture, and, perhapsmost iconically, shape shifting. In the era of big data and deeplearning, the ability to morph between objects could have an im-pact beyond the generation of the visual effect itself. For instance,synthesized intermediate frames depicting novel variations of in-put objects, may be added to existing datasets for densiﬁcation andenrichment.Traditional morphing techniques rely on correspondences be-tween relevant features across the participating instances, to drivean operation of warp and cross-dissolve [BN92]. These methodsare mostly invariant to the semantics of the underlying objects and are therefore prone to artifacts such as ghosting and implausibleintermediates. Correspondence points are normally user-provided,or are automatically computed assuming sufﬁcient similarity. Re-cently, an abundance of available data has given rise to its utiliza-tion as guidance proxies for extraction of short or smooth pathsbetween the two endpoints [AECOK16], thus providing more plau-sible in-betweens.In this paper, we aim to further tap into the data-driven mor-phing paradigm, and leverage the power of deep neural networksto learn a shape prior beﬁtting a given source dataset, catering tothe task of image morphing. Speciﬁcally, we employ a generativeadversarial network (GAN) [GPAM ∗

14] combined with a spatialtransformer [JSZ ∗

15] for shape alignment, for mitigation of thechallenges associated with morphing. GANs are known for display-ing impressive generative capabilities by their capacity to learn andmodel a given distribution, a particularly lucrative attribute for atask for which realism and plausibility is crucial. Accordingly, weopt to design a GAN framework to learn the space of natural im- a r X i v : . [ c s . G R ] A p r N. Fish, R. Zhang, L. Perry, D. Cohen-Or, E. Shechtman & C. Barnes / Image Morphing with Perceptual Constraints and STN Alignment ages of a given class so that intermediate frames appear to be real-istic, and to enforce sufﬁcient similarity between sufﬁciently closeframes, to maintain smoothness of transformation.We present a conditional GAN framework trained to generatesequences of transformations between two or more inputs, and fur-ther integrate it with a grid-based freeform spatial transformer net-work to alleviate large discrepancies in shape. The generated outputsequences are constrained by a perceptual loss, culminating in anend-to-end solution that encourages transitions that are both plau-sible and smooth, with a gradual and realistic change in shape andtexture.The result is a trained generator specializing in a certain familyof shapes, that, given a pair of inputs and a desired point in time,outputs the appropriate in-between frame. A full morphing effectcan then be synthesized by requisitioning a reasonably dense se-quence of frames, which yield a smooth transformation (see Figure1).During training, each sampled set of inputs is ﬁrst processed bya spatial transformer network, which computes an alignment allow-ing a feature-based warp operation to map each input to the other.Next, our conditional generator processes the warped inputs, andoutputs a sequence of frames, each corresponding to a given pointin time. A reconstruction loss encourages the two endpoint framesto match the inputs. Meanwhile, a GAN loss pushes the generatedframes towards the natural image manifold of the training set. Fi-nally, a perceptual transition loss [ZIE ∗

18] constrains the transfor-mation over time to be smooth and gradual.We demonstrate the competence of our generator and its abilityto produce visually pleasing morphing effects with smooth transi-tions and plausible in-betweens, on different sets of objects, bothreal and computer rendered. We conduct a thorough ablation studyto examine the individual contributions of our design components,and perform comparisons to traditional morphing, as well as GAN-based latent space interpolation. We show that our framework, unit-ing the GAN paradigm with shape alignment and perceptually con-strained transitions, provides a solution that is robust to signiﬁ-cant changes in shape, a challenging setup that commonly inducesghosting artifacts in morphs.

2. Related WorkClassic morphing.

Pioneering morphing techniques combinecorrespondence-driven bidirectional warping with blending oper-ations to generate a sequence of images depicting a transforma-tion between the entities in play. The approach by Beier and Neely[BN92] leverages user-deﬁned line segments to establish corre-sponding feature points that are used to distort each endpoint to-wards the other, and proceeds to apply a cross-dissolve opera-tion on respective pairs of warped images to obtain a transfor-mation sequence. More recently, Liao et al. [LLN ∗

14] automati-cally extract correspondences for morphs by performing an opti-mization of a term similar to structural image similarity [WBS ∗ ∗ ∗

18] focus on cross-domain correspon-dences extracted using a coarse-to-ﬁne search of mutual nearestneighbor features, and show that this can produce cross-domainmorphs. Shechtman et al. [SRAIS10] introduced an alternativeway to morph between different images using patch-based synthe-sis that did not rely on correspondences and cross-blending, andDarabi et al. [DSB ∗

12] extended it by allowing patches to rotateand scale. While these methods produced nice transitions that lookdifferent than the traditional warp+blend effect, the method is lim-ited to patches drawn from the two sources and does not producenew content.

Deep interpolations.

Neural networks have been previouslytrained to synthesize novel views of objects using interpolation.Given two images of the same object from two different view-points as input, Ji et al. [JKMS17] generate a new image of theobject from an in-between viewpoint. The images are ﬁrst broughtinto horizontal alignment, and are then processed by an encoder-decoder network that predicts dense correspondences used to com-pute an interpolated view. General image interpolations are com-monly demonstrated within the VAE and GAN realms. A notableby-product of a trained GAN is its rich latent embedding spacethat facilitates linear interpolation between data points. Such in-terpolations drive a generation of morphing sequences, by pro-ducing a series of interpolated latent vectors that are decodedto images that appear to smoothly morph from source to target[BDS18, BSM17, KALL17, DS19]. To perform interpolation be-tween existing instances, one must obtain their corresponding la-tent codes in order to compute interpolated vectors and their de-coded images. This is commonly accomplished with an optimiza-tion process that starts from a random code, which is updated tominimize a loss such as L ∗

3. Method

Our system combines several key components that together pro-vide a robust solution for morphing effect generation. We hence-forth present these components and address the manner in whichthey cater to the three requirements, namely, frame realism with re-spect to the training set, smooth transitions, and input ﬁdelity at theendpoints.

We use a convolutional GAN approach [GPAM ∗

14, RMC15] forour morphing. GANs have been demonstrated to perform highly . Fish, R. Zhang, L. Perry, D. Cohen-Or, E. Shechtman & C. Barnes / Image Morphing with Perceptual Constraints and STN Alignment sophisticated modeling of image training data [BDS18]. This char-acteristic is appealing for our endeavor, as we seek to create se-quences of transformation between entities belonging to a speciﬁcfamily of objects, i.e. , our target distribution. Therefore, a GAN losscan help fulﬁll our ﬁrst requirement of realism. In our implemen-tation, we combine the Least-Squares GAN loss [MLX ∗

17] withtwo discriminators: a local PatchGAN [LW16] discriminator anda global discriminator. We denote by L D and L G the GAN lossesused to train D and G respectively, each by minimization of thecorresponding sum: L D = L D reallocal + L D realglobal + L D fakelocal + L D fakeglobal (1) L G = L ∗ G fakelocal + L ∗ G fakeglobal (2)The asterisk in Equation 2 indicates the inversion of labels when D is used to evaluate L G .Common image morphing operates on existing instances givenas input, thus, accordingly, we opt for a special type of GAN knownas the conditional GAN [MO14,IZZE17,ZPIE17,KCK ∗ I A , I B , as well as a scalar t specifying the desired timesample of the output in-between frame. We note that this could alsobe generalized to an arbitrary number k ≥ L norm of unity. Our conditional GAN consists of an encoder fol-lowed by a generator.Our second requirement is smoothness of transitions. This isdealt with by a combination of a special training protocol anda suitable loss component. To better control and guide the gen-eration to comply with our aim, at training time, for each inputpair I A , I B , we generate a sequence of frames of length k . Eachof these frames correspond to a predetermined time sample, andare uniformly sampled on the unit interval [ , ] . This approachallows us to apply a loss component, L T , designed to constrainthe similarity between frames, and encourage smooth transitions.More speciﬁcally, we make use of a pretrained neural network(VGG-16 [SZ14]) to obtain deep features of generated framesupon which perceptual similarity (PS) is computed [ZIE ∗ , ( I A , I B ) = MSE ( VGG , ( I A ) , VGG , ( I B )) , where VGG , ( I A ) are all VGGfeatures of I A extracted from layer groups 4 and 5 (out of 5). Usingthat, we deﬁne L T as: L T = max i = .. k {(cid:107) PS , ( I i − , I i ) − ( t i − t i − ) · PS , ( I A , I B ) (cid:107) } (3)That is, we constrain each frame to be a certain distance, insemantic feature space, from its preceding frame. This distanceshould ideally match the feature distance PS , ( I A , I B ) between theinput images, after rescaling by the time between adjacent frames t i − t i − .The ﬁnal component in our basic setup is a reconstruction loss, Figure 2:

Grid-based freeform spatial transformer. The two inputsare concatenated and processed by the network which outputs a5x5 grid aligning the ﬁrst to the second. The deformed ﬁrst imageis compared against the second image using perceptual similarity.The grid is compared to the identity grid for regularization. which encourages the endpoint frames in the sequence to match theinputs: L R = MSE ( I , I A ) + MSE ( I k , I B ) (4) The characteristic locality of convolutional networks is a knownhindrance in situations where changes in shape are required. Tosupport a wide range of inputs of varying shapes, we recognize theneed for higher-level, semantic information to establish the rela-tionship between the inputs, much like classic morphing techniquesthat rely on correspondences between points and features to drive awarping operation. Manually collecting correspondence points be-tween instances in large datasets such as ours is intractable. Al-though it is possible to incorporate an automatic correspondencecomputation [ALS ∗ ∗

15] is a componentthat can be added to a neural network as a means to learn and ap-ply transformations to the data to assist the main learning task. Inour setting, we seek to compute an alignment between the inputs,and apply it onto them to be given to the generator for further pro-cessing. For greater ﬂexibility and range of deformation, we add aspatial transformer component that computes a grid-based freeformdeformation warp ﬁeld [HFW ∗ I A , I B are concatenated channel-wise be-fore passing through this component, which outputs two grids (for x and y ) indicating the warp from I A to I B - W AB . Likewise, W BA isobtained by switching the order between I A and I B . See Figure 2 foran illustration, and our supplementary material for speciﬁc designdetails.We combine the STN with our sequence generation scheme, byapplying a series of partial deformations to the inputs, each corre-sponding to a certain time stamp. The partial deformation for W AB at time t is W tAB = I + t · ( W AB − I ) , and W tBA = I + ( − t ) · N. Fish, R. Zhang, L. Perry, D. Cohen-Or, E. Shechtman & C. Barnes / Image Morphing with Perceptual Constraints and STN Alignment ( W BA − I ) for W BA , where I is the identity warp grid. The gridsare upsampled to the input image size using bilinear interpolation,and are applied onto I A and I B to obtain a sequence of warped in-puts { I tA } t k t = t , { I tB } t k t = t , that are passed on to the encoder. See ﬁgure3 for three examples of partial to full deformations computed byour STN. Figure 3:

STN warp examples. We show three examples for bidi-rectional warps computed by our STN. For each example, the ﬁrstinput is in the top row on the left and the second input is in the bot-tom row on the right. The STN computes a full warp, shown in thetop row on the right for the ﬁrst input and in the bottom row on theleft for the second. The warped instances in between have all beendeformed with corresponding partial warps.

We add two losses tailored to our spatial transformer network.The ﬁrst is a shape warp loss, L W , comparing the warped I A , de-noted by I t k A , to I B , and the second, L I , compares the predicted gridto the identity grid, for regularization. L W makes another use ofperceptual similarity by using the deep VGG features of layer group5. These provide a higher level of abstraction that encourages theoverall shape of the warped image to match the other endpoint, asopposed to stylistic details. The two losses are given by: L W = PS ( I t k A , I B ) L I = MSE ( W AB , I ) (5)We note that the losses we have described thus far, do not directlybind the inner frames to the inputs I A , I B . With the addition of thealignment computation, we are able to add a ﬁnal perceptual simi-larity loss, L E , that draws a connection between each frame and itscorresponding warped inputs, without restricting the ability of theframe to shift the shape of its underlying object. We choose layergroup 4 for this purpose, to beneﬁt from a combination of abstrac- tion and a notion of ﬁner detail, and compute a blend of similaritiesdependent upon the time stamp: L E = k ∑ i = ( − t i ) · PS ( I t i , I t i A ) + t i · PS ( I t i , I t i B ) (6)The total loss function of our generator is thus: L G = λ G L G + λ T L T + λ R L R + λ W L W + λ I L I + λ E L E (7) The architectures of G and D are similar to those of DiscoGAN[KCK ∗ G is composed of an encoder containing blocks of conv and ReLU followed by a decoder, containing blocks of tconv (trans-posed convolution) and

ReLU . Both local and global D containblocks of conv and ReLU with a ﬁnal

Sigmoid . In both G and D ,the number of blocks depends on the input image resolution. Formore details please refer to our supplementary material.We employ a late fusion protocol, where the inputs I t i A , I t i B areﬁrst processed separately by the encoder of G , which outputs fea-ture maps F t i A , F t i B respectively. An adaptive instance normalizationcomponent [HB17] blends the mean and standard deviation of thefeature maps according to the input time stamp t i . That is, for givenstatistics µ t i A , µ t i B and σ t i A , σ t i B , we compute the blended statistics fortime t i : µ t i = ( − t i ) · µ t i A + t i · µ t i B σ t i = (cid:113) ( − t i ) · ( σ t i A ) + t i · ( σ t i B ) (8) F t i A is then updated as: F t i A ∗ = σ t i · ( F tiA − µ tiA ) σ tiA + µ t i , and F t i B sim-ilarly. Next, F t i A ∗ , F t i B ∗ are concatenated channel-wise, along withan additional channel containing the time stamp t i expanded to theappropriate spatial resolution – F t i . The resulting block of data, F t i A ∗ F t i B ∗ F t i , is processed by the decoder which outputs the corre-sponding generated frame. During training we generate k frames,thus we prepare k such blocks { F t i A ∗ F t i B ∗ F t i } ki = , all of which arepassed through the decoder.At train time, we randomly draw another instance from withinthe set for each input in the batch, and together these make up theinput pairs. At each iteration, we also draw at random a pool ofimages to be shown to D as real data. Since each pair of inputsspawns k frames, the real pool for each pair is of size k as well. SeeFigure 4 for a high level illustration of our pipeline. We extend our solution to address the problem of content and styleseparation [GEB15, JAFF16, DSK17, HB17] within the morphingscope, to allow greater control over the desired outcome and pro-vide increased freedom of creativity. Instead of a single axis oftransformation between our two inputs I A , I B , we seek to engage . Fish, R. Zhang, L. Perry, D. Cohen-Or, E. Shechtman & C. Barnes / Image Morphing with Perceptual Constraints and STN Alignment Figure 4:

Method pipeline. A pair of inputs is ﬁrst processed by the STN. The predicted warp is applied onto the inputs to obtain a sequenceof warped images corresponding to different time stamps. An encoder outputs a feature map per warped image, and every respective pairof feature maps undergoes a weighted adaptive instance normalization dependent on its time stamp, and proceeds, after channel-wiseconcatenation, into the decoder. All the resulting frames are evaluated by the GAN loss and a reconstruction loss compares the two endpointsto the inputs. Local and global perceptual similarity losses compare each pair of adjacent frames, and each frame to the warped inputs,respectively. two axes corresponding to disentangled transitions of content andstyle. This can be viewed as a 2D morphing effect taking placewithin the unit square, such that at coordinate ( t c i , t s j ) , the contentof the generated frame reﬂects an interpolation of ( − t c i ) · I cA + t c i · I cB and its style a similar interpolation of ( − t s j ) · I sA + t s j · I sB ,where t c , t s = t c k , t s k = k samples along both axes), and I cA , I cB and I sA , I sB are the content and style characteristics of the inputsrespectively.We recognize the inherent capacity of the various components inour pipeline towards the distinction between the manifestation ofcontent in our setup, i.e. , overall shape and geometric detail, andstylistic attributes such as color and texture. Speciﬁcally, we ob-serve that our local and global perceptual similarity losses can beemployed in such a way as to encourage one aspect or the otherby demand. Combining these with the initial warping mechanismcatering to content (shape) rather than style, and the adaptive in-stance normalization component favoring style over content, we areable to formulate a disentangled solution dependent upon the twoaxes of transformation. Alignment.

Initial alignment is carried out as before, but is onlygoverned by the content axis, disregarding the style axis com-pletely.

Training.

The new training protocol resembles our original onein that for each input pair, we generate k frames. We randomly sam-ple k − t c , t s = t c k , t s k = i , F t ci A , F t ci B , exit the en- coder, we perform adaptive instance normalization according to thestyle axis alone, such that t i in Equation 8 is replaced with t s i . Wethen concatenate the two samples associated with frame i – t c i , t s i ,each expanded to the appropriate spatial resolution as before, to thenormalized feature stack. The stack given to the decoder is thus: F t ci A ∗ F t ci B ∗ F t ci F t si . Perceptual similarity losses.

We create a hard separation be-tween the authorities of the two PS losses with respect to contentand style. The local PS loss L T is assigned to the content whereasthe global loss L E is assigned to style. For L T , t i in Equation 3 isreplaced with t c i . Similarly, for L E , t i in Equation 6 is replaced with t s i . Additionally, to increase the emphasis upon stylistic elements,we compute L E with VGG layer group 3 instead of 4.

4. Evaluation

In this section we perform various experiments to evaluate ourmethod, both within its own scope (4.1), and externally (4.2). Weexperiment on four datasets - boots [YG14, YG17] ( ∼ ∼ ∼ ∼ ∗ N. Fish, R. Zhang, L. Perry, D. Cohen-Or, E. Shechtman & C. Barnes / Image Morphing with Perceptual Constraints and STN Alignment

We explore the individual contributions of our various design com-ponents by conducting an ablation study. For this purpose, we trainsix variations of our network, aside from the proposed solution.Each variation excludes one component: GAN loss (adversary), lo-cal perceptual similarity, global perceptual similarity, reconstruc-tion loss, adaptive instance normalization and STN (which also ex-cludes global perceptual similarity, see Subsection 3.2).We compute the Fréchet Inception Distance (FID) [HRU ∗

17] be-tween the generated frames of each version in each dataset, and itsrespective training set, resized to a resolution of 96x96. The overalltrend of the scores, summarized in Table 1, indicates that our mainsolution generates images that are generally in-line with the train-ing set distribution. Additionally, in Figure 5, which contains visualexamples for generated sequences obtained with the six variations,we note the various shortcomings characterizing the ﬁve ablationvariations. The "w/o GAN" version does not preserve object detail,the "w/o PS" versions do not appropriately combine characteris-tics from both inputs, the "w/o recon" version does not adhere tothe two endpoints and neither does the "w/o adaIn" version, andthe "w/o STN" version is characterized by a serious degeneration,exhibiting little to no deformation in shape, resulting in a prefer-ence of one endpoint over the other. Note that as part of our earlierexperiments, we did not experience a similar degeneration with abaseline system that did not incorporate an STN. However, theseearlier versions naturally produced substantially lower quality re-sults (due to their lack of advanced image alignment), and theirfar-removed architecture places them are outside the scope of thisablation study.Ablation Bags Boots Cars Planes MeanMain 31.96 27.75 34.90 44.18 34.70w/o GAN 30.71 27.98 37.10 44.52 35.08w/o local PS 31.67 27.32 29.72 43.79 33.13w/o global PS 36.17 31.85 38.61 49.19 38.96w/o recon 33.18 29.03 36.13 41.17 34.88w/o adaIn 34.40 32.57 40.29 44.51 37.94w/o STN 53.68 57.72 64.18 57.26 58.21

Table 1:

Ablation FID scores on four datasets. We compute FIDscores for different versions of our method, on a test set of 100 in-put pairs per dataset with 9 frames each, totaling at 900 framesper dataset. These generated frames are compared against the cor-responding training set.

We compare our results to three other methods. The ﬁrst is simplelinear blending. We take the two sequences of warped inputs thatour STN outputs, and blend each pair of corresponding frames ac-cording to their respective time stamp. The second is the morphingmethod by Liao et al. [LLN ∗

14] (termed "Halfway" in Table 2 andFigures 6 and 7). The ﬁnal method is GAN-based latent space in-terpolation. Although recent high resolution GAN solutions suchas BigGAN [BDS18] have been shown to produce impressivelyhigh quality generation and interpolation results, they are not as readily available to train, thus we opt for the well-known WGAN-GP [GAA ∗

17] solution for which we make use of the ofﬁcial im-plementation. We also experimented with VAE-GAN [LSLW15]and IntroVAE [HHS ∗ L WGAN-GP 68.91 83.71 54.38 55.53 65.63

Table 2:

Comparing FID scores on four datasets. All methods weregiven the same set of 100 input pairs yielding morphing sequencesof length 9, totaling at 900 frames per method.

Figures 6 and 7 present qualitative examples of our generated se-quences compared to those of the other methods. We observe thatclassic techniques exhibit excellent adherence to the original inputsas well as smooth transitions, however, at times they suffer fromghosting artifacts and exaggerated deformations due to incorrectcorrespondences. In contrast, our method is able to overcome dif-ferences in the overall shape, supporting a plausible transformationbetween the inputs. Speciﬁcally, we note that Liao et al. [LLN ∗ . Fish, R. Zhang, L. Perry, D. Cohen-Or, E. Shechtman & C. Barnes / Image Morphing with Perceptual Constraints and STN Alignment InputMainw/o GANw/o localPSw/oglobal PSw/o reconw/o adaInw/o STN

Figure 5:

Qualitative ablation. We present 11 frames generated for an input pair of planes2 by our main solution and the ﬁve ablationvariations. All variations in this study operate on a resolution of 96x96, and were trained for 200 epochs. with WGAN-GP [GAA ∗

17] show that generation quality as wellas latent space encoding of existing instances, is still insufﬁcientfor high quality morphing effect creation. However, despite the ar-tifacts that often appear in the generated frames, a strong advantageof latent space interpolation is its manner of frame creation. Framesare generated independently of one another, unlike approaches thatare based on warp and cross-dissolve operations, and thus, ghostingartifacts are naturally avoided.

To obtain user perspective, we designed a survey that presents theuser with 36 pairs of morphing effects (9 of each dataset), such thateach pair is composed of our result vs. that of one of the comparedmethods (in arbitrary order). For each pair, the users were askedto select the one they preferred of the two (subjectively), as wellas the one that exhibits a more plausible transformation of shape(objectively). Users were able to select ’no preference’ wheneverthey wished. A total of 50 participants took part in our study. Theresults are shown in Table 3, where the column ’Ours’ containsthe portion of morphing effects where our method was selectedover the other method (appearing in the ’Compared to’ column).Likewise, the column ’Theirs’ contains the portion where the othermethod triumphed, and the ’Tie’ column speciﬁes the remainingportion, where ’no preference’ was selected. The statistics of thetwo questions appear in the same cell in the format q / q , suchthat q corresponds to the statistics of the ﬁrst question. These re-sults show that in all sets except for Planes, users prefer Liao et al. [LLN ∗

14] (Halfway) over ours, with larger margins in the realimage datasets (Bags and Boots), where faithfulness to the orig-inal image statistics is more crucial. The Planes dataset containsinstances with highly distinct silhouettes that prove challenging forall methods, but are slightly better handled by our method, whichis able to reliably compute the alignment between the inputs. Ourmethod had the upper hand over Linear blend and WGAN-GP inall datasets, with a smaller margin against Linear blend, whose per-formance is satisfactory when the two input shapes are sufﬁcientlysimilar in shape, but otherwise produces ghosting artifacts. Notethat all morphing effects were taken from the pool of 100 effectsper dataset that we generated from the test set, all of which areavailable for viewing in our supplementary material.While the classic method of Liao et al. [LLN ∗

14] has the over-all upper hand in terms of user preference, the advantage of ourmethod is its consistency and robustness to different shape silhou-ettes and textures, and its speedy inference time (see our supple-mentary material for run time comparisons). Our main limitation isindividual frame quality which relies on network generation, thus,latest and future advances in neural generation may help alleviatethis, although at a probable training time penalty.

Figure 8 contains two examples for content and style disentangledmorphing as described in Subsection 3.4. For a given input pair,we generate each frame in a 6x6 grid, such that for cell ( i , j ) , the N. Fish, R. Zhang, L. Perry, D. Cohen-Or, E. Shechtman & C. Barnes / Image Morphing with Perceptual Constraints and STN Alignment

OursLinearblendHalfwayWGAN-GPOursLinearblendHalfwayWGAN-GP

Figure 6:

Qualitative comparisons. Morphing sequences of bags and boots, generated by our method vs. three others.

Set Compared to Ours Theirs TieBags Halfway 0.25/0.27

Table 3:

User study results. Refer to main text for details. coordinate i represents the desired location on the content axis, andsimilarly for coordinate j with the style axis.For more results, please see our supplementary material. For ourfull implementation please see our GitHub page.

5. Conclusion

We presented a new approach for morphing effect generation, com-bining the conditional GAN paradigm with a grid-based freeformdeformation STN and a set of perceptual similarity losses. Thecomponents that make up our pipeline have been carefully curated to promote the generation of realistic in-betweens with smooth andgradual transitions, resulting in a solution that is robust to inputsexhibiting differences in shape and texture. Particularly, shape mis-alignments are overcome automatically by the integrated STN thatlearns a strong shape prior based on semantic features, rather thanon potentially misleading low-level features.In a world that is constantly hungry for more visual data, theability to generate high-ﬁdelity image instances is particularly ben-eﬁcial. These can be used not only for artistic purposes, but alsoto enrich and augment existing datasets in support of various en-deavors requiring substantial amounts of information. Moreover,as a frame generation framework, a natural and potentially advan-tageous connection ties us to the ﬁeld of video processing and syn-thesizing, one that may establish a bidirectional exchange of ideaswith the prospect of mutual gain.We note that our current setup is composed of simple buildingblocks – a no-frills generator and discriminator that maintain a bal-ance of good performance with low computational cost. Despitethat, potential improvements and extensions to these componentsmay further increase the quality of the generated frames, whichare not always free of common morphing maladies such as ghost-ing and blurring. The addition of supervision to the pipeline maybroaden the scope of our approach, and allow various types of tran-sitions such as rotations. Similarly, morphing between images witharbitrary backgrounds may call for an integration of a dedicatedsegmentation component, one that is either pretrained, or trainedwithin the entire framework in an end-to-end manner. . Fish, R. Zhang, L. Perry, D. Cohen-Or, E. Shechtman & C. Barnes / Image Morphing with Perceptual Constraints and STN Alignment OursLinearblendHalfwayWGAN-GPOursLinearblendHalfwayWGAN-GP

Figure 7:

Qualitative comparisons. Morphing sequences of cars and planes, generated by our method vs. three others.

Acknowledgements

This work was supported by Adobe and the Israel Science Founda-tion (grant no. 2366/16 and 2472/17).

References [AECOK16] A

VERBUCH -E LOR

H., C

OHEN -O R D., K

OPF

J.: Smoothimage sequences for data-driven morphing. In

Computer Graphics Fo-rum (2016), vol. 35, Wiley Online Library, pp. 203–213. 1, 2[ALS ∗

18] A

BERMAN

K., L

IAO

J., S HI M., L

ISCHINSKI

D., C

HEN

B.,C

OHEN -O R D.: Neural best-buddies: sparse cross-domain correspon-dence.

ACM Transactions on Graphics (TOG) 37 , 4 (2018), 69. 2, 3[BCW ∗

17] B AO J., C

HEN

D., W EN F., L I H., H UA G.: Cvae-gan:ﬁne-grained image generation through asymmetric training. In

Proceed-ings of the IEEE International Conference on Computer Vision (2017),pp. 2745–2754. 2[BDS18] B

ROCK

A., D

ONAHUE

J., S

IMONYAN

K.: Large scale gantraining for high ﬁdelity natural image synthesis. arXiv preprintarXiv:1809.11096 (2018). 2, 3, 6[BN92] B

EIER

T., N

EELY

S.: Feature-based image metamorphosis.

Computer graphics 26 , 2 (1992), 35–42. 1, 2[BSM17] B

ERTHELOT

D., S

CHUMM

T., M

ETZ

L.: Began: Bound-ary equilibrium generative adversarial networks. arXiv preprintarXiv:1703.10717 (2017). 2[CFG ∗

15] C

HANG

A. X., F

UNKHOUSER

T., G

UIBAS

L., H

ANRAHAN

P., H

UANG

Q., L I Z., S

AVARESE

S., S

AVVA

M., S

ONG

S., S U H.,

ET AL .: Shapenet: An information-rich 3d model repository. arXivpreprint arXiv:1512.03012 (2015). 5[DS19] D

ONAHUE

J., S

IMONYAN

K.: Large scale adversarial represen- tation learning. In

Advances in Neural Information Processing Systems (2019), pp. 10541–10551. 2[DSB ∗

12] D

ARABI

S., S

HECHTMAN

E., B

ARNES

C., G

OLDMAN

D. B., S EN P.: Image melding: Combining inconsistent images usingpatch-based synthesis.

ACM Trans. Graph. 31 , 4 (2012), 82–1. 2[DSK17] D

UMOULIN

V., S

HLENS

J., K

UDLUR

M.: A learned represen-tation for artistic style.

Proc. of ICLR 2 (2017). 4[GAA ∗

17] G

ULRAJANI

I., A

HMED

F., A

RJOVSKY

M., D

UMOULIN

V.,C

OURVILLE

A. C.: Improved training of wasserstein gans. In

Advancesin Neural Information Processing Systems (2017), pp. 5767–5777. 6, 7[GEB15] G

ATYS

L. A., E

CKER

A. S., B

ETHGE

M.: A neural algorithmof artistic style. arXiv preprint arXiv:1508.06576 (2015). 4[GPAM ∗

14] G

OODFELLOW

I., P

OUGET -A BADIE

J., M

IRZA

M., X U B., W

ARDE -F ARLEY

D., O

ZAIR

S., C

OURVILLE

A., B

ENGIO

Y.: Gen-erative adversarial nets. In

Advances in neural information processingsystems (2014), pp. 2672–2680. 1, 2[HB17] H

UANG

X., B

ELONGIE

S.: Arbitrary style transfer in real-timewith adaptive instance normalization. In

Proceedings of the IEEE Inter-national Conference on Computer Vision (2017), pp. 1501–1510. 4[HFW ∗

18] H

ANOCKA

R., F

ISH

N., W

ANG

Z., G

IRYES

R., F

LEISHMAN

S., C

OHEN -O R D.: Alignet: Partial-shape agnostic alignment via unsu-pervised learning.

ACM Transactions on Graphics (TOG) 38 , 1 (2018),1. 3[HHS ∗

18] H

UANG

H., H E R., S UN Z., T AN T.,

ET AL .: Introvae: Intro-spective variational autoencoders for photographic image synthesis. In

Advances in neural information processing systems (2018), pp. 52–63. 6[HRU ∗

17] H

EUSEL

M., R

AMSAUER

H., U

NTERTHINER

T., N

ESSLER

B., H

OCHREITER

S.: Gans trained by a two time-scale update rule con-verge to a local nash equilibrium. In

Advances in Neural InformationProcessing Systems (2017), pp. 6626–6637. 60

N. Fish, R. Zhang, L. Perry, D. Cohen-Or, E. Shechtman & C. Barnes / Image Morphing with Perceptual Constraints and STN Alignment

Figure 8:

Content and style disentangled morphing. A 2D 6x6 morphing grid between an input pair of boots (shown at the top left and bottomright corners) appears on the left, and similarly for cars on the right. [IZZE17] I

SOLA

P., Z HU J.-Y., Z

HOU

T., E

FROS

A. A.: Image-to-image translation with conditional adversarial networks. In

Proceed-ings of the IEEE conference on computer vision and pattern recognition (2017), pp. 1125–1134. 3[JAFF16] J

OHNSON

J., A

LAHI

A., F EI -F EI L.: Perceptual losses forreal-time style transfer and super-resolution. In

European conference oncomputer vision (2016), Springer, pp. 694–711. 4[JKMS17] J I D., K

WON

J., M C F ARLAND

M., S

AVARESE

S.: Deep viewmorphing. In

Computer Vision and Pattern Recognition (CVPR) (2017),vol. 2. 2 [JSZ ∗

15] J

ADERBERG

M., S

IMONYAN

K., Z

ISSERMAN

A.,

ET AL .:Spatial transformer networks. In

Advances in neural information pro-cessing systems (2015), pp. 2017–2025. 1, 3[KALL17] K

ARRAS

T., A

ILA

T., L

AINE

S., L

EHTINEN

J.: Progres-sive growing of gans for improved quality, stability, and variation. arXivpreprint arXiv:1710.10196 (2017). 2[KCK ∗

17] K IM T., C HA M., K IM H., L EE J. K., K IM J.: Learning todiscover cross-domain relations with generative adversarial networks. In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 (2017), JMLR. org, pp. 1857–1865. 3, 4 . Fish, R. Zhang, L. Perry, D. Cohen-Or, E. Shechtman & C. Barnes / Image Morphing with Perceptual Constraints and STN Alignment ∗

14] L

IAO

J., L

IMA

R. S., N

EHAB

D., H

OPPE

H., S

ANDER

P. V.,Y U J.: Automating image morphing using structural similarity on ahalfway domain.

ACM Transactions on Graphics (TOG) 33 , 5 (2014),168. 2, 6, 7[LSLW15] L

ARSEN

A. B. L., S

ØNDERBY

S. K., L

AROCHELLE

H.,W

INTHER

O.: Autoencoding beyond pixels using a learned similaritymetric. arXiv preprint arXiv:1512.09300 (2015). 2, 6[LW16] L I C., W

AND

M.: Precomputed real-time texture synthesis withmarkovian generative adversarial networks. In

European Conference onComputer Vision (2016), Springer, pp. 702–716. 3[LYY ∗

17] L

IAO

J., Y AO Y., Y

UAN

L., H UA G., K

ANG

S. B.: Vi-sual attribute transfer through deep image analogy. arXiv preprintarXiv:1705.01088 (2017). 2[MLX ∗

17] M AO X., L I Q., X IE H., L AU R. Y., W

ANG

Z., P

AUL S MOL - LEY

S.: Least squares generative adversarial networks. In

Proceed-ings of the IEEE International Conference on Computer Vision (2017),pp. 2794–2802. 3[MO14] M

IRZA

M., O

SINDERO

S.: Conditional generative adversarialnets. arXiv preprint arXiv:1411.1784 (2014). 3[RMC15] R

ADFORD

A., M

ETZ

L., C

HINTALA

S.: Unsupervised rep-resentation learning with deep convolutional generative adversarial net-works. arXiv preprint arXiv:1511.06434 (2015). 2[SRAIS10] S

HECHTMAN

E., R AV -A CHA

A., I

RANI

M., S

EITZ

S.: Re-generative morphing. In (2010), IEEE, pp. 615–622. 2[SZ14] S

IMONYAN

K., Z

ISSERMAN

A.: Very deep convolutional net-works for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014). 3[WBS ∗

04] W

ANG

Z., B

OVIK

A. C., S

HEIKH

H. R., S

IMONCELLI

E. P.,

ET AL .: Image quality assessment: from error visibility to structural sim-ilarity.

IEEE transactions on image processing 13 , 4 (2004), 600–612.2[WRSJ19] W

EBSTER

R., R

ABIN

J., S

IMON

L., J

URIE

F.: Detect-ing overﬁtting of deep generative networks via latent recovery. arXivpreprint arXiv:1901.03396 (2019). 2[YG14] Y U A., G

RAUMAN

K.: Fine-grained visual comparisons withlocal learning. In

Computer Vision and Pattern Recognition (CVPR) (Jun2014). 5[YG17] Y U A., G

RAUMAN

K.: Semantic jitter: Dense supervision forvisual comparisons via synthetic images. In

International Conferenceon Computer Vision (ICCV) (Oct 2017). 5[ZIE ∗

18] Z

HANG

R., I

SOLA

P., E

FROS

A. A., S

HECHTMAN

E., W

ANG

O.: The unreasonable effectiveness of deep features as a perceptual met-ric. In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition (2018), pp. 586–595. 2, 3[ZKSE16] Z HU J.-Y., K

RÄHENBÜHL

P., S

HECHTMAN

E., E

FROS

A. A.: Generative visual manipulation on the natural image manifold.In

European Conference on Computer Vision (2016), Springer, pp. 597–613. 2, 5[ZPIE17] Z HU J.-Y., P

ARK

T., I

SOLA

P., E

FROS

A. A.: Unpairedimage-to-image translation using cycle-consistent adversarial networks.In