Unsupervised Novel View Synthesis from a Single Image
UUnsupervised Novel View Synthesis from a Single Image
Pierluigi Zama Ramirez , [email protected] University of Bologna Alessio Tonioni [email protected] Google Inc. Federico Tombari , [email protected] Technische Universit¨at M¨unchen
Abstract
Novel view synthesis from a single image aims at gen-erating novel views from a single input image of an object.Several works recently achieved remarkable results, thoughrequire some form of multi-view supervision at trainingtime, therefore limiting their deployment in real scenarios.This work aims at relaxing this assumption enabling train-ing of conditional generative model for novel view synthe-sis in a complete unsupervised manner. We first pre-traina purely generative decoder model using a GAN formula-tion while at the same time training an encoder network toinvert the mapping from latent code to images. Then weswap encoder and decoder and train the network as a con-ditioned GAN with a mixture of auto-encoder-like objectiveand self-distillation. At test time, given a view of an object,our model first embeds the image content in a latent codeand regresses its pose w.r.t. a canonical reference system,then generates novel views of it by keeping the code andvarying the pose. We show that our framework achieves re-sults comparable to the state of the art on ShapeNet and thatit can be employed on unconstrained collections of naturalimages, where no competing method can be trained.
1. Introduction
Novel View Synthesis (NVS) aims at generating novelviewpoints of an object or a scene given only one or fewimages of it. Given its general formulation it is at the sametime one of the most challenging and well studied problemsin computer vision, with several applications ranging fromrobotics, image editing and animation to 3D immersive ex-periences. Traditional solutions are based on multi-viewreconstruction using a geometric formulation [11, 14, 49].Given several views of an object, these methods build an ex-plicit 3D representation ( e.g ., mesh, pointcloud, voxels. . . )and render novel views of it. A more challenging problemis novel view synthesis given just a single view of the ob-ject as input. In this case multi-view geometry cannot beleveraged, making this an intrinsically ill-posed problem.Deep learning, however, has made it approachable by rely- ing on inductive biases, similarly to what humans do. Forinstance, when provided with the image of a car, we havethe ability to picture how it would look like from a differentviewpoint. We can do it because we have, unconsciously,learnt an implicit model for the 3D structure of a car, there-fore we can easily fit the current observation to our mentalmodel and use it to hallucinate a novel view of the car froma different angle. This work tries to replicate a similar be-haviour using a deep generative model. Several works haverecently started to explore this line of research [62, 31, 42]by designing category specific generative models that, attest time, can render novel views of an object given a sin-gle input view. However, all these methods require expen-sive supervision at training time, such as images with cam-era poses or multiple views of the same object, that practi-cally make them usable only on academic benchmarks. Wewant to overcome this limitation and train a model for NVSfrom a single image using only natural images at train-ing time (such as those available from the web) withoutany additional supervision. This would allow deploymentto any object category without being constrained to syn-thetic benchmarks or well-crafted and annotation-intensivedatasets. The underlying research question that we addressin this work is: how can we learn to hallucinate novel viewsof a certain object from natural images only?
Inspiration for our work comes from the recent introduc-tion of several 3D-aware generative models [37, 38, 29, 48],which allow to disentangle the content of a scene from theviewpoint from which it is observed. All these methods cansynthesise novel views of a generated object, however theycannot be directly used to render novel views of a real ob-ject, i.e ., they cannot be used for NVS. We leverage sim-ilar techniques to design a completely unsupervised novelframework for NVS that can be trained on natural imagesof an object category. We rely on an encoder network thatprojects an input view to a latent vector and an estimatedpose, and on a decoder architecture that decodes the encoderoutputs to recreate the input view (by keeping the estimatedpose) or to generate novel views (by changing the pose).We design a two-stage training procedure for this sys-tem. First we pre-train the decoder in a generative fashion1 a r X i v : . [ c s . C V ] F e b nput. θ Interpolation
Figure 1. Novel view synthesis from a single input image. We train our method on a set of natural images of a certain category ( e.g .,faces, cars. . . ) without any 3D, multi-view or pose supervision. Given an input image (left column) our model estimates a global latentrepresentation of the object and its pose θ to reconstruct the input image (images framed in red in each row). By changing the pose θ wecan synthesize novel views of the same object (columns 2-8). to produce realistic images from a random latent vector andpose. At the same time we also pre-train the encoder to in-vert the transformation and estimate the latent vector andpose of a generated image. Then, in a second stage, weswap the position of encoder and decoder turning the net-work in a conditional GAN for NVS. Given the change oforder between encoder and decoder in the two stages of ourapproach we named it Disassembled Hourglass . In the firststage the decoder learns an implicit 3D model of the cate-gory, while the encoder is pre-trained to project images to acorresponding latent vector and estimated pose. Given thelack of explicit supervision, we allow the decoder to figureout a representation and a canonical orientation that makesnovel view generation easier, while the encoder is trainedseamlessly to use the same convention to regress the poseof an input image. In the second stage, instead, we train thesystem to reconstruct real images in an auto-encoder fash-ion. This objective might result in a degenerate solutionthat does fit the input view but it does not maintain the 3Dstructure of the object. Therefore, to counteract this effect,we propose to self-distill the knowledge of the generativemodel obtained in the first step. While training the auto-encoder to reconstruct real images, we also generate ran-dom samples at different rotations. The encoder processesa source-generated view, but instead of reconstructing thesame image, it reconstructs the image from a different view-point. Supported by experimental results, we argue that this is key to learn a network that is able to produce meaningfulnovel views. Finally, to further increase reconstruction fi-delity, we add an optional third step that relies on an imagespecific self-supervised fine-tuning to recover fine detailswith just few optimization steps.We demonstrate the validity of our framework with thor-ough experiments on the standard benchmark ShapeNet [7]for NVS, showing comparable results with state-of-the-artNVS methods that require more supervision. Moreover, wetest it also on real datasets of people [32] and Cars [64],where competitors cannot be trained due to the lack of an-notations. Some qualitative results of our method can beseen in Figure 1. To summarize our contributions :• We introduce
Disassembled Hourglass a novel frame-work for NVS that allows to reconstruct novel viewsof a given object from a single input image, while atthe same time regressing the pose of the image withrespect to a learned reference system.• Our framework can be trained on natural images with-out any form of explicit supervision besides represent-ing object of a certain category.• We introduce a two-stage training approach that allowsto self-distill the knowledge of a 3D-aware generativemodel to make it conditioned while preserving disen-tanglement of pose and content.2 . Related Works
Novel View Synthesis . Traditional approaches for theNVS problem are based on multi-view geometry [11, 14,49]. They estimate an explicit 3D representation (e.g. mesh)of the object, then render novel views from it. More recentlyand thanks to the advent of CNNs, several learned-basedapproaches have been proposed. Some of them, inspired bytraditional techniques in 3D computer vision, train networksto explicitly estimate the 3D representation of the object andthen use it for rendering. Different works have considereddifferent 3D representations: mesh [45], point-cloud [30],voxel map [20], depth map [16, 61]. All these methods,however, need some kind of 3D supervision as input. Re-cently [60] relaxed this assumption and proposed to infer3D properties such as depth, normal, camera pose from asingle image to reconstruct a 3D mesh of the object relyingon the assumption of dealing with quasi symmetric objects.A different line of work is represented by learning-basedapproaches, which attempt to estimate novel views by di-rectly learning the mapping from a source to a target view[24, 13, 66, 43, 55, 56, 53, 54, 62, 63] without explicitlyestimating a 3D model. Among them, we find approachessuitable for a single object instance [33, 51, 35] or able togeneralize to novel scenes/objects [47, 52, 8, 59, 31, 62, 19,65, 54, 66, 42, 69, 15]. Training an instance-specific modelgenerally produces higher quality results at the cost, how-ever, of a long training time for each object, a constraint of-ten to prohibitive for real applications. General approachesare, instead, trained once per category and therefore can bemore easily deployed on real problems. However, all theproposed solutions require poses, 3D models or multipleviews as source of supervision during training. To the bestof our knowledge we are the first work to relax these as-sumptions and learn a category specific generative networkfor NVS from natural images.
Generative Latent Space . Generative Adversarial Net-works (GANs) [17] have been successfully applied to manycomputer vision applications, making them the most popu-lar generative models. Thanks to recent discoveries in termsof architecture[6, 25, 22, 44], loss functions [34, 5], andregularization [36, 18], GANs are able to synthesise imagesof impressive realism [25, 6]. A recent line of work hasstarted to explore the possibility of modifying pre-existinggenerator to use them for image manipulation tasks. Theseapproaches investigate a way to go from image space to thegenerator latent space, and how to navigate the latent spaceto produce variants of the original picture. Two main fam-ily of solutions have been proposed for this task: passingthe input image through an encoder neural network (e.g. aVariational Auto-Encoder [27]) or optimizing a random ini-tial latent code by backpropagation to match the input im-age [2, 3, 10, 67]. Recently, Config [28] proposes to usean encoder to embed face images into a latent space that disentangled pose and appearance. By modifying values inthe embedded appearance, they enable a remarkable varietyof explicit image manipulations (e.g. color, haircut, poseetc..). However, the method needs a 3D synthetic paramet-ric model to generate training data and therefore it has beentested only on faces. In our work, along the lines of thisresearch avenue, we try to embed real images into the latentspace of a generative model and we do so without supervi-sion and by disentangling pose and appearance. . Generative mod-els are recently been extended with different form of 3Drepresentation embedded directly in the network architec-ture [29, 4, 37, 57, 68, 48] to synthesise objects from multi-ple views. The majority of methods require 3D supervision[57, 68] or 3D information as input [41, 4]. [21, 37] traingenerative models without supervision from only a collec-tions of images without constraint. [21] learns a textured3D voxel representation from 2D images using a differen-tiable render, while HoloGAN [37] and some related works[29, 38, 39] learn a low-dimensional 3D feature combinedwith a learnable 3D-to-2D projection. GRAF [48] lever-ages neural radiance fields [35] conditioned on a randomvariable to enable generation of novel objects. We build onthese recent finding and extend 3D-Aware Generative Mod-els to perform novel view synthesis from natural images. Inparticular we will employ the HoloGAN architecture sinceit can be trained using a collection of natural images.
3. Method
The key concept behind the proposed framework is to ex-ploit the implicit knowledge of 3D-aware generative modelin order to perform novel view synthesis. Figure 2 shows anoverview of the proposed approach.
Defined as θ s and θ t a source and a target unknown view-points, our goal is to synthesize a novel view I t from θ t given only a single source image I s from θ s . Our frame-work, namely Disassembled Hourglass , is composed of twomain components: an encoder E and a decoder D . E takesas input I s and estimates a global latent representation z s jointly with the source view θ s . D , instead, takes z s and θ t as input to synthesise the novel view I t . A naive solutionsolution for our problem could be to train an auto-encoderto reconstruct images, i.e . θ t = θ s , hoping that D wouldlearn to correctly care about θ . Something similar was suc-cessfully used by [42] but relying on multi-view supervisionat training time. However, in our scenario, without multi-ple views of the same object, the network is free to learnincomplete representations that are good to reconstruct theinput image but cannot be used to generate novel views, i.e.,it learns to ignore the pose. We will show experimental re-sults to support this claim in subsection 5.4.3 𝑟𝑎𝑛𝑑 𝑧 𝑟𝑎𝑛𝑑 Ƹ𝑧 𝑟𝑎𝑛𝑑 𝜃 𝑟𝑎𝑛𝑑 𝐿 𝜃 𝐿 𝑧 𝐷 𝐸 realfake 𝐼 𝑟𝑎𝑛𝑑 𝐼 𝑠 𝐷𝑖𝑠𝑐
Stage 1 𝜃 𝑟𝑎𝑛𝑑 𝜃 𝑠 Ƹ𝑧 𝑠 𝜃 𝑠 𝐼 𝑠 𝑧 𝑟𝑎𝑛𝑑 𝐷 𝑓 Ƹ𝑧 𝑟𝑎𝑛𝑑 realfake መ𝐼 𝑟𝑎𝑛𝑑 መ𝐼 𝑠 𝐿 + 𝐿 𝑠𝑠𝑖𝑚 + 𝐿 𝑣𝑔𝑔 𝐿 + 𝐿 𝑠𝑠𝑖𝑚 + 𝐿 𝑣𝑔𝑔 𝐷 E 𝐷𝑖𝑠𝑐
Stage 2 Ƹ𝑧 𝑟𝑎𝑛𝑑 𝜃 𝑟𝑎𝑛𝑑 𝜃 𝑟𝑎𝑛𝑑 𝐼 𝑟𝑎𝑛𝑑 መ𝐼 𝐿 𝑧 + 𝐿 𝑝𝑜𝑠𝑒 Figure 2. We propose
Disassembled Hourglass , a two stage framework to invert the knowledge of 3D-aware generative models for un-supervised novel view synthesis. During the first stage (left) we train a 3D-aware decoder network using a generative formulation whilesimultaneously training an encoder to invert the generation process and regress the latent code z and the pose θ . Then during the secondstage (right), we swap the encoder with the decoder and train the model in an autoencoder fashion to reconstruct real images and self-distillthe pre-trained generative model to keep multi-view consistency. To overcome this problem, we propose a two stage train-ing framework. In the first stage, subsection 3.2, we trainthe decoder D in a generative manner to create realisticviews starting from randomly sampled latent representa-tions and poses. This stage helps D to learn a good ini-tial representation for the objects category. In parallel, wealso pre-train the encoder to invert the process, learning togo from images to a latent space and pose. We argue thatthis step is fundamental to achieve good NVS results andto prevent the network to find undesirable shortcuts. In thesecond stage, subsection 3.3, we swap the decoder with theencoder, rebuilding the original auto-encoder structure, andwe train the network to reconstruct real images. Moreover,we jointly train the auto-encoder on samples generated bya frozen D from multiple views. We experimentally showthat this process, defined as self-distillation of the gener-ative model, is extremely important to generate meaning-ful novel views. Finally, we propose an optional final stepwhere we employ one-shot learning subsection 3.4 to fine-tune our network on a specific target image with a self-supervised protocol. As outlined previously, in the first stage we train D ina generative way. Thus, we have to learn a decoder ableto synthesise realistic images from a latent random variable z rand and a random pose θ rand . Moreover, we would likeimages generated from the same latent vector, but with dif-ferent poses to have 3D consistency. 3D-aware generativemodels explicitly address this problem, trying to find mean-ingful 3D representations, typically from only single imageswithout any additional supervision. Usually these works train the generator to fool with a discriminator, Disc , in or-der to craft realistic images using an adversarial loss, L adv .In this work we used HoloGAN [37] as our D since it meetsthe previous requirements. Additional implementation de-tails of D can be found in section 4.Together with the decoder, we pre-train an encoder net-work E to learn the mapping from a generated image toits latent space. Thus, we use D to generate an image I rand = D ( z rand , θ rand ) from randomly sampled latentcode and pose; then we optimize E to invert the genera-tion process and estimate the view ˆ θ rand and latent vector ˆ z rand by minimizing the following losses: L z = E || z rand − ˆ z rand || (1) L θ = E || θ rand − ˆ θ rand || (2) As outlined previously, during the second stage we trainour system as an auto-encoder. Given a source real image I s we propagate it through the encoder E to get the corre-sponding E ( I s ) = (ˆ z s , ˆ θ s ) , i.e ., the predicted latent vectorand view respectively. At this point, we use D to reconstructthe original image ˆ I s = D (ˆ z s , ˆ θ s ) . We use the pre-trained D and E from stage 1. To measure the accuracy in the re-construction we use a mixture of losses. Precisely, we usea combination of pixel wise L , perceptual[23] L vgg , andSSIM [58] L ssim losses. The losses are as follows: L = E || I s − ˆ I s || (3) L vgg = (cid:88) i E || V i ( I s ) − V i ( ˆ I s ) || (4)4 ssim = E (1 − SSIM ( I s , ˆ I s )) (5)where V i are features extracted at the i − th layer of aVGG16 [50] backbone pre-trained on ImageNet [12]. Weexperimentally achieved best reconstruction results by us-ing the output of the block2 conv2 layer. Moreover, to in-crease the quality and details of the reconstruction, we fol-low the example of [22], and we train the generator ad-versarially with Disc using the DCGAN [46] loss, L adv ,computed using I s as real samples and ˆ I s as fake samples.Since our generator is pre-trained, to stabilize the adversar-ial learning, we start from the pre-trained Disc of stage 1.We experimentally verified that by naively minimizingthe above losses the network may forget the learned 3Drepresentations, i.e ., it is able to reconstruct the input viewnicely, but not to generate novel views. However, we knowthat D has a consistent 3D representation at the end of stage1. Thus, we force our network to preserve such crucialproperty by self-distilling the knowledge from a frozen ver-sion of itself, namely D f . In practise, we train the modelon a set of images generated by D f . For these imageswe can have both “labels” for the latent code and posethat generated them, as well as multiple-views of the sameobject. We use all these information to train the encoderand decoder with direct supervision. For the training of E , given a random latent vector z rand and a random view θ rand we generate an image from the frozen decoder D f as I rand = D f ( z rand , θ rand ) . Then, we can predict ˆ z rand and ˆ θ rand from E ( I rand ) and apply the consistency lossesdescribed in Equation 1 and Equation 2. For the trainingof D , given z rand and θ s , we use D f to generate a novelview of the same object, ˆ I . Instead of applying the stan-dard auto-encoder loss, we use E and D to reconstruct anovel view of the object and provide direct supervision onit. This has the double effect of preserving the meaning of3D-representations and to better disentangle the pose fromthe latent representation. To do so, the encoder predictsthe latent vector ˆ z rand from ˆ I and generate a novel view as ˆ I rand = D (ˆ z rand , θ rand ) . Then, we can apply the previousreconstruction losses to the novel view: L gen = E || ˆ I rand − I rand || (6) L genvgg = (cid:88) i E || V i ( ˆ I rand ) − V i ( I rand ) || (7) L genssim = E (1 − SSIM ( ˆ I rand , I rand )) (8)Finally, the total auto-encoder objective is the sum of thepreviously defined losses.At test time given an input image I of an object we canuse E to predict the corresponding ( z, θ ) , by keeping z fixedand changing θ we can easily synthesize novel view of thespecific object depicted in I . 𝑧 𝜃 Figure 3. Qualitative pose estimation in CelebA. We use the latent z corresponding to the I z images on the first column and mix themwith the pose θ estimated from I θ images on the first row. Theresults show a set of views with same appearance and differentposes. SofaMethod Poses Multi-views L ↓ SSIM ↑ MV3D[54] (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51)
AUTO3D[31] (cid:55) (cid:51) (cid:55) (cid:55)
Table 1. Evaluation of NVS using a single input image on theShapeNet-Sofa dataset. The best results are highlighted in bold.When computing the L error, pixel values are in range of [0, 255]. To improve the quality of the reconstruction we can alsouse the auto-encoder losses L , L ssim , L vgg , and L adv tooptimize the value of ( z, θ ) by back-propagation startingfrom the initial values provided by E . This step is optionalbut we will show how it can improve the reconstruction byadding some fine details to the reconstructed images with-out downgrading the ability to generate novel views.
4. Experimental Settings
Training Details.
Our pipeline is implemented usingTensorflow [1] and trained on one NVIDIA 1080 Ti with11 GB of memory. During Stage 1 we train HoloGAN fol-lowing the procedure of the original paper. We weigh alllosses 1. We train our networks for 50 and 30 epochs forstage 1 and 2 respectively. We use Adam Optimizer [26]with β beta ∗ − and after 25 epochs we start to linearly decay it. During theoptional third stage we fine-tune for 100 steps with learningrate − . Network Architectures.
We implement D and Disc with the same architectures of HoloGAN [37] except for5arsId S t a g e A u t o e n c od e r S e l f- D i s till . M u lti - v i e w A dv e r s a r i a l z & θ C on s i s t . L ↓ SSIM ↑ (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) (cid:51) (cid:51) (cid:55) (cid:55) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:55) (cid:55) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:55) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) Table 2. Ablation study on ShapeNet Cars. When computing theL1 error, pixel values are in [0, 255] range. a 3x3 convolution followed by a bilinear sampling insteadof deconvolutions for up-sampling. We do it to decrease thewell known grid artifacts problem [40]. HoloGAN learns3D representations assuming the object is at the center ofthe image. It encodes the pose θ as the azimuth and el-evation of the object respect to a canonical reference sys-tem which is learnt directly from the network. Moreover,HoloGAN uses a scale parameter representing the object di-mension which is inversely related to the distance betweenthe object and the virtual camera. Thus, we use the sameparametrization and train E to estimate θ as azimuth, ele-vation, and scale. E is composed of three stride-2 resid-ual blocks with reflection padding followed by two headsfor pose and z regression. The pose head is composed ofa sequence of 3x3 convolution, average pooling, fully con-nected and sigmoid layers. The estimated pose is remappedfrom [0,1] into a given range depending on the datasets. The z head has a similar architecture except for the tanh activa-tion at the end. z is a 1D array of shape except whentraining on Real Cars dataset when we use shape . Datasets.
We evaluate our framework on a wide rangeof datasets including ShapeNet [7], CelebA [32] and RealCars [64]. We train at resolution 128x128 for all datasets.Even when available, we did not make use of any annota-tions during training. ShapeNet [7] is a synthetic datasetof 3D models belonging to various categories. We testedour method in the cars and sofa categories using renderingsfrom [9] as done in [62, 31]. For each category, followingthe standard splits defined in [62, 31], we use 70% of themodels for training, 20% for testing and 10% to validate ourapproach. Even though the dataset provides multiple viewsof each object, we did not use this information at trainingtime since our method does not require any multi-view as-sumption. CelebA [32] is a dataset composed of 202599images of celebrity faces. We use the aligned version ofthe dataset, center cropping the images. Real Cars [64] is a large scale dataset of real cars composed of 136726 images.For real datasets we split them in 80% for training and 20%for testing. All quantitative and qualitative results are onthe test set. We make use of the bounding box providedfrom the dataset to crop out cars. We crop out a square withthe same center of the bounding box and size equal to thelargest side of the bounding box. In case of squares exceed-ing image dimension, we crop the largest possible square.Then, we resize images at 128x128.
5. Experimental Results
We investigate the performance of our framework forNVS from object rotation. We follow the protocol definedin VIGAN [62] and Auto3D [31] and benchmark single in-put NVS for the class Sofa in ShapeNet [7]. For each pairof views of an object, ( I s , I t ) , we evaluate the L1 (lower isbetter) and SSIM [58] (higher is better) reconstruction met-rics. Given an input image I s , we first estimate the globallatent vector z s = E ( I s ) . Then, we generate the targetview I t = D ( z s , θ t ) using the ground-truth target pose θ t .When computing the L error we use pixel values in range [0 , . Since the pose estimated by our network is alignedto a learnt reference system which may differ from the oneof ShapeNet, we train a linear regressor on the training set tomap the ground-truth reference system into the learnt one.We use this mapping at test time to be able to compare ourmethod to the already published works. We report results inTable 1. Our method achieves results comparable to MV3D[54] with a lower L1 error but with a slightly lower SSIM.Compared to the other works we achieve slightly worse per-formance, however all competitors require multiple viewsof the same object to be trained while we don’t. Moreover,[54, 66, 62] require also ground-truth poses either at train-ing or test time, while, once again we don’t. With our method we are able to directly infer the pose ofan object from a single image. Since for real datasets we arenot able to estimate the pose error quantitatively, we evalu-ate the pose qualitatively with a simple experiments. Giventwo images I z and I θ , we infer the latent representation z from I z and the pose θ from I θ . Then, we can generate anew image with the appearance of I z and the pose of I θ bysimply combining the estimated z and θ . Figure 3 showsthe results of this experiment on the test set of CelebA. Weemploy the network after fine-tuning for this experiment.Notably, our method is able to estimate with high accuracyalso challenging poses, such as faces seen from the side,which is an underrepresented pose in the training set.6 nput. θ Interpolation
Figure 4. Qualitative comparison between results of stage 2 (first and third rows) and fine-tuning (second and fourth rows) on CelebA andReal Cars datasets. Reconstructed images are framed.
Input -25 ◦ Rec. 25 ◦ Figure 5. Qualitative comparison between different strategies of fitting a test image. From top to bottom: fitting HoloGAN by[10], fitting D after stage 1 by [10], our fine-tuning. From leftto right: input, azimuth rotations respect to the input of -25 ◦ ,0 ◦ (reconstruction), and 25 ◦ . Fine-Tuning.
In Figure 4 we show the effect of the thirdfine-tuning stage on CelebA and Real Cars datasets. Wecompare results after stage 2 (first and third rows) against the results after the fine-tuning (second and fourth rows).After stage 2, we are already able to produce images thatresemble the original one and that can rotate meaningfully.However, there is an identity gap between the input andthe reconstructed image, which gets filled through the fine-tuning procedure. For instance, we notice that the input eyescolor, skin tone, and face details are slightly different with-out fine-tuning (first row), while they are recovered afterfine-tuning (second row). Notably we achieve better per-formances by updating both E and D during fine-tuning,differently from what it is typically done when invertinggenerative models [2, 3, 10, 67]. Fitting the Latent Space.
Recent trends in invertinggenerative models [2, 3, 10, 67] for image manipulationshowed visually appealing results by directly optimizing alatent representation z that, once fed to the network, allowsthe reconstruction of the desired image. To do so they freezethe parameters of the generator network and use back propa-gation to explore the generative latent space, until a suitable z that reconstructs the input image is found. Here we wishto compare results achieved by our method with respect todirectly fitting the latent space of a generative model. In Fig-ure 5, we show a qualitative comparison obtained by the fit-ting procedure (first and second row) compared to the sameimage processed by our method after fine-tuning for mod-els trained on CelebA. Regarding the fitting procedure, wetrain z and pose simultaneously ( i.e . inputs of D ) by back-7ropagation using a sum of L , L ssim , and L vgg reconstruc-tion losses as supervision. We try to fit the latent space ofan HoloGAN trained following the protocol reported in theoriginal paper (first row) or the latent space of our D afterstage 1 (second row). The latter case has extra regulariza-tion coming from E . We use the same data pre-processingof other stages, learning rate 0.1 and Adam Optimizer. Wenotice that our method recovers well fine details such ashair, eyes and skin color, but still can rotate the object mean-ingfully. Notably, our method requires only 100 steps offine-tuning with respect to the 300 steps of the fitting proce-dure, and it takes only 3 seconds on a NVIDIA 1080Ti. Wealso tried to alternate the fitting of z and pose to avoid am-biguities in the optimization process, achieving comparableresults. Our intuition is that methods that directly fit the la-tent space by back-propagation assume a generative modelso powerful (e.g. StyleGAN [25]) that its latent space is ableto represent the full natural image manifold. However, thisis not always true, thus they may not achieve good resultswithout a strong generative model or when applied to outof training distribution images. Moreover, 3D-aware gener-ative models have the additional difficulty of inverting notonly z, but also the pose. For all the previous reasons, theylead to poor results in our scenario. We conduct an ablation study to show the impact of ev-ery component of our framework. For these experimentswe train our framework on ShapeNet Cars for 50 epochs inStage 1 and 10 epochs in Stage 2, and we evaluate on the testset using the L1 and SSIM metrics as done in subsection 5.1.We report the quantitative and qualitative results in Table 2and in Figure 6 respectively. In Figure 6, from left to right,we show: the reconstruction using the predicted θ from E ,three azimuth rotations of 36 ◦ , 108 ◦ and 288 ◦ respectively.In the first row, we train our framework as a simple autoen-coder with disentangled pose in the latent space. As shownin the picture, although the reconstruction results are good,we cannot meaningfully rotate the object. Even if our de-coder explicitly rotates 3D features with a rigid geometrictransformation as done in [42], without multi-view infor-mation, we are not able to produce significant 3D features.In the second row we show the results of the autoencoderafter pre-training with stage 1. Even though the objects canrotate correctly, the car appearance is different and the posein the reconstruction is wrong. In the third row we add thestage 1 pre-training before rebuilding the autoencoder. Un-fortunately the model is still not able to reasonably rotatethe object. Nevertheless, every pose generates a realisticcar even if with the wrong orientation. We argue that, dur-ing the second stage, we are forgetting the concept of the3D representation learnt during the first stage. To preventforgetting, we propose to self-distill the knowledge of the Rec. ° ° ° Input2
Figure 6. Ablation qualitative results on ShapeNet Cars. The left-most column report experiment ids corresponding to Table 2. generative model learnt during stage 1. Thus, in the fourthrow, we reconstruct also the generated samples from D f and we use the consistency losses L z and L θ . We can seethat we can generate novel views of the object, but they arenot consistent. Thus, to improve 3D representations, in thefifth row we add the multi-views supervision on the gener-ated samples which can be obtained for free. Notably, weare able to rotate the object, showing that this is the crucialstep for learning 3D features. To further improve the qualityof the generated sample, we add an adversarial loss similarto [22] on the real samples. However, the instability of theadversarial training makes the training prone to collapsing(row 6). Finally, we notice that with the consistency losses(row 7) L z and L θ we can stabilize the adversarial training.From our experiments, the adversarial loss is fundamentalin real datasets to achieve high definition results, thereforewe kept it on our formulation, even if it does not lead tomajor improvements on synthetic datasets.8 . Method Limitations As mentioned before, we use HoloGAN as architecturefor our D . However, sometimes HoloGAN fails to synthe-size 3D-consistent novel-views. During our experiments,we noticed that in such cases, we inherit the same artifactsduring stage 2. For instance, when training on Real Cars,HoloGAN is not able to rotate the object at 360 ◦ aroundthe azimuth, learning incomplete representations that allowonly partial rotations. In such case, our stage 2 would learnthe same incomplete rotations while distilling the generativeknowledge. Nevertheless, we argue that, with the advent ofnew generative models with stronger 3D consistency, ourmethod will also become more effective to deal with thesespecific circumstances.
7. Conclusions
In this paper we have presented a novel method for NVSfrom a single view. Our method is the first of its kind thatcan be trained without requiring any form of annotation orpairing in the training set, the only requirement being a col-lection of images of object belonging to the same macrocategory. Our method achieves a performance comparableto other alternatives on synthetic datasets, while also work-ing on real datasets like CelebA or Cars, where no groundtruth is available and the competitors cannot be trained. Wetested our framework with an HoloGAN-inspired decoder,though in the future we plan to extend our experiments todifferent 3D-aware generative models like the one based onthe recently proposed neural radiance fields [48].
References [1] Mart´ın Abadi, Ashish Agarwal, Paul Barham, EugeneBrevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, AndyDavis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, IanGoodfellow, Andrew Harp, Geoffrey Irving, Michael Isard,Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, ManjunathKudlur, Josh Levenberg, Dandelion Man´e, Rajat Monga,Sherry Moore, Derek Murray, Chris Olah, Mike Schuster,Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Tal-war, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fer-nanda Vi´egas, Oriol Vinyals, Pete Warden, Martin Watten-berg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor-Flow: Large-scale machine learning on heterogeneous sys-tems, 2015. Software available from tensorflow.org. 5[2] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im-age2stylegan: How to embed images into the stylegan latentspace? In
Proceedings of the IEEE/CVF International Con-ference on Computer Vision (ICCV) , October 2019. 3, 7[3] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im-age2stylegan++: How to edit the embedded images? In
Pro-ceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition (CVPR) , June 2020. 3, 7[4] Hassan Abu Alhaija, Siva Karthik Mustikovela, AndreasGeiger, and Carsten Rother. Geometric image synthesis. In
Computer Vision – ACCV 2018 , pages 85–100. Springer In-ternational Publishing, 2019. 3[5] Martin Arjovsky, Soumith Chintala, and L´eon Bottou.Wasserstein generative adversarial networks. In
Proceedingsof the 34th International Conference on Machine Learning -Volume 70 , ICML’17, page 214–223. JMLR.org, 2017. 3[6] Andrew Brock, Jeff Donahue, and Karen Simonyan. Largescale GAN training for high fidelity natural image synthe-sis. In
International Conference on Learning Representa-tions , 2019. 3[7] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, PatHanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Mano-lis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi,and Fisher Yu. ShapeNet: An Information-Rich 3D ModelRepository. Technical Report arXiv:1512.03012 [cs.GR],Stanford University — Princeton University — Toyota Tech-nological Institute at Chicago, 2015. 2, 6[8] Xu Chen, Jie Song, and Otmar Hilliges. Monocular neuralimage based rendering with continuous view control. In
Pro-ceedings of the IEEE International Conference on ComputerVision , pages 4090–4100, 2019. 3[9] Christopher B Choy, Danfei Xu, JunYoung Gwak, KevinChen, and Silvio Savarese. 3d-r2n2: A unified approachfor single and multi-view 3d object reconstruction. In
Pro-ceedings of the European Conference on Computer Vision(ECCV) , 2016. 6[10] A. Creswell and A. A. Bharath. Inverting the generator of agenerative adversarial network.
IEEE Transactions on Neu-ral Networks and Learning Systems , 30(7):1967–1974, 2019.3, 7[11] Paul E Debevec, Camillo J Taylor, and Jitendra Malik. Mod-eling and rendering architecture from photographs: A hybridgeometry-and image-based approach. In
Proceedings of the23rd annual conference on Computer graphics and interac-tive techniques , pages 11–20, 1996. 1, 3[12] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database.In , pages 248–255, 2009. 5[13] Alexey Dosovitskiy, Jost Tobias Springenberg, and ThomasBrox. Learning to generate chairs with convolutional neuralnetworks. In
Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 1538–1546,2015. 3[14] Andrew Fitzgibbon, Yonatan Wexler, and Andrew Zisser-man. Image-based rendering using image-based priors.
International Journal of Computer Vision , 63(2):141–151,2005. 1, 3[15] John Flynn, Michael Broxton, Paul Debevec, Matthew Du-Vall, Graham Fyffe, Ryan Overbeck, Noah Snavely, andRichard Tucker. Deepview: View synthesis with learned gra-dient descent. In
Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition (CVPR) , June2019. 3[16] John Flynn, Ivan Neulander, James Philbin, and NoahSnavely. Deepstereo: Learning to predict new views from theworld’s imagery. In
Proceedings of the IEEE conference on omputer vision and pattern recognition , pages 5515–5524,2016. 3[17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. In Advancesin neural information processing systems , pages 2672–2680,2014. 3[18] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, VincentDumoulin, and Aaron C Courville. Improved training ofwasserstein gans. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed-itors,
Advances in Neural Information Processing Systems30 , pages 5767–5777. Curran Associates, Inc., 2017. 3[19] Nicolai H¨ani, Selim Engin, Jun-Jee Chao, and Volkan Isler.Continuous object representation networks: Novel view syn-thesis without target view supervision. In
Proc. NeurIPS ,2020. 3[20] Paul Henderson and Vittorio Ferrari. Learning single-image3d reconstruction by generative modelling of shape, pose andshading.
International Journal of Computer Vision , pages 1–20, 2019. 3[21] Philipp Henzler, Niloy J. Mitra, and Tobias Ritschel. Escap-ing plato’s cave: 3d shape from adversarial rendering. In
TheIEEE International Conference on Computer Vision (ICCV) ,October 2019. 3[22] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A.Efros. Image-to-image translation with conditional adver-sarial networks. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , July2017. 3, 5, 8[23] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei AEfros. Image-to-image translation with conditional adver-sarial networks. In
Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 1125–1134,2017. 4[24] Dinghuang Ji, Junghyun Kwon, Max McFarland, and Sil-vio Savarese. Deep view morphing. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 2155–2163, 2017. 3[25] Tero Karras, Samuli Laine, and Timo Aila. A style-basedgenerator architecture for generative adversarial networks.In
Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition (CVPR) , June 2019. 3, 8[26] Diederik P. Kingma and Jimmy Ba. Adam: A method forstochastic optimization. In , 2015. 5[27] Diederik P. Kingma and Max Welling. Auto-encoding vari-ational bayes. In , 2014. 3[28] Marek Kowalski, Stephan J. Garbin, Virginia Estellers,Tadas Baltruˇsaitis, Matthew Johnson, and Jamie Shotton.Config: Controllable neural face image generation. In
Eu-ropean Conference on Computer Vision (ECCV) , 2020. 3[29] Yiyi Liao, Katja Schwarz, Lars Mescheder, and AndreasGeiger. Towards unsupervised learning of generative mod-els for 3d controllable image synthesis. In
Proceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR) , June 2020. 1, 3[30] Chen-Hsuan Lin, Chen Kong, and Simon Lucey. Learningefficient point cloud generation for dense 3d object recon-struction. In
AAAI , 2018. 3[31] Xiaofeng Liu, Tong Che, Yiqun Lu, Chao Yang, Site Li, andJane You. Auto3d: Novel view synthesis through unsuper-visely learned variational viewpoint and global 3d represen-tation. In
Proceedings of the European Conference on Com-puter Vision (ECCV) , 2020. 1, 3, 5, 6[32] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.Deep learning face attributes in the wild. In
Proceedings ofInternational Conference on Computer Vision (ICCV) , De-cember 2015. 2, 6[33] Stephen Lombardi, Tomas Simon, Jason Saragih, GabrielSchwartz, Andreas Lehrmann, and Yaser Sheikh. Neural vol-umes: learning dynamic renderable volumes from images.
ACM Transactions on Graphics (TOG) , 38(4):65, 2019. 3[34] Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau,Zhen Wang, and Stephen Paul Smolley. Least squares gen-erative adversarial networks. In
Proceedings of the IEEEInternational Conference on Computer Vision (ICCV) , Oct2017. 3[35] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:Representing scenes as neural radiance fields for view syn-thesis. In
Proceedings of the European Conference on Com-puter Vision (ECCV) , 2020. 3[36] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, andYuichi Yoshida. Spectral normalization for generative ad-versarial networks. In
International Conference on LearningRepresentations , 2018. 3[37] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, ChristianRichardt, and Yong-Liang Yang. Hologan: Unsupervisedlearning of 3d representations from natural images. In
Proceedings of the IEEE/CVF International Conference onComputer Vision (ICCV) , October 2019. 1, 3, 4, 5[38] Thu Nguyen-Phuoc, Christian Richardt, Long Mai, Yong-Liang Yang, and Niloy Mitra. Blockgan: Learning 3d object-aware scene representations from unlabelled images. arXiv ,2020. 1, 3[39] Michael Niemeyer, Lars Mescheder, Michael Oechsle, andAndreas Geiger. Differentiable volumetric rendering: Learn-ing implicit 3d representations without 3d supervision. In
Proceedings of the IEEE/CVF Conference on Computer Vi-sion and Pattern Recognition (CVPR) , June 2020. 3[40] Augustus Odena, Vincent Dumoulin, and Chris Olah. De-convolution and checkerboard artifacts.
Distill , 2016. 6[41] Michael Oechsle, Lars Mescheder, Michael Niemeyer, ThiloStrauss, and Andreas Geiger. Texture fields: Learning tex-ture representations in function space. In
Proceedings ofthe IEEE/CVF International Conference on Computer Vision(ICCV) , October 2019. 3[42] Kyle Olszewski, Sergey Tulyakov, Oliver Woodford, Hao Li,and Linjie Luo. Transformable bottleneck networks. In
Pro-ceedings of the IEEE International Conference on ComputerVision , pages 7648–7657, 2019. 1, 3, 8
43] Eunbyung Park, Jimei Yang, Ersin Yumer, Duygu Ceylan,and Alexander C Berg. Transformation-grounded imagegeneration network for novel 3d view synthesis. In
Proceed-ings of the ieee conference on computer vision and patternrecognition , pages 3500–3509, 2017. 3[44] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-YanZhu. Semantic image synthesis with spatially-adaptive nor-malization. In
Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition (CVPR) , June2019. 3[45] Jhony K Pontes, Chen Kong, Sridha Sridharan, SimonLucey, Anders Eriksson, and Clinton Fookes. Image2mesh:A learning framework for single image 3d reconstruction.In
Asian Conference on Computer Vision , pages 365–381.Springer, 2018. 3[46] Alec Radford, Luke Metz, and Soumith Chintala. Unsuper-visedrepresentation learning with deep convolutional gener-ativeadversarial networks. In
International Conference onLearning Representations (ICLR) , 2016. 5[47] Helge Rhodin, Mathieu Salzmann, and Pascal Fua. Unsu-pervised geometry-aware representation for 3d human poseestimation. In
Proceedings of the European Conference onComputer Vision (ECCV) , September 2018. 3[48] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and AndreasGeiger. Graf: Generative radiance fields for 3d-aware im-age synthesis. In
European Conference on Computer Vision(ECCV) , 2020. 1, 3, 9[49] Steven M Seitz, Brian Curless, James Diebel, DanielScharstein, and Richard Szeliski. A comparison and evalua-tion of multi-view stereo reconstruction algorithms. In , volume 1, pages 519–528.IEEE, 2006. 1, 3[50] Karen Simonyan and Andrew Zisserman. Very deep convo-lutional networks for large-scale image recognition. In
In-ternational Conference on Learning Representations , 2015.5[51] Vincent Sitzmann, Justus Thies, Felix Heide, MatthiasNießner, Gordon Wetzstein, and Michael Zollhofer. Deep-voxels: Learning persistent 3d feature embeddings. In
Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 2437–2446, 2019. 3[52] Vincent Sitzmann, Michael Zollh¨ofer, and Gordon Wet-zstein. Scene representation networks: Continuous 3d-structure-aware neural scene representations. In
Advances inNeural Information Processing Systems , pages 1121–1132,2019. 3[53] Shao-Hua Sun, Minyoung Huh, Yuan-Hong Liao, NingZhang, and Joseph J Lim. Multi-view to novel view: Syn-thesizing novel views with self-learned confidence. In
Pro-ceedings of the European Conference on Computer Vision(ECCV) , pages 155–171, 2018. 3[54] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox.Multi-view 3d models from single images with a convolu-tional network. In
European Conference on Computer Vi-sion , pages 322–337. Springer, 2016. 3, 5, 6 [55] Yu Tian, Xi Peng, Long Zhao, Shaoting Zhang, and Dim-itris N Metaxas. Cr-gan: learning complete representationsfor multi-view generation. arXiv , 2018. 3[56] Luan Tran, Xi Yin, and Xiaoming Liu. Disentangled repre-sentation learning gan for pose-invariant face recognition. In
Proceedings of the IEEE conference on computer vision andpattern recognition , pages 1415–1424, 2017. 3[57] Xiaolong Wang and Abhinav Gupta. Generative image mod-eling using style and structure adversarial networks. In
Com-puter Vision – ECCV 2016 . Springer International Publish-ing, 2016. 3[58] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si-moncelli. Image quality assessment: from error visibility tostructural similarity.
IEEE transactions on image processing ,13(4):600–612, 2004. 4, 6[59] Daniel E Worrall, Stephan J Garbin, Daniyar Turmukham-betov, and Gabriel J Brostow. Interpretable transformationswith encoder-decoder networks. In
Proceedings of the IEEEInternational Conference on Computer Vision , pages 5726–5735, 2017. 3[60] Shangzhe Wu, Christian Rupprecht, and Andrea Vedaldi.Unsupervised learning of probably symmetric deformable3d objects from images in the wild. In
Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition , pages 1–10, 2020. 3[61] Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3d:Fully automatic 2d-to-3d video conversion with deep convo-lutional neural networks. In
European Conference on Com-puter Vision , pages 842–857. Springer, 2016. 3[62] Xiaogang Xu, Ying-Cong Chen, and Jiaya Jia. View inde-pendent generative adversarial network for novel view syn-thesis. In
Proceedings of the IEEE International Conferenceon Computer Vision , pages 7791–7800, 2019. 1, 3, 5, 6[63] Jimei Yang, Scott E Reed, Ming-Hsuan Yang, and HonglakLee. Weakly-supervised disentangling with recurrent trans-formations for 3d view synthesis. In
Advances in NeuralInformation Processing Systems , pages 1099–1107, 2015. 3[64] Linjie Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang.A large-scale car dataset for fine-grained categorization andverification. In
Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 3973–3981,2015. 2, 6[65] Mingyu Yin, Li Sun, and Qingli Li. Novel view synthe-sis on unpaired data by conditional deformable variationalauto-encoder. In
Proceedings of the European Conferenceon Computer Vision (ECCV) , 2020. 3[66] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Ma-lik, and Alexei A Efros. View synthesis by appearance flow.In
European Conference on Computer Vision , 2016. 3, 5, 6[67] Jun-Yan Zhu, Philipp Kr¨ahenb¨uhl, Eli Shechtman, andAlexei A. Efros. Generative visual manipulation on the nat-ural image manifold. In
Computer Vision – ECCV 2016 .Springer International Publishing, 2016. 3, 7[68] Jun-Yan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun Wu,Antonio Torralba, Josh Tenenbaum, and Bill Freeman. Vi-sual object networks: Image generation with disentangled3d representations. In S. Bengio, H. Wallach, H. Larochelle, . Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Ad-vances in Neural Information Processing Systems 31 , pages118–129. Curran Associates, Inc., 2018. 3[69] Zhenyao Zhu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.Multi-view perceptron: a deep model for learning face iden-tity and view representations. In
Advances in Neural Infor-mation Processing Systems , pages 217–225, 2014. 3, pages 217–225, 2014. 3