[PDF] BlockGAN: Learning 3D Object-aware Scene Representations from Unlabelled Images

Abstract

We present BlockGAN, an image generative model that learns object-aware 3D scene representations directly from unlabelled 2D images. Current work on scene representation learning either ignores scene background or treats the whole scene as one object. Meanwhile, work that considers scene compositionality treats scene objects only as image patches or 2D layers with alpha maps. Inspired by the computer graphics pipeline, we design BlockGAN to learn to first generate 3D features of background and foreground objects, then combine them into 3D features for the wholes cene, and finally render them into realistic images. This allows BlockGAN to reason over occlusion and interaction between objects' appearance, such as shadow and lighting, and provides control over each object's 3D pose and identity, while maintaining image realism. BlockGAN is trained end-to-end, using only unlabelled single images, without the need for 3D geometry, pose labels, object masks, or multiple views of the same scene. Our experiments show that using explicit 3D features to represent objects allows BlockGAN to learn disentangled representations both in terms of objects (foreground and background) and their properties (pose and identity).

Full PDF

BBlockGAN: Learning 3D Object-awareScene Representations from Unlabelled Images

Thu Nguyen-Phuoc

University of Bath

Christian Richardt

University of Bath

Long Mai

Adobe Research

Yong-Liang Yang

University of Bath

Niloy Mitra

Adobe Research & UCL

Abstract

We present BlockGAN, an image generative model that learns object-aware 3Dscene representations directly from unlabelled 2D images. Current work on scenerepresentation learning either ignores scene background or treats the whole sceneas one object. Meanwhile, work that considers scene compositionality treats sceneobjects only as image patches or 2D layers with alpha maps. Inspired by the computergraphics pipeline, we design BlockGAN to learn to ﬁrst generate 3D features ofbackground and foreground objects, then combine them into 3D features for thewhole scene, and ﬁnally render them into realistic images. This allows BlockGAN toreason over occlusion and interaction between objects’ appearance, such as shadowand lighting, and provides control over each object’s 3D pose and identity, whilemaintaining image realism. BlockGAN is trained end-to-end, using only unlabelledsingle images, without the need for 3D geometry, pose labels, object masks, ormultiple views of the same scene. Our experiments show that using explicit 3Dfeatures to represent objects allows BlockGAN to learn disentangled representationsboth in terms of objects (foreground and background) and their properties (pose andidentity). Our code is available at https://github.com/thunguyenphuoc/BlockGAN.

The computer graphics pipeline has achieved impressive results in generating high-quality images,while offering users a great level of freedom and controllability over the generated images. Thishas many applications in creating and editing content for the creative industries, such as ﬁlms,games, scientiﬁc visualisation, and more recently, in generating training data for computer visiontasks. However, the current pipeline, ranging from generating 3D geometry and textures, rendering,compositing and image post-processing, can be very expensive in terms of labour, time, and costs.Recent image generative models, in particular generative adversarial networks [GANs; 16], havegreatly improved the visual ﬁdelity and resolution of generated images [7, 25, 26]. Conditional GANs[39] allow users to manipulate images, but require labels during training. Recent work on unsuperviseddisentangled representations using GANs [11, 26, 41] relaxes this need for labels. The ability to producehigh-quality, controllable images has made GANs an increasingly attractive alternative to the traditionalgraphics pipeline for content generation. However, most work focuses on property disentanglement,such as shape, pose and appearance, without considering the compositionality of the images, i.e., scenesbeing made up of multiple objects. Therefore, they do not offer control over individual objects in a waythat respects the interaction of objects, such as consistent lighting and shadows. This is a major limitationof current image generative models, compared to the graphics pipeline, where 3D objects are modelledindividually in terms of geometry and appearance, and combined into 3D scenes with consistent lighting. a r X i v : . [ c s . C V ] D ec ven when considering object compositionality, most approaches treat objects as 2D layers combinedusing alpha compositing [14, 53, 56]. Moreover, they also assume that each object’s appearance is inde-pendent [4, 8, 14]. While this layering approach has led to good results in terms of object separation andvisual ﬁdelity, it is fundamentally limited by the choice of 2D representation. Firstly, it is hard to manipu-late properties that require 3D understanding, such as pose or perspective. Secondly, object layers tend tobake in appearance and cannot adequately represent view-speciﬁc appearance, such as shadows or mate-rial highlights changing as objects move around in the scene. Finally, it is non-trivial to model the appear-ance interactions between objects, such as scene lighting that affects objects’ shadows on a background.We introduce BlockGAN, a generative adversarial network that learns 3D object-oriented scene repre-sentations directly from unlabelled 2D images. Instead of learning 2D layers of objects and combiningthem with alpha compositing, BlockGAN learns to generate 3D object features and to combine theminto deep 3D scene features that are projected and rendered as 2D images. This process closely re-sembles the computer graphics pipeline where scenes are modelled in 3D, enabling reasoning overocclusion and interaction between object appearance, such as shadows or highlights. During test time,each object’s pose can be manipulated using 3D transforms directly applied to the object’s deep 3D fea-tures. We can also add new objects and remove existing objects in the generated image by changing thenumber of 3D object features in the 3D scene features at inference time. This shows that BlockGAN haslearnt a non-trivial representation of objects and their interaction, instead of merely memorizing images.BlockGAN is trained end-to-end in an unsupervised manner directly from unlabelled 2D images,without any multi-view images, paired images, pose labels, or 3D shapes. We experiment withBlockGAN on a variety of synthetic and natural image datasets. In summary, our main contributions are:• BlockGAN, an unsupervised image generative model that learns an object-aware 3D scenerepresentation directly from unlabelled 2D images, disentangling both between objects andindividual object properties (pose and identity);• showing that BlockGAN can learn to separate objects even from cluttered backgrounds; and• demonstrating that BlockGAN’s object features can be added, removed and manipulatedto create novel scenes that are not observed during training. GANs.

Unsupervised GANs learn to map samples from a latent distribution to data categorised as realby a discriminator network. Conditional GANs enable control over the generated image content, butrequire labels during training. Recent work on unsupervised disentangled representation learning usingGANs provides controllability over the ﬁnal images without the need for labels. Loss functions can bedesigned to maximize mutual information between generated images and latent variables [11, 22]. How-ever, these models do not guarantee which factors can be learnt, and have limited success when appliedto natural images. Network architectures can play a vital role in both improving training stability [9] andcontrollability of generated images [26, 41]. We also focus on designing an appropriate architecture tolearn object-level disentangled representations. We show that injecting inductive biases about how the3D world is composed of 3D objects enables BlockGAN to learn 3D object-aware scene representationsdirectly from 2D images, thus providing control over both 3D pose and appearance of individual objects.

Introducing 3D structures into neural networks can improvethe quality [40, 44, 47, 51] and controllability of the image generation process [41, 42, 62]. Thiscan be achieved with explicit 3D representations, like appearance ﬂow [61], occupancy voxel grids[46, 62], meshes, or shape templates [30, 49, 59], in conjunction with handcrafted differentiablerenderers [10, 19, 34, 36]. Renderable deep 3D representations can also be learnt directly from images[41, 50, 51]. HoloGAN [41] further shows that adding inductive biases about the 3D structure ofthe world enables unsupervised disentangled feature learning between shape, appearance and pose.However, these learnt representations are either object-centric (i.e., no background), or treat the wholescene as one object. Thus, they do not consider scene compositionality, i.e., components that can moveindependently. BlockGAN, in contrast, is designed to learn object-aware 3D representations that arecombined into a uniﬁed 3D scene representation.

Object-aware image synthesis.

Recent methods decompose image synthesis into generating com-ponents like layers or image patches, and combining them into the ﬁnal image [31, 53, 56]. This includes2

CENE COMPOSER

Scene coordinates Cameracoordinates

CAMERAPROJECTION UNIT θ Cam

3D similarity transformation Shared generator weights2D features with 2D convolutions3D features with 3D convolutionsCamera transformation

BACKGROUNDFOREGROUND Z θ Z k θ k Z θ Figure 1: BlockGAN’s generator network. Each noise vector z i is mapped to deep 3D object features,which are transformed to the desired 3D pose θ i . Object features are combined into 3D scene features,where the camera pose θ cam is applied, before projection to 2D features that produce the image x .conditional GANs that use segmentation masks [43, 52], scene graphs [24], object labels, key pointsor bounding boxes [20, 45], which have shown impressive results for natural image datasets. Recently,unsupervised methods [3, 14, 15, 29, 53, 58] learned object disentanglement for multi-object sceneson simpler synthetic datasets (single-colour objects, simple lighting, and material). Other approachessuccessfully separate foreground from background objects in natural images, but make strong assump-tions about the size of objects [56] or independent object appearance [4, 8]. These methods treat objectcomponents as image patches or 2D layers with corresponding masks, which are combined via alphacompositing at the pixel level to generate the ﬁnal stage. The work closest to ours learns to generatemultiple 3D primitives (cuboids, spheres and point clouds), renders them into separate

2D layers with ahandcrafted differentiable renderer, and alpha-composes them based on their depth ordering to create theﬁnal image [32]. Despite the explicit 3D geometry, this method does not handle cluttered backgroundsand requires extra supervision in the shape of labelled images with and without foreground objects.BlockGAN takes a different approach. We treat objects as learnt 3D features with corresponding 3Dposes, and learn to combine them into 3D scene features. Not only does this provide control over 3D pose,but also enables learning of realistic lighting and shadows. Our approach allows adding more objects intothe 3D scene features to generate images with multiple objects, which are not observed at training time.

Inspired by the computer graphics pipeline, we assume that each image x is a rendered 2D image ofa 3D scene composed of K

3D foreground objects { O ,...,O K } in addition to the background O : x = p (cid:0) f ( O , (cid:124)(cid:123)(cid:122)(cid:125) background O , ..., O K (cid:124) (cid:123)(cid:122) (cid:125) foreground ) (cid:1) , (1)where the function f combines multiple objects into uniﬁed scene features that are projected to theimage x by p . We assume each object O i is deﬁned in a canonical orientation and generated from anoise vector z i by a function g i before being individually posed using parameters θ i : O i = g i ( z i , θ i ) .We inject the inductive bias of compositionality of the 3D world into BlockGAN in two ways. (1) Thegenerator is designed to ﬁrst generate 3D features for each object independently, before transformingand combining them into uniﬁed scene features, in which objects interact. (2) Unlike other methods thatuse 2D image patches or layers to represent objects, BlockGAN directly learns from unlabelled imageshow to generate objects as 3D features. This allows our model to disentangle the scene into separate3D objects and allows the generator to reason over 3D space, enabling object pose manipulation andappearance interaction between objects. BlockGAN, therefore, learns to both generate and renderthe scene features into images that can fool the discriminator.Figure 1 illustrates the BlockGAN generator architecture. Each noise vector z i is mapped to 3D objectfeatures O i . Objects are then transformed according to their pose θ i using a 3D similarity transform,before being combined into 3D scene features using the scene composer f . The scene features are trans-formed into the camera coordinate system before being projected to 2D features to render the ﬁnal im-ages using the camera projector function p . During training, we randomly sample both the noise vectors z i and poses θ i . During test time, objects can be generated with a given identity z i in the desired pose θ i .3lockGAN is trained end-to-end using only unlabelled 2D images, without the need for any labels,such as poses, 3D shapes, multi-view inputs, masks, or geometry priors like shape templates, symmetryor smoothness terms. We next explain each component of the generator in more detail. Each object O i ∈ R H o × W o × D o × C o is a deep 3D featuregrid generated by O i = g i ( z i , θ i ) , where g i is an object generator that takes as input a noise vector z i controlling the object appearance, and the object’s 3D pose θ i = ( s i , R i , t i ) , which comprises itsuniform scale s i ∈ R , rotation R i ∈ SE(3) and translation t i ∈ R . The object generator g i is speciﬁcto each category of objects, and is shared between objects of the same category. We assume that3D scenes consist of at least two objects: the background O and one or more foreground objects { O , ... , O K } . This is different to object-centric methods that only assume a single object with asimple white background [50], or only deal with static scenes whose object components cannot moveindependently [41]. We show that, even when BlockGAN is trained with only one foreground andbackground object, we can add an arbitrary number of foreground objects to the scene at test time. C o n s t x x x C o n v D A d a I N R e L U θ i z i C o n v D A d a I N R e L U N o r m a l i s e m e a n / s t d + MLP

AdaIN z i Φ Φ’ β γ ...

Figure 2: BlockGAN’s object generator.Each object starts with a constant tensorthat is learnt with the rest of the network.To generate 3D object features, BlockGAN implements thestyle-based strategy, which helps to disentangle betweenpose and identity [41] while improving training stability[26]. As illustrated in Figure 2, the noise vector z i is mappedto afﬁne parameters – the “style controller” – for adaptiveinstance normalization [AdaIN; 21] after each 3D convo-lution layer. However, unlike HoloGAN [41], which learns3D features directly for the whole scene, BlockGAN learns3D features for each object, which are then transformed totheir target poses using similarity transforms, and combinedinto 3D scene features. We implement these 3D similaritytransforms by trilinear resampling of the 3D features ac-cording to the translation, rotation and scale parameters θ i ;samples falling outside the feature tensor are clamped to zero. This allows BlockGAN to not onlyseparate object pose from identity, but also to disentangle multiple objects in the same scene. We combine the 3D object features { O i } into scene features S = f ( O , O , ... , O K ) ∈ R H s × W s × D s × C s using a scene composer function f . For this, we use theelement-wise maximum as it achieves the best image quality compared to element-wise summation anda multi-layer perceptron (MLP); please see our supplemental document for an ablation. Additionally,the maximum is invariant to permutation and allows a ﬂexible number of input objects to add newobjects into the scene features during test time, even when trained with fewer objects (see Section 4.3). Instead of using a handcrafted differentiable renderer, we aim to learnrendering directly from unlabelled images. HoloGAN showed that this approach is more expressive asit is capable of handling unlabelled, natural image data. However, their projection model is limited to aweak perspective, which does not support foreshortening – an effect that is observed when objects areclose to real (perspective) cameras. We therefore introduce a graphics-based perspective projectionfunction p : R H s × W s × D s × C s (cid:55)→ R H c × W c × C c that transforms the 3D scene features into camera spaceusing a projective transform, and then learns the projection of the 3D features to a 2D feature map. C ABD

Scene-space features Camera-space features

Near planeFar planeC ABD

Angleof view

Figure 3:

Left:

The camera’s viewing volume (frus-tum) overlaid on scene-space features. We trilin-early resample the scene features based on the view-ing volume at the orange dots.

Right:

The resultingcamera-space features before projection to 2D.The computer graphics pipeline implements per-spective projection using a projective transformthat converts objects from world coordinates (ourscene space) to camera coordinates [37]. We im-plement this camera transform like the similaritytransforms used to manipulate objects in Sec-tion 3.1, by resampling the 3D scene featuresaccording to the viewing volume (frustum) of thevirtual perspective camera (see Figure 3). For cor-rect perspective projection, this transform mustbe a projective transform, the superset of simi-larity transforms [55]. Speciﬁcally, the viewingfrustum, in scene space, can be deﬁned relativeto the camera’s pose θ cam using the angle of view, 4nd the distance of the near and far planes. The camera-space features are a new 3D tensor of features,of size H c × W c × D c × C s , whose corners are mapped to the corners of the camera’s viewing frustumusing the unique projective 3D transform computed from the coordinates of corresponding cornersusing the direct linear transform [18].In practice, we combine the object and camera transforms into a single transform by multiplying bothtransform matrices and resampling the object features in a single step, directly from object to cameraspace. This is computationally more efﬁcient than resampling twice, and advantageous from a samplingtheory point of view, as the features are only interpolated once, not twice, and thus less informationis lost by the resampling. The combined transform is a ﬁxed, differentiable function with parameters ( θ i , θ cam ) . The individual objects are then combined in camera space before the ﬁnal projection.After the camera transform, the 3D features are projected into view-speciﬁc 2D feature maps using the learnt camera projection p (cid:48) : R H c × W c × D c × C s (cid:55)→ R H c × W c × C c . This function ensures that occlusioncorrectly shows nearby objects in front of distant objects. Following the RenderNet projection unit[40], we reshape the 3D camera-space features (with depth D c and C s channels) into a 2D featuremap with ( D c · C s ) channels, followed by a per-pixel MLP (i.e., 1 × C c channels. We choose to use this learnt renderer following HoloGAN, which shows the effectiveness ofthe renderer in learning powerful 3D representations directly from unlabelled images. This is differentfrom the supervised multi-view setting with pose labels in the renderer of DeepVoxels [50], whichlearns occlusion values, or Neural Volumes [35] and NeRF [38], which learn explicit density values. We train BlockGAN adversarially using the non-saturating GAN loss [16]. Fornatural images with cluttered backgrounds, we also add a style discriminator loss [41]. In additionto classifying the images as real or fake, this discriminator also looks at images at the feature level.Given image features Φ l at layer l , the style discriminator classiﬁes the mean µ ( Φ l ) and standarddeviation σ ( Φ l ) over the spatial dimensions, which describe the image “style” [21]. This morepowerful discriminator discourages the foreground generator to include parts of the background withinthe foreground object(s). We provide detailed network and loss deﬁnitions in the supplemental material. Datasets.

We train BlockGAN on images at 64 ×

64 pixels, with increasing complexity in terms ofnumber of foreground objects (1–4) and texture (synthetic images with simple shapes and simple tonatural images with complex texture and cluttered background). These datasets include the syntheticCLEVR n [23], S YNTH -C AR n and S YNTH -C HAIR n , and the real R EAL -C AR [57], where n is thenumber of foreground objects. Additional details and results are included in the supplementary material. Implementation details.

We assume a ﬁxed and known number of objects of the same type. Fore-and background generators have similar architectures and the same number of output channels, but fore-ground generators have twice as many channels in the learnt constant tensor. Since foreground objectsare smaller than the background, we set scale=1 for the background object, and randomly sample scales < Despite being trained with only unlabelled images, Figure 4 shows thatBlockGAN learns to disentangle different objects within a scene: foreground from background, andbetween multiple foreground objects. More importantly, BlockGAN also provides explicit controland enables smooth manipulation of each object’s pose θ i and identity z i . Figure 6 shows results onnatural images with a cluttered background, where BlockGAN is still able to separate objects andenables 3D object-centric modiﬁcations. Since BlockGAN combines deep object features into scenefeatures, changes in an object’s properties also inﬂuence its shadows, and highlights adapt to theobject’s movement. These effects can be better observed in the supplementary animations. We evaluate the visual ﬁdelity of BlockGAN’s results using KernelInception Distance [KID; 5], which has an unbiased estimator and works even for a small numberof images. Note that KID does not measure the quality of object disentanglement, which is the maincontribution of BlockGAN. We ﬁrst compare with a vanilla GAN [WGAN-GP; 17] using a publicly5 o t a t i o n One foreground object ab D e p t h t r a n s l a t i o n ab H o r i z o n t a l t r a n s l a t i o n ab C h a n g i n g F G ( z ) ab C h a n g i n g B G ( z ) ab Multiple foreground objects R o t a t i o n D e p t h t r a n s l a t i o n ccdd H o r i z o n t a l t r a n s l a t i o n cd C h a n g i n g F G ( z ) C h a n g i n g B G ( z ) cddc effefeffee Figure 4: BlockGAN enables explicit spatial manipulation of individual objects (rotation, translation)and changing the identity of background or foreground objects across different datasets: (a) S

YNTH -C AR

1, (b) S

YNTH -C HAIR

1, (c) S

YNTH -C AR

2, (d) S

YNTH -C AR

3, (e) CLEVR2 and (f) CLEVR3.Notice how the shadows and highlights change as objects move around in the scene, and how changingthe background lighting affects the appearance of foreground objects. Figure 6 shows similar results onnatural images. Please refer to the supplemental material for animated results.6able 1: KID estimates (mean ± std), lower is better, between real images and images generated byBlockGAN and other GANs. BlockGAN achieves competitive KID scores while providing control ofeach object in the generated images (which is not measured by KID). Method S YNTH -C AR ×

64 S

YNTH -C HAIR ×

64 R

EAL -C AR ×

64 CLEVR264 × WGAN-GP [17] 0.141 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± available implementation . Secondly, we compare with LR-GAN [56], a 2D-based method that learnsto generate image background and foregrounds separately and recursively. Finally, we compare withHoloGAN, which learns 3D scene representations that separate camera pose and identity, but doesnot consider object disentanglement. For LR-GAN and HoloGAN, we use the authors’ code. We tunehyperparameters and then compute the KID for 10,000 images generated by each model (samples by allmethods are included in the supplementary material). Table 1 shows that BlockGAN generates imageswith competitive or better visual ﬁdelity than other methods. We show that at test time, 3D object features learntby BlockGAN can be realistically manipulated in ways that have not been observed during training time.First, we show that the learnt 3D object features can also be reused to add more objects to the scene at testtime, thanks to the compositionality inductive bias and our choice of scene composer function. Firstly,we use BlockGAN trained on datasets with only one foreground object and one background, and showthat more foreground objects of the same category can be added to the same scene at test time . Figure 6shows that 2–4 new objects are added and manipulated just like the original objects while maintaining

RemoveAdd

Figure 5: Removing/adding objects.The red box shows the original scene.realistic shadows and highlights. In Figure 5, we use BlockGANtrained on CLEVR4 and then remove (top) and add (bottom)more objects to the scene. Note how BlockGAN generatesrealistic shadows and occlusion for scenes that the model hasnever seen before.Secondly, we apply spatial manipulations that were not part ofthe similarity transform used during training, such as horizontalstretching, or slicing and combining different foreground ob-jects. Figure 6 shows that object features can be geometricallymodiﬁed intuitively, without needing explicit 3D geometry ormulti-view supervision during training.

LR-GAN [56] ﬁrst generates a 2D background layer, andthen generates and combines foreground layers with the generated background using alpha-compositing.Both BlockGAN and LR-GAN show the importance of combining objects in a contextually relevantmanner to generate visually realistic images (see Table 1). However, LR-GAN does not offer explicitcontrol over object location. More importantly, LR-GAN learns an entangled representation of thescene: sampling a different background noise vector also changes the foreground (Figure 7). Finally,unlike BlockGAN, LR-GAN does not allow adding more foreground objects during test time. Thisdemonstrates the beneﬁts of learning disentangled object features compared to a 2D-based approach. For the natural R

EAL -C AR dataset, weobserve that BlockGAN has difﬁculties learning the full 360° rotation of the car, even though fore- andbackground are disentangled well. We hypothesise that this is due to the mismatch between the true(unknown) pose distribution of the car, and the uniform pose distribution we assume during training. Totest this, we create a synthetic dataset similar to S YNTH -C AR uniform pose distribution. To generate the imbalanced rotation dataset, we samplethe rotation uniformly from the front/left/back/right viewing directions ± °. In other words, the car is https://github.com/LynnHo/DCGAN-LSGAN-WGAN-WGAN-GP-Tensorﬂow R o t a t i o n H o r i z o n t a l t r a n s l a t i o n B G i n t e r p o l a t i o n F G i n t e r p o l a t i o n S p l i tt i n g a n d c o m b i n i n g S t r e t c h i n g i ii iii iv v vi A dd i n g a n d m a n i p u l a t i n g n e w o b j e c t s a t t e s t t i m e + = Figure 6:

Left: R EAL -C AR . Even for natural images with cluttered backgrounds, BlockGAN canstill disentangle objects in a scene well. Note that interpolating the background ( z ) affects theappearance of the car in a meaningful way, showing the beneﬁt of 3D scene features. Right:

Test-timegeometric modiﬁcation of the learnt 3D object features (unless stated, background is ﬁxed): stretching(top), splitting and combining (middle), and adding and manipulating new objects after training(bottom). Bottom row shows (i) Original scene, (ii) new object added, (iii) manipulated, (iv,v) differentbackground appearance, and (vi) more objects added. Note the realistic lighting and shadows. C h a n g i n g f o r e g o r u n d C h a n g i n g b a c k g r o u n d A dd i n g o b j e c t s C h a n g i n g f o r e g r o u n d Depth translationabababiiiiiiiii D e p t h t r a n s l a t i o n R o t a t i o n Figure 7: Comparison between (i) LR-GAN [56]and (ii) BlockGAN for S

YNTH -C AR EAL -C AR (right). C h a n g i n g f o r e g o r u n d C h a n g i n g b a c k g r o u n d A dd i n g o b j e c t s C h a n g i n g f o r e g r o u n d Depth translationabababiiiiiiiii D e p t h t r a n s l a t i o n R o t a t i o n Figure 8: Different manipulations applied toBlockGAN trained on (a) a dataset with imbal-anced rotations, and (b) a balanced dataset.only seen from the front/left/back/right 30°, respectively, and there are four evenly spaced gaps of 60°that are never observed, for example views from the front-right. With the imbalanced dataset, Figure 8(bottom) shows correct disentangling of foreground and background. However, rotation of the caronly produces images with (near-)frontal views (top), while depth translation results in cars that arerandomly rotated sideways (middle). We observe similar behaviour for the natural R

EAL -C AR dataset.This suggests that learning object disentanglement and full 3D pose rotation might be two independentproblems. While assuming a uniform pose distribution already enables good object disentanglement,learning the pose distribution from the training data would likely improve the quality of 3D transforms.In our supplemental material, we include comparisons to HoloGAN [41] as well as additional ablationstudies on comparing different scene composer functions, using a perspective camera versus a weak-8erspective camera, adopting the style discriminator for scenes with cluttered backgrounds, and trainingon images with an incorrect number of objects. We introduced BlockGAN, an image generative model that learns 3D object-aware scene representa-tions from unlabelled images. We show that BlockGAN can learn a disentangled scene representationboth in terms of objects and their properties, which allows geometric manipulations not observedduring training. Most excitingly, even when BlockGAN is trained with fewer or even single objects,additional 3D object features can be added to the scene features at test time to create novel scenes withmultiple objects. In addition to computer graphics applications, this opens up exciting possibilities,such as combining BlockGAN with models like BiGAN [12] or ALI [13] to learn powerful objectrepresentations for scene understanding and reasoning.Future work can adopt more powerful relational learning models [28, 48, 54] to learn more complexobject interactions such as inter-object shadowing or reﬂections. Currently, we assume prior knowledgeof object category and the number of objects for training. We also assume object poses are uniformlydistributed and independent from each other. Therefore, the ability to learn this information directlyfrom training images would allow BlockGAN to be applied to more complex datasets with a varyingnumber of objects and different object categories, such as COCO [33] or LSUN [60].

Acknowledgments and Disclosure of Funding

We received support from the European Union’s Horizon 2020 research and innovation programmeunder the Marie Skłodowska-Curie grant agreement No. 665992, the EPSRC Centre for DoctoralTraining in Digital Entertainment (EP/L016540/1), RCUK grant CAMERA (EP/M023281/1), anEPSRC-UKRI Innovation Fellowship (EP/S001050/1), and an NVIDIA Corporation GPU Grant. Wereceived a gift from Adobe.

Broader Impact

BlockGAN is an image generative model that learns an object-oriented 3D scene representation directlyfrom unlabelled 2D images. Our approach is a new machine learning technique that makes it possibleto generate unseen images from a noise vector, with unprecedented control over the identity and poseof multiple independent objects as well as the background. In the long term, our approach could enablepowerful tools for digital artists that facilitate artistic control over realistic procedurally generateddigital content. However, any tool can in principle be abused, for example by adding new, manipulatingor removing existing objects or people from images.At training time, our network performs a task somewhat akin to scene understanding, as our approachlearns to disentangle between multiple objects and individual object properties (speciﬁcally their poseand identity). At test time, our approach enables sampling new images with control over pose andidentity for each object in the scene, but does not directly take any image input. However, it is possibleto embed images into the latent space of generative models [1]. A highly realistic generative imagemodel and a good image ﬁt would then make it possible to approximate the input image and, moreimportantly, to edit the individual objects in a pictured scene. Similar to existing image editing software,this enables the creation of image manipulations that could be used for ill-intended misinformation( fake news ), but also for a wide range of creative and other positive applications. We expect the beneﬁtsof positive applications to clearly outweigh the potential downsides of malicious applications.

References [1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2StyleGAN: How to embed images into theStyleGAN latent space? In

ICCV , 2019.[2] Kara-Ali Aliev, Artem Sevastopolsky, Maria Kolos, Dmitry Ulyanov, and Victor Lempitsky.Neural point-based graphics. In

ECCV , 2020.93] Titas Anciukeviˇcius, Christoph Lampert, and Paul Henderson. Object-centric image generationwith factored depths, locations, and appearances. arXiv:2004.00642 , 2020.[4] Adam Bielski and Paolo Favaro. Emergence of object segmentation in perturbed generativemodels. In

NeurIPS , 2019.[5] Mikołaj Bi´nkowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. DemystifyingMMD GANs. In

ICLR , 2018.[6] Blender Online Community.

Blender – a 3D modelling and rendering package . BlenderFoundation, 2020. URL .[7] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high ﬁdelitynatural image synthesis. In

ICLR , 2019.[8] Mickaël Chen, Thierry Artières, and Ludovic Denoyer. Unsupervised object segmentation byredrawing. In

NeurIPS , 2019.[9] Ting Chen, Mario Lucic, Neil Houlsby, and Sylvain Gelly. On self modulation for generativeadversarial networks. In

ICLR , 2019.[10] Wenzheng Chen, Huan Ling, Jun Gao, Edward Smith, Jaakko Lehtinen, Alec Jacobson, and SanjaFidler. Learning to predict 3D objects with an interpolation-based differentiable renderer. In

NeurIPS , 2019.[11] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. InfoGAN:Interpretable representation learning by information maximizing generative adversarial nets. In

NIPS , 2016.[12] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. In

ICLR ,2017.[13] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martín Arjovsky, Olivier Mastropi-etro, and Aaron C. Courville. Adversarially learned inference. In

ICLR , 2017.[14] Martin Engelcke, Adam R. Kosiorek, Oiwi Parker Jones, and Ingmar Posner. GENESIS: Genera-tive scene inference and sampling with object-centric latent representations. In

ICLR , 2020.[15] S. M. Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, KorayKavukcuoglu, and Geoffrey E Hinton. Attend, infer, repeat: Fast scene understanding withgenerative models. In

NIPS , 2016.[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In

NIPS , 2014.[17] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C. Courville.Improved training of Wasserstein GANs. In

NIPS , 2017.[18] Richard Hartley and Andrew Zisserman.

Multiple View Geometry in Computer Vision . CambridgeUniversity Press, 2004. ISBN 0521540518. doi: 10.1017/CBO9780511811685.[19] Philipp Henzler, Niloy J. Mitra, and Tobias Ritschel. Escaping Plato’s cave: 3D shape fromadversarial rendering. In

ICCV , 2019.[20] Tobias Hinz, Stefan Heinrich, and Stefan Wermter. Generating multiple objects at spatiallydistinct locations. In

ICLR , 2019.[21] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instancenormalization. In

ICCV , 2017.[22] Insu Jeon, Wonkwang Lee, and Gunhee Kim. IB-GAN: Disentangled representation learningwith information bottleneck GAN. https://openreview.net/forum?id=ryljV2A5KX , 2018.[23] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, andRoss Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visualreasoning. In

CVPR , 2017. 1024] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image generation from scene graphs. In

CVPR ,2018.[25] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs forimproved quality, stability, and variation. In

ICLR , 2018.[26] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generativeadversarial networks. In

CVPR , 2019.[27] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In

ICLR , 2015.[28] Thomas Kipf, Elise van der Pol, and Max Welling. Contrastive learning of structured worldmodels. In

ICLR , 2020.[29] Adam Kosiorek, Hyunjik Kim, Yee Whye Teh, and Ingmar Posner. Sequential attend, infer,repeat: Generative modelling of moving objects. In

NeurIPS , 2018.[30] Jean Kossaiﬁ, Linh Tran, Yannis Panagakis, and Maja Pantic. GAGAN: Geometry-awaregenerative adverserial networks. In

CVPR , 2018.[31] Hanock Kwak and Byoung-Tak Zhang. Generating images part by part with composite generativeadversarial networks. arXiv:1607.05387, 2016.[32] Yiyi Liao, Katja Schwarz, Lars Mescheder, and Andreas Geiger. Towards unsupervised learningof generative models for 3D controllable image synthesis. In

CVPR , 2020.[33] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO:common objects in context. In

ECCV , 2014.[34] Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. Soft rasterizer: A differentiable renderer forimage-based 3D reasoning. In

ICCV , 2019.[35] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, andYaser Sheikh. Neural volumes: Learning dynamic renderable volumes from images.

ACMTransactions on Graphics , 38(4):65:1–14, July 2019.[36] Matthew M. Loper and Michael J. Black. OpenDR: An approximate differentiable renderer. In

ECCV , 2014.[37] Stephen Marschner, Peter Shirley, Michael Ashikhmin, Michael Gleicher, Naty Hoffman, GarrettJohnson, Tamara Munzner, Erik Reinhard, William B. Thompson, Peter Willemsen, and BrianWyvill.

Fundamentals of Computer Graphics . A K Peters/CRC Press, 4th edition, 2015.[38] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi,and Ren Ng. NeRF: Representing scenes as neural radiance ﬁelds for view synthesis. In

ECCV ,2020.[39] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv:1411.1784,2014.[40] Thu Nguyen-Phuoc, Chuan Li, Stephen Balaban, and Yongliang Yang. RenderNet: A deepconvolutional network for differentiable rendering from 3D shapes. In

NeurIPS , 2018.[41] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian Richardt, and Yong-Liang Yang. HoloGAN:Unsupervised learning of 3D representations from natural images. In

ICCV , 2019.[42] Kyle Olszewski, Sergey Tulyakov, Oliver Woodford, Hao Li, and Linjie Luo. Transformablebottleneck networks.

ICCV , 2019.[43] Dim P. Papadopoulos, Youssef Tamaazousti, Ferda Oﬂi, Ingmar Weber, and Antonio Torralba.How to make a pizza: Learning a compositional layer-based GAN model. In

CVPR , 2019.[44] Eunbyung Park, Jimei Yang, Ersin Yumer, Duygu Ceylan, and Alexander C. Berg. Transformation-grounded image generation network for novel 3D view synthesis. In

CVPR , 2017.1145] Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee.Learning what and where to draw. In

NIPS , 2016.[46] Konstantinos Rematas and Vittorio Ferrari. Neural voxel renderer: Learning an accurate andcontrollable rendering tool. In

CVPR , 2020.[47] Helge Rhodin, Mathieu Salzmann, and Pascal Fua. Unsupervised geometry-aware representationfor 3D human pose estimation. In

ECCV , 2018.[48] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, PeterBattaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In

NIPS , 2017.[49] Zhixin Shu, Mihir Sahasrabudhe, Riza Alp Guler, Dimitris Samaras, Nikos Paragios, and IasonasKokkinos. Deforming autoencoders: Unsupervised disentangling of shape and appearance. In

ECCV , 2018.[50] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and MichaelZollhöfer. DeepVoxels: Learning persistent 3D feature embeddings. In

CVPR , 2019.[51] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene representation networks:Continuous 3D-structure-aware neural scene representations. In

NeurIPS , 2019.[52] Mehmet Özgür Türko˘glu, Luuk Spreeuwers, William Thong, and Berkay Kicanaoglu. A layer-based sequential framework for scene generation with GANs. In

AAAI , 2019.[53] Sjoerd van Steenkiste, Karol Kurach, and Sylvain Gelly. A case for object compositionality indeep generative models of images. In

NeurIPS Workshops , 2018.[54] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In

NIPS , 2017.[55] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformernets: Learning single-view 3D object reconstruction without 3D supervision. In

NIPS , 2016.[56] Jianwei Yang, Anitha Kannan, Dhruv Batra, and Devi Parikh. LR-GAN: Layered recursivegenerative adversarial networks for image generation.

ICLR , 2017.[57] Linjie Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. A large-scale car dataset forﬁne-grained categorization and veriﬁcation. In

CVPR , 2015.[58] Yanchao Yang, Yutong Chen, and Stefano Soatto. Learning to manipulate individual objects in animage. In

CVPR , 2020.[59] Shunyu Yao, Tzu Ming Harry Hsu, Jun-Yan Zhu, Jiajun Wu, Antonio Torralba, William T.Freeman, and Joshua B. Tenenbaum. 3D-aware scene manipulation via inverse graphics. In

NeurIPS , 2018.[60] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. LSUN: Construction of alarge-scale image dataset using deep learning with humans in the loop. arXiv:1506.03365, 2015.[61] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A. Efros. Viewsynthesis by appearance ﬂow. In

ECCV , pages 286–301, 2016.[62] Jun-Yan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun Wu, Antonio Torralba, Josh Tenenbaum,and Bill Freeman. Visual object networks: Image generation with disentangled 3D representations.In

NeurIPS , 2018. 12

Additional results

A.1. Comparison to entangled

3D scene representation.

We compare BlockGAN with HoloGAN[41], which also learns deep 3D scene features but does not consider object disentanglement. Inparticular, HoloGAN only considers one noise vector z for identity and one pose θ for the entire scene,and does not consider translation t as part of θ . While HoloGAN works well with object-centred scenes,it struggles with moving foreground objects. Figure 9 shows that HoloGAN tends to associate each pose θ with a ﬁxed object’s identity (i.e., moving objects erroneously changes identity of both foregroundand background), while changing z only changes a small part of the background. BlockGAN, on theother hand, can separate identity and pose for each object, while being able to learn scene-level effectssuch as lighting and shadows. HoloGAN D e p t h t r a n s l a t i o n H o r i z o n t a l t r a n s l a t i o n R o t a t i o n C h a n g i n g i d e n t i t y ( z ) abab ababc cc c Figure 9: Samples from HoloGAN [41] trained on the datasets (a) S

YNTH -C HAIR

1, (b) S

YNTH -C AR θ with a ﬁxed object’s identity, i.e., movingobjects erroneously changes identity of both foreground and background (see top left, bottom left andright), while changing the noise vector z only changes a small part of the background (top right). A.2. Increasing the number of foreground objects.

To test the capability of our method, we trainBlockGAN on CLEVR6 (6 foreground objects). As shown in Figure 10, BlockGAN is still capable ofgenerating and manipulating 3D object features, although the background generator now also producesforeground objects (Figure 10f). Moreover, rotating individual object leads to changes in object’s depth(Figure 10a).Interestingly, we notice that BlockGAN now generates images with more or less than 6 objects, despitebeing trained with images that contain exactly 6 objects (Figure 10g). We hypothesise that BlockGAN’sfailure in this case is due to our assumption that the poses of all objects are independent from each other(during training, we randomly sample the pose θ for each object). This is not true in the physical world(also in the CLEVR dataset) where objects do not intersect. The more objects there are in the scene, thestronger the interdependence between objects’ poses becomes. Therefore, for future work, we hope toadopt more powerful relation learning structures to learn objects’ pose directly from training images.Another interesting direction is to design object-aware discriminators, which are capable of recognisingfake images when the generators produce samples with more objects than the training images. B Additional ablation studies

B.1. Learning without the perspective camera.

Here we show the advantage of implementing theperspective camera explicitly, compared to using a weak-perspective projection like HoloGAN [41].Since a perspective camera directly affects foreshortening, it provides strong cues for BlockGAN to13olve the scale/depth ambiguity. This is especially important for BlockGAN to learn to project andreason over occlusion by concatenating the depth and channel dimension, followed by an MLP. Sincethe MLP is ﬂexible, BlockGAN trained without a perspective camera, therefore, tends to learn toassociate an object’s identity with scale and depth, while changing depth only changes the object’sappearance (see Figure 11).

B.2. Scene composer function.

We consider and compare three scene composer functions:(i) element-wise summation, (ii) element-wise maximum, and (iii) an MLP (multi-layer percep-tron). We train BlockGAN with each function and compare their performance in terms of visual quality(KID score) in Table 2. While all three functions can successfully combine objects into a scene, the abcdef g

Figure 10: Qualitative results of BlockGAN trained on CLEVR6. (a) Rotating a foreground object,(b) Horizontal translation, (c) Depth translation, (d) Changing foreground object 1, (e) Changingforeground object 3, (f) Changing the background object, and (g) Random samples.14 epth translationChanging foreground object 1 (z ) abab Figure 11: The effect of modelling the perspective camera explicitly (b) compared to using a weak-perspective camera (a). Note that with the weak-perspective camera (a), translation along the depthdimension (top) leads to identity changes without any translation in depth, while changing the noisevector z (bottom) changes both depth translation and, to a lesser extent, the object identity. Using aperspective camera correctly disentangles position and identity (b).element-wise maximum performs best and easily generalises to multiple objects. Therefore, we use theelement-wise maximum for BlockGAN.Table 2: KID estimates for different scene composer functions. Method S YNTH -C AR × ) S YNTH -C HAIR × )Sum 0.040 ± ± ± ± ± ± When BlockGAN is trained with a standard dis-criminator on datasets with a cluttered background, such as the R

EAL -C AR dataset, the foregroundobject features tend to include part of the background object. This creates visual artefacts when objectsmove in the scene (indicated by red arrows in Figure 12a). We hypothesise that these artefacts shouldbe picked up by the discriminator since generated images should look unrealistic. Therefore, we addmore powerful style discriminators [41] to the original discriminator at different layers (see Section Dfor details). Figure 12b shows that the generator is indeed discouraged from adding backgroundinformation to the foreground object features, leading to cleaner results. a)b) Figure 12: With a standard discriminator (a), a part of the background appearance is baked into theforeground object (see red arrows). Adding the style discriminator (b) cleanly separates the car fromthe background. 15 .4. Incorrect number of objects.

We next investigate the performance of BlockGAN when thetraining data contains fewer or more objects than expected. In Figure 13, we show BlockGAN conﬁguredwith 2 foreground object generators when trained with images containing 1 or 3 foreground objects. Ifonly a single object is present (Figure 13, left), changing either of the two foreground generators changesthe object’s appearance and pose (top), while changing the background works as expected (bottom). Ifthere are three objects present (Figure 13, right), changing one foreground generator changes one objectas expected (top), while changing the background generator simultaneously changes one foregroundobject and the background (bottom).

Change foregroundChange backgroundChange foregroundChange background Change foregroundChange backgroundChange foregroundChange background

Figure 13: Results for BlockGAN trained with 2 foreground (FG) object generators when trained on 1or 3 foreground objects. 1 object (left): Changing either FG object changes the object’s appearance andpose; changing the background works as expected. 3 objects (right): Changing one FG object changesone object as expected; changing the background changes one FG object and the background.

C Comparison to other methods

In Figure 14, 15, 16 and 17, we show generated samples by a vanilla GAN (WGAN-GP [17]), 2Dobject-aware LR-GAN [56], 3D-aware HoloGAN [41] and our BlockGAN. Compared to other models,BlockGAN produces samples with competitive or better quality, and offers explicit control overthe poses of objects in the generated images. Notice that although LR-GAN is designed to handleforeground and background objects explicitly, for CLEVR2 with two foreground objects, this methodstruggles and tends to always place one foreground object at the image centre (see Figure 16).

Implementation details

For WGAN-GP, we use a publicly available implementation . For LR-GANand HoloGAN, we use the code provided by the authors. We conduct hyperparameter search forthese models, and report best results for each method. Note that for HoloGAN, we modify the 3Dtransformation to add translation during training, since this method assumes that foreground objectsare at the image centre. https://github.com/LynnHo/DCGAN-LSGAN-WGAN-WGAN-GP-Tensorﬂow GAN-GP LR-GANHoloGAN BlockGAN (ours)

Figure 14: Samples from WGAN-GP, LR-GAN, HoloGAN and our BlockGAN trained on S

YNTH -C AR

1. 17

GAN-GP LR-GANHoloGAN BlockGAN (ours)

Figure 15: Samples from WGAN-GP, LR-GAN, HoloGAN and our BlockGAN trained on S

YNTH -C HAIR

1. 18

GAN-GP LR-GANHoloGAN BlockGAN (ours)

Figure 16: Samples from WGAN-GP, LR-GAN, HoloGAN and our BlockGAN trained on CLEVR2.19

GAN-GP LR-GANHoloGAN BlockGAN (ours)

Figure 17: Samples from WGAN-GP, LR-GAN, HoloGAN and our BlockGAN trained on R

EAL -C ARS . 20

Loss function and style discriminator

For datasets with cluttered backgrounds like the natural R

EAL -C AR dataset, we adopt style discrimina-tors in addition to the normal image discriminator (see the beneﬁt in Figure 12). Style discriminatorsperform the same real/fake classiﬁcation task as the standard image discriminator, but at the featurelevel across different layers. In particular, style discriminators classify the mean µ and standard devia-tion σ of the features Φ l at different levels l (which are believed to describe the image “style”). Themean µ ( Φ l ( x )) and variance σ ( Φ l ( x )) of the features Φ l ( x ) are computed across batch and spatialdimensions independently using: µ ( Φ l ( x )) = 1 N × H × W N (cid:88) n =1 H (cid:88) h =1 W (cid:88) w =1 Φ l ( x ) nhw , (2) σ ( Φ l ( x )) = (cid:118)(cid:117)(cid:117)(cid:116) N × H × W N (cid:88) n =1 H (cid:88) h =1 W (cid:88) w =1 (cid:0) Φ l ( x ) nhw − µ ( Φ l ( x )) (cid:1) + (cid:15) . (3)The style discriminators are implemented as MLPs with sigmoid activation functions for binaryclassiﬁcation. A style discriminator at layer l is written as L l style ( G ) = E z ,θ [ − log D l ( G ( z ,θ ))] . (4)The total loss therefore can be written as L total ( G ) = L GAN ( G )+ λ s · (cid:88) l L l style ( G ) . (5)We set λ s = 1 for all natural datasets and λ s = 0 for synthetic datasets. E Datasets

We modify the CLEVR dataset [23] to add a larger variety of colours and primitive shapes. Additionally,we use the scene setups provided by CLEVR to render the remaining synthetic datasets (S

YNTH -C AR n and S YNTH -C HAIR n , with n foreground objects each). These include a ﬁxed, grey background, avirtual camera with ﬁxed parameters but random location jittering, and random lighting. We also usethe render script from CLEVR to randomly place foreground objects into the scene and render them.We render all image at resolution 128 × ×

64 for training.For the natural C AR dataset, each image is ﬁrst scaled such that the smaller side is 64, then it is croppedto produce a 64 ×

64 pixel crop. During training, we randomly move the 64 ×

64 cropping windowbefore cropping the image. Figure 18 includes samples from our generated datasets, and Table 3 liststhe range of pose parameters used for each dataset during training.Link for 3D textured chair models: https://keunhong.com/publications/photoshape/

Link for CLEVR: https://github.com/facebookresearch/clevr-dataset-gen

Link for natural C AR dataset: http://mmlab.ie.cuhk.edu.hk/datasets/comp_cars/ Table 3: Datasets used in our paper ( n = number of foreground objects). ‘Azimuth’ describes objectrotation about the up-axis. ‘Elevation’ refers to the camera’s elevation above ground. ‘Scaling’ is thescale factor applied to foreground objects. ‘Horiz. transl.’ and ‘Depth transl.’ are horizontal/depthtranslation of objects relative to the global origin. Ranges represent uniform random distributions. Name S YNTH -C AR n YNTH -C HAIR n n [23] 100,000 0° – 359° 45° 0.5 – 0.6 –4 – 4 –4 – 4R EAL -C ARS [57] 139,714 0° – 359° 0° – 35° 0.5 – 0.8 –3 – 4 –5 – 6 YNTH-CAR1 SYNTH-CAR2 SYNTH-CAR3CHAIR1 CHAIR2CLEVR2 CLEVR3 CLEVR4

Figure 18: Samples from the synthetic datasets.

F Implementation

F.1. Training details.Virtual camera model.

We assume a virtual camera with a focal length of 35 mm and a sensor sizeof 32 mm (Blender’s default values), which corresponds to an angle of view of mm × mm =49 . degrees (we use the same setup for natural images). Sampling.

We initialise all weights using N (0 , . and biases as . For CLEVR n , we use noisevector dimensions of | z | = 20 for the background, and | z i | = 60 (for i = 1 ,...,n ) for the foregroundobjects, to account for their relative visual complexity. Similarly, for S YNTH -C AR n and S YNTH -C HAIR n , we use | z | = 30 and | z i | = 90 (for i = 1 ,...,n ), to account for their relative visual complexity.For the natural R EAL -C AR dataset, we use | z | = 100 and | z | = 200 . Note that we only feed z to the3D features of each object, and not to the 3D scene features and 2D features. Table 3 provides theranges we use for sampling the pose θ i of foreground objects during training.22 raining. We train BlockGAN using the Adam optimiser [27], with β = 0 . and β = 0 . . We usethe same learning rate for both the discriminator and the generator. Empirically, we ﬁnd that updatingthe generator twice for every update of the discriminator achieves images with the best visual ﬁdelity.We use a learning rate of 0.0001 for all synthetic datasets. For the natural C ARS dataset, we use alearning rate of 0.00005. We train all datasets with a batch size of 64 for 50 epochs. Training takes 1.5days for the synthetic datasets and 3 days for the natural R

EAL -C ARS dataset.

Infrastructure.

All models were trained using a single GeForce RTX 2080 GPU.

F.2. Network architecture.

We describe the network architecture for the BlockGAN foregroundobject generator in Table 4, the BlockGAN background generator in Table 5, and the overall BlockGANgenerator in Tables 6 and 7 for synthetic and real datasets, respectively. Note that we use ReLU for thesynthetic datasets and LReLU for the natural C AR dataset after the AdaIN layer. The discriminator isdescribed in Table 8.In terms of the notation in Section 3 of the main paper, object features have dimensions H o × W o × D o × C o = 16 × × × , scene features have the same dimensions H s × W s × D s × C s =16 × × × , and camera features have dimensions H c × W c = 16 × (before up-convolutions to × ) with C c = 64 channels for synthetic datasets and C c = 256 channels for natural image datasets.As GANs empirically tend to perform better on category-speciﬁc datasets, we decided to start with thisassumption. A promising future direction is to adopt a shared rendering layer for objects generated bydifferent category-speciﬁc generators, similar to Aliev et al. [2].Table 4: Network architecture of the BlockGAN foreground (FG) object generator. Layer type Kernel size Stride Normalisation Output dimension

Learnt constant tensor — — AdaIN × × × UpConv × × × × × UpConv × × × × ×

3D transformation — — — × × × Table 5: Network architecture of the BlockGAN background (BG) object generator.

Layer type Kernel size Stride Normalisation Output dimension

Learnt constant tensor — — AdaIN × × × UpConv × × × × × UpConv × × × × ×

3D transformation — — — × × × Layer type Kernel size Stride Activation Norm. Output dimension n × FG generator (Table 4) — — ReLU — × × × BG generator (Table 5) — — ReLU — × × × Element-wise maximum — — — — × × × Concatenate — — — — × × (16 · Conv × × × UpConv × × × UpConv × × × UpConv × × × Table 7: Network architecture of the BlockGAN generator for the R

EAL -C ARS dataset. Differences tothe synthetic foreground object generator in Table 6 are highlighted in blue.

Layer type Kernel size Stride Activation Normal. Output dimension

FG generator (Table 4) — — LReLU — × × × BG generator (Table 5) — — LReLU — × × × Element-wise maximum — — — — × × × Concatenate — — — — × × (16 · Conv × × × UpConv × × × UpConv × × × UpConv × × × Table 8: Network architecture of the BlockGAN discriminator for both synthetic and real datasets.

Layer type Kernel size Stride Activation Normalisation Output dimension

Conv × × × Conv × × × Conv × × × Conv × × × Fully connected — — Sigmoid None/Spectral1