[PDF] Generation High resolution 3D model from natural language by Generative Adversarial Network

Abstract

We present a method of generating high resolution 3D shapes from natural language descriptions. To achieve this goal, we propose two steps that generating low resolution shapes which roughly reflect texts and generating high resolution shapes which reflect the detail of texts. In a previous paper, the authors have shown a method of generating low resolution shapes. We improve it to generate 3D shapes more faithful to natural language and test the effectiveness of the method. To generate high resolution 3D shapes, we use the framework of Conditional Wasserstein GAN. We propose two roles of Critic separately, which calculate the Wasserstein distance between two probability distribution, so that we achieve generating high quality shapes or acceleration of learning speed of model. To evaluate our approach, we performed quantitive evaluation with several numerical metrics for Critic models. Our method is first to realize the generation of high quality model by propagating text embedding information to high resolution task when generating 3D model.

Full PDF

GGeneration High resolution 3D model from natural languageby Generative Adversarial Network

Kentaro Fukamizu, Masaaki Kondo, Ryuichi Sakamoto

Abstract

Since creating 3D models with 3D designing toolsis a heavy task for human, there is a need to gen-erate high quality 3D shapes quickly form text de-scriptions. In this paper, we propose a method ofgenerating high resolution 3D shapes from naturallanguage descriptions. Our method is based on priorwork [2] where relatively low resolution shapes aregenerated by neural networks. Generating high res-olution 3D shapes is diﬃcult because of the restric-tion of the memory size of a GPU or the time re-quired for neural network training. To overcomethis challenge, we propose a neural network modelwith two steps; ﬁrst a low resolution shape whichroughly reﬂects a given text is generated, and sec-ond corresponding high resolution shape which re-ﬂects the detail of the text is generated. To generatehigh resolution 3D shapes, we use the framework ofConditional Wasserstein GAN. we perform quanti-tative evaluation with several numerical metrics forgenerated 3D models. We found that the proposedmethod can improve both the quality of shapes andtheir faithfulness to the text.

People usually use their natural languages to communicatetheir thoughts and emotions to others. Recently, artiﬁcialintelligence (AI) technology allows us to use these naturallanguages for operating and controlling the computer sys-tems. For example, it is becoming possible for an AI engineto generate 3D shapes from descriptions of a natural lan-guage.When creating 3D models, 3D designers commonly relyon expensive modeling software, such as Maya, Blender, and3DSMAX, and they need to spend very long time to createsatisﬁed quality of 3D models with these tools. Even afterbecoming an expert of these tools, it takes a long time tocreate one 3D model. Compared with using 3D modelingsoftware, if human already has an image of the target objectin her/his brain, it is very easy to express the outline of theshapes in the form of text descriptions. If an AI enginecan generate 3D shapes quickly form text descriptions, it ispossible to reduce the time taken for these heavy tasks.There have been research eﬀorts to generate 3D models from vectors of encoded text descriptions by using Genera-tive Adversarial Network(GAN)[3]. For example, in [7], itis possible to generate 3D shapes with rough categories, butit is not possible to generate the ﬁne models of them withcolors or details of each part.Prior work shown in [2] successfully generates 3D shapeswith colors from text descriptions. However the resolution ofthe shapes is still low. In the work, text is ﬁrst converted to avector representation via an encoder. Next, a corresponding3D shape is generated by the Generator from the vector.This Generator is based on a deep-learning methodologyand a GAN framework. When the vectorized text is inputto the Generator, it is combined with noises to ensure theﬂexibility of the Generator.The output of the 3D shapes is created through 3D de-convolution operations and then input to a Critic network.The Wasserstein distance between the actual 3D shape andthe generated one is calculated. On one hand, the Criticnetwork attempts to calculate the Wasserstein distance ac-curately. On the other hand, the Generator attempts tominimize the distance calculated by Critic. This ensuresthat the probability distribution of the actual 3D shape andthe generated one becomes closer. The details of the algo-rithm will be described later in Section ??.In the previous work, the output is generated in the formof 3D voxels. As for voxel generation, learning was donewith voxel size of ( x, y, z, color ) = (32 , , , x, y, z, color ) = (64 , , , a r X i v : . [ c s . G R ] J a n table.Throughout this paper, we propose several GAN modelswhere the rolse of the Critic network diﬀer, for example,whether or not focusing on preciseness of the text descrip-tions. Since these models have advantages and disadvan-tages for various indices, we also introduce several metricsto compare the eﬀectiveness of these proposed models.The rest of this paper is organized as follows. In the nextsection, ..... In the Generative Adversarial Network (GAN) frame-work, two networks called Generator and Discriminator arelearned. Generator is trained to generate the data similar tothe training data from latent vector. Discriminator learns todiscriminate between the training data and the data gener-ated by Generator. Generator and Discriminator is trainedalternately with this mechanism, and ﬁnally it is expectedthat Generator can make the data similar to the trainingone.These processes can be expressed by the following math-ematical expression,min G max D V ( D, G ) = E x ∼ p data ( x ) [log D ( x )]+ E z ∼ p z ( z ) [log(1 − D ( G ( z )))]where, G , D , and x are Generator, Discriminator, and train-ing data, respectively. This expression shows the objectivefunction of D, G . Here, we use z as a noise dor the latentvector and G generates data from the noise z . D ( x ) meansthe probability that x is regarded as the training data. In the original GAN, all elements of the latent vector whichinput to G are noise based on certain probability distribu-tion. However, if noise is taken as an input, it is diﬃcult tospecify and generate the data which you want to generatespecially. On the other hand, CGAN can express what wewant to generate, by inputting a label as latent vector intoGenerator and Discriminator. In the CGAN, the latent vec-tor of what you want to generate is encoded (from image,text. etc.) before generation.In CGAN, the latent vector l which is appropriately gener-ated by the training data is combined with the noise vector z and input to the Generator. The reason for combiningwith the noise vector is that this ensures Generator hav-ing diversity. Otherwise, it may limit its possible output,determined only by the latent vector. In addition to that,this helps make the output more robust even for some partswhere latent vector cannot describe suﬃciently (such as de-tails of the background of an image). It is common in usual GAN frameworks to use combinedlatent vector for an input to Generator so that the Discrim-inator recognizes the generated data as training data. Thediﬀerence of CGAN is that it uses the combined vector l asan input to Discriminator. By doing so, Discriminator canjudge whether the output from the Generator is actuallylinked with the corresponding latent vector description.To implement the above point, the objective function ofCGAN is expressed as follows:min G max D V ( D, G ) = E x ∼ p data ( x ) [log D ( x )]+ E z ∼ p z ( z ) ,l ∼ p l ( l ) [log(1 − D ( G ( l, z )))]In order to strengthen the capability of identifyingwhether output is generated with the corresponding latentvector, the literature [10] proposes a way to train the dataso that D should output 0 for training data that diﬀers withthe description of the latent vector. In this case, the objec-tive function is expressed as follows:min G max D V ( D, G ) = E x ∼ p data ( x ) [log D ( x )]+ E x ∼ p mis ( x ) [log(1 − D ( x ))]+ E z ∼ p z ( z ) ,l ∼ p l ( l ) [log(1 − D ( G ( l, z )))]where p mis ( x ) is the probability distribution of the trainingdata that mismatched with the latent vector. The aim of WassersteinGAN (WGAN) is to bring the prob-ability distribution of the generated voxel, P g ( x ), close tothe probability distribution of the training data, P r ( x ). Thesimplest way of doing this is minimizing Kullback-Leibler(KL) divergence which is one of the metrics to observe dis-tances between two probability distributions. KL divergence( KL ) between the probability distribution P r and P g is cal-culated as follows: KL ( P r || P g ) = (cid:90) log (cid:32) P r ( x ) P g ( x ) (cid:33) P r ( x ) dx In the case of GANs, it is not possible to numerically cal-culate KL , and hence compute the loss function for them,because a speciﬁc probability distribution is not assumed inGANs.Instead of using KL directly, the GAN try to minimizeJensen-Shanon divergence (JSD). Although KL is asymmet-rical, JSD can be symmetrical. The mathematical expres-sion of JSD is represented as follows: P A = P r + P g JSD ( P r || P g ) = 12 KL ( P r || P A ) + 12 KL ( P g || P A )As for loss of the GAN shown below; V ( D, G ) = E x ∼ p data ( x ) [log D ( x )]+ E z ∼ p z ( z ) [log(1 − D ( G ( z )))]2bviously JSD becomes maximum in the case of D ∗ ( x ) = P r ( x ) P r ( x ) + P g ( x )In this condition, the objective function becomes as fol-lows V ( D ∗ , G ) = 2 JSD ( P r || P g ) − D accurately approximates JSD, G learns to minimize JSD which is calculated by D . How-ever, in training GANs, it is very important to carefullyadjust learning balance between Discriminator and Gener-ator or learning rate of them. If the Discriminator’s train-ing is insuﬃcient, the Generator will minimize the incorrectJSD. On the other hands, if Discriminator’s training is tooenough, the gradient for the parameters of Discriminatorwill be small, making Generator’s training infeasible.As discussed, the success of learning of GANs depends onlearning parameters. Specially when learning a model witha large number of parameters such as for 3D voxel creationin this research, adjustment of those learning parametersis extremely diﬃcult. To overcome this challenge, One ofprior work introduces another index to measure the distancebetween probability distributions instead of using JSD. InWassersteinGAN[1], Wasserstein distance is introduced asa metric of distance. Wasserstein distance is expressed asfollows: W ( P r , P g ) = inf γ ∈ (cid:81) ( P r , P g ) E ( x,y ) ∼ γ [ || x − y || ]where (cid:81) ( P r , P g ) denotes the set of all joint distributions γ ( x, y ) whose marginals are P r and P g , respectively. Intu-itively, γ ( x, y ) indicates how much “mass” must be trans-ported from x to y for converting the probability distribu-tions P r into the probability distribution P g . Originally, thisis a metric used for optimal transport problems. This makesmeasuring the distance between low dimensional manifoldspossible.The Wasserstein distance can a better way to describe thedistance than JSD, but it is diﬃcult to calculate. Accordingto Kantorovich-Rubinstein duality [9], it can be expressedusing the 1-Lipschitz function f as follows: W ( P r , P g ) = sup || f || L ≤ E x ∼ P r [ f ( x )] − E x ∼ P g [ f ( x )]where x ∈ χ and f : χ → R . The meaning of 1-Lipschitzfunction is that the slope of a straight line of arbitrary x, x (cid:48) ∈ χ does not exceed 1. Here, if f is a function { f w } w ∈ W represented by some parameters and x ∼ P g follows g θ ( z ) : R n → χ , we can express W ( P r , P g ) as follows: W ( P r , P g ) = max w ∈ W E x ∼ P r [ f w ( x )] − E z ∼ P z [ f w ( g θ ( z ))]In order to satisfy the 1-Lipschitz condition, it suﬃces thateach parameter ﬁt into the compact space, that is, the abso-lute value of the weight parameter w is clipped to a certain value c . Discriminator is called Critic to distinguish it fromthe original GAN.In WGAN, the Critic and the Generator networks learnalternately until Wasserstein distance converges. While theCritic attempts to calculate the Wasserstein distance be-tween training data and generated data accurately, the Gen-erator attempts to bring the probability distribution of gen-erated data close to that of the training data by minimizingthe Wasserstein distance calculated. Given that gradientsdo not disappear even if the Wasserstein distance convergescompletely WGAN is very stable in learning. Therefore ad-justment of the learning balance between the Critic and theGenerator is unnecessary.In conventional GANs, since the Critic and the Genera-tor networks use diﬀerent loss functions, the loss values donot converge even if training fairly progress. The problemhere is that the timing of ﬁnishing training is diﬃcult toﬁnd out. However, the loss value of the Critic always goestoward converge in WGAN. Therfore, one of the advantagesof WGAN is that decrease of the loss value always correlateswith improvement in the quality of the generated data.However, one of the disadvantage in WGAN is clippingweight parameters. If the weight parameters are clipped,the weights become polarized to the clipped boundary val-ues, resulting in the gradient explosion or disappearance.This cause delay in learning. Here, the optimized Critic hasa characteristic that it has a slope whose norm is 1 at al-most all points below P r and P g [4]. Based on this feature,WGANgp is proposed[4]. In WGANgp, a penalty term isintroduced in the loss function so that the Critic has a slopewhose norm is 1 at almost all points below P r and P g . Thisallows it to be optimized without clipping the weights. Let D ( x ) be the output of the Critic. Then the loss is expressedas follows: L W GANGP = E x ∼ P g [ D ( x )] − E x ∼ P r [ D ( x )]+ λ E x (cid:48) ∼ P gp [( ||∇ x (cid:48) D ( x (cid:48) ) || − ]where x (cid:48) = (cid:15)x + (1 − (cid:15) )ˆ x , at (cid:15) ∼ U [0 , x ∼ P r , ˆ x ∼ P g .By providing the penalty to the gradient, the weights havediversiﬁed values without polarizing, giving higher modelperformance. As mentioned in Section 1, this paper is based on prior workwhich generates 3D voxels from texts[2]. In text2shapeGAN(TSGAN), they generate 3D voxels stably by combiningCGAN and WGANgp techniques. In TSGAN, the Critic notonly evaluates how realistic the generated voxels look like,but also how faithfully the generated voxels reﬂect texts.The objective function of TSGAN is as follows: L CW GAN = E t ∼ p τ [ D ( t, G ( t ))] + E ( t,s ) ∼ p mis [ D ( t, s )] − E ( t,s ) ∼ p mat [ D ( t, s )] + λ GP L GP L GP = E ( t,s ) ∼ p GP [( ||∇ t D ( t, s ) || − + ( ||∇ s D ( t, s ) || − ]3here t is the text embedding, s is the 3D voxel, and p τ is the probability distribution of the text embedding. Inaddition, p mat and p mis are the probability distribution ofmatching text-voxel pairs and mismatching text-voxel pairs,respectively. Note that they sum up gradients for all theinput variables of D to make the gradient penalty.Figure 1: The model of TSGANFig.1 shows the model of TSGAN. First, it combines textembedding with a noise vector. This is an input to the Gen-erator and a set of 3D voxels is output by deconvolution. Asthe last layer of the Generator, it has a sigmoid layer so thatthe output value is restricted from 0 to 1. The generated3D voxels are input to the Critic and it is transformed intoa one-dimensional vector via convolutional layers. It is com-bined with the text embedding and then passed to the fullyconnected layers. The output of the ﬁnal fully connectedlayer is a scaler value. The output of the Generator is a setof voxels of ( x, y, z, color ) = (32 , , , As the simplest approach to generate high resolution 3Dshapes, it is conceivable that we can add a higher resolu-tion deconvolution layer to the model of TSGAN. However,the number of parameters to be added becomes too largeto compute because it needs to deal with three-dimensionaldata for learning. In this research, we assume to use a GPUfor faster learning speed, but the number of parameters thata GPU can store in its memory for learning is limited, de-pending on the memory size of the GPU.If the number of parameters is large, the problem is notonly the fact that the learning is sometimes terminated bythe lack of memory, but also the number of epochs requiredfor training dramatically increases. Since the training forthree-dimensional data is proportional to the order of cubic,training will not be ﬁnished in realistic time. Even thoughwe can limit the number of epochs, the generated voxelsmay be collapsed. For these reasons, simply adding higherresolution layer does not work well.To overcome the challenges described above, our approachis dividing the tasks into two steps; one is for generatinga low resolution shape which roughly reﬂects the targettext (StageI) and the other is for generating corresponding high resolution shape (StageII) by using the knowledge ofStackGAN[12]. In StackGAN, StageI generates a low res-olution image by roughly deciding color distribution andplacement of it. In StageII, low resolution image generatedin StageI is input to some convolution layers, then combinedwith the text embedding, which is sent to the Residual layer.Finally, a high resolution output image is generated via de-convolution layers. In the proposed method, we use TSGANas StageI and constructing a new model for StageII to gen-erate high resolution voxels. The following sections describethe details of these two stages.

The Generator tries to create rough shapes of resulting vox-els at this stage. Unlike TSGAN, StageI in this researchdoes not need to generate voxels which are strictly faithfulto the input text since the details described in the text arereshaped at StageII. Instead of using latent vectors com-bined with additional noises as proposed in [2], we use onlylatent vectors as the input to the Critic since we found thatit can create suﬃcient level of voxels and achieve faster con-vergence of training. In addition to that, we found a problemin TSGAN. In TSGAN, generated voxels are ﬁrst convertedto one-dimensional vectors through convolutional and fullyconnect layers of the Generator, and then combined withtext latent vectors. This causes some cases that spatial in-formation of voxels is lost, resulting in the lack of meaning-ful connection between the voxels and corresponding texts.Therefore, we spatially duplicated the text embedding vec-tors and combined them with the convolved voxels to holdthe spatial feature as like StackGAN[12]. Fig.2 shows ourproposed network model for StageI.Figure 2: The network model of StageI

In the high resolution task, the role of the Critic networkvaries depending on the type of input variables or loss func-tions. In contrast with StageI which decides the rough colorand shape of voxels, we can consider two training models inStageII; focusing only on heightening resolution or extend-ing the existing training models by reﬁning the prior work.Therefore, we propose the following two models for StageII:4 (v0): This model supposes that the faithfulness to textis suﬃcient in StageI, so that binding to text is relaxed.The Critic network focus on whether the generated highresolution shapes are correct or not. • (v1): Like TSGAN, the Critic network focus on whethervoxel is accurately generated from the text descriptionof the shape. In this model, Critic monitors whether higher resolutionshapes can be appropriately achieved from the low resolu-tion voxels. Fig.3 shows the ﬂow of v0 model.Figure 3: The ﬂow of high resolution task v0A vector of text embedding is ﬁrst input to the genera-tor of StageI for generating low resolution voxels. The out-put voxels are input to the generator of StageII to generatehigher resolution voxels. In v0 model, the input of the Criticnetwork is both high resolution voxels and low resolutionvoxels so that the Critic can evaluate high resolution vox-els based on the information of low resolution ones. We donot use vectors of text embedding for tasks after StageII be-cause StageII concentrates solely on heightening resolution.Therefore, high resolution tasks can be completely separatedfrom low resolution tasks by not using the text informationin StageII.Figure 4: The model of high resolution task v0 Fig.4 shows the training model of v0. We introduce aresidual layer as a hidden layer of the Generator network ofStageII. By the residual layer, we can optimize each layerby learning the residual function using layer input insteadof learning the optimum output of each layer[5].The loss function for v0 model is deﬁned as follows: L CW GAN = E t ∼ p τ [ D ( G ( G ( t )) , G ( t ))] − E ( t,s ) ∼ p mat [ D ( s, G ( t ))] + λ GP L GP L GP = E ( t,s ) ∼ p GP [( ||∇ s D ( s, G ( t )) || − + ( ||∇ G D ( s, G ( t )) || − ]where the generators for low resolution and high resolutiontasks are indicated by G and G , respectively. By reliev-ing the term contributed to p mis from L CW GAN , the modelsuppress learning the aspect of whether the shape matcheswith the text description. The degree of matching betweenthe text description and the generated shape depends onthe quality of training in StageI. However, by concentratingsolely on heightening resolution, the number of parameterupdates in the training phase can be greatly reduced, result-ing in faster learning speed.

In this model, the Critic network monitors whether voxelsare faithfully generated from the text depictions as like TS-GAN. Fig.5 shows the ﬂow of v1 model.Figure 5: The ﬂow of high resolution model v1A vector of textembedding is ﬁrst input to the generatorof StageI for generating low resolution voxels. Here, StageIis the same as Generator of Fig.2. The generated low resolu-tion voxel is input to StageII and a one-dimensional vector iscreated through the convolution layers. The vector is com-bined with the vector of text embedding and converted to avoxel again by the deconvolution layers. The reason of com-bining a vector representation of the voxel with the vectorof text embedding is to generate a higher resolution voxelwhich reﬂects the details described in the text. The Criticnetwork tries to determines both how realistic the generatedvoxel looks and how closely it matches with the text by us-ing Wasserstein distance. Fig.6 shows the training model ofv1.5igure 6: The training model of high resolution task v1The loss function for v1 is deﬁned as follows: L CW GAN = E t ∼ p τ [ D ( t, G ( G ( t )))] + E ( t,s ) ∼ p mis [ D ( t, s )] − E ( t,s ) ∼ p mat [ D ( t, s )] + λ GP L GP L GP = E ( t,s ) ∼ p GP [( ||∇ t D ( t, s ) || − + ( ||∇ s D ( t, s ) || − ]This is the same loss function as TSGAN. Therefore, theCritic network evaluates whether the generated voxel isproperly created from the text. The cost for a voxel mis-matched with the text is included in the loss function tocheck how accurately the voxel is matched with the text.However, many parameter updates are needed in trainingthe network since the Generator has two roles of heighten-ing the resolution and making the generated voxel is closelymatched with the text. Therefore, long training time is ex-pected in this model In this section, we evaluate the proposed high resolutiontasks v0 and v1. We compare the shapes generated by themodels trained with 18000 epochs. In this evaluation, we usetwo objects, table and chair for generating 3D shapes. Forthis experiment, we used the same dataset as in TSGAN[2].We created train/validation/test data by randomly splittingthe dataset into the ratio of 80% / / • Accuracy of classiﬁcation (Class acc.): The ﬁrst indexis the accuracy rate of classiﬁcation which evaluate howcorrectly the generated shapes are classiﬁed. We cre-ated another classiﬁer network model to classify twotarget objects represented by voxels generated from the text descriptions. All the generated 3D voxels are in-put to the classiﬁer and evaluate its accuracy rate. Themodel of the classiﬁer is created based on the prior work[7]. We expect that this accuracy metric can reﬂect howrealistic the generated voxels look like. • Mean squared error (mse): The second index is themean square error of the text embedding vector. Wecreated an encoder that generates a 128 dimensionalvector from a resulted voxel. The dimension of the vec-tor is the same as the text embedding vector. We trainthis encoder to minimize the mean squared error be-tween the output vectors and original text embeddingvectors. We input the resulting voxels into the encodernetwork and then calculate the mse between the resultand the corresponding latent vectors. We expect thatthis can evaluate how well each voxel is matched withthe text representation for it.

Fig. 7 and Fig. 8 show examples of generated result with v0and v1 models.Figure 7: The example of generation in high resolution taskv0 (18000 epochs)Figure 8: The example of generation in high resolution taskv1 (18000 epochs)Table 1 shows the evaluation results of two indices forv0 and v1 models. The higher the classiﬁcation accuracyrate, the better the quality of generated voxels. As for the6se, the lower the better. From the table, we see that theclassiﬁcation accuracy rate of v1 is larger than that of v0and the mse of v1 is smaller than that of v0.Fig. 9 and Fig 10 show the trend of training losses of theCritic and the Generator network for two high resolutiontasks, respectively. The loss for the both the overall trendof the loss for v0 and v1 in the Critic network is similar.However, variance of the loss trend for v0 is larger thanthat of v1 in the Generator network.Table 1: Evaluation of each model in two indicesMethod Class acc. mseDataSet 1.0 0.153v0 0.97 0.156v1

Figure 9: Critic’s loss of high resolution taskFigure 10: Generator’s loss of high resolution task

As stated above, we compare the results of the networkstrained with 18000 epochs. At this epoch, both v0 andv1 models have almost no 3D shape collapse. Accordingto Table 1, v1 model is more faithful to the text than v0 model. As for the accuracy, both v0 and v1 model havehigh classiﬁcation accuracy, but the v1 model achieves alittle higher accuracy rate compared with the v0 model.This is because the Critic in v1 model can calculate thedistance of the probability distribution between the traindata and the generated voxel more accurately by referringto the text description. Thus, the Generator could gener-ate more realistic voxels properly. The mean squared errorfor v1 is smaller that v0, meaning that v1 achieves betterencoding results. This indicates the v1 model can generatevoxels more faithfully to the text. Since we use informationof the cost for a voxel mismatched with the text, the Criticcan calculate Wasserstein distance more properly.As can be seen from Fig. 7 and Fig. 8 (and also the ﬁguresin Appendix, the shape resolution is appropriately increasedin both v0 and v1 models. However, a little change in thecolor distribution can be seen compared with the low res-olution cases. We consider this is because the constraintof “How realistic and faithful the generated voxel is” is ap-plied to the loss function, but we do not set any constraintregarding “How a high resolution shape is faithful to thecorresponding low resolution shape”. As in the previousstudy[11], introducing a constraint on the mean and varianceof the color distribution to the loss function may suppressthe change in color.

In this paper, we extend the prior work [2] and propose newGAN models that can generate high resolution voxels. Inthe proposed model, we also improved the previous methodto generate even low resolution voxels more faithfully to agiven text.The contributions of this research are three-fold. First, weproposed the models which generate high resolution voxelsfaithfully to a given text from low resolution voxels. Fromthe evaluation results, it is possible to generate high resolu-tion voxels with good visual quality. Second, we contrivedmultiple roles of the Critic network and conﬁgured multiplemodels. We showed that there was a diﬀerence in accuracydepending on whether separating the higher resolution taskfrom considering the text latent vector. Third, we intro-duced multiple indices to compare the performance of themodels. As described in the discussion, there is a possibil-ity for our proposed model to generate voxels that are morefaithful to a given text with a higher quality shape.

References [1] Martin Arjovsky, Soumith Chintala, and L´eon Bot-tou. Wasserstein generative adversarial networks. InDoina Precup and Yee Whye Teh, editors,

Proceed-ings of the 34th International Conference on MachineLearning , volume 70 of

Proceedings of Machine Learn- ng Research , pages 214–223, International ConventionCentre, Sydney, Australia, 06–11 Aug 2017. PMLR.[2] Kevin Chen, Christopher B Choy, Manolis Savva, An-gel X Chang, Thomas Funkhouser, and Silvio Savarese.Text2shape: Generating shapes from natural lan-guage by learning joint embeddings. arXiv preprintarXiv:1803.08495 , 2018.[3] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative adversarialnets. In Z. Ghahramani, M. Welling, C. Cortes, N. D.Lawrence, and K. Q. Weinberger, editors, Advances inNeural Information Processing Systems 27 , pages 2672–2680. Curran Associates, Inc., 2014.[4] Ishaan Gulrajani. Improved Training of WassersteinGANs.[5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residuallearning for image recognition. In , pages 770–778, June 2016.[6] William E. Lorensen and Harvey E. Cline. Marchingcubes: A high resolution 3d surface construction al-gorithm.

SIGGRAPH Comput. Graph. , 21(4):163–169,August 1987.[7] Daniel Maturana and Sebastian Scherer. Voxnet: A 3dconvolutional neural network for real-time object recog-nition. In

IEEE/RSJ International Conference on In-telligent Robots and Systems , page 922 928, September2015.[8] Simon Osindero. Conditional Generative AdversarialNets. pages 1–7.[9] Cedric Villani. Grundlehren der mathematischen wis-senschaften. 2009.[10] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, HanZhang, Zhe Gan, Xiaolei Huang, and Xiaodong He.Attngan: Fine-grained text to image generation withattentional generative adversarial networks. In

CVPR ,pages 1316–1324. IEEE Computer Society, 2018.[11] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang,and D. N. Metaxas. Stackgan++: Realistic imagesynthesis with stacked generative adversarial networks.

IEEE Transactions on Pattern Analysis & Machine In-telligence , page 1, 2017.[12] Han Zhang, Tao Xu, and Hongsheng Li. Stackgan: Textto photo-realistic image synthesis with stacked genera-tive adversarial networks. In

ICCV , pages 5908–5916.IEEE Computer Society, 2017.