Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis
IInferring Semantic Layout for Hierarchical Text-to-Image Synthesis
Seunghoon Hong † Dingdong Yang † Jongwook Choi † Honglak Lee ‡ , † † University of Michigan ‡ Google Brain † { hongseu,didoyang,jwook,honglak } @umich.edu ‡ [email protected] Abstract
We propose a novel hierarchical approach for text-to-image synthesis by inferring semantic layout. Instead oflearning a direct mapping from text to image, our algorithmdecomposes the generation process into multiple steps, inwhich it first constructs a semantic layout from the text bythe layout generator and converts the layout to an image bythe image generator. The proposed layout generator pro-gressively constructs a semantic layout in a coarse-to-finemanner by generating object bounding boxes and refiningeach box by estimating object shapes inside the box. Theimage generator synthesizes an image conditioned on theinferred semantic layout, which provides a useful seman-tic structure of an image matching with the text description.Our model not only generates semantically more meaning-ful images, but also allows automatic annotation of gen-erated images and user-controlled generation process bymodifying the generated scene layout. We demonstrate thecapability of the proposed model on challenging MS-COCOdataset and show that the model can substantially improvethe image quality, interpretability of output and semanticalignment to input text over existing approaches.
1. Introduction
Generating images from text description has been an ac-tive research topic in computer vision. By allowing users todescribe visual concepts in natural language, it provides anatural and flexible interface for conditioning image genera-tion. Recently, approaches based on conditional GenerativeAdversarial Network (GAN) have shown promising resultson text-to-image synthesis task [22, 36, 24]. Conditioningboth generator and discriminator on text, these approachesare able to generate realistic images that are both diverseand relevant to input text. Based on conditional GANframework, recent approaches further improve the predic-tion quality by generating high-resolution images [36] oraugmenting text information [6, 4].However, the success of existing approaches has beenmainly limited to simple datasets such as birds [35] andflowers [19], while generation of complicated, real-world box generation mask generation pixel generation real imageStackGAN [32] resultReed et al. [19] result
Input Text : People riding on elephants that are walking through a river.
Figure 1. Overall framework of the proposed algorithm. Given atext description, our algorithm sequentially constructs a semanticstructure of a scene and generates an image conditioned on theinferred layout and text. Best viewed in color. images such as MS-COCO [14] remains an open challenge.As illustrated in Figure 1, generating image from a gen-eral sentence “people riding on elephants that are walk-ing through a river” requires multiple reasonings on var-ious visual concepts, such as object category ( people and elephants ), spatial configurations of objects ( riding ), scenecontext ( walking through a river ), etc. , which is much morecomplicated than generating a single, large object as in sim-pler datasets [35, 19]. Existing approaches have not beensuccessful in generating reasonable images for such com-plex text descriptions, because of the complexity of learninga direct text-to-pixel mapping from general images.Instead of learning a direct mapping from text to image,we propose an alternative approach that constructs semanticlayout as an intermediate representation between text andimage. Semantic layout defines a structure of scene basedon object instances and provides fine-grained information ofthe scene, such as the number of objects, object category, lo-cation, size, shape, etc . (Figure 1). By introducing a mech-anism that explicitly aligns the semantic structure of an im-age to text, the proposed method can generate complicatedimages that match complex text descriptions. In addition,conditioning the image generation on semantic structure al-1 a r X i v : . [ c s . C V ] J u l ows our model to generate semantically more meaningfulimages that are easy to recognize and interpret.Our model for hierarchical text-to-image synthesis con-sists of two parts: the layout generator that constructs asemantic label map from a text description, and the imagegenerator that converts the estimated layout to an image us-ing the text. Since learning a direct mapping from text tofine-grained semantic layout is still challenging, we furtherdecompose the task into two manageable subtasks: we firstestimate the bounding box layout of an image using the boxgenerator , and then refine the shape of each object insidethe box by the shape generator . The generated layout isthen used to guide the image generator for pixel-level syn-thesis. The box generator, shape generator and image gener-ator are implemented by independent neural networks, andtrained in parallel with corresponding supervisions.Generating semantic layout not only improves quality oftext-to-image synthesis, but also provides a number of po-tential benefits. First, the semantic layout provides instance-wise annotations on generated images, which can be di-rectly exploited for automated scene parsing and object re-trieval. Second, it offers an interactive interface for control-ling image generation process; users can modify the seman-tic layout to generate a desired image by removing/addingobjects, changing size and location of objects, etc .The contributions of this paper are as follows: • We propose a novel approach for synthesizing imagesfrom complicated text descriptions. Our model explic-itly constructs semantic layout from the text descrip-tion, and guides image generation using the inferredsemantic layout. • By conditioning image generation on explicit layoutprediction, our method is able to generate images thatare semantically meaningful and well-aligned with in-put descriptions. • We conduct extensive quantitative and qualitativeevaluations on challenging MS-COCO dataset, anddemonstrate substantial improvement on generationquality over existing works.The rest of the paper is organized as follows. We brieflyreview related work in Section 2, and provide an overviewof the proposed approach in Section 3. Our model for layoutand image generation is introduced in Section 4 and 5, re-spectively. We discuss the experimental results on the MS-COCO dataset in Section 6.
2. Related Work
Generating images from text descriptions has recentlydrawn a lot of attention from the research community. For-mulating the task as a conditional image generation prob-lem, various approaches have been proposed based on Vari-ational Auto-Encoders (VAE) [16], auto-regressive mod- els [23], optimization techniques [18], etc . Recently, ap-proaches based on conditional Generative Adversarial Net-work (GAN) [7] have shown promising results in text-to-image synthesis [22, 24, 36, 6, 4]. Reed et al . [22] proposedto learn both generator and discriminator conditioned ontext embedding. Zhang et al . [36] improved the image qual-ity by increasing image resolution with a two-stage GAN.Other approaches include improving conditional genera-tion by augmenting text data with synthesized captions [6],or adding conditions on class labels [4]. Although theseapproaches have demonstrated impressive generation re-sults on datasets of specific categories ( e.g ., birds [35] andflowers [19]), the perceptual quality of generation tends tosubstantially degrade on datasets with complicated images( e.g ., MS-COCO [14]). We investigate a way to improvetext-to-image synthesis on general images, by conditioninggeneration on inferred semantic layout.The problem of generating images from pixel-wise se-mantic labels has been explored recently [3, 10, 12, 23].In these approaches, the task of image generation is formu-lated as translating semantic labels to pixels. Isola et al . [10]proposed a pixel-to-pixel translation network that convertsdense pixel-wise labels to image, and Chen et al . [3] pro-posed a cascaded refinement network that generates high-resolution output from dense semantic labels. Karacan etal . [12] employed both dense layout and attribute vectors forimage generation using conditional GAN. Notably, Reed etal . [23] utilized sparse label maps like our method. Unlikeprevious approaches that require ground-truth layouts forgeneration, our method infers the semantic layout, and thusis more generally applicable to various generation tasks.Note that our main contribution is complementary to theseapproaches, and we can integrate existing segmentation-to-pixel generation methods to generate an image conditionedon a layout inferred by our method.The idea of inferring scene structure for image gener-ation is not new, as it has been explored by some recentworks in several domains. For example, Wang et al . [34]proposed to infer a surface normal map as an intermediatestructure to generate indoor scene images, and Villegas etal . [31] predicted human joints for future frame prediction.The most relevant work to our method is Reed et al . [24],which predicted local key-points of bird or human for text-to-image synthesis. Contrary to the previous approachesthat predict such specific types of structure for image gen-eration, our proposed method aims to predict semantic labelmaps, which is a general representation of natural images.
3. Overview
The overall pipeline of the proposed framework is illus-trated in Figure 2. Given a text description, our model pro-gressively constructs a scene by refining semantic structureof an image using the following sequence of generators: ox Generator (§4.1)
People riding on elephants that are walking through a river.
BiLSTM M M · · · s p ( B t | B t − ) M T elephant B T LSTM elephantpersonchair
Shape Generator (§4.2)Text Encoding
Image Generator (§5) M s B T Figure 2. Overall pipeline of the proposed algorithm. Given a text embedding, our algorithm first generates a coarse layout of the image byplacing a set of object bounding boxes using the box generator (Section 4.1), and further refines the object shape inside each box using theshape generator (Section 4.2). Combining outputs from the box and the shape generator leads to a semantic label map defining semanticstructure of the scene. Conditioned on the inferred semantic layout and the text, a pixel-wise image is finally generated by the imagegenerator (Section 5). • Box generator takes a text embedding s as input,and generates a coarse layout by composing object in-stances in an image. The output of the box generator isa set of bounding boxes B T = { B , ..., B T } , whereeach bounding box B t defines the location, size andcategory label of the t -th object (Section 4.1). • Shape generator takes a set of bounding boxes gen-erated from box generator, and predicts shapes of theobject inside the boxes. The output of the shape gen-erator is a set of binary masks M T = { M , ..., M T } ,where each mask M t defines the foreground shape ofthe t -th object (Section 4.2). • Image generator takes the semantic label map M ob-tained by aggregating instance-wise masks, and thetext embedding as inputs, and generates an image bytranslating a semantic layout to pixels matching thetext description (Section 5).By conditioning the image generation process on the se-mantic layouts that are explicitly inferred, our method isable to generate images that preserve detailed object shapesand therefore are easier to recognize semantic contents. Inour experiments, we show that the images generated by ourmethod are semantically more meaningful and well-alignedwith the input text, compared to ones generated by previousapproaches [22, 36] (Section 6).
4. Inferring Semantic Layout from Text
Given an input text embedding s , we first generate acoarse layout of image in the form of object boundingboxes. We associate each bounding box B t with a classlabel to define which class of object to place and where,which plays a critical role in determining the global layoutof the scene. Specifically, we denote the labeled bound-ing box of the t -th object as B t = ( b t , l t ) , where b t = [ b t,x , b t,y , b t,w , b t,h ] ∈ R represents the location and sizeof the bounding box, and l t ∈ { , } L +1 is a one-hot classlabel over L categories. We reserve the ( L + 1) -th class asa special indicator for the end-of-sequence.The box generator G box defines a stochastic mappingfrom the input text s to a set of T object bounding boxes B T = { B , ..., B T } : (cid:98) B T ∼ G box ( s ) . (1) Model.
We employ an auto-regressive decoder for the boxgenerator, by decomposing the conditional joint boundingbox probability as p ( B T | s ) = (cid:81) Tt =1 p ( B t | B t − , s ) ,where the conditionals are approximated by LSTM [9]. Inthe generative process, we first sample a class label l t for the t -th object and then generate the box coordinates b t condi-tioned on l t , i.e ., p ( B t |· ) = p ( b t , l t |· ) = p ( l t |· ) p ( b t | l t , · ) .The two conditionals are modeled by a Gaussian MixtureModel (GMM) and a categorical distribution [8], respec-tively: p ( l t | B t − , s ) = Softmax( e t ) , (2) p ( b t | l t , B t − , s ) = K (cid:88) k =1 π t,k N (cid:0) b t ; µ t,k , Σ t,k (cid:1) , (3)where K is the number of mixture components. The soft-max logit e t in Eq.(2) and the parameters for the Gaussianmixtures π t,k ∈ R , µ t,k ∈ R and Σ t,k ∈ R × in Eq.(3)are computed by the outputs from each LSTM step t . Pleasesee Section A.1 in the appendix for details. Training.
We train the box generator by minimizing thenegative log-likelihood of ground-truth bounding boxes: L box = − λ l T T (cid:88) t =1 l ∗ t log p ( l t ) − λ b T T (cid:88) t =1 log p ( b ∗ t ) , (4)here T is the number of objects in an image, and λ l , λ b are balancing hyper-parameters, which are set to 4 and 1in our experiment, respectively. b ∗ t and l ∗ t are ground-truthbounding box coordinates and label of the t -th object, re-spectively, which are ordered based on their bounding boxlocations from left to right. Note that we drop the condi-tioning in Eq. (4) for notational brevity.At test time, we generate bounding boxes via ancestralsampling of box coordinates and class label by Eq. (2) and(3), respectively. We terminate the sampling when the sam-pled class label corresponds to the termination indicator ( L + 1) , thus the number of objects are determined adap-tively based on the text. Given a set of bounding boxes obtained by the box gen-erator, the shape generator predicts more detailed imagestructure in the form of object masks. Specifically, foreach object bounding box B t obtained by Eq. (1), we gen-erate a binary mask M t ∈ R H × W that defines the shapeof the object inside the box. To this end, we first convertthe discrete bounding box outputs { B t } to a binary tensor B t ∈ { , } H × W × L , whose element is 1 if and only if itis contained in the corresponding class-labeled box. Usingthe notation M T = { M , ..., M T } , we define the shapegenerator G mask as (cid:99) M T = G mask ( B T , z T ) , (5)where z t ∼ N (0 , I ) is a random noise vector.Generating an accurate object shape should meet two re-quirements: (i) First, each instance-wise mask M t shouldmatch the location and class information of B t , and berecognizable as an individual instance (instance-wise con-straints). (ii) Second, each object shape must be alignedwith its surrounding context (global constraints). To satisfyboth, we design the shape generator as a recurrent neuralnetwork, which is trained with two conditional adversariallosses as described below. Model.
We build the shape generator G mask using a con-volutional recurrent neural network [26], as illustrated inFigure 2. At each step t , the model takes B t through en-coder CNN, and encodes information of all object instancesby bi-directional convolutional LSTM (Bi-convLSTM). Ontop of convLSTM output at t -th step, we add noise z t byspatial tiling and concatenation, and generate a mask M t byforwarding it through a decoder CNN. Training.
Training of the shape generator is based on theGAN framework [7], in which generator and discriminatorare alternately trained. To enforce both the global and theinstance-wise constraints discussed earlier, we employ twoconditional adversarial losses [17] with the instance-wisediscriminator D inst and the global discriminator D global . First, we encourage each object mask to be compati-ble with class and location information encoded by objectbounding box. We train an instance-wise discriminator D inst by optimizing the following instance-wise adversarial loss : L ( t ) inst = E ( B t ,M t ) (cid:104) log D inst (cid:0) B t , M t (cid:1)(cid:105) (6) + E B t , z t (cid:104) log (cid:16) − D inst (cid:0) B t , G ( t ) mask ( B T , z T ) (cid:1)(cid:17)(cid:105) , where G ( t ) mask ( B T , z T ) indicates the t -th output frommask generator. The instance-wise loss is applied for eachof T instance-wise masks, and aggregated over all instancesas L inst = (1 /T ) (cid:80) t L ( t ) inst .On the other hand, the global loss encourages all theinstance-wise masks form a globally coherent context.To consider relation between different objects, we ag-gregate them into a global mask G global ( B T , z T ) = (cid:80) t G ( t ) mask ( B t , z t ) , and compute an global adversarialloss analogous to Eq. (6) as L global = E ( B T ,M T ) (cid:104) log D global (cid:0) B global , M global (cid:1)(cid:105) (7) + E B T , z T (cid:104) log (cid:16) − D global (cid:0) B global , G global ( B T , z T ) (cid:1)(cid:17)(cid:105) , where M global ∈ R H × W is an aggregated mask obtainedby taking element-wise addition over M T , and B global ∈ R H × W × L is an aggregated bounding box tensor obtainedby taking element-wise maximum over B T .Finally, we additionally impose a reconstruction loss L rec that encourages the predicted instance masks to be similar tothe ground-truths. We implement this idea using perceptualloss [11, 3, 33, 2], which measures the distance of real andfake images in the feature space of a pre-trained CNN by L rec = (cid:88) l (cid:13)(cid:13) Φ l ( G global ) − Φ l ( M global ) (cid:13)(cid:13) , (8)where Φ l is the feature extracted from the l -th layer of aCNN. We use the VGG-19 network [27] pre-trained on Im-ageNet [5] in our experiments. Since our input to the pre-trained network is a binary mask, we replicate masks tochannel dimension and use the converted mask to computeEq. (8). We found that using the perceptual loss signifi-cantly improves stability of GAN training and the qualityof object shapes, as discussed in [3, 33, 2].Combining Eq.(6), (7) and (8), the overall training ob-jective for the shape generator becomes L shape = λ i L inst + λ g L global + λ r L rec , (9)where λ i , λ g and λ r are hyper-parameters that balance dif-ferent losses, which are set to 1, 1 and 10 in the experiment,respectively. We provide more details of training and net-work architecture in the appendix (Section A.2). G global is computed by addition to model overlap between objects.
28 128 { , } People riding elephantsin a body of water.
Generator Network
SpatialTiling z ∼ N (0 , L Residual Blocks down-sample concatenation CascadedDecoder Network 128
People riding elephants in a body of water.
Discriminator Network
M M
FC FC A FC ● FC ss S A g h × w × d hw h w σ down-sample X X
Figure 3. Architecture of the image generator. Conditioned on the text description and the semantic layout generated by the layout generator,it generates an image that matches both inputs.
5. Synthesizing Images from Text and Layout
The outputs from the layout generator define location,size, shape and class information of objects, which providesemantic structure of a scene relevant to text. Given thesemantic structure and text, the objective of the image gen-erator is to generate an image that conforms to both condi-tions. To this end, we first aggregate binary object masks M T to a semantic label map M ∈ { , } H × W × L , suchthat M ijk = 1 if and only if there exists an object of class k whose mask M t covers the pixel ( i, j ) . Then, given thesemantic layout M and the text s , the image generator isdefined by (cid:98) X = G img ( M , s , z ) , (10)where z ∼ N (0 , I ) is a random noise. In the following, wedescribe the network architecture and training proceduresof the image generator. Model.
Figure 3 illustrates the overall architecture of theimage generator. Our generator network is based on a con-volutional encoder-decoder network [10] with several mod-ifications. It first encodes the semantic layout M throughseveral down-sampling layers to construct a layout feature A ∈ R h × w × d . We consider that the layout feature en-codes various context information of the input layout alongthe channel dimension. To adaptively select a context rel-evant to the text, we apply attention on the layout fea-ture. Specifically, we compute a d -dimensional vector fromthe text embedding, and spatially replicate it to construct S ∈ R h × w × d . Then we apply gating on the layout featureby A g = A (cid:12) σ ( S ) , where σ is the sigmoid nonlinear-ity, and (cid:12) denotes element-wise multiplication. To furtherencode text information on background, we compute an-other text embedding with separate fully-connected layersand spatially replicate it to size h × w . The gated layoutfeature A g , the text embedding and noises are then com-bined by concatenation along channel dimension, and sub-sequently fed into several residual blocks and decoder to bemapped to an image. We employ a cascaded network [3]for decoder, which takes the semantic layout M as an addi-tional input to every upsampling layer. We found that cas- caded network enhances conditioning on layout structureand produces better object boundary.For the discriminator network D img , we first concatenatethe generated image X and the semantic layout M . It isfed through a series of down-sampling blocks, resulting ina feature map of size h (cid:48) × w (cid:48) . We concatenate it with aspatially tiled text embedding, from which we compute adecision score of the discriminator. Training.
Conditioned on both the semantic layout M and the text embedding s , the image generator G img isjointly trained with the discriminator D img . We define theobjective function by L img = λ a L adv + λ r L rec , where L adv = E ( M , s ,X ) (cid:104) log D img (cid:0) M , s , X (cid:1)(cid:105) (11) + E ( M , s ) , z (cid:104) log (cid:16) − D img (cid:0) M , s , G img ( M , s , z ) (cid:1)(cid:17)(cid:105) , L rec = (cid:88) l (cid:107) Φ l ( G img ( M , s , z )) − Φ l ( X ) (cid:107) , (12)where X is a ground-truth image associated with semanticlayout M . As in the mask generator, we apply the sameperceptual loss L rec , which is found to be effective. We setthe hyper-parameters λ a = 1 , λ r = 10 in our experiment.More details on network architecture and training procedureis provided in appendix (Section A.3).
6. Experiments
Dataset.
We use the MS-COCO dataset [14] to evalu-ate our model. It contains 164,000 training images over80 semantic classes, where each image is associated withinstance-wise annotations ( i.e ., object bounding boxes andsegmentation masks) and 5 text descriptions. The datasethas complex scenes with many objects in a diverse context,which makes generation very challenging. We use the of-ficial train and validation splits from MS-COCO 2014 fortraining and evaluating our model, respectively.
Evaluation metrics.
We evaluate text-conditional imagegeneration performance using various metrics: Inceptionscore, caption generation, and human evaluation. aption generation Inception[25]Method Box Mask BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR CIDErReed et al . [22] - - 0.470 0.253 0.136 0.077 0.122 0.160 7.88 ± ± ± Ours (control experiment) GT Pred. 0.556 0.353 0.219 0.139 0.162 0.400 11.94 ± ± GroundTruth (GT)
A kid in wet-suit on surfboardin the ocean. generated image and captionStackGAN256x256 a person flying akite on a beach .
Reed et al. a man is flying akite in the sky
Ours128x128 a man is surfing inthe ocean with asurfboard . (GT) a lady that ison some skies onsome snow generated image and caption a man is walkingon a beach with asurfboard .a person is ridinga snowboard on asnowy slope .a man is skiingdown a hill with asnowboard . (GT)
A youngman playing fris-bee while peoplewatch. generated image and caption a man is standingnext to a cow .a group of peoplestanding around afield with kites .a man is playingwith a frisbee in afield . (GT)
A bus thatis sitting in thestreet. generated image and caption a city street with atraffic light and agreen light .a large boat is inthe water near acity .a red and white busparked on a citystreet .
Figure 4. Qualitative examples of generated images conditioned on text descriptions on the MS-COCO validation set, using our methodand baselines (StackGAN [36] and Reed et al . [22]). The input text and ground-truth image are shown in the first row. For each method,we provide a reconstructed caption conditioned on the generated image.Method ratio ofranking 1st vs.
OursStackGAN [36] 18.4 % 29.5 %Reed et al . [22] 23.3 % 32.3 %Ours % -Table 2. Human evaluation results.
Inception score —
We compute the Inception score [25]by applying pre-trained classifier on synthesized imagesand investigating statistics of their score distributions. Itmeasures recognizability and diversity of generated images,and has been known to be correlated with human percep-tions on visual quality [20]. We use the Inception-v3 [28]network pre-trained on ImageNet [5] for evaluation, andmeasure the score for all validation images.
Caption generation —
In addition to the Inception score,assessing performance of text-conditional image generationnecessitates measuring the relevance of generated image tothe input text. To this end, we generate sentences from thesynthesized image and measure the similarity between in-put text and predicted sentence. The underlying intuition isthat if the generated image is relevant to input text and itscontents are recognizable, one should be able to guess theoriginal text from the synthesized image. We employ an im-age caption generator [32] trained on MS-COCO to gener- ate sentences, where one sentence is generated per image bygreedy decoding. We report three standard language simi-larity metrics: BLEU [21], METEOR [1] and CIDEr [30].
Human evaluation —
Evaluation based on caption gen-eration is beneficial for large-scale evaluation but may in-troduce unintended bias by the caption generator. To verifythe effectiveness of caption-based evaluation, we conducthuman evaluation using Amazon Mechanical Turk. Foreach text randomly selected from MS-COCO validation set,we presented 5 images generated by different methods, andasked users to rank the methods based on the relevance ofgenerated images to text. We collected results for 1000 sen-tences, each of which is annotated by 5 users. We reportresults based on the ratio of each method ranked as the best,and one-to-one comparison between ours and the baselines.
We compare our method with two state-of-the-art ap-proaches [22, 36] based on conditional GANs. Table 1 andTable 2 summarizes the quantitative evaluation results.
Comparisons to other methods.
We first present sys-temic evaluation results based on Inception score and cap-tion generation performance. The results are summarized a) Predict box&mask (b) Use GT box, predict mask (c) Use GT box&mask input caption real image boxes mask pixel boxes mask pixel boxes mask pixel
A group of people flykites into the air on alarge grassy field.A tower toweringabove a small cityunder a blue sky.a bench in the woodscovered in snowthis is two people ski-ing down a hillA rusted pink fire hy-drant in the grassA large cow walks overa fox in the grass.A laptop computer sit-ting on a desk next to adesktop monitor.
Figure 5. Image generation results of our method. Each column corresponds to generation results conditioned on (a) predicted box andmask layout, (b) ground-truth box and predicted mask layout and (c) ground-truth box and mask layout. Classes are color-coded forillustration purpose. See Figure 11 for more examples on the generated layouts and images. Best viewed in color. in Table 1. The proposed method substantially outperformsexisting approaches based on both evaluation metrics. Interms of Inception score, our method outperforms the ex-isting approaches with a substantial margin, presumablybecause our method generates more recognizable objects.Caption generation performance shows that captions gener-ated from our synthesized images are more strongly corre-lated with the input text than the baselines. This shows thatimages generated by our method are better aligned with de-scriptions and are easier to recognize semantic contents.Table 2 summarizes comparison results based on humanevaluation. When users are asked to rank images based ontheir relevance to input text, they choose images generatedby our method as the best in about of all presented sen-tences, which is substantially higher than baselines (about ). This is consistent with the caption generation resultsin Table 1, in which our method substantially outperformsthe baselines while their performances are comparable.Figure 4 illustrates qualitative comparisons. Due to ad-versarial training, images generated by the other methods,especially StackGAN [36], tend to be clear and exhibitshigh frequency details. However, it is difficult to recognizecontents from the images, since they often fail to predict
Input Text : A man is jumping and throwing a frisbee
Input Text : two skiers on a big snowy hill in the woods
Input Text : A man flying a kite at the beach while several people walk by
Figure 6. Multiple samples generated from a text description. SeeFigure 12 for more results. important semantic structure of object and scene. As a re-sult, the reconstructed captions from the generated imagesare usually not relevant to the input text. Compared to them,our method generates much more recognizable and seman-tically meaningful images by conditioning the generationwith inferred semantic layout, and is able to reconstruct de-scriptions that better align with the input sentences. zebra stands in the snow
A zebra stands in a forest
A zebra stands on grass covered field
A zebra stands in the desert
A zebra stands on dried filed A giraffe stands on grass covered field A horse stands on grass covered field An elephant stands on grass covered field A person stands on grass covered field A truck sits on grass covered field A large herd of sheep grazing on grass covered field
Three sheepgrazing on grass covered field
Two sheep grazing on the grass covered field A person riding on a horse on grass covered field A person next to a horse on grass covered field
Figure 7. Generation results by manipulating captions. The manip-ulated parts of texts are highlighted in bold characters, where thetypes of manipulation is indicated by different colors.
Blue : scenecontext,
Magenta : spatial location,
Red : the number of objects,
Green : object category.
Ablative Analysis.
To understand quality and the im-pact of the predicted semantic layout, we conduct an ab-lation study by gradually replacing the bounding box andmask layout predicted by layout generator with the ground-truths. Table 1 summarizes quantitative evaluation results.As it shows, replacing the predicted layouts to ground-truths leads with gradual performance improvements, whichshows predictions errors in both bounding box and masklayout.
Figure 5 shows qualitative results of our method. Foreach text, we present the generated images alongside thepredicted semantic layouts. As in the previous section, wealso present our results conditioned on ground-truth layouts.As it shows, our method generates reasonable semantic lay-out and image matching the input text; it generates bound-ing boxes corresponding to fine-grained scene structure im-plied in texts ( i.e . object categories, the number of objects),and object masks capturing class-specific visual attributesas well as relation to other objects. Given the inferred lay-outs, our image generator produces correct object appear-ances and background compatible with text. Replacing thepredicted layouts with ground-truths makes the generatedimages to have a similar context to original images.
Diversity of samples.
To assess the diversity in genera-tion, we sample multiple images while fixing the input text.Figure 6 illustrates the example images generated by ourmethod. Our method generates diverse semantic structures B ox L a you t G e n e r a t e d I m a g e Input text: a group of people standing in the snow and holding skis (a) Generation results by adding new objects. B ox L a you t G e n e r a t e d I m a g e Input Text:
A baseball player holding a bat over his head (b) Generation results by changing spatial configuration of objects.Figure 8. Examples of controllable image generation. See Fig-ure 13 and 14 for more results. given the same text description, while preserving semanticdetails such as the number of objects and object categories.
Text-conditional generation.
To see how our model in-corporates text description in generation process, we gener-ate images while modifying parts of the descriptions. Fig-ure 7 illustrates the example results. When we change thecontext of descriptions such as object class, number of ob-jects, spatial composition of objects and background pat-terns, our method correctly adapts semantic structure andimages based on the modified part of the text.
Controllable image generation.
We demonstrate con-trollable image generation by modifying bounding box lay-out. Figure 8 illustrates the example results. Our methodupdates object shapes and context based on the modifiedsemantic layout ( e.g . adding new objects, changing spatialconfiguration of objects) and generates reasonable images.See Figure 13 and 14 for more examples on various typesof layout modifications.
7. Conclusion
We proposed an approach for text-to-image synthesiswhich explicitly infers and exploits a semantic layout as anintermediate representation from text to image. Our modelhierarchically constructs a semantic layout in a coarse-to-fine manner by a series of generators. By conditioning im-age generation on explicit layout prediction, our methodgenerates complicated images that preserve semantic detailsand highly relevant to the text description. We also showedthat the predicted layout can be used to control generationprocess. We believe that end-to-end training of layout andimage generation would be an interesting future work. cknowledgments
This work was supported in part byONR N00014-13-1-0762, NSF CAREER IIS-1453651,DARPA Explainable AI (XAI) program
References [1] S. Banerjee and A. Lavie. METEOR: An Automatic Metricfor MT Evaluation with Improved Correlation with HumanJudgments. In
ACL , 2005. 6[2] M. Cha, Y. Gwon, and H. T. Kung. Adversarial nets withperceptual losses for text-to-image synthesis. arXiv preprintarXiv:1708.09321 , 2017. 4[3] Q. Chen and V. Koltun. Photographic image synthesis withcascaded refinement networks. In
ICCV , 2017. 2, 4, 5, 10[4] A. Dash, J. C. B. Gamboa, S. Ahmed, M. Z. Afzal,and M. Liwicki. TAC-GAN-Text Conditioned AuxiliaryClassifier Generative Adversarial Network. arXiv preprintarXiv:1703.06412 , 2017. 1, 2[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In
CVPR , 2009. 4, 6[6] H. Dong, J. Zhang, D. McIlwraith, and Y. Guo. I2t2i: Learn-ing text to image synthesis with textual data augmentation.
ICIP , 2017. 1, 2[7] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative Adversarial Networks. In
NIPS , 2014. 2, 4[8] D. Ha and D. Eck. A Neural Representation of Sketch Draw-ings. arXiv preprint arXiv:1704.03477 , 2017. 3[9] S. Hochreiter and J. Schmidhuber. Long short-term memory.
Neural computation , 1997. 3[10] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-imagetranslation with conditional adversarial networks. In
CVPR ,2017. 2, 5[11] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual Losses forReal-Time Style Transfer and Super-Resolution. In
ECCV ,2016. 4[12] L. Karacan, Z. Akata, A. Erdem, and E. Erdem. Learningto generate images of outdoor scenes from attributes and se-mantic layouts.
CoRR , 2016. 2[13] D. Kingma and J. Ba. Adam: A method for stochastic opti-mization. arXiv preprint arXiv:1412.6980 , 2014. 10, 12[14] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick,J. Hays, P. Perona, D. Ramanan, C. . L. Zitnick, and P. Doll´ar.Microsoft COCO: Common Objects in Context. In
ECCV ,2014. 1, 2, 5[15] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlin-earities improve neural network acoustic models. In
ICMLWorkshop on Deep Learning for Audio, Speech and Lan-guage Processing , 2013. 10[16] E. Mansimov, E. Parisotto, and J. Ba. Generating imagesfrom captions with attention. In
ICLR , 2016. 2[17] M. Mirza and S. Osindero. Conditional generative adversar-ial nets. arXiv preprint arXiv:1411.1784 , 2014. 4 [18] A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, andJ. Clune. Plug & play generative networks: Conditional iter-ative generation of images in latent space. In
CVPR , 2017.2[19] M.-E. Nilsback and A. Zisserman. Automated flower classi-fication over a large number of classes. In
Proceedings of theIndian Conference on Computer Vision, Graphics and ImageProcessing , Dec 2008. 1, 2[20] A. Odena, C. Olah, and J. Shlens. Conditional Image Syn-thesis with Auxiliary Classifier GANs. In
ICML , 2017. 6[21] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: AMethod for Automatic Evaluation of Machine Translation.In
ACL , 2002. 6[22] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, andH. Lee. Generative adversarial text to image synthesis. In
ICML , 2016. 1, 2, 3, 6, 10[23] S. Reed, A. Oord, N. Kalchbrenner, S. G´omez, Z. Wang,D. Belov, and N. Freitas. Parallel multiscale autoregressivedensity estimation. In
ICML , 2017. 2[24] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, andH. Lee. Learning what and where to draw. In
NIPS , 2016. 1,2[25] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Rad-ford, and X. Chen. Improved Techniques for Training GANs.In
NIPS , 2016. 6[26] X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo.Convolutional LSTM Network: A Machine Learning Ap-proach for Precipitation Nowcasting. In
NIPS , 2015. 4[27] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. In
ICLR , 2015.4[28] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.Rethinking the inception architecture for computer vision. In
CVPR , 2016. 6[29] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky. Instance nor-malization: The missing ingredient for fast stylization. arXivpreprint arXiv:1607.08022 , 2016. 10[30] R. Vedantam, C. L. Zitnick, and D. Parikh. CIDEr:Consensus-based Image Description Evaluation. In
CVPR ,2015. 6[31] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee.Learning to generate long-term future via hierarchical pre-diction. In
ICML , 2017. 2[32] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show andtell: A neural image caption generator. In
CVPR , 2015. 6[33] C. Wang, C. Xu, C. Wang, and D. Too. Perceptual adversarialnetworks for image-to-image transformation. arXiv preprintarXiv:1706.09138 , 2017. 4[34] X. Wang and A. Gupta. Generative image modeling usingstyle and structure adversarial networks. In
ECCV , 2016. 2[35] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Be-longie, and P. Perona. Caltech-UCSD Birds 200. TechnicalReport CNS-TR-2010-001, California Institute of Technol-ogy, 2010. 1, 2[36] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, andD. Metaxas. StackGAN: Text to Photo-realistic Image Syn-thesis with Stacked Generative Adversarial Networks. In
ICCV , 2017. 1, 2, 3, 6, 7 ppendixA. Implementation Details
A.1. Box Generator
This section describes the details of the box gener-ator. Denoting bounding box of t -th object as B t =( b xt , b yt , b wt , b ht , l t ) , the joint probability of sampling B t from the box generator is given by p ( b xt , b yt , b wt , b ht , l t ) = p ( l t ) p ( b xt , b yt , b wt , b ht | l t ) . (13)We drop the conditioning variables for notational brevity.As described in the main paper, we implement p ( l t ) bycategorical distribution and p ( b xt , b yt , b wt , b ht | l t ) by a mix-ture of quadravariate Gaussians. However, modeling fullconvariance matrix of quadravariate Gaussian is expensiveas it involves many parameters. Therefore, we decom-pose the box coordinate probability as p ( b xt , b yt , b wt , b ht | l t ) = p ( b xt , b yt | l t ) p ( b xt , b yt | b wt , b ht , l t ) , and approximate it with twobivariate Gaussian mixtures by p ( b xt , b yt | l t ) = K (cid:88) k =1 π xyt,k N (cid:16) b xt , b yt ; µ xyt,k , Σ xyt,k (cid:17) ,p ( b wt , b ht | b xt , b yt , l t ) = K (cid:88) i = k π wht,k N (cid:16) b wt , b ht ; µ wht,k , Σ wht,k (cid:17) . Then the parameters for Eq. (13) are obtained from LSTMoutputs at each step by [ h t , c t ] = LSTM ( B t − ; [ h t − , c t − ]) , (14) l t = W l h t + b l , (15) θ xyt = W xy [ h t , l t ] + b xy , (16) θ wht = W wh [ h t , l t , b x , b y ] + b wh , (17)where θ · t = [ π · t, K , µ · t, K , Σ · t, K ] are the parameters forGMM concatenated to a vector.For training, we employ an Adam optimizer [13] withlearning rate 0.001, β = 0 . , β = 0 . and exponen-tially decrease the learning rate with rate 0.5 at every epochafter the initial 10 epochs. A.2. Shape Generator
We provide a detailed architecture of the shape gener-ator G mask and the two discriminators D inst and D global inFigure 9. At each step t , we encode a box tensor B t by aseries of downsampling layers, where each downsamplinglayer is implemented by a stride-2 convolution followed byinstance-wise normalization [29] and ReLU. The encodedfeature is fed into the bidirectional convolutional LSTM (bi-convLSTM), and combined with features from all object in-stances. On top of the bi-convLSTM output at each step t , we add a noise z t by spatial replication and depth concate-nation, and apply masking operation so that regions outsidethe object bounding box B t are all set to 0. The maskedfeature is fed into several residual blocks, and mapped toa binary mask M t by a series of upsampling layers. Sim-ilar to downsampling layers, we implement an upsamplinglayer by stride-2 deconvolution followed by instance-wisenormalization and ReLU except the last one, which is × convolution followed by the sigmoid nonlinearity.The instance-wise discriminator D inst and global dis-criminator D global share the same architecture but have sep-arate parameters. The input to the instance-wise discrimi-nator is constructed by concatenating the box tensor B t andthe corresponding binary mask M t through channel dimen-sion, while the one for global discriminator is constructedby concatenating the aggregated box tensor B global and theaggregated masks M global . Both discriminators encode theinput by a series of downsampling layers, which are imple-mented by stride-2 convolutions followed by instance-wisenormalization and Leaky-ReLU [15].For training, we employ an Adam optimizer [13] withlearning rate 0.0002, β = 0 . , β = 0 . and linearlydecrease the learning rate after the first 50-epochs training. A.3. Image Generator
A detailed architecture of the image generator is illus-trated in Figure 10. The architecture of the downsamplingand the residual blocks are same as the ones used in theshape generator. To encourage the model to generate im-ages that match the input layout, we implement upsamplinglayers based on cascaded refinement network [3]. At eachupsampling layer, it takes an output from the previous layerand the semantic layout resized to the same spatial size asinputs, and combines them by depth concatenation followedby convolution. The combined feature map is then spatiallyupscaled by bilinear upsampling followed by instance-wisenormalization and ReLU, and subsequently fed into the nextupsampling layer.To encourage the model to generate images that matchinput text descriptions, we employ a matching-aware lossproposed in [22]. Denoting a ground-truth training exampleas ( M , s , X ) , where M , s and X denote semantic layout,text embedding and image, respectively, we construct an ad-ditional mismatching triple ( M , (cid:101) s , X ) by sampling randomtext embedding (cid:101) s non-relevant to the image. We considerit as additional fake examples in adversarial training, andextend the conditional adversarial loss for image generator(Eq. (11) in the main paper) as L adv = E ( M , s ,X ) (cid:104) log D img (cid:0) M , s , X (cid:1)(cid:105) (18) + E ( M , (cid:101) s ,X ) (cid:104) log (cid:16) − D img (cid:0) M , (cid:101) s , X (cid:1)(cid:17)(cid:105) + E ( M , s ) , z (cid:104) log (cid:16) − D img (cid:0) M , s , G img ( M , s , z ) (cid:1)(cid:17)(cid:105) . Bi-ConvLSTMBi-ConvLSTMBi-ConvLSTM Residual blocks crop
Residual blocks crop
Residual blocks crop ⋯ B B B T M T M M Generator Network Discriminator Network M t B t B global M global (i) Instance-wise discriminator(ii) Global discriminator depth concatdepth concat { , }{ , } Conv
InstanceNorm+ ReLU Deconv
InstanceNorm+ ReLU Conv
InstanceNorm+ LeakyReLUConv
Conv
Conv downsampe block in residual block upsample block downsample block in and z D inst D global G mask Figure 9. Architecture of the shape generator.
128 128 { , } People riding elephants in a body of water.
Generator Network
SpatialTiling z ∼ N (0 , L Residual Blocks down-sample concatenation CascadedDecoder Network 128
People riding elephants in a body of water.
Discriminator Network
M M
FC FC A FC ● FC ss S A g h × w × d σ down-sample X X
Conv
InstanceNorm+ ReLU downsampe block in Conv
InstanceNorm+ LeakyReLU downsample block in Conv
Conv
Conv residual block M Conv bilinear upsample (x2) concat cascaded block G img D img Figure 10. Architecture of the image generator. e found that employing matching-aware loss substantiallyimproves text-conditional generation and stabilizes overallGAN training.For training, we employ an Adam optimizer [13] withlearning rate 0.0002, β = 0 . , β = 0 . and linearlydecrease the learning rate after the first 30-epoch training. B. Additional Experiment Results
B.1. Ablative Analysis
To understand the impact of each component in the pro-posed framework, we conduct an ablation study by varyingconfigurations of the proposed model. Table 3 summarizesthe results based on caption generation performance.
Impact of shape generator
We first investigate the im-pact of shape generator. To this end, we remove the shapegenerator from our generation pipeline, and modify the im-age generator to generate images directly from box gener-ator outputs. Specifically, we feed the aggregated bound-ing box tensor B global as an input to the image generator,which is constructed by taking pixel-wise maximum overall box tensors as B global ( i, j, l ) = max t B t ( i, j, l ) . Theresult is presented in the second row in Table 3. Removingthe shape generator leads to substantial performance degra-dation, since predicting accurate object shapes and texturesdirectly from bounding box is a complicated task; the imagegenerator tends to miss detailed object shapes such as bodyparts, which are critical to recognize the image content forhuman. By explicitly inferring object shapes, it improvesthe overall image quality and interpretability of content. Impact of perceptual loss in shape generator
To see theeffectiveness of perceptual loss in shape generator, we trainthe model after replacing the reconstruction loss in Eq. (8)with the (cid:96) loss on pixels. The result is presented in the thirdrow in Table 3. As it shows, adding perceptual loss to theshape generator improves the accuracy of object shapes andleads to more recognizable images and improved captiongeneration performance. Impact of perceptual loss in image generator
Similar tothe previous experiment, we replaced the perceptual loss inimage generator (Eq. (12)) with the (cid:96) loss on pixels. Thefourth row in Table 3 summarizes the results. As it shows,employing perceptual loss in the image generator criticallyimproves the performance by reducing visual differencesbetween real and synthesized images. Impact of attention in image generator
Our image gen-erator combines features from the text embedding s and se-mantic layout M by attention mechanism. To see its impact Note that the the aggregated box tensor B global can be considered as asemantic layout M that the shape of each object is a rectangular box. on text-conditional image generation, we remove the atten-tion mechanism from the image generator (computation of S and A g in Figure 10) and concatenate the layout feature A directly to text embedding. As shown in the last row ofTable 3, employing attention mechanism improves the text-conditional image generation performance, since it forcesthe model to exploit text information in generation process.We found that the attention mechanism helps the model togenerate textures and background relevant to the input text. B.2. More qualitative examples
Image and layout generation.
We present the end-to-endimage generation results of our method in Figure 11, includ-ing object bounding boxes and masks obtained by the layoutgenerator. As illustrated in the figure, our model generatesobject bounding boxes that match content of the input text,and shapes capturing class-specific visual attributes and re-lation with other objects ( e.g . person riding a motorcycle,person swinging a bat, etc). Given the layout, the imagegenerator correctly predicts object textures and backgroundmatch the description.
Diversity of samples.
Figure 12 presents a set of samplesgenerated by our method, which corresponds to Figure 6 inthe main paper. Our method generates diverse samples bygenerating semantic layouts that are both diverse and highlyrelated to the input text description.
Controllable image generation.
Semantic layout pro-vides a natural and interactive interface for image editing.By modifying the bounding box layout of the scene, ourmodel can generate the object shapes and images compati-ble with the modified layout. Figure 13 illustrates the gener-ated images obtained by adding new objects to the existingsemantic layout. By placing new object bounding boxes toa scene, our model not only creates the corresponding ob-ject instance but also modifies surrounding context adaptiveto the change. For instance, adding cars and pedestrians infront of a tower makes the model to generate a street on abackground (the 4th row in Figure 13). Similarly, one canmodify the semantic layout by changing size and spatial lo-cation of existing objects. Figure 14 illustrates the results.Modifying the spatial configuration of objects sometimeschanges the relationship between objects and leads to im-ages in different context. For instance, changing the loca-tions of a soccer ball and players leads to various imagessuch as dribbling, shooting and competing to occupy theball (the first row in Figure 14). aption generationMethod BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR CIDErOurs (Full) w/o shape generator 0.494 0.291 0.170 0.102 0.131 0.215w/o perceptual loss in shape generator 0.536 0.329 0.196 0.119 0.151 0.324w/o perceptual loss in image generator 0.491 0.272 0.146 0.080 0.126 0.170w/o attention in image generator 0.522 0.315 0.184 0.112 0.148 0.324Table 3. Ablation study of the proposed method. The first row corresponds to the performance of our model presented in the main paper. input caption real image boxes mask pixel two surfers walk across thebeach towards the ocean.An Aer Lingus planetouches down on anairport runway.A crowd of people stand-ing on cement ground fly-ing kites.A skier stands next to skisstuck into the snow.A person is shown ridinga bicycle down the side ofthe road.A zebra standing on top ofa lush green field.A man wearing a blackwater suit surfs throughthe waterA Chicago logo boat istraveling in the water.A man holding a baseballbat on a field.A group of people fly kitesinto the air on a largegrassy field.The dinner plate has as-paragus, carrots and somekind of meat. input caption real image boxes mask pixel
A bird flying througha blue sky with widewings.A man wearing a blacksuit and pink bow tie.A man standing out-doors next to a dirt bikeand ATV.A red fire hydrant sit-ting in a parking lotnext to a metal post.A train moving on thetracks in treed area.a number of peopleplaying frisbee indoorsAn elephant stands infront of a watering holein his habitat.A skateboarder in themiddle of a trick on aconcrete rail.A man is playing tennisin front of stepsa bench in the woodscovered in snowA blue and white dou-ble decker bus next to abarrier.
Figure 11. Illustrations of end-to-end prediction results of our method. Best viewed in color. nput Text : A group of men playing soccer in a park
Input Text : A man taking a swing at a baseball
Input Text : Elephants walking along a dirt path next to water
Input Text : two pizzas on plates on a dining table with a pink design table cloth
Input Text : A woman working on something hanging from the ceiling in a computer room
Input Text : A fire hydrant covered with snow in the snow
Input Text : an image of people riding elephants in the water
Input Text : A baby elephant following its parent through a field
Input Text : A clock tower is raised in front of a blue sky
Input Text : A man riding a skateboard down a sidewalk
Input Text : a couple of giraffes are standing on a trail
Input Text : a couple of buses drive next to each other
Input Text : The boat is in a body of water that is not clear
Input Text : a plate of rice and broccoli with meat
Input Text : A zebra standing on the fields looking up and howling
Input Text : A blue plate holding food that includes shrimp, fries and greens
Figure 12. Examples of multiple samples generated from a text description. ox L a you t G e n e r a t e d I m a g e Input Text: two pizzas on plates on a dining table with a pink design table cloth... B ox L a you t G e n e r a t e d I m a g e Input Text:
A baseball player taking a swing at a ball.. B ox L a you t G e n e r a t e d I m a g e Input Text:
A cat sits on a desk in front of a computer. B ox L a you t G e n e r a t e d I m a g e Input Text:
A very tall clock tower with clocks on every side. B ox L a you t G e n e r a t e d I m a g e Input Text:
Waterway will all kinds of boats moored and docked..
Figure 13. Examples of controllable image generation by addingnew objects. B ox L a you t G e n e r a t e d I m a g e Input Text:
Two men who are standing in the grass near a soccer ball B ox L a you t G e n e r a t e d I m a g e Input Text:
A person riding on top of a surfboard on water B ox L a you t G e n e r a t e d I m a g e Input Text:
A black and white dog laying on top of a field of grass B ox L a you t G e n e r a t e d I m a g e Input Text:
A dog playing with a calf in the grass. B ox L a you t G e n e r a t e d I m a g e Input Text: