[PDF] Faces à la Carte: Text-to-Face Generation via Attribute Disentanglement

Abstract

Text-to-Face (TTF) synthesis is a challenging task with great potential for diverse computer vision applications. Compared to Text-to-Image (TTI) synthesis tasks, the textual description of faces can be much more complicated and detailed due to the variety of facial attributes and the parsing of high dimensional abstract natural language. In this paper, we propose a Text-to-Face model that not only produces images in high resolution (1024x1024) with text-to-image consistency, but also outputs multiple diverse faces to cover a wide range of unspecified facial features in a natural way. By fine-tuning the multi-label classifier and image encoder, our model obtains the vectors and image embeddings which are used to transform the input noise vector sampled from the normal distribution. Afterwards, the transformed noise vector is fed into a pre-trained high-resolution image generator to produce a set of faces with the desired facial attributes. We refer to our model as TTF-HD. Experimental results show that TTF-HD generates high-quality faces with state-of-the-art performance.

Full PDF

FFaces la Carte: Text-to-Face Generation viaAttribute Disentanglement st Tianren Wang

ITEEThe University of Queensland

Brisbane, [email protected] nd Teng Zhang

ITEEThe University of Queensland

Brisbane, Australia rd Brian C. Lovell

ITEEThe University of Queensland

Brisbane, AustraliaFig. 1: Several examples of synthesised images of our model. We select four groups of images which are arranged with respectto gender and age. The highlighted features in the textual descriptions are all rendered in the images. The images also exhibitdiversity in terms of the other unspeciﬁed features.

Abstract —Text-to-Face (TTF) synthesis is a challenging taskwith great potential for diverse computer vision applications.Compared to Text-to-Image (TTI) synthesis tasks, the textualdescription of faces can be much more complicated and detaileddue to the variety of facial attributes and the parsing of highdimensional abstract natural language. In this paper, we proposea Text-to-Face model that not only produces images in highresolution (1024 × Index Terms —text-to-face synthesis, multi-label, disentangle-ment, high-resolution, diversity

I. I

NTRODUCTION

With the advent of Generative Adversarial Networks (GAN)[1], image generation has made huge strides in terms of bothimage quality and diversity. However, the original GAN model[1] cannot generate images tailored to meet design speciﬁca-tions. To this end, many conditional GAN models have beenproposed to ﬁt different task scenarios [2]–[8]. Among theseworks, Text-to-Image (TTI) synthesis is a challenging yet lessstudied topic. TTI refers to generating a photo-realistic imagewhich matches a given text description. As an inverse imagecaptioning task, TTI aims to establish an interpretable mappingbetween image space and the text semantic space. TTI hashuge potential and can be used in many applications includingphoto editing and computer-aided design. However, naturallanguage is high dimensional information which is often lessspeciﬁc but also much more abstract than images. Therefore,this research problem is quite challenging.Just like TTI synthesis, the sub-topic of Text-to-Face (TTF) a r X i v : . [ c s . C V ] J un ynthesis also has practical value in areas such as crimeinvestigation and also biometric research. For example, thepolice often need professional artists to sketch pictures ofsuspects based on the descriptions of the eyewitnesses. Thistask is time-consuming, requires great skill and often results ininferior images. Many police may not have access to such pro-fessionals. However, with a well-trained Text-to-Face model,we could quickly produce a wide diversity of high-qualityphoto-realistic pictures based simply on the descriptions ofeyewitnesses. Moreover, TTF can be used to address theemerging issues of data scarcity arising from the growingethical concerns regarding informed consent for the use offaces in biometrics research.A major challenge of the TTF task is that the linkagebetween face images and their text descriptions are muchlooser than for, say, bird and ﬂower images commonly usedin TTI research. A few sentences of description are hardlyadequate to cover all the variations of human facial features.Also, for the same face image, different people may use quitedifferent descriptions. This increases the challenge of ﬁndingmappings between these descriptions and the facial features.Therefore, in addition to the aforementioned two criteria,a TTF model should have the ability to produce a groupof images with high diversity conditioned on the same textdescription. In a real-world application, a witness could chooseone picture among these output images which they think is theclosest to the appearance of the suspect. This feature is alsoimportant for biometric researchers to get sufﬁcient data fromrare ethnicities and demographics when synthesising ethicalface datasets that do not require informed consent.Therefore, to meet these demands, we proposed a modelwhich includes a novel TTF framework satisfying: 1) highimage quality; 2) improved consistency of synthesised imagesand their descriptions; and 3) ability to generate a group ofdiverse faces from the same text description.To achieve these goals, we propose a pre-trained BERT[9] multi-label model for natural language processing. Thismodel outputs sparse text embeddings of length 40. We ﬁne-tune a pre-trained MobileNets [10] model using CelebA’s[11] training data where images have paired labels. We thenpredict the labels from the input images. Next, we structurea feature space with 40 orthogonal axes based on the noisevectors and the predicted labels. After this operation, theinput noise vectors can be moved along a speciﬁed axis torender output images which have the desired features. Last butcertainly not least, we use the state-of-the-art image generator,StyleGAN2 [12], which maps the noise vectors into a featuredisentangled latent space, to generate high-resolution images.As Fig.1 shows, the synthesised images match the features ofthe description with good diversity and image quality.Our work has the following main contributions. • We propose a novel TTF-HD framework that comprisesa text multi-label classiﬁer, an image label encoder, anda feature-disentangled image generator to generate high-quality faces with a wide range of diversity. • In addition, we added a novel design to the framework:a 40-label orthogonal coordinate system to guide thetrajectory of the input noise vector. • Last but not least, we use the state-of-the-art StyleGAN2[12] as our generator to map the manipulated noisevectors into the disentangled feature space to generateour 1024 × ELATED W ORKS

A. Text-to-image Synthesis

In the area of TTI, Reel et al. [6] ﬁrst proposed to take ad-vantage of GAN, which includes a text encoder and an imagegenerator and concatenated the text embedding to the noisevector as input. Unfortunately, the model failed to establishgood mappings between the keywords and the correspondingimage features. Besides, due to the ﬁnal results being directlygenerated from the concatenated vectors, the image qualitywas poor so that images could be easily discerned as fake. Toaddress these two issues, StackGAN [7] proposed to generateimages hierarchically by utilising two pairs of generators anddiscriminators. Later, Xu et al. proposed AttnGAN [8]. Byintroducing the attention mechanism, the model successfullymatched the keywords with the corresponding image features.Their interpolation experimental results indicated that themodel could correctly render the image features according tothe selected keywords. The model works remarkably well intranslating bird and ﬂower descriptions. However, such de-scriptions are mostly just one sentence. If the descriptions havemore sentences, the efﬁcacy of the text encoder deterioratesbecause the attention map becomes harder to train.

B. Text-to-face Synthesis

Compared to the number of works in TTI, the publishedworks in TTF are far fewer. The main reason is that a facedescription has a much weaker connection to facial featurescompared to that of, say, bird or ﬂower images. Typically,the descriptions of birds and ﬂowers are mostly about thecolour of feathers and petals. Descriptions of faces can bemuch more complicated with gender, age, ethnicity, pose, andother facial attributes. Moreover, most of the TTI models aretrained with Oxford-102 [13], CUB [14], and COCO [15]which are not face image datasets. On the other hand, theonly face dataset that is suitable is Face2text [16] which hasjust ﬁve thousand pairs of samples, which is not sufﬁcient fortraining a satisfactory model.With all the challenges mentioned above, there are stillseveral inspiring works engaging in text-to-face synthesis. Inig. 2: TTF-HD diagram. The text is fed into the multi-label classiﬁer T and then output text vector l trg which represents 40facial attributes. The image generator G ﬁrstly synthesises an image from random noise vector z . Then the image encoder E output the image embeddings l org . The differentiated embedding l diff is used to manipulate the original noise vector from z to ˆ z . Finally, the generator synthesises an image with desired features from ˆ z .a project named T2F [17], Akanimax proposed to encodethe text descriptions into a summary vector using the LSTMnetwork. ProGAN [18] was adopted as the generator of themodel. Unfortunately, the ﬁnal output images exhibited poorimage quality. Later, the author improved his work, whichhe named T2F 2.0, by replacing the ProGAN with MSG-GAN [19]. As a result, the image quality and image-textconsistency improved considerably, but the output showed lowdiversity in facial appearance. To address the data scarcityissue, O.R. Nasir et al. [20] proposed to utilise the labels ofCelebA [11] to produce structured pseudo text descriptionsautomatically. In this way, the samples in the dataset are pairedwith sentences which contains the positive feature namesseparated by conjunctions and punctuation. The results are64 ×

64 pixel images showing a certain degree of diversityin appearance. The best output image quality so far is from[23] which also adopted the model structure of AttnGAN [8].Therefore, this work has the same issues with text encodingmentioned previously.

C. Feature-disentangled Latent Space

Conventionally, the generator will produce random imagesfrom noise vectors sampled from a normal distribution. How-ever, we desire to control the rendering of the images inresponse to the feature labels. To do this, Chen et al . [24]proposed to disentangle the desired features, by maximisingthe mutual information between the latent code c of thedesired features and the noise vector x . In his experiments,he introduced a variation distribution Q ( c | x ) to approach P ( c | x ) . Finally, the latent code indicates that it has managedto learn interpretable information by changing the value in acertain dimension. However, the latent code in this work hasonly 3 or 4 dimensions, but we require 40 features, which ismuch more complicated. Later, Karras et al . [21] establisheda novel style-based generator architecture, named StyleGAN,which does not take the noise vector as input like the previousworks. The input vector is mapped into an intermediate latentspace through a non-linear network before being fed into thegenerator network. The non-linear network consists of eightfully connected layers. A beneﬁt for such a setting is thatthe latent space does not have to support sampling accordingto any ﬁxed distribution [21]. In other words, we have morefreedom to combine the desired features.III. P ROPOSED M ETHOD

Our proposed model, named TTF-HD, comprising a multi-label classiﬁer T , image encoder E , and a generator G isshown in Fig. . Details will be discussed in the followingsubsections. A. Multi-label Text Classiﬁcation

To conduct the TTF task, it is of vital importance to havesufﬁcient facial attribute labels to best describe a face. Wepropose to use the CelebA [11] dataset which includes 40facial attribute labels for each face. To map the free-formnatural language descriptions to the 40 facial attributes, wepropose to ﬁne-tune a multi-label text classiﬁer T to gettext embeddings of length 40. With these considerations, wedopt the state-of-the-art natural language processing model,Bidirectional Transformer (BERT) [9]. In light of the fact thatthis is a 40-class classiﬁcation task, we choose to use thelarge network of the BERT model to have a stronger ﬁttingability for high-dimensional training data. Some features havedifferent names for opposites. For example, when training themodel T , the feature “age” could be represented by eitheryoung” or old” where young” would be a value close to0 and old” would be a be a value close to 1. If a featureisn’t speciﬁed, it is set to 0. This process is shown in Fig.3.Finally, the classiﬁer outputs a text vector of length 40 foreach description.Fig. 3: A possible classiﬁcation result of the text classiﬁer T .Note that there is one advantage of the text classiﬁercompared to the traditional text encoder in previous works. It isthat there are no restrictions to the length of text descriptions.In previous works, the text encoders are mostly crammed intoone or two sentences. But for face descriptions, the lengthis longer than for bird and ﬂower descriptions, which makestraditional text encoders less appropriate. B. Image Multi-label Embeddings

In the proposed framework, an image encoder E is requiredto predict the feature labels of the generated images. To dothis, we ﬁne-tune a MobileNet model [10], with the samplesof CelebA [11]. The reason for choosing MobileNet is that itis a light-weight network model which has a good trade-offbetween accuracy and speed. With this model, we can obtainthe image embeddings which have the same length of that ofthe text vectors of the images generated from the noise vectors. C. Feature Axes

After training the image encoder, now we can ﬁnd the re-lationship between the noise vectors and the predicted featurelabels by logistic regression. The length of the noise vectorsis 512 ( x ∈ R ) and the feature vectors is 40 ( y ∈ R ).Therefore, we can obtain: y = x · B (1)where B is a matrix to be solved with dimention 512 × proj u ( v ) = (cid:104) v , u (cid:105)(cid:104) v , v (cid:105) u (2)where v is the axis to be orthogonalised and u is the referenceaxis. Then, we can obtain: u k = v k − k − (cid:88) j =1 proj u j ( v k ) , w k = u k (cid:107) u k (cid:107) , ( k = 1 , , ... . (3)In (3), the matrix W = [ w , w , ... w k ] is normalised sothat W becomes unitary.After these steps, we get the feature axes which are used toguide the update direction of the input noise vectors to obtainthe desired features in the output images. D. Noise Vector Manipulation

Manipulating the noise vectors is vital to our work becausethis determines whether the output images will have thedescribed features in the text corpus. In the model diagramFig. 2, this is the process of changing the random noise vectorfrom z to ˆ z by (4) where l is a column vector to determinethe direction and magnitude of the movement along featureaxes. ˆ z = z + W · l (4)To ensure that the model will produce an image of desiredfeatures no matter where the noise vectors are in the latentspace, we introduce four operations. Differentiation.

As shown in Fig.2, the text classiﬁer em-bedding output is denoted as l trg and the predicted embeddingfrom the initial random vector is denoted as l org = E ( G ( z )) .Intuitively, we can use l trg to guide the movement of noisevectors in the feature axes. However, the value range of l trg is [0 , . This means that the model cannot render features inopposite directions, say, young versus old, because there areno labels of negative value. To solve this, we use differentiatedembeddings l diff to guide the feature editing obtained by (5) l diff = l trg − l org . (5)In this way, the noise vectors can be moved in both positiveand negative direction along the feature axes because thevalue range of the differentiated embeddings is [ − , . Forthe features which have a similar probability value in the textembeddings and the image embeddings, their probability valueis cancelled out and they will not be rendered repeatedly inthe output images. This operation is shown in Fig. 2. For eachfeature, according to its probability value level in l trg and l org ,the movement direction can be positive, negative or cancelledout.Note that to minimize interference of the unspeciﬁed fea-tures in the text descriptions, we will not apply the differen-tiation operation to such features and we keep their value aszero in the differentiated embeddings. Reweighting.

In the differentiated embeddings, the labelswhose value approaching -1 or 1 are the speciﬁed featureswhere the text descriptions may specify in a positive ornegative way. Apart from these labels, there may be someother labels whose value are between -1 and 1 which interfereig. 4: Images produced with single-sentence input. With less speciﬁed labels in the text, the model can generate samples withhigher diversity.with the desired feature rendering. Therefore, we need to givehigher weights to the values of the speciﬁed features. To dothis, we propose to map the differentiated embeddings valuerange from [ − , to (cid:2) − π , π (cid:3) . Then we compute the tangentvalue of every factor of the mapped differentiated embeddings.As a result, the value approaching the ends of the value rangewill get a higher weight. In our scenario, the weighed valuerange is (cid:2) −√ , √ (cid:3) . Normalisation.

As the noise vectors are sampled froma normal distribution, they have a higher probability to besampled near the origin of the axes where the probabil-ity density is high. However, the more steps we move thevectors along different feature axes, the larger the distancemay become between the vectors and the origin, which willlead to more artifacts in the generated images. That is whywe need to renormalise the vectors after every movementalong the axes. This distance can be denoted as L distance.Therefore, for the noise vector X = [ x , x , ... x n ] , we get X (cid:48) = [ x (cid:48) , x (cid:48) , ..., x n (cid:48) ] with (6) (cid:107) x (cid:107) = N =512 (cid:88) i =1 | x i | x i (cid:48) = x i (cid:107) x (cid:107) ( i = 1 , , ..., (6) Feature lock.

To make the face morphing process morestable, we have a feature lock step every time we move thevectors along a certain axis. In other words, the model onlyuses the axes along which the vectors have been moved as thebasis axes to disentangle the following feature axis. While forother axes of unspeciﬁed attributes in the textual descriptions, the movement direction and step size along such axes arenot ﬁxed to ensure a diversity of generated images. In thisway, the noise vectors are locked only in terms of the featuresmentioned in the descriptions.

E. High Resolution Generator

The generator G we use is a pre-trained model of Style-GAN2 [12]. On the basis of mapping the noise vectors whichare sampled from the normal distribution to the intermediatelatent space, StyleGAN2 improves the small artifacts byrevisiting the structure of the network. With this generator,not only can the model synthesise high-resolution images, butit can also render the desired features from the manipulatedinput vectors.IV. E XPERIMENTS & E VALUATION

Dataset . The dataset we use is CelebA [11] which containsover 200k face images. For each sample, there is a pairedone-shot label vector whose length is 40. In addition, thereis another paired text description corpus set in which everydescription has almost 10 sentences. There may be someredundant sentences in some of them, but every descriptionincludes all the features the paired label vector indicates. Weuse this dataset to ﬁne-tune the pre-trained multi-label textclassiﬁer and the pre-trained image encoder.

Experimental setting . In our evaluation experiments, werandomly choose 100 text descriptions. With each of them,the model will randomly generate 10 images. Therefore, thetest set has 1000 images in total. As the experiments show,there will be signiﬁcant image morphing when the noise vectorig. 5: Image morphing process of each group in ablation study. (A) A group with all operations. (The default setting for TTF-HD) (B) A group with reweighting, differentiation, and normalisation operations. (C) A group with reweighting, differentiationoperations. (D) A group with the reweighting operation. (E) Blank group. We ﬁx the noise vector input of each group. Theﬁgure shows the morphing process from the random image on the left column to the ﬁnal output on the right column.moves twice along certain feature disentangled axis. Thus, weset the step size as 1.2, which multiplies the reweighted outputof the differentiated vector. This guarantees a ﬁnal weightwhich is used to move along the axis of around 2 ( √ × . ). A. Qualitative Evaluation

Image quality . Fig.1 also shows the paired descriptions ineach group. We can see that most of the generated images arecorrectly rendered with described features.

Image diversity . To show the proposed method has greatfeature generalisation ability, we conduct the image synthesisconditioned on the single-sentence description. In other words,apart from the key features that the sentence refers to, themodel should generalise the other features in the output. AsFig.4 shows, for each single-sentence description, the proposedmodel can produce images with high diversity.

B. Quantitative Evaluation

In this section, we use three metrics to evaluate the abovethree criteria respectively. They are Inception Score (IS) [25]which is used in many previous works, Learned PerceptualImage Patch Similarity (LPIPS) [22] which is for evaluating the diversity of the generated images, and Cosine Similaritywhich is widely used to evaluate the similarity of two chunksof a corpus in natural language processing. Due to the lack ofthe source code for most of the works in the TTF area such asT2F 2.0 [17], we compare our experimental results with theTTF implementation of AttnGAN [8].TABLE I: Evaluation results of different models

Methods

IS CS * LPIPS

TTF-HD (ours) ± ± ± Table. I shows the evaluation results of different models. Wecan see the proposed method outperforms one of the state-of-the-art methods AttnGAN [8] in terms of image quality andText-to-Image similarity.

C. Ablation Study

In Section 3, we propose four operations to manipulate thenoise vector to get the desired features. In this subsection,e conduct the ablation study and discuss the effects of thedifferent operations applied.To conduct the ablation study, we have 5 experiment set-tings. We choose one face description and produce 100 randomimages under each experimental setting respectively. Then, weuse the above three metrics to evaluate the effect of differentoperations.Fig. 5 shows the morphing process of the generated images.We can see that with the proposed four manipulating opera-tions, Group A can ﬁnally obtain an output with all desiredfeatures. While for other groups, the ﬁnal morphing imagesall suffer from the artifact issue on the rendering of the faceand the background. This is because with too many featureaxis moving steps, the noise vector has been moved to a low-density region of the latent space distribution, which also leadsto a mode collapse problem.TABLE II: Ablation study evaluation results

Exp. Evaluation MetricsSettings

IS CS * LPIPS

Group A 1.122 ± ± Group B 1.116 ± ± ± ± ± ± ± ± Table.II shows the quantitative evaluation metrics on differ-ent groups of TTF-HD. We can see that Group A has the bestdiversity score as well as the second-best performance in termsof IS and CS score. This suggests that applying all operationsleads to a good trade-off between image quality, text-to-facesimilarity and diversity.V. C

ONCLUSION