Faces à la Carte: Text-to-Face Generation via Attribute Disentanglement
FFaces la Carte: Text-to-Face Generation viaAttribute Disentanglement st Tianren Wang
ITEEThe University of Queensland
Brisbane, [email protected] nd Teng Zhang
ITEEThe University of Queensland
Brisbane, Australia rd Brian C. Lovell
ITEEThe University of Queensland
Brisbane, AustraliaFig. 1: Several examples of synthesised images of our model. We select four groups of images which are arranged with respectto gender and age. The highlighted features in the textual descriptions are all rendered in the images. The images also exhibitdiversity in terms of the other unspecified features.
Abstract —Text-to-Face (TTF) synthesis is a challenging taskwith great potential for diverse computer vision applications.Compared to Text-to-Image (TTI) synthesis tasks, the textualdescription of faces can be much more complicated and detaileddue to the variety of facial attributes and the parsing of highdimensional abstract natural language. In this paper, we proposea Text-to-Face model that not only produces images in highresolution (1024 × Index Terms —text-to-face synthesis, multi-label, disentangle-ment, high-resolution, diversity
I. I
NTRODUCTION
With the advent of Generative Adversarial Networks (GAN)[1], image generation has made huge strides in terms of bothimage quality and diversity. However, the original GAN model[1] cannot generate images tailored to meet design specifica-tions. To this end, many conditional GAN models have beenproposed to fit different task scenarios [2]–[8]. Among theseworks, Text-to-Image (TTI) synthesis is a challenging yet lessstudied topic. TTI refers to generating a photo-realistic imagewhich matches a given text description. As an inverse imagecaptioning task, TTI aims to establish an interpretable mappingbetween image space and the text semantic space. TTI hashuge potential and can be used in many applications includingphoto editing and computer-aided design. However, naturallanguage is high dimensional information which is often lessspecific but also much more abstract than images. Therefore,this research problem is quite challenging.Just like TTI synthesis, the sub-topic of Text-to-Face (TTF) a r X i v : . [ c s . C V ] J un ynthesis also has practical value in areas such as crimeinvestigation and also biometric research. For example, thepolice often need professional artists to sketch pictures ofsuspects based on the descriptions of the eyewitnesses. Thistask is time-consuming, requires great skill and often results ininferior images. Many police may not have access to such pro-fessionals. However, with a well-trained Text-to-Face model,we could quickly produce a wide diversity of high-qualityphoto-realistic pictures based simply on the descriptions ofeyewitnesses. Moreover, TTF can be used to address theemerging issues of data scarcity arising from the growingethical concerns regarding informed consent for the use offaces in biometrics research.A major challenge of the TTF task is that the linkagebetween face images and their text descriptions are muchlooser than for, say, bird and flower images commonly usedin TTI research. A few sentences of description are hardlyadequate to cover all the variations of human facial features.Also, for the same face image, different people may use quitedifferent descriptions. This increases the challenge of findingmappings between these descriptions and the facial features.Therefore, in addition to the aforementioned two criteria,a TTF model should have the ability to produce a groupof images with high diversity conditioned on the same textdescription. In a real-world application, a witness could chooseone picture among these output images which they think is theclosest to the appearance of the suspect. This feature is alsoimportant for biometric researchers to get sufficient data fromrare ethnicities and demographics when synthesising ethicalface datasets that do not require informed consent.Therefore, to meet these demands, we proposed a modelwhich includes a novel TTF framework satisfying: 1) highimage quality; 2) improved consistency of synthesised imagesand their descriptions; and 3) ability to generate a group ofdiverse faces from the same text description.To achieve these goals, we propose a pre-trained BERT[9] multi-label model for natural language processing. Thismodel outputs sparse text embeddings of length 40. We fine-tune a pre-trained MobileNets [10] model using CelebA’s[11] training data where images have paired labels. We thenpredict the labels from the input images. Next, we structurea feature space with 40 orthogonal axes based on the noisevectors and the predicted labels. After this operation, theinput noise vectors can be moved along a specified axis torender output images which have the desired features. Last butcertainly not least, we use the state-of-the-art image generator,StyleGAN2 [12], which maps the noise vectors into a featuredisentangled latent space, to generate high-resolution images.As Fig.1 shows, the synthesised images match the features ofthe description with good diversity and image quality.Our work has the following main contributions. • We propose a novel TTF-HD framework that comprisesa text multi-label classifier, an image label encoder, anda feature-disentangled image generator to generate high-quality faces with a wide range of diversity. • In addition, we added a novel design to the framework:a 40-label orthogonal coordinate system to guide thetrajectory of the input noise vector. • Last but not least, we use the state-of-the-art StyleGAN2[12] as our generator to map the manipulated noisevectors into the disentangled feature space to generateour 1024 × ELATED W ORKS
A. Text-to-image Synthesis
In the area of TTI, Reel et al. [6] first proposed to take ad-vantage of GAN, which includes a text encoder and an imagegenerator and concatenated the text embedding to the noisevector as input. Unfortunately, the model failed to establishgood mappings between the keywords and the correspondingimage features. Besides, due to the final results being directlygenerated from the concatenated vectors, the image qualitywas poor so that images could be easily discerned as fake. Toaddress these two issues, StackGAN [7] proposed to generateimages hierarchically by utilising two pairs of generators anddiscriminators. Later, Xu et al. proposed AttnGAN [8]. Byintroducing the attention mechanism, the model successfullymatched the keywords with the corresponding image features.Their interpolation experimental results indicated that themodel could correctly render the image features according tothe selected keywords. The model works remarkably well intranslating bird and flower descriptions. However, such de-scriptions are mostly just one sentence. If the descriptions havemore sentences, the efficacy of the text encoder deterioratesbecause the attention map becomes harder to train.
B. Text-to-face Synthesis
Compared to the number of works in TTI, the publishedworks in TTF are far fewer. The main reason is that a facedescription has a much weaker connection to facial featurescompared to that of, say, bird or flower images. Typically,the descriptions of birds and flowers are mostly about thecolour of feathers and petals. Descriptions of faces can bemuch more complicated with gender, age, ethnicity, pose, andother facial attributes. Moreover, most of the TTI models aretrained with Oxford-102 [13], CUB [14], and COCO [15]which are not face image datasets. On the other hand, theonly face dataset that is suitable is Face2text [16] which hasjust five thousand pairs of samples, which is not sufficient fortraining a satisfactory model.With all the challenges mentioned above, there are stillseveral inspiring works engaging in text-to-face synthesis. Inig. 2: TTF-HD diagram. The text is fed into the multi-label classifier T and then output text vector l trg which represents 40facial attributes. The image generator G firstly synthesises an image from random noise vector z . Then the image encoder E output the image embeddings l org . The differentiated embedding l diff is used to manipulate the original noise vector from z to ˆ z . Finally, the generator synthesises an image with desired features from ˆ z .a project named T2F [17], Akanimax proposed to encodethe text descriptions into a summary vector using the LSTMnetwork. ProGAN [18] was adopted as the generator of themodel. Unfortunately, the final output images exhibited poorimage quality. Later, the author improved his work, whichhe named T2F 2.0, by replacing the ProGAN with MSG-GAN [19]. As a result, the image quality and image-textconsistency improved considerably, but the output showed lowdiversity in facial appearance. To address the data scarcityissue, O.R. Nasir et al. [20] proposed to utilise the labels ofCelebA [11] to produce structured pseudo text descriptionsautomatically. In this way, the samples in the dataset are pairedwith sentences which contains the positive feature namesseparated by conjunctions and punctuation. The results are64 ×
64 pixel images showing a certain degree of diversityin appearance. The best output image quality so far is from[23] which also adopted the model structure of AttnGAN [8].Therefore, this work has the same issues with text encodingmentioned previously.
C. Feature-disentangled Latent Space
Conventionally, the generator will produce random imagesfrom noise vectors sampled from a normal distribution. How-ever, we desire to control the rendering of the images inresponse to the feature labels. To do this, Chen et al . [24]proposed to disentangle the desired features, by maximisingthe mutual information between the latent code c of thedesired features and the noise vector x . In his experiments,he introduced a variation distribution Q ( c | x ) to approach P ( c | x ) . Finally, the latent code indicates that it has managedto learn interpretable information by changing the value in acertain dimension. However, the latent code in this work hasonly 3 or 4 dimensions, but we require 40 features, which ismuch more complicated. Later, Karras et al . [21] establisheda novel style-based generator architecture, named StyleGAN,which does not take the noise vector as input like the previousworks. The input vector is mapped into an intermediate latentspace through a non-linear network before being fed into thegenerator network. The non-linear network consists of eightfully connected layers. A benefit for such a setting is thatthe latent space does not have to support sampling accordingto any fixed distribution [21]. In other words, we have morefreedom to combine the desired features.III. P ROPOSED M ETHOD
Our proposed model, named TTF-HD, comprising a multi-label classifier T , image encoder E , and a generator G isshown in Fig. . Details will be discussed in the followingsubsections. A. Multi-label Text Classification
To conduct the TTF task, it is of vital importance to havesufficient facial attribute labels to best describe a face. Wepropose to use the CelebA [11] dataset which includes 40facial attribute labels for each face. To map the free-formnatural language descriptions to the 40 facial attributes, wepropose to fine-tune a multi-label text classifier T to gettext embeddings of length 40. With these considerations, wedopt the state-of-the-art natural language processing model,Bidirectional Transformer (BERT) [9]. In light of the fact thatthis is a 40-class classification task, we choose to use thelarge network of the BERT model to have a stronger fittingability for high-dimensional training data. Some features havedifferent names for opposites. For example, when training themodel T , the feature “age” could be represented by eitheryoung” or old” where young” would be a value close to0 and old” would be a be a value close to 1. If a featureisn’t specified, it is set to 0. This process is shown in Fig.3.Finally, the classifier outputs a text vector of length 40 foreach description.Fig. 3: A possible classification result of the text classifier T .Note that there is one advantage of the text classifiercompared to the traditional text encoder in previous works. It isthat there are no restrictions to the length of text descriptions.In previous works, the text encoders are mostly crammed intoone or two sentences. But for face descriptions, the lengthis longer than for bird and flower descriptions, which makestraditional text encoders less appropriate. B. Image Multi-label Embeddings
In the proposed framework, an image encoder E is requiredto predict the feature labels of the generated images. To dothis, we fine-tune a MobileNet model [10], with the samplesof CelebA [11]. The reason for choosing MobileNet is that itis a light-weight network model which has a good trade-offbetween accuracy and speed. With this model, we can obtainthe image embeddings which have the same length of that ofthe text vectors of the images generated from the noise vectors. C. Feature Axes
After training the image encoder, now we can find the re-lationship between the noise vectors and the predicted featurelabels by logistic regression. The length of the noise vectorsis 512 ( x ∈ R ) and the feature vectors is 40 ( y ∈ R ).Therefore, we can obtain: y = x · B (1)where B is a matrix to be solved with dimention 512 × proj u ( v ) = (cid:104) v , u (cid:105)(cid:104) v , v (cid:105) u (2)where v is the axis to be orthogonalised and u is the referenceaxis. Then, we can obtain: u k = v k − k − (cid:88) j =1 proj u j ( v k ) , w k = u k (cid:107) u k (cid:107) , ( k = 1 , , ... . (3)In (3), the matrix W = [ w , w , ... w k ] is normalised sothat W becomes unitary.After these steps, we get the feature axes which are used toguide the update direction of the input noise vectors to obtainthe desired features in the output images. D. Noise Vector Manipulation
Manipulating the noise vectors is vital to our work becausethis determines whether the output images will have thedescribed features in the text corpus. In the model diagramFig. 2, this is the process of changing the random noise vectorfrom z to ˆ z by (4) where l is a column vector to determinethe direction and magnitude of the movement along featureaxes. ˆ z = z + W · l (4)To ensure that the model will produce an image of desiredfeatures no matter where the noise vectors are in the latentspace, we introduce four operations. Differentiation.
As shown in Fig.2, the text classifier em-bedding output is denoted as l trg and the predicted embeddingfrom the initial random vector is denoted as l org = E ( G ( z )) .Intuitively, we can use l trg to guide the movement of noisevectors in the feature axes. However, the value range of l trg is [0 , . This means that the model cannot render features inopposite directions, say, young versus old, because there areno labels of negative value. To solve this, we use differentiatedembeddings l diff to guide the feature editing obtained by (5) l diff = l trg − l org . (5)In this way, the noise vectors can be moved in both positiveand negative direction along the feature axes because thevalue range of the differentiated embeddings is [ − , . Forthe features which have a similar probability value in the textembeddings and the image embeddings, their probability valueis cancelled out and they will not be rendered repeatedly inthe output images. This operation is shown in Fig. 2. For eachfeature, according to its probability value level in l trg and l org ,the movement direction can be positive, negative or cancelledout.Note that to minimize interference of the unspecified fea-tures in the text descriptions, we will not apply the differen-tiation operation to such features and we keep their value aszero in the differentiated embeddings. Reweighting.
In the differentiated embeddings, the labelswhose value approaching -1 or 1 are the specified featureswhere the text descriptions may specify in a positive ornegative way. Apart from these labels, there may be someother labels whose value are between -1 and 1 which interfereig. 4: Images produced with single-sentence input. With less specified labels in the text, the model can generate samples withhigher diversity.with the desired feature rendering. Therefore, we need to givehigher weights to the values of the specified features. To dothis, we propose to map the differentiated embeddings valuerange from [ − , to (cid:2) − π , π (cid:3) . Then we compute the tangentvalue of every factor of the mapped differentiated embeddings.As a result, the value approaching the ends of the value rangewill get a higher weight. In our scenario, the weighed valuerange is (cid:2) −√ , √ (cid:3) . Normalisation.
As the noise vectors are sampled froma normal distribution, they have a higher probability to besampled near the origin of the axes where the probabil-ity density is high. However, the more steps we move thevectors along different feature axes, the larger the distancemay become between the vectors and the origin, which willlead to more artifacts in the generated images. That is whywe need to renormalise the vectors after every movementalong the axes. This distance can be denoted as L distance.Therefore, for the noise vector X = [ x , x , ... x n ] , we get X (cid:48) = [ x (cid:48) , x (cid:48) , ..., x n (cid:48) ] with (6) (cid:107) x (cid:107) = N =512 (cid:88) i =1 | x i | x i (cid:48) = x i (cid:107) x (cid:107) ( i = 1 , , ..., (6) Feature lock.
To make the face morphing process morestable, we have a feature lock step every time we move thevectors along a certain axis. In other words, the model onlyuses the axes along which the vectors have been moved as thebasis axes to disentangle the following feature axis. While forother axes of unspecified attributes in the textual descriptions, the movement direction and step size along such axes arenot fixed to ensure a diversity of generated images. In thisway, the noise vectors are locked only in terms of the featuresmentioned in the descriptions.
E. High Resolution Generator
The generator G we use is a pre-trained model of Style-GAN2 [12]. On the basis of mapping the noise vectors whichare sampled from the normal distribution to the intermediatelatent space, StyleGAN2 improves the small artifacts byrevisiting the structure of the network. With this generator,not only can the model synthesise high-resolution images, butit can also render the desired features from the manipulatedinput vectors.IV. E XPERIMENTS & E VALUATION
Dataset . The dataset we use is CelebA [11] which containsover 200k face images. For each sample, there is a pairedone-shot label vector whose length is 40. In addition, thereis another paired text description corpus set in which everydescription has almost 10 sentences. There may be someredundant sentences in some of them, but every descriptionincludes all the features the paired label vector indicates. Weuse this dataset to fine-tune the pre-trained multi-label textclassifier and the pre-trained image encoder.
Experimental setting . In our evaluation experiments, werandomly choose 100 text descriptions. With each of them,the model will randomly generate 10 images. Therefore, thetest set has 1000 images in total. As the experiments show,there will be significant image morphing when the noise vectorig. 5: Image morphing process of each group in ablation study. (A) A group with all operations. (The default setting for TTF-HD) (B) A group with reweighting, differentiation, and normalisation operations. (C) A group with reweighting, differentiationoperations. (D) A group with the reweighting operation. (E) Blank group. We fix the noise vector input of each group. Thefigure shows the morphing process from the random image on the left column to the final output on the right column.moves twice along certain feature disentangled axis. Thus, weset the step size as 1.2, which multiplies the reweighted outputof the differentiated vector. This guarantees a final weightwhich is used to move along the axis of around 2 ( √ × . ). A. Qualitative Evaluation
Image quality . Fig.1 also shows the paired descriptions ineach group. We can see that most of the generated images arecorrectly rendered with described features.
Image diversity . To show the proposed method has greatfeature generalisation ability, we conduct the image synthesisconditioned on the single-sentence description. In other words,apart from the key features that the sentence refers to, themodel should generalise the other features in the output. AsFig.4 shows, for each single-sentence description, the proposedmodel can produce images with high diversity.
B. Quantitative Evaluation
In this section, we use three metrics to evaluate the abovethree criteria respectively. They are Inception Score (IS) [25]which is used in many previous works, Learned PerceptualImage Patch Similarity (LPIPS) [22] which is for evaluating the diversity of the generated images, and Cosine Similaritywhich is widely used to evaluate the similarity of two chunksof a corpus in natural language processing. Due to the lack ofthe source code for most of the works in the TTF area such asT2F 2.0 [17], we compare our experimental results with theTTF implementation of AttnGAN [8].TABLE I: Evaluation results of different models
Methods
IS CS * LPIPS
TTF-HD (ours) ± ± ± Table. I shows the evaluation results of different models. Wecan see the proposed method outperforms one of the state-of-the-art methods AttnGAN [8] in terms of image quality andText-to-Image similarity.
C. Ablation Study
In Section 3, we propose four operations to manipulate thenoise vector to get the desired features. In this subsection,e conduct the ablation study and discuss the effects of thedifferent operations applied.To conduct the ablation study, we have 5 experiment set-tings. We choose one face description and produce 100 randomimages under each experimental setting respectively. Then, weuse the above three metrics to evaluate the effect of differentoperations.Fig. 5 shows the morphing process of the generated images.We can see that with the proposed four manipulating opera-tions, Group A can finally obtain an output with all desiredfeatures. While for other groups, the final morphing imagesall suffer from the artifact issue on the rendering of the faceand the background. This is because with too many featureaxis moving steps, the noise vector has been moved to a low-density region of the latent space distribution, which also leadsto a mode collapse problem.TABLE II: Ablation study evaluation results
Exp. Evaluation MetricsSettings
IS CS * LPIPS
Group A 1.122 ± ± Group B 1.116 ± ± ± ± ± ± ± ± Table.II shows the quantitative evaluation metrics on differ-ent groups of TTF-HD. We can see that Group A has the bestdiversity score as well as the second-best performance in termsof IS and CS score. This suggests that applying all operationsleads to a good trade-off between image quality, text-to-facesimilarity and diversity.V. C
ONCLUSION