[PDF] A3GAN: An Attribute-aware Attentive Generative Adversarial Network for Face Aging

Abstract

Face aging, which aims at aesthetically rendering a given face to predict its future appearance, has received significant research attention in recent years. Although great progress has been achieved with the success of Generative Adversarial Networks (GANs) in synthesizing realistic images, most existing GAN-based face aging methods have two main problems: 1) unnatural changes of high-level semantic information (e.g. facial attributes) due to the insufficient utilization of prior knowledge of input faces, and 2) distortions of low-level image content including ghosting artifacts and modifications in age-irrelevant regions. In this paper, we introduce A3GAN, an Attribute-Aware Attentive face aging model to address the above issues. Facial attribute vectors are regarded as the conditional information and embedded into both the generator and discriminator, encouraging synthesized faces to be faithful to attributes of corresponding inputs. To improve the visual fidelity of generation results, we leverage the attention mechanism to restrict modifications to age-related areas and preserve image details. Moreover, the wavelet packet transform is employed to capture textural features at multiple scales in the frequency space. Extensive experimental results demonstrate the effectiveness of our model in synthesizing photorealistic aged face images and achieving state-of-the-art performance on popular face aging datasets.

Full PDF

11 A GAN: An Attribute-aware AttentiveGenerative Adversarial Network for Face Aging

Yunfan Liu, Qi Li, Zhenan Sun,

Senior Member, IEEE , and Tieniu Tan,

Fellow, IEEE

Abstract —Face aging, which aims at aesthetically rendering a given face to predict its future appearance, has received signiﬁcantresearch attention in recent years. Although great progress has been achieved with the success of Generative Adversarial Networks(GANs) in synthesizing realistic images, most existing GAN-based face aging methods have two main problems: 1) unnatural changes ofhigh-level semantic information (e.g. facial attributes) due to the insufﬁcient utilization of prior knowledge of input faces, and 2) distortionsof low-level image content including ghosting artifacts and modiﬁcations in age-irrelevant regions. In this paper, we introduce A GAN,an A ttribute- A ware A ttentive face aging model to address the above issues. Facial attribute vectors are regarded as the conditionalinformation and embedded into both the generator and discriminator, encouraging synthesized faces to be faithful to attributes ofcorresponding inputs. To improve the visual ﬁdelity of generation results, we leverage the attention mechanism to restrict modiﬁcationsto age-related areas and preserve image details. Moreover, the wavelet packet transform is employed to capture textural features atmultiple scales in the frequency space. Extensive experimental results demonstrate the effectiveness of our model in synthesizingphotorealistic aged face images and achieving state-of-the-art performance on popular face aging datasets. Index Terms —Generative adversarial networks, face aging, facial attribute, attention mechanism, wavelet packet transform (cid:70)

NTRODUCTION F ACE aging, also known as age progression, refers torendering a given face image with realistic aging ef-fects while still preserving personalized features [1], [2],[3]. Applications of face aging techniques range from socialsecurity to digital entertainment, including predicting thecontemporary appearance of lost individuals or wanted sus-pects based on outdated photos and bringing improvementsto face recognition systems in the cross-age veriﬁcation sce-nario. Because of the signiﬁcant practical value, face aginghas received considerable research attention but remainschallenging due to its intrinsic complexity.In the last two decades, face aging has witnessed im-pressive progress and a large number of approaches havebeen proposed to address this problem, which could begenerally divided into two categories: physical model-basedmethods [4], [5], [6], [7], [8], [9] and prototype-based meth-ods [10], [11], [12], [13], [14]. Physical model-based methodssimulate the proﬁle growth mechanically via parameterizedmodels of facial shape and texture, while prototype-basedmethods render aging effects by applying learned transla-tion patterns between prototype faces (averaged faces ofpre-deﬁned age groups) to the test image. Although obviousaging signs could be synthesized using these two types ofmethods, they suffer from extremely high computationalexpenses and limited generalization ability [3]. • Y. Liu is with the Center for Research on Intelligent Perception andComputing, National Laboratory of Pattern Recognition, Institute ofAutomation, Chinese Academy of Sciences, Beijing 100190, China andalso with University of Chinese Academy of Sciences, Beijing 100190,China. E-mail: [email protected]. • Q. Li, Z. Sun, and T. Tan are with the Center for Research on IntelligentPerception and Computing, National Laboratory of Pattern Recognition,Institute of Automation, CAS Center for Excellence in Brain Science andIntelligence Technology, Chinese Academy of Sciences, Beijing 100190,China. E-mail: { qli, znsun, tnt } @nlpr.ia.ac.cn. In recent years, with the great success of GenerativeAdversarial Networks (GANs) [15] in image synthesis andtranslation tasks, many studies resort to GAN-based frame-works to solve the face aging problem [3], [16], [17], [18].These methods model the mapping function between distri-butions of young and old face images and directly translatetest faces into the target age group via learned mappings.The most remarkable advantage of GAN-based methodsover previous conventional approaches (physical model-based and prototype-based methods) is that synthesizedface images are much more visually plausible and havefewer ghosting artifacts. Moreover, GAN-based modelscould be trained in an end-to-end manner, which signiﬁ-cantly reduces the overall complexity of the algorithm.Since multiple face images of the same subject at differ-ent ages are prohibitively expensive to collect in practice,most GAN-based methods resort to unpaired face agingdata to train the model. However, these approaches mainlyfocus on simulating mappings between image contentswhile neglecting other critical semantic conditional infor-mation of the input (e.g., facial attributes), and thus failto regulate the training process accordingly. Concretely, agiven young face image might map to multiple elderlyface candidates in unpaired scenarios, which may misleadthe model to establish translation patterns other than ag-ing if no high-level conditional information is considered.Consequently, serious ghosting artifacts and even incorrectfacial attributes may appear in synthesized face images,which seriously reduces the authenticity and rationality ofgeneration results. For example, Fig. 1 shows several faceaging results with mismatched attributes. In the rightmostface aging result under ‘Gender’, the beard is mistakenlyattached to the input female face image, which is almostimpossible to happen in the natural aging process. This isbecause the model learns that growing a beard is a typical a r X i v : . [ c s . C V ] N ov BaldRace

Asian White Asian White

White Black White Black

Black Asian Black Asian

GlassesGender

Male Female Male Female Female Male Female Male Female Male Female Male

Test Face Output Test Face Output Test Face Output Test Face Output Test Face Output Test Face Output

Fig. 1. Examples of face aging with mismatched facial attributes generated by face aging model without facial attribute embedding. Four attributes(Race, Gender, Glasses, and Bald) are considered and three sample results are presented for each. Labels of ‘Race’ and ‘Gender’ are all obtainedvia the public face analysis API of Face++ [19] and placed under each image. sign of aging, but fails to recognize that this does not happento a woman since no conditional information of the test faceis involved in the training process.In order to preserve personalized characteristics of inputfaces, many recent face aging studies attempt to regulategeneration results by enforcing the identity consistency [3],[16], [17], [18]. However, as shown in Fig. 1, the identity ofthe test face is well preserved in the output for all sampleresults, nevertheless, unnatural changes of facial attributescould still be observed. This suggests that well-maintainedidentity information does NOT imply reasonable aging re-sults when training with unpaired data. Therefore, merelyenforcing identity consistency is insufﬁcient to eliminatematching ambiguities, and thus fails to achieve satisfactoryface aging performance in unpaired training scenarios.In addition to undesired changes of facial attributes,another critical problem of existing GAN-based face agingmethod is that image contents irrelevant to age progression(e.g., image background) are not well preserved in theoutput, resulting in obvious ghosting artifacts and colordistortions. Basically, from the perspective of conditionalimage translation, face aging could be considered as addingrepresentative signs of aging (e.g., wrinkles, eye bags, andlaugh lines) to the input face image. Therefore, image mod-iﬁcations are supposed to be restricted to those regionshighly relevant to age changes and image contents shouldbe well preserved elsewhere. However, most existing GAN-based face aging methods do not enforce the constraint onregions of modiﬁcation, instead, the pixel at each spatiallocation of the synthesis result is re-estimated by the gen-erator. Consequently, unintended correspondences betweenimage contents other than age translation (e.g. clothes andaccessories) would be inevitably established, which heavilyincreases the chance of introducing age-irrelevant imagemodiﬁcations and ghosting artifacts.To solve the above-mentioned issues, in this paper, wepropose A GAN, a GAN-based framework for A ttribute- A ware A ttentive face aging. Different from existing methodsin the literature, we involve semantic conditional informa-tion of the input by embedding facial attribute vectors intoboth the generator and discriminator, so that the modelcould be guided to output elderly face images with at-tributes faithful to the corresponding input. To improve thevisual quality of synthesized face images, we leverage theattention mechanism to restrict modiﬁcations to age-relatedareas and preserve details in input images. Furthermore, to enhance aging details, based on the observation that signsof aging are mainly represented by wrinkles, eye bags, andlaugh lines, which could be treated as local textures, weemploy wavelet packet transform in the critic network toextract features at multiple scales in the frequency spaceefﬁciently.Main contributions of this study are summarized asfollows: • An effective end-to-end GAN-based network,A GAN, is proposed to solve the face aging prob-lem. Speciﬁcally, facial attributes are embedded asthe semantic conditional information into both thegenerator and discriminator to enforce more ﬁne-grained consistency between input and generationresults. Besides, a wavelet packet transform moduleis adopted to extract features of aging textures atmultiple scales in the frequency domain for gener-ating more realistic details of aging effects. • To improve the quality of synthesized images andsuppress ghosting artifacts, attention mechanism isintroduced to help restrict image modiﬁcations toage-related image regions. • Extensive experiments have been conducted todemonstrate the ability of the proposed methodin rendering accurate aging effects and preservinginformation of both identity and facial attributes.Quantitative comparison with other four advancedface aging benchmarks indicates that our methodachieves state-of-the-art performance.Compared to our previous work in [20], this paperhas the following extensions: 1) attention mechanism isintroduced to help improve the visual quality of generationresults via restricting modiﬁcations to image regions closelyrelated to age progression; 2) generalization ability of theproposed method is investigated by testing the trainedmodel on other widely used face datasets, including FG-NET [21] and CelebA [22]. 3) we reﬁned our model to obtainbetter results, and a thorough comparison with four GAN-based benchmark methods is provided to demonstrate theeffectiveness of the proposed model in achieving reasonableand lifelike face aging results.The rest of this paper is organized as follows: Section 2brieﬂy reviews related works on face aging and attentionmechanism. Detailed descriptions of the proposed methodis provided in Section 3. Experimental results are reported

TABLE 1Comparison between our model and previous GAN-based face aging methods.

Method Main Features Evaluation Metrics Remarks

ConditionalAdversarialAutoencoder(CAAE) [16] Age and identity translation areachieved by traversing on alow-dimensional manifold M . Identity permanence and visualﬁdelity (user study). Only subtle aging textures aregenerated, which are insufﬁ-cient to reﬂect associated agechanges.Global and LocalConsistent AgeGAN(GLCA-GAN) [17] Three face patches aretranslated via dedicated sub-networks besides the globalgenerator. Face veriﬁcation on ageprogression/regression(LightCNN-29 [23]) Extra network structures in thegenerator and accurate facepatch cropping are required.Identity PreservedConditional GAN(IPCGAN) [18] A pre-trained AlexNet [24] isadopted to preserve the iden-tity information in the featurespace. Face veriﬁcation and age clas-siﬁcation (user study); Incep-tion Score (feature extracted byVGG [25]). Training and testing face im-ages are in low resolution ( × ).Pyramid-StructuredDiscriminator GAN(PSD-GAN) [3] A VGG-16 network [25] isused to extract multi-level age-related features for discrimina-tion. Age estimation and face ver-iﬁcation (public face analysistools of Face++ [19]) The VGG-16 network in the dis-criminator requires pre-trainingon an age estimation task.Attribute-awareAttentive GAN(A GAN, our model) Facial attributes are involved asconditional semantic informa-tion, and attention mechanismis adopted to further reﬁne im-age quality. Age estimation, face veriﬁca-tion, and attribute preservation(public face analysis tools ofFace++ [19]) Wavelet packet transform isemployed in the discriminatorto compute multi-scale texturalfeatures in the frequency space. in Section 4 presents. Section 5 concludes this work anddiscusses possible future research directions.

ELATED W ORKS

In the last few decades, face aging has been a very popularresearch topic and a great number of algorithms have beenproposed to solve this problem. In general, these methodscould be divided into two categories: physical model-basedmethods and prototype-based methods.Physical model-based methods are the initial explo-rations of face aging and they simulate changes of facialappearance w.r.t. time by modeling both geometrical andtextural features of human faces. As one of the earliestattempts, Todd et al. [4] model the proﬁle growth via therevised cardioidal strain transformation. Subsequent worksinvestigate the problem from various biological aspectsincluding muscles and overall facial structures [5], [6], [7],[8], [9]. However, physical model-based algorithms are com-putationally expensive and difﬁcult to generalize as theyheavily depend on speciﬁc empirical aging rules.As for data-driven prototyping approaches, Burt etal. [10] propose to divide faces into age groups, each rep-resented by an average face (the prototype), and regarddifferences between average faces as aging patterns. Follow-ing [10], many prototype-based methods are proposed toimprove the face aging result [11], [12], [13], [14]. However,the main problem of prototype-based methods is that per-sonalized features are eliminated when calculating averagedfaces thus the identity information is not well preserved inaging results. Moreover, since the pattern of age progressionis largely determined by the averaged face of each age group, the diversity of visual effects of face aging is quitelimited.With the rapid development of deep learning theory,deep generative models with temporal architectures are pro-posed to model age progression with hierarchically learnedrepresentations [26], [27], [28]. However, in most of theseworks, face image sequence over a long age span for eachsubject is required thus their potential in practical applica-tion is limited.Recently, Generative Adversarial Networks (GANs) [15]have achieved remarkable success in generating visuallyplausible images, and many efforts have been made tosolve the problem of face aging taking the advantage ofadversarial learning [3], [16], [17], [18]. Zhang et al. [16]propose a conditional adversarial autoencoder (CAAE) toachieve age progression and regression by traversing in alow-dimensional feature manifold. Li et al. [17] attend tothree manually selected facial patches where age effectsare likely to appear, and separate generators are adoptedto model the appearance change in these areas. By incor-porating a conditional age vector, Wang et al. [18] achieveage progression to multiple target age groups with a singlemodel. Using a pre-trained deep model in the discriminatornetwork, Yang et al. [3] propose a GAN-based frameworkwith pyramid-structured discriminator (PSD-GAN) to ren-der aging effects. A comprehensive summary of GAN-basedface aging methods and comparisons between our methodand previous state-of-the-art is provided in TABLE. 1.

Attention plays an important role in the human visualsystem as it serves as a high-level understanding of thescene and could guide the bottom-up processing of detailedobjects [29], [30], [31]. In recent years, numerous attempts ··· M A M I Input face I y Real aged face with the same attribute Real / FakeLevel 0

Level 1Level 2

Attribute-aware Attentive Generator G

Wavelet Coefficients N NN N Wavelet-based Multi-pathway Discriminator D

Fusion

LayerGender: Male

Race: Black … N Facial Attribute Embedding

Encoder Decoder

Resnet

Blocks

Fake aged face I o Wavelet Packet Transform

Fully

Connected

Layer

Fig. 2. An overview of the proposed A GANmodel. An hourglass-shaped generator G learns the age mapping and outputs lifelike elderly faceimages. A discriminator D is employed to distinguish synthesized face images from generic ones, based on multi-scale wavelet coefﬁcientscomputed by the wavelet packet transform module. The N-dimensional attribute vector describing the input face image is embedded to boththe generator and discriminator to reduce matching ambiguity inherent to unpaired training data. have been made to embed the attention mechanism intodeep neural networks to improve the performance. Atten-tion mechanism has been successfully applied to recur-rent neural networks (RNN) and long short-term memory(LSTM) to tackle problems with sequential input, includ-ing neural machine translation [32], [33], visual questionanswering [34], [35], [36], and caption generation [37].As for vision-related tasks, attention mechanism couldbe naturally introduced to guide the model to focus on spe-ciﬁc image regions closely related to the target task. Wanget al. [38] propose a residual attention network which couldgenerate attention-aware features for image classiﬁcation.Woo et al. [39] explore the effectiveness of a light-weightgeneral attention module, Convolutional Block AttentionModule (CBAM), in improving the performance of deepmodels in various vision tasks. Albert et al. [40] adopt thespatial attention mechanism to synthesize face images withtarget expression. Attention is also widely used in solvingimage captioning [41] and saliency detection problems [42],[43], [44], [45], [46]. HE P ROPOSED M ETHOD

In an unpaired face aging dataset, a given young face imagemight map to many elderly face candidates during thetraining process, which may mislead the model into learningtranslations other than aging if no conditional information isconsidered. To solve this problem, we present a GAN-basedface aging model that takes both young face images andtheir semantic information (i.e. facial attributes) as input andoutputs visually plausible aged faces with consistent facialattributes.Our model mainly consists of two key components: anattribute-aware attentive generator G and a wavelet-based multi-pathway discriminator D . The generator G takes ayoung face image I y ∈ R H × W × C as input and predictthe corresponding aged face I o , while the discriminator D encourages generation results to be indistinguishablefrom generic face images. Unlike most existing face agingmethods, the attribute of the input (denoted as α ∈ R N ,where N is the number of facial attributes to be preserved)is considered as the conditional information and embeddedinto both G and D to ensure the attribute consistency. Anoverview of the proposed framework is shown in Fig. 2. Most existing GAN-based face aging methods [3], [16], [17],[18] have two main problems:1) Only images of young faces are taken as input tolearn mappings between age groups, regardless ofany prior knowledge that may have an inﬂuenceon the visual pattern of age progression. Althoughconstraints on identity information and pixel valuesare usually adopted to restrict modiﬁcations madeto input images, facial attributes may still undergounnatural translations (as shown in Fig. 1).2) Although signs of aging concentrate on certain facialregions which only take up a small percentage ofthe entire image, pixel at each spatial location isre-estimated in the generation result. Consequently,unintended correspondences between image con-tents other than age translation (e.g. backgroundtextures) would be inevitably established, whichincreases the chance of introducing age-irrelevantchanges and ghosting artifacts.To solve these problems, in this paper, we proposean attribute-aware attentive generator G to achieve ﬁne-grained face aging (detailed structure is shown in Table 2). I(k) h low h high h high h low high h low V (k+1)I H (k+1)I D (k+1) ColumnsColumns RowsRows (a)(b)Level 0 Level 1 Level 22 ↓

Fig. 3. Demonstration of wavelet packet transform. (a) Low-pass andhigh-pass decomposition ﬁlters ( h low and h high ) are applied iterativelyto the input on k -th level to compute wavelet coefﬁcients on the nextlevel; (b) a sample face image with its wavelet coefﬁcients at differentdecomposing levels. We employ an hourglass-shaped fully convolutional net-work as the backbone of the generator, which has achievedsuccess in previous image translation studies [47], [48].Speciﬁcally, it consists of three key components: an encodernetwork, a decoder network, and six residual blocks inbetween as the bottleneck. Unlike previous works, we pro-pose to incorporate both low-level image data (pixel values)and high-level semantic information (facial attributes) intothe face aging model to regulate image translation patternsand reduce the ambiguity of mappings between unpairedyoung and aged faces. Concretely, the input facial attributevector is replicated along the spatial dimension and thenconcatenated with the output of the last residual block(ResBlock6), as they both contain high-level representationsof the input image.Considering the fact that age progression is essentiallythe gradual emergence of aging signs, we naturally intro-duce the attention mechanism to guide the generator toconcentrate more on image regions that aging signs arelikely to appear. This is achieved by estimating an atten-tion mask describing the contribution of each pixel in theinput image I y to the ﬁnal generation result. As shownin Fig. 2, the decoder network outputs two feature maps,an attention mask M A ∈ [0 , H × W × C and an image map M I ∈ R H × W × C . These two feature maps are then fed intoa fusion layer along with the input image I y to obtain theaged face image I o , which could be formulated as, I o = M A (cid:12) I y + (1 − M A ) (cid:12) M I (1)where (cid:12) denotes element-wise product and M A is replicatedalong the channel dimension to match the size of M I and I y .Intuitively, the attention mask M A indicates the pro-portion of the original input image being retained at eachspatial location, or in other words, to what extent the imagemap M I contributes to the ﬁnal output. For example, asshown in Fig. 2, brighter areas in M A suggest those regionsof the ﬁnal output I o retain more information from theinput I y , which are precisely image contents irrelevant to TABLE 2Architecture of the generator G Module Layer K / S / P * Output SizeEncoder Conv1 × / 1 / 3 × × Conv2 × / 2 / 1 × × Conv3 × / 2 / 1 × × Bottleneck ResBlock1 × / 1 / 1 × × ResBlock2 × / 1 / 1 × × ResBlock3 × / 1 / 1 × × ResBlock4 × / 1 / 1 × × ResBlock5 × / 1 / 1 × × ResBlock6 × / 1 / 1 × × Decoder up ↑ & Conv1 † × / 1 / 1 × × up ↑ & Conv2 × / 1 / 1 × × Conv for estimating M A × / 1 / 3 × × Conv for estimating M I × / 1 / 3 × × * K, S, P denotes the size of kernel, stride, and padding, respec-tively. † up ↑ & Conv denotes upsampleing layer (scale factor of 2)followed by a convolutional layer age changing (e.g. background and clothes). On the otherhand, darker areas in M A refer to regions in I o that are moreclosely related to the image map M I , that is, representativesigns of face aging (i.e. forehead, hair, and laugh lines asshown in Fig. 2). The greatest advantage of adopting atten-tion mechanism is that the generator could be guided tofocus only on rendering age changing effects within speciﬁcregions, and pixels irrelevant to age progression could bedirectly obtained from the original input, resulting in moreﬁne-grained image details with fewer ghosting artifacts. In face aging tasks, a discriminator network D is introducedto distinguish synthetic aged face images from generic ones,and the generator learns to confuse D with outputs of highvisual ﬁdelity.In order to generate more accurate and lifelike agingdetails, Yang et al. [3] exploit a deep network with VGG-16 structure [25] pre-trained on an age classiﬁcation taskto extract age-related features conveyed by faces. Althoughmulti-scale representations could be obtained, storing andforwarding through a deep network would damage theefﬁciency of the model. Besides, pre-training also requiresextra effort and might potentially limit the generalizabilityof the model due to the bias towards training dataset.To overcome this issue, since typical signs of aging,e.g. wrinkles, laugh lines, and eye bags, could be regardedas local image textures, we adopt wavelet packet transform(WPT) to transform the input image to the frequency do-main and capture textural features. Speciﬁcally, multi-levelWPT (see Fig. 3) is performed to provide a more compre-hensive analysis of textures at multiple scales in the givenimage. Compared to extracting multi-scale features using asequence of convolutional layers as in [3], the advantageof using WPT is that the computational cost is signiﬁcantlyreduced since wavelet coefﬁcients could be calculated bysimply forwarding through a single convolutional layer.Therefore, WPT greatly reduces the number of convolutionsperformed in each forwarding process. Although this part TABLE 3Architecture of the discriminator D Pathway index 1 2 3Input size × × × ×

12 64 × × ConvolutionalPathway * Conv-64 - -Conv-128 Conv-128 -Conv-256 Conv-256 Conv-256(concat I y ) (concat I y ) (concat I y )Conv-512 Conv-512 Conv-512Conv-512 Conv-512 Conv-512Conv-1 Conv-1 Conv-1FC Layer × × → * For all layers in the convolutional pathway, the size of kernel,stride, and padding are set to 4, 2, 1, respectively. Only thenumber of channels of output for each layer is listed in the table. of the model has been simpliﬁed in terms of networkstructure, it still takes the advantage of multi-scale imagetexture analysis, which helps improve the visual ﬁdelity ofgenerated images.The overall structure of D is shown in Tabel 3. Specif-ically, WPT is applied to input images to perform waveletdecomposition, and coefﬁcients at each decomposing levelare concatenated along the channel dimension before fedinto separate convolutional pathways for feature extraction.To make D gain the ability to tell whether attributes arepreserved in generated images, the attribute vector α y is alsoreplicated and concatenated to the output of an intermediateconvolutional block of each pathway. At the end of D ,same-sized outputs of all pathways are fused to form asingle tensor and then fed into a fully connected networkto produce the ﬁnal rating score for the authenticity of inputimages. The training objective of the proposed model consists ofthree parts: an adversarial loss to encourage the distributionof generated images to be indistinguishable from that of realimages, an identity loss to preserve personalized characteris-tics of the input image, and a pixel-level loss to reduce thegap between the input and output of the generator in theimage space.

The adversarial process between the generator G and dis-criminator D encourages synthetic results to be photo-realistic and indistinguishable from real ones. Besides vi-sual ﬁdelity, attribute consistency is also guaranteed byinvolving the attribute of input face images as conditionalinformation in the adversarial process.To achieve these two goals, unlike existing face agingmethods, our discriminator network D is designed to takepair-wise input, i.e. aged face images and their correspond-ing attributes. Our goal is to make D gain the ability todiscriminate generated aged face images from real ones andtelling whether the input face image contains the desiredattribute. Therefore, to train the discriminator network, datapairs of real aged faces with attributes same as I y , denotedby { I o , α } , are considered as positive samples. Negative samples include image pairs of synthesized aged faces G ( I y , α ) and their attributes α , i.e. { G ( I y , α ) , α } , as well asimages pairs of real aged faces and mismatched attributes,i.e. { I o , ¯ α } .Formally, the objective function for training the discrim-inative network D consists of two parts, that is, L adv att forchecking the attribute consistency and L adv auth for imageauthenticity discrimination. Therefore, the adversarial losscould be formulated as, L adv D = λ att L adv att + L adv auth (2)where the parameter λ att controls the relative importancebetween L adv att and L adv auth , which is initialized as 0 andthen linearly increased during the training process. Thisenables D to ﬁrstly focus on discriminating fake imagesfrom real ones and then gradually adapts to the task ofchecking attribute consistency, which is critical for stabiliz-ing the training process. We follow WGAN [49] and use theWasserstein distance to measure the discrepancy betweentwo data distributions. Therefore, L adv att and L adv auth areformulated as follows, L adv att = − E ( I o ,α ) ∼ P o ( I,α ) [ D ( I o , α )]+ E ( I o ,α ) ∼ P o ( I,α ) [ D ( I o , ¯ α )] (3) L adv auth = − E ( I o ,α ) ∼ P o ( I,α ) [ D ( I o , α )]+ E ( I y ,α ) ∼ P y ( I,α ) [ D ( G ( I y , α ) , α )] (4)where P y and P o stand for the distribution of generic faceimages of young and old subjects, respectively.The generator network G is trained to confuse D with vi-sually plausible synthetic images, and the objective functioncould be written as, L adv G = − E ( I y ,α ) ∼ P y ( I,α ) [ D ( G ( I y , α ) , α )] (5)Notably, since our model aims at rendering lifelike aging ef-fects rather than transferring attributes of input face images,only correct attribute of young face images are fed into thegenerator in the training process. Although the goal of face aging is to modify a given faceimage to present aging effects, one key requirement is topreserve the identity-related information of the input. Tothis end, we adopt the identity preserving loss to minimizethe distance between input and output of the generator inthe feature space embedding personalized characteristics.Speciﬁcally, we employ a pre-trained LightCNN model [23],denoted as φ id , as the feature extractor and ﬁx the parame-ters during the training process. To be concrete, the identitypreserving loss is deﬁned on the output of both the lastpooling layer and the fully connected layer of φ id , whichcould be formulated as, L id = E ( I y ,α ) ∼ P y ( I,α ) (cid:20)(cid:13)(cid:13)(cid:13) φ poolid ( G ( I y , α )) − φ poolid ( I y ) (cid:13)(cid:13)(cid:13) F (cid:21) + E ( I y ,α ) ∼ P y ( I,α ) (cid:20)(cid:13)(cid:13)(cid:13) φ fcid ( G ( I y , α )) − φ fcid ( I y ) (cid:13)(cid:13)(cid:13) (cid:21) (6)where φ poolid and φ fcid denote the output of the last poolinglayer and the fully connected layer, respectively. Addi-tionally, a pixel-level loss is also adopted to maintain the consistency of low-level image content between the inputand output of the generator, which could be written as, L pix = E ( I y ,α ) ∼ P y ( I,α ) (cid:104) (cid:107) G ( I y , α y ) − I y (cid:107) (cid:105) (7) To generate photo-realistic aged face with attributes faithfulto the corresponding input, the the overall objective functionfor optimizing the discriminator D could be formulated as min (cid:107) D (cid:107) L ≤ L D = L adv D = λ att L adv att + L adv auth (8)where (cid:107) D (cid:107) L ≤ denotes the 1-Lipschitz constraint [49]imposed on D , and is implemented by gradient penalty asproposed in WGAN-GP [50]. The objective for the generator G could be written as min G L G = L adv G + λ id L id + λ pix L pix (9)where λ id and λ pix are hyperparameters for balancing theimportance of L id and L pix w.r.t. the adversarial loss term,respectively. G and D are trained alternatively until reachingthe optimality. XPERIMENTS

Extensive experiments are conducted to validate theproposed A GAN in generating realistic and attribute-consistent aged face images. In this section, we ﬁrst intro-duce face aging datasets and then present implementationdetails of our model. After that, extensive qualitative andquantitative results are reported to demonstrate the effec-tiveness of the proposed method. Finally, ablation studyis conducted to further explore the contribution of eachcomponent of our model.

Two publicly available face aging datasets, MORPH [51]and CACD [52], are employed in our experiments for bothtraining and testing.

MORPH contains 55,134 face imagesof 13,000 people, covering an age span of 16 to 77. Faceimages in MORPH capture near-frontal faces of collabo-rative subjects under uniform and moderate illuminationwith simple background.

CACD contains 163,446 photos of2,000 celebrities obtained in much less controlled (in-the-wild) conditions compared to MORPH. Consequently, largevariations in terms of pose, illumination, and expression(PIE variations) exist in CACD. Besides, due to the fact thatimages in CACD are collected via Google Image Search,there are mismatches between faces and associated labelsprovided (i.e. name and age), making it a very challengingdataset for accurate modeling of the face aging process.Another two face datasets, FG-NET [21] and CelebA [22],are employed as test datasets to validate the generalizationability of the proposed model.

FG-NET contains 1,002 faceportraits of 82 subjects and is widely adopted in the testphase of previous works [3], [12], [14], [16].

CelebA is alarge-scale face dataset featuring diverse facial attributes,which contains 202,599 face images with 40 attribute anno-tations for each sample. Similar to CACD, images in CelebAare also captured in the wild and cover large pose variationsas well as background clutter.

Image Normalization.

Following the convention of previ-ous studies [3], [16], [17], [26], [28], we adopt the age span of10 years for each age group and only consider adult aging asboth MORPH and CACD do not contain images of children.In this way, faces are divided into four groups in terms ofage, i.e. 30-, 31-40, 41-50, 51+, and only age translationsfrom 30- to the other three age groups are considered. Allface images in MORPH and CACD are aligned accordingto eye locations detected using MTCNN [53] and thencropped into size × × . After face normalization,51,822 and 163,355 face images from MORPH and CACDare used in the experiments, respectively. As for FG-NETand CelebA, images are normalized according to faciallandmarks provided along with the dataset. In total, 984faces from FG-NET and 10,000 images randomly sampledfrom CelebA are used as testing data in the cross-datasetvalidation experiment. Facial Attribute Labeling.

As for facial attribute label-ing, MORPH provides researchers with labels including age,gender, and race for each image. We choose ‘gender’ and‘race’ to be the attributes required to be preserved, sincethese two attributes are guaranteed to remain unchangedduring the natural aging process, and are relatively objectivecompared to attributes such as ‘attractive’ or ‘chubby’ usedin CelebA. For CACD, we go through the name list ofcelebrities and label corresponding images accordingly. Thisintroduces noise in attribute labels due to the mismatchingbetween annotated names and actual faces presented, whichfurther increases the difﬁculty for our method to achieve asatisfying performance on this dataset. Since face imageswith race other than ‘white’ only take a small portion ofthe entire dataset, we only select ‘gender’ as the attributeto preserve. Facial attributes of FG-NET are detected viapublic face analysis APIs of Face++ [19], and for CelebA,we simply adopt facial attribute annotations provided alongwith the dataset. It is worthwhile to note that the proposedmodel is highly expandable, as researchers may choosewhatever attributes to preserve by simply incorporatingthem in the conditional facial attribute vector. Training Conﬁgurations.

We choose Adam to be theoptimizer of both G and D with learning rate e − . Asfor trade-off parameters, λ att , λ pix and λ id are set to . , . and . , respectively. The identity preserving loss isapplied at every generator iteration, and the pixel-level lossis employed every 5 generator iterations, creating sufﬁcientroom for the generator to manipulate the input image. Onboth MORPH and CACD, the model is trained with batch-size of 16 for 30 epochs, and all experiments are conductedunder 5-fold cross-validation. To demonstrate the effectiveness of our model in perform-ing lifelike and accurate age progression, six benchmarkmethods (CONGRE [54], HFA [55], CAAE [16], GLCA-

1. According to API documentations on the ofﬁcial website of Face++(https://console.faceplusplus.com), the latest update of face APIs (An-alyze API and Compare API) was in March, 2017. All quantitativeresults from Face++ API were obtained in August, 2019.

Test Face 31 – 40 41 – 50 51+ Test Face 31 – 40 41 – 50 51+ Test Face 31 – 40 41 – 50 51+Test Face 31 – 40 41 – 50 51+

30 Years Old 29 Years Old 24 Years Old27 Years Old 30 Years Old 29 Years Old 24 Years Old27 Years Old

24 Years Old

30 Years Old29 Years Old

30 Years Old24 Years Old

30 Years Old29 Years Old

30 Years Old

30 Years Old27 Years Old 27 Years Old27 Years Old 30 Years Old27 Years Old 27 Years Old27 Years Old23 Years Old 24 Years Old 28 Years Old 29 Years Old23 Years Old 24 Years Old 28 Years Old 29 Years Old19 Years Old 19 Years Old 19 Years Old 27 Years Old19 Years Old 19 Years Old 19 Years Old 27 Years Old28 Years Old 29 Years Old 30 Years Old 23 Years Old28 Years Old 29 Years Old 30 Years Old 23 Years Old28 Years Old 29 Years Old19 Years Old 26 Years Old28 Years Old 29 Years Old19 Years Old 26 Years Old17 Years Old19 Years Old 30 Years Old 17 Years Old 17 Years Old19 Years Old 30 Years Old 17 Years Old

Fig. 4. Sample face aging results on MORPH (ﬁrst four rows) and CACD (last four rows). The leftmost image of each result is the input test faceimage (age labeled below) and subsequent 3 images are synthesized elderly face images of the same subject in age group 31-40, 41-50 and 51+,respectively. Zoom in for a better view of aging details.

GAN [17], IPC-GAN [18], and PSD-GAN [3]) are selectedfor comparison.Speciﬁcally, CONGRE [54] and HFA [55] do not adoptGAN-based framework thus only participate in the com-parison of visual results for fairness, and results of othermethods are considered as benchmarks for both qualitativeand quantitative comparisons. As for CAAE and IPC-GAN,code provided by corresponding authors are used forreproducing results, and hyper-parameters are ﬁne-tuned toobtain the optimal results. Since GLCA-GAN requires ﬁne-grained cropping of facial components, original experimen-tal results are obtained from the authors and directly usedfor evaluation. As for PSD-GAN, we re-implemented themodel and the VGG network in the discriminator is pre-trained on the same set of training samples as the entireGAN-based framework. GAN

Sample face aging results on MORPH and CACD are shownin Fig. 4. For each result, the leftmost image shows the testface under 30 years old, and the subsequent three images

2. CAAE: https://github.com/ZZUTK/Face-Aging-CAAE3. IPC-GAN: https://github.com/dawei6875797/Face-Aging-with-Identity-Preserved-Conditional-Generative-Adversarial-Networks (b)(b)(d)(d)(c)(c) (a)(a)

Fig. 5. Illustration of visual ﬁdelity for different facial components: (a) Hairwhitening; (b) Forehead wrinkle and receding hairline; (c) Eye regionaging; (d) Mouth region aging. Zoom in for a better view of details. are synthesized aged faces in age group 31-40, 41-50 and51+, respectively. Although input face images cover a widerange of gender, race, pose, and expression, the proposedmethod could generate visually appealing face aging resultswith coherent and diverse signs of age progression, includ-ing bald forehead, white hair, laugh lines, etc. Notably,compared to MORPH, generated faces in CACD presentmore ﬁne-grained and subtler signs of aging, especially forfemale subjects. This observation reﬂects the difference indata distributions of MORPH and CACD, as CACD mainlycontains face images of celebrities with apparent make-up,

Test Face 31 – 40 41 – 50 51+ Test Face 31 – 40 41 – 50 51+ Test Face 31 – 40 41 – 50 51+Test Face 31 – 40 41 – 50 51+ Test Face 31 – 40 41 – 50 51+ Test Face 31 – 40 41 – 50 51+ Test Face 31 – 40 41 – 50 51+Test Face 31 – 40 41 – 50 51+

Fig. 6. Illustration of face aging results and corresponding attention maps on MORPH (ﬁrst two rows) and CACD (second two rows). Darker regionssuggest those areas of the face image receive more attention in the generation process, and brighter regions indicate that more information isretained from the original input image. Zoom in for a better view of aging details. making them look generally younger than faces in MORPH.Quantitative results reported in Sec. 4.5.1 also conﬁrms ourconlusion. Closer inspect of aging details in different facialregions are presented in Fig. 5.

The attention mechanism is employed in our model to re-strict image modiﬁcations within age-related regions. Fig. 6shows sample face aging results with corresponding atten-tion maps, indicating the contribution of input images togenerated aged faces. It could be observed that darker re-gions, which represent image areas receiving more attentionin the generation process, are distributed mainly in facialareas closely related to signs of aging (e.g. hair, forehead,and mouth). Image content in these areas is modiﬁed toreﬂect age changes. On the other hand, pixels located inbrighter regions are mainly retained from the original inputimage. This enables the generator to focus on synthesizingsigns of aging, which is helpful for both preserving ﬁne-grained textural details in the input image and improvingthe visual ﬁdelity of generation results.

To further demonstrate the effectiveness of our model, per-formance comparison is conducted between the proposedmethod and prior work on MORPH and CACD. Six faceaging methods, including both traditional (CONGRE [54]and HFA [55]) and GAN-based models (CAAE [16], GLCA-GAN [17], IPC-GAN [18], and PSD-GAN [3]), are consideredas benchmarks, and comparison results are shown in Fig. 7.Clearly, CONGRE and HFA only render subtle agingeffects within the facial area while our method could also vividly simulate the process of hair whitening and hairlinereceding. As for CAAE, due to its incapability in jointlymodeling age progression and identity translation, over-smoothed faces are generated and signs of aging couldhardly be observed.Although more obvious aging effects could be seen inresults of GLCA-GAN and IPC-GAN, they are originallydesigned for face images of size × while our methodworks on higher resolution ( × ) with rich and enhanceddetails. In addition, GLCA-GAN adopts local generatornetworks to emphasize aging patterns in facial patches offorehead, eyes, and mouth, but it overlooks the importanceof the hair region which is also critical in reﬂecting agechanges.With the aid of multi-level face representations extractedby the pre-trained deep network in the discriminator, PSD-GAN is able to generate aged faces with high visual ﬁdelity.However, they suffer from obvious color distortions (hairand background) since the value of every single pixel is re-estimated by the generator rather than retained from theinput image. Moreover, due to the lack of prior knowledgeof input faces, masculine facial characteristics (e.g. stubbleabove the mouth) emerge in the aged faces, making thegeneration results look much less natural. To evaluate the generalization ability of our model, cross-dataset validation experiments are conducted on FG-NETand CelebA with the model trained on CACD and resultsare shown in Fig. 8. Although input faces are sampledfrom data distributions different from the training set, vi-sually plausible aging results could still be obtained via Test FaceResults of

Prior Work

Results of

Our Work(translation to 51+)

48 4218 22

CONGRE

48 4218 22

CONGRE [51-60] [51-60]21 28

HFA [51-60] [51-60]21 28

HFA PSD-GAN

14 2851+ 51+51+30 3051+

GLCA-GAN

GLCA-GAN [51-60]

IPC-GAN

CAAE

Fig. 7. Performance comparison with prior work on MORPH and CACD. Samples results of six benchmark methods are presented in the secondrow with the target age (group) labeled below. Test face images and results obtained by the proposed model are shown in the ﬁrst and last row,respectively. Zoom in for a better comparison of aging details.

Test Face Aged Face Att. Map Test Face Aged Face Att. Map Test Face Aged Face Att. Map Test Face Aged Face Att. Map

Fig. 8. Sample results achieved on CelebA (ﬁrst two rows) and FG-NET (last two rows) with the model trained on CACD. For each sample, the ﬁrstimage is the test face and the image in the middle shows the aging result. The attention map is (Att. Map) shown on the right for each result. Zoomin for a better view of aging details. the proposed method, demonstrating its effectiveness indealing with unseen face images. Notably, although testfaces follow different data distributions, activated regionsin the attention map still concentrate on facial areas closelyrelated to aging effects (e.g. white hair and laugh lines),which helps improve the visual quality of generation resultsby restricting the image region being modiﬁed.

Apart from visual ﬁdelity, the performance of the proposedmodel could also be quantitatively evaluated in the follow-ing aspects: • Aging Accuracy:

Synthesized aged faces are ex-pected to present accurate aging signs that makethem fall into the target age group. • Identity Preservation:

Personal characteristics of in-put faces are supposed to be preserved in generationresults. • Attribute Consistency:

Besides identity, facial at-tributes that should remain stable in the naturalaging process, such as gender and race, are expectedto be consistent between input and generated faces.To evaluate the performance of the proposed methodobjectively, measurements of all metrics are conducted via the public face analysis API of Face++ [19]. To unify theevaluation criteria, all metrics are computed based on re-sults obtained by the Face++ API, rather than annotationsprovided along with the dataset. For each evaluation metric,the results of four other GAN-based frameworks (CAAE,GLCA-GAN, IPC-GAN, and PSD-GAN) are also reportedfor comparison.

The goal of face aging is rendering a given face with agingeffects to predict its appearance in the future. Therefore, theage distribution of generated aged faces should match thatof real faces from the same age group to represent accurateage simulation. In this experiment, age distributions of bothgeneric and synthetic faces are estimated and compared forall three target age groups (31-40, 41-50, 51+). Results of ageestimation on MORPH and CACD are shown in TABLE 4.Clearly, mean estimated ages of synthesized faces show thetrend of increasing age (38.84, 47.84, 56.68 on MORPH, and38.26, 47.25, 54.05 on CACD), and are close to that of realfaces (38.60, 47.74, 57.25 on MORPH, and 38.50, 46.53, 53.41on CACD), demonstrating the effectiveness of our modelin accurately simulating age progression with various timeintervals. TABLE 4Results of age estimation on MORPH and CACD.

MORPH CACDAge group 31 - 40 41 - 50 51 + Age group 31 - 40 41 - 50 51 +Generic . ± .

43 47 . ± .

30 57 . ± . Generic . ± .

66 46 . ± .

69 53 . ± . Distributions of estimated ages Distributions of estimated agesCAAE . ± .

26 32 . ± .

40 35 . ± . CAAE . ± .

67 36 . ± .

56 39 . ± . GLCA-GAN . ± .

21 48 . ± .

85 53 . ± . GLCA-GAN . ± .

61 45 . ± .

04 53 . ± . IPC-GAN . ± .

62 45 . ± .

80 55 . ± . IPC-GAN . ± .

59 47 . ± .

38 55 . ± . PSD-GAN . ± .

90 50 . ± .

75 58 . ± . PSD-GAN . ± .

80 49 . ± .

23 55 . ± . Ours . ± .

42 47 . ± .

03 56 . ± . Ours . ± .

36 47 . ± .

73 54 . ± . Difference of mean ages (against generic faces) Difference of mean ages (against generic faces)CAAE − . − . − . CAAE − . − . − . GLCA-GAN +5 .

19 +0 . − . GLCA-GAN − . − . − . IPC-GAN − . − . − . IPC-GAN − .

46 +1 .

22 +2 . PSD-GAN +1 .

20 +2 .

35 +1 . PSD-GAN +4 .

47 +3 .

00 +2 . Ours +0 .

24 +0 . − . Ours − .

24 +0 .

72 +0 . Fig. 9. Estimated age distributions of (a) synthetic faces on MORPH;(b) synthetic faces on CACD; (c) generic faces on MORPH; (d) genericfaces on CACD.

Compared to our method, CAAE produces over-smoothed face images with subtle changes of appearanceon both datasets, leading to insufﬁcient facial changes andlarge errors in estimated ages. Although much more obvi-ous aging signs could be synthesized by GLCA-GAN, onMORPH, stitching outputs of several local aging networksintroduce additional ghosting artifacts in generation resultsof age group 31-40, causing large errors ( +5 . ) in estimatedages. With the aid of a pre-trained VGG network adopted inthe discriminator, PSD-GAN could effectively extract multi-level age-related representations and generate faces withclear aging signs. However, due to the matching ambiguityof facial attributes, translations in gender (female to male)take place when synthesizing aged faces, causing the overallage of generation results to be higher than generic faces,especially on CACD. This observation could be conﬁrmedby the visual comparison between our method and PSD-GAN shown in Fig. 7 as well as preservation rate of ‘gender’on CACD reported in TABLE 6. Further comparisons between detailed age distributionsbetween real and generated face images are shown in Fig. 9.For each age group, it could be observed that the agedistribution on MORPH is more concentrated than that onCACD by comparing Fig. 9 (c) and Fig. 9 (d). This is becausefaces in CACD are obtained online via image search enginethus contain noisy age labels. By comparing Fig. 9 (a) andFig. 9 (c) as well as Fig. 9 (b) and Fig. 9 (d), it is clear that agedistributions of faces generated by our model well matchthat of real faces, indicating the effectiveness of the proposedmethod in rendering accurate age translations. Besides rendering representative signs of aging, a face agingmodel is also expected to preserve personalized characteris-tics embedded in the input young face when synthesizingthe corresponding aged face. To this end, face veriﬁca-tion experiments are carried out to measure the similar-ity between real faces from age group 30- and their age-progressed counterparts in age group 31-40, 41-50, and 51+,respectively.Results of face veriﬁcation, including conﬁdence scoresand veriﬁcation rates (threshold set to 73.395@FAR=1e-5 forall experiments), are reported in TABLE 5. On MORPH, ourmodel achieves veriﬁcation rates of . , . , and . on translation to age group 31-40, 41-50, and 51+,respectively. Although there are larger variations in pose,expression, and background textures in images of CACD,face veriﬁcation rates of . , . , and . are ob-tained on three age groups, demonstrating the effectivenessof the proposed method in preserving identity information.Notably, as the time interval of age progression increases,both conﬁdence scores and veriﬁcation rates gradually de-crease. This is reasonable since a larger age gap is reﬂectedin more obvious aging signs (e.g. deeper wrinkles and eyebags), which may lower the similarity between faces fromdifferent age groups.As for benchmark methods, personalized facial featuresare failed to be preserved in heavily blurred aging results TABLE 5Resutls on face veriﬁcation on MORPH and CACD.

MORPH CACDAge group 31 - 40 41 - 50 51 + Age group 31 - 40 41 - 50 51 +Face Veriﬁcation Conﬁdence Face Veriﬁcation ConﬁdenceCAAE . ± .

99 65 . ± .

06 63 . ± . CAAE . ± .

65 59 . ± .

73 57 . ± . GLCA-GAN . ± .

33 90 . ± .

82 86 . ± . GLCA-GAN . ± .

65 85 . ± .

48 84 . ± . IPC-GAN . ± .

45 93 . ± .

01 90 . ± . IPC-GAN . ± .

18 86 . ± .

86 87 . ± . PSD-GAN . ± .

80 92 . ± .

99 89 . ± . PSD-GAN . ± .

58 90 . ± .

53 90 . ± . Ours . ± .

66 92 . ± .

59 88 . ± . Ours . ± .

77 94 . ± .

23 90 . ± . Face Veriﬁcation Rate (%) Face Veriﬁcation Rate (%)CAAE 24.28 20.05 14.42 CAAE 9.20 7.04 5.10GLCA-GAN 100.00 99.97 98.99 GLCA-GAN 96.09 95.79 95.29IPC-GAN 100.00 100.00 99.48 IPC-GAN 100.00 97.95 97.36PSD-GAN 100.00 100.00 99.42 PSD-GAN 99.83 97.67 98.50Ours 100.00 100.00 99.53 Ours 99.94 99.61 98.85

TABLE 6Preservation rate of facial attributes on MORPH and CACD.

MORPH CACDGender (%) Race (%) Gender (%)Age group 31 - 40 41 - 50 51 + 31 - 40 41 - 50 51 + 31 - 40 41 - 50 51 +CAAE 51.38 47.07 54.24 95.45 95.23 92.37 87.43 86.53 85.25GLCA-GAN 96.44 95.90 94.85 93.69 91.79 91.48 95.46 95.51 94.65IPC-GAN 96.87 97.45 96.75 97.11 96.88 90.57 94.79 90.18 93.24PSD-GAN 96.62 95.94 93.28 96.61 91.77 91.42 87.56 83.19 75.72Ours 97.41 97.58 96.92 97.68 96.36 93.28 99.00 98.59 98.00 generated by CAAE, causing poor face veriﬁcation per-formance. To preserve as many facial details in the inputface image as much, a residual connection is adopted inGLCA-GAN by adding the input face image to the outputof the generator. However, separate translations of differentfacial components inevitably introduce extra distortions andghosting artifacts, which lead to a slightly lower veriﬁcationrate on generated face images of 51+. The performance ofPSD-GAN on identity preservation is very close to ours,and the veriﬁcation rate between 30- and 51+ on MORPH( . ) is clearly higher than reported in [3] ( . ),indicating the quality of our re-implementation. In this experiment, the performance of facial attributepreservation is evaluated by comparing attributes of genericand synthetic faces estimated by the Face++ API, and resultsare shown in TABLE 6. On MORPH, our model achievespreservation rate of . , . , . on ‘Gender’and . , . , . on ‘Race’, for age mappingsfrom 30- to 31-40, 41-50, and 51+, respectively. As for CACD, . , . , and . of generated faces in age group31-40, 41-50, and 51+ have facial attributes consistent withthe corresponding input, respectively. Notably, similar toresults of face veriﬁcation experiments, with the age gapincreases, preservation rates of attributes decrease due tolarger variations of facial appearance.According to TABLE 6, it could be observed that theproposed method consistently outperforms other bench- marks by a clear margin in all cases, demonstrating theeffectiveness of our model in preserving facial attributesbeyond identity information during the face aging process.Speciﬁcally, gender characteristics are lost along with iden-tity information in faces generated by CAAE, causing largeerrors in preservation rate on ‘Gender’. Among all fourbenchmarks, GLCA-GAN and IPC-GAN give better perfor-mance on maintaining facial attribute consistency. This isbecause they are applied to face images of lower resolu-tion ( × ) with less textural details, which reducesthe chance of introducing distortions of ﬁne-grained imagecontent. Although PSD-GAN achieves good results on agingaccuracy and face veriﬁcation, it suffers from inconsistentfacial attributes between input and generated faces, due tothe lack of prior knowledge regarding the input image. In this subsection, experiments are carried out to compre-hensively analyze the contribution of each component of theproposed model, namely, facial attribute embedding (FAE),wavelet-based multi-pathway discriminator (WMD), andattention mechanism (AM). Speciﬁcally, ‘w/o FAE’ denotesthe setting that no facial attribute is considered as the con-ditional information, and both the generator and discrim-inator only receive image data as input. For ‘w/o WMD’,the proposed wavelet-based multi-pathway discriminator isreplaced with an ordinary PatchGAN discrminator [56].Moreover, ‘w/o AM’ refers to the variant without atten-tion mechanism, where the generator works on synthesizing TABLE 7Performance comparison on aging accuracy between variants of the proposed model (difference of mean ages against generic faces are shown inbrackets).

Age group 31-40 41-50 51+Generic . ± .

43 47 . ± .

30 57 . ± . Baseline . ± .

62 ( − .

17) 45 . ± .

98 ( − .

53) 55 . ± .

75 ( − . w/o FAE . ± .

56 ( − .

28) 46 . ± .

68 ( − .

86) 55 . ± .

31 ( − . w/o WMD . ± .

48 ( − .

38) 46 . ± .

83 ( − .

27) 53 . ± .

30 ( − . w/o AM . ± .

12 ( − .

37) 48 . ± .

55 (+0 .

44) 54 . ± .

03 ( − . Proposed . ± .

42 (+0 .

24) 47 . ± .

03 (+0 .

10) 56 . ± .

78 ( − . TABLE 8Performance comparison on facial attribute preservation and face veriﬁcation between variants of the proposed model..

Gender Pre. Rate (%) Race Pre. Rate (%) Face Veri. Rate (%) Face Veri. ScoreAge group 31-40 41-50 51+ 31-40 41-50 51+ 31-40 41-50 51+ 31-40 41-50 51+Baseline 97.05 95.35 92.20 97.04 94.85 91.18 100.00 99.99 97.66 . ± .

72 93 . ± .

31 86 . ± . w/o FAE 96.95 96.11 92.93 95.20 95.84 88.32 100.00 99.99 97.68 . ± .

29 93 . ± .

20 86 . ± . w/o WMD 96.84 96.93 96.14 97.44 96.74 91.66 100.00 99.96 97.02 . ± .

03 93 . ± .

34 88 . ± . w/o AM 96.90 96.27 94.95 97.69 95.89 91.57 100.00 99.93 98.44 . ± .

90 92 . ± .

57 86 . ± . Proposed 97.41 97.58 96.92 97.68 96.36 93.28 100.00 100.00 99.63 . ± .

66 92 . ± .

59 88 . ± . Test Face Baseline w / o FAE w / o WMD w / o AM Proposed White Female White Male White Male White Female White Male White Female

28 59 55 46 64 59

Black Female Black Male Black Male Black Female Black Female Black Female

22 56 59 33 54 49 (a) (b)

Fig. 10. Illustration of face aging results generated by different variants ofthe proposed model. For each subject, estimated ages (ﬁrst row), faicalattributes (second row), and conﬁdence scores of face veriﬁcation (thirdrow) are reported. All quantitative results are obtained using Face++API. pixel at each location of the entire output image. The impactof excluding each of these factors is studied in terms of vi-sual ﬁdelity, aging accuracy, identity veriﬁcation, and facialattribute preservation.Generation results obtained by different variants of theproposed model are shown in Fig. 10. It could be observedthat aged faces synthesized via the baseline model havesevere ghosting artifacts (e.g. hair area) and color distortion(e.g. facial skin). Since no semantic prior knowledge of theinput face is considered in the aging process, masculinefacial characteristics emerge and gender reversal takes place.Similarly, results obtained under the setting ‘w/o FAE’ alsosuffer from unnatural translations of facial attributes dueto the lack of conditional information. However, it couldbe noticed that involving WMD and AM helps to capture more representative age-related facial features (e.g. beardin Fig. 10 (a), white hair and mustache in Fig. 10 (b)) andpreserving image content in the input, respectively. Notably,although aged faces generated under setting ‘baseline’ and‘w/o FAE’ have limitations in maintaining the consistencyof facial attributes, this is not clearly reﬂected in resultsof age estimation or face veriﬁcation shown in TABLE 7and TABLE 8. Therefore, it could be concluded that merelyenforcing identity consistency is insufﬁcient in synthesizingaged faces reasonable in terms of facial attributes.After adopting FAE, the unnatural translation of gendercharacteristics is greatly suppressed, as could be observedfrom results under ‘w/o WMD’ and ‘w/o AM’ in Fig. 10.However, closer inspect would clearly reveal that replac-ing WMD with ordinary PatchGAN discriminator (‘w/oWMD’) damages the performance of the model in captur-ing age-related texture details, resulting in relatively largererrors in estimated ages. As for results obtained under ‘w/oAM’, distortions of color and image content (e.g. facialcontour and textual details of the hair area in Fig. 10 (b))as well as ghosting artifacts (e.g. mouth and hair region inFig. 10 (a)) could be observed, leading to lower veriﬁcationconﬁdence and even incorrect facial attribute recognitionresults. This conﬁrms the contribution of the attention mech-anism, that is, improving the visual ﬁdelity of generationresults by only attending to speciﬁc image regions closelyrelated to age progression, and retaining textual details fromthe input face for the rest image areas.Quantitative results for ablation study are reported inTABLE 7 and TABLE 8. According to results in TABLE 8,removing the facial attribute embedding component (‘w/oFAE’ and ‘Baseline’) would cause obvious performance dropin preservation rate for both ‘gender’ and ‘race’. Fromanother perspective of view, while face veriﬁcation rateshave already reached a high level (over . for mostcases), there is still relatively large room for improvement on preservation rates of facial attributes. Therefore, we couldconclude that facial attribute consistency is complementaryto identity permanence in rendering natural and reasonableface aging results with high visual ﬁdelity.In addition, from results in TABLE 7, it could be con-cluded that adopting wavelet-based multi-pathway discrim-inator (WMD) reduces the gap between age distributionsof real and synthesized faces for all age mappings. Thisdemonstrates the ability of WMD in capturing discrimi-native representations of age progression which is helpfulin rendering more accurate signs of aging. Moreover, in-troducing the attention mechanism helps comprehensivelyimprove the performance of the model on all experiments,indicating its effectiveness in generating aged faces withhigh visual quality. ONCLUSION AND F UTURE W ORK

In this paper, an attribute-aware attentive face aging model,named as A GAN, is proposed to overcome two majorlimitations of existing face aging methods, i.e. unnaturaltranslations of facial attributes and modiﬁcations to imagecontents irrelevant to age progression. Speciﬁcally, facialattributes of input images are considered as conditionalinformation and embedded to both the generator and dis-criminator to encourage attribute consistency. Besides, theattention mechanism is adopted to restrict modiﬁcationsto age-related regions and preserve image details from theinput image for the rest area, improving the visual ﬁdelityof generation results. Moreover, a wavelet packet transformmodule is employed to extract textural features, and a multi-pathway discriminator is designed to capture age-relatedrepresentations in multiple scales. Extensive experimentalresults on MORPH and CACD demonstrate the effective-ness of the proposed model in rendering accurate agingeffects while maintaining identity permanence and facialattribute consistency.Although the proposed method achieves state-of-the-artperformance in various experiments, it indeed has somelimitations. Since existing face aging datasets are heavilybiased towards White and Black people, aging patterns ofother ethnic groups (e.g. Asian and Hispanic) receive muchless attention. Besides, child aging is not investigated in thiswork due to insufﬁcient data of child faces at different ages.Considering the above issues, collecting large-scale high-quality face images covering various ethnic and age groupscould be one working direction in the future. R EFERENCES [1] N. Ramanathan, R. Chellappa, S. Biswas et al. , “Age progression inhuman faces: A survey,”

Journal of Visual Languages and Computing ,vol. 15, pp. 3349–3361, 2009.[2] Y. Fu, G. Guo, and T. S. Huang, “Age synthesis and estimation viafaces: A survey,”

IEEE Transactions on Pattern Analysis and MachineIntelligence (TPAMI) , vol. 32, no. 11, pp. 1955–1976, 2010.[3] H. Yang, D. Huang, Y. Wang, and A. K. Jain, “Learning face ageprogression: A pyramid architecture of gans,”

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,pp. 31–39, 2018.[4] J. T. Todd, L. S. Mark, R. E. Shaw, and J. B. Pittenger, “Theperception of human growth,”

Scientiﬁc American , vol. 242, no. 2,pp. 132–145, 1980. [5] A. Lanitis, C. J. Taylor, and T. F. Cootes, “Toward automaticsimulation of aging effects on face images,”

IEEE Transactions onPattern Analysis and Machine Intelligence (TPAMI) , vol. 24, no. 4, pp.442–455, 2002.[6] A. Golovinskiy, W. Matusik, H. Pﬁster, S. Rusinkiewicz, andT. Funkhouser, “A statistical model for synthesis of detailed facialgeometry,”

ACM Transactions on Graphics (TOG) , vol. 25, no. 3, pp.1025–1034, 2006.[7] N. Ramanathan and R. Chellappa, “Modeling shape and texturalvariations in aging faces,”

Proceedings of IEEE International Confer-ence on Automatic Face and Gesture Recognition (FG) , pp. 1–8, 2008.[8] J. Suo, S.-C. Zhu, S. Shan, and X. Chen, “A compositional and dy-namic model for face aging,”

IEEE Transactions on Pattern Analysisand Machine Intelligence (TPAMI) , vol. 32, no. 3, pp. 385–401, 2010.[9] Y. Tazoe, H. Gohara, A. Maejima, and S. Morishima, “Facial agingsimulator considering geometry and patch-tiled texture,”

ACMSIGGRAPH 2012 Posters , p. 90, 2012.[10] D. M. Burt and D. I. Perrett, “Perception of age in adult caucasianmale faces: Computer graphic manipulation of shape and colourinformation,”

Proceedings of the Royal Society of London. Series B:Biological Sciences , vol. 259, no. 1355, pp. 137–143, 1995.[11] B. Tiddeman, M. Burt, and D. Perrett, “Prototyping and trans-forming facial textures for perception research,”

IEEE ComputerGraphics and Applications , vol. 21, no. 5, pp. 42–50, 2001.[12] I. Kemelmacher-Shlizerman, S. Suwajanakorn, and S. M. Seitz,“Illumination-aware age progression,”

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR) , pp.3334–3341, 2014.[13] X. Shu, J. Tang, H. Lai, Z. Niu, and S. Yan, “Kinship-guided ageprogression,”

Pattern Recognition (PR) , vol. 59, pp. 156–167, 2016.[14] J. Tang, Z. Li, H. Lai, L. Zhang, S. Yan et al. , “Personalizedage progression with bi-level aging dictionary learning,”

IEEETransactions on Pattern Analysis and Machine Intelligence (TPAMI) ,vol. 40, no. 4, pp. 905–917, 2017.[15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarialnets,”

Advances in Neural Information Processing Systems (NIPS) , pp.2672–2680, 2014.[16] Z. Zhang, Y. Song, and H. Qi, “Age progression/regression byconditional adversarial autoencoder,”

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR) , pp.4352–4360, 2017.[17] P. Li, Y. Hu, Q. Li, R. He, and Z. Sun, “Global and local consistentage generative adversarial networks,” 2018, pp. 1073–1078.[18] Z. Wang, W. L. X. Tang, and S. Gao, “Face aging with identity-preserved conditional generative adversarial networks,”

Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR)

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Proceedings of the IEEE International Conference onComputer Vision (ICCV) , 2015.[23] X. Wu, R. He, Z. Sun, and T. Tan, “A light cnn for deep facerepresentation with noisy labels,”

IEEE Transactions on InformationForensics and Security (TIFS) , vol. 13, no. 11, pp. 2884–2896, 2018.[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁca-tion with deep convolutional neural networks,”

Advances in NeuralInformation Processing Systems (NIPS) , pp. 1097–1105, 2012.[25] K. Simonyan and A. Zisserman, “Very deep convolutionalnetworks for large-scale image recognition,” arXiv preprintarXiv:1409.1556 , 2014.[26] W. Wang, Z. Cui, Y. Yan, J. Feng, S. Yan, X. Shu, and N. Sebe, “Re-current face aging,”

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , pp. 2378–2386, 2016.[27] C. N. Duong, K. Luu, K. G. Quach, and T. D. Bui, “Longitudinalface modeling via temporal deep restricted boltzmann machines,”

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR) , pp. 5772–5780, 2016.[28] C. N. Duong, K. G. Quach, K. Luu, T. H. N. Le, and M. Sav-vides, “Temporal non-volume preserving approach to facial age- progression and age-invariant face recognition,” Proceedings of theIEEE International Conference on Computer Vision (ICCV) , pp. 3755–3763, 2017.[29] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visualattention for rapid scene analysis,”

IEEE Transactions on PatternAnalysis and Machine Intelligence (TPAMI) , vol. 20, no. 11, pp. 1254–1259, 1998.[30] R. A. Rensink, “The dynamic representation of scenes,”

VisualCognition , vol. 7, no. 1-3, pp. 17–42, 2000.[31] M. Corbetta and G. L. Shulman, “Control of goal-directed andstimulus-driven attention in the brain,”

Nature Reviews Neuro-science , vol. 3, no. 3, p. 201, 2002.[32] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine transla-tion by jointly learning to align and translate,”

Proceedings of theInternational Conference on Learning Representations (ICLR) , 2015.[33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,”

Advances in Neural Information Processing Systems (NIPS) , pp. 5998–6008, 2017.[34] H. Xu and K. Saenko, “Ask, attend and answer: Exploringquestion-guided spatial attention for visual question answering,”

Proceedings of the European Conference on Computer Vision (ECCV) ,pp. 451–466, 2016.[35] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attentionnetworks for image question answering,”

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR) , pp.21–29, 2016.[36] D. Yu, J. Fu, T. Mei, and Y. Rui, “Multi-level attention networksfor visual question answering,”

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , pp. 4709–4717,2017.[37] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov,R. Zemel, and Y. Bengio, “Show, attend and tell: Neural imagecaption generation with visual attention,”

Proceedings of the In-ternational Conference on Machine Learning (ICML) , pp. 2048–2057,2015.[38] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang,and X. Tang, “Residual attention network for image classiﬁcation,”

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR) , pp. 3156–3164, 2017.[39] S. Woo, J. Park, J.-Y. Lee, and I. So Kweon, “Cbam: Convolutionalblock attention module,”

Proceedings of the European Conference onComputer Vision (ECCV) , pp. 3–19, 2018.[40] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, andF. Moreno-Noguer, “Ganimation: Anatomically-aware facial ani-mation from a single image,”

Proceedings of the European Conferenceon Computer Vision (ECCV) , pp. 818–833, 2018.[41] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua,“Sca-cnn: Spatial and channel-wise attention in convolutionalnetworks for image captioning,”

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pp. 5659–5667, 2017.[42] X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang, “Progressiveattention guided recurrent network for salient object detection,”

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR) , pp. 714–722, 2018.[43] N. Liu, J. Han, and M.-H. Yang, “Picanet: Learning pixel-wisecontextual attention for saliency detection,”

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR) , pp.3089–3098, 2018.[44] S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention forsalient object detection,”

Proceedings of the European Conference onComputer Vision (ECCV) , pp. 234–250, 2018.[45] W. Wang, S. Zhao, J. Shen, S. C. Hoi, and A. Borji, “Salient objectdetection with pyramid attention and salient edges,”

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , pp. 1448–1457, 2019.[46] T. Zhao and X. Wu, “Pyramid feature attention network forsaliency detection,”

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , pp. 3085–3094, 2019.[47] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-timestyle transfer and super-resolution,”

Proceedings of the EuropeanConference on Computer Vision (ECCV) , pp. 694–711, 2016.[48] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,”2017, pp. 2242–2251. [49] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,”

Pro-ceedings of the International Conference on Machine Learning (ICML) ,2017.[50] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C.Courville, “Improved training of wasserstein gans,”

Advances inNeural Information Processing Systems (NIPS) , pp. 5767–5777, 2017.[51] K. Ricanek and T. Tesafaye, “Morph: A longitudinal imagedatabase of normal adult age-progression,”

Proceedings of IEEEInternational Conference on Automatic Face and Gesture Recognition(FG) , pp. 341–345, 2006.[52] B.-C. Chen, C.-S. Chen, and W. H. Hsu, “Face recognition andretrieval using cross-age reference coding with cross-age celebritydataset,”

IEEE Transactions on Multimedia (TMM) , vol. 17, no. 6, pp.804–815, 2015.[53] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detectionand alignment using multitask cascaded convolutional networks,”

IEEE Signal Processing Letters , vol. 23, no. 10, pp. 1499–1503, 2016.[54] J. Suo, X. Chen, S. Shan, W. Gao, and Q. Dai, “A concatenationalgraph evolution aging model,”

IEEE Transactions on Pattern Analy-sis and Machine Intelligence (TPAMI) , vol. 34, no. 11, pp. 2083–2096,2012.[55] H. Yang, D. Huang, Y. Wang, H. Wang, and Y. Tang, “Face agingeffect simulation using hidden factor analysis joint sparse rep-resentation,”

IEEE Transactions on Image Processing (TIP) , vol. 25,no. 6, pp. 2493–2507, 2016.[56] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-imagetranslation with conditional adversarial networks,”

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , pp. 1125–1134, 2017.

Yunfan Liu received the B.E. degree in elec-tronic engineering from Tsinghua University, Bei-jing, China, in 2015, the M.S. degree in elec-tronic engineering: systems from University ofMichigan, Ann Arbor, United States, in 2017.He is currently pursuing the Ph.D. degree withthe National Laboratory of Pattern Recognition,Center for Research on Intelligent Perceptionand Computing, Institute of Automation, ChineseAcademy of Sciences, Beijing, China. His re-search interests include computer vision, patternrecognition, and machine learning.

Qi Li received the B.E. degree in automationfrom China University of Petroleum, Qingdao,China, in 2011, the Ph.D. degree in patternrecognition and intelligent systems from NationalLaboratory of Pattern Recognition, CASIA, Bei-jing, China, in 2016. He is currently an assis-tant professor in the Center for Research onIntelligent Perception and Computing, NationalLaboratory of Pattern Recognition, CASIA. Hisresearch interests include face recognition, com-puter vision and machine learning. Zhenan Sun received the B.S. degree in indus-trial automation from the Dalian University ofTechnology, Dalian, China, in 1999, the M.S. de-gree in system engineering from the HuazhongUniversity of Science and Technology, Wuhan,China, in 2002, and the Ph.D. degree in pat-tern recognition and intelligent systems from theChinese Academy of Sciences, Beijing, China,in 2006. Since 2006, he has been a FacultyMember with the National Laboratory of Pat-tern Recognition, Institute of Automation, Chi-nese Academy of Sciences, where he is currently a Professor. He hasauthored or coauthored more than 200 technical papers. His currentresearch interests include biometrics, pattern recognition, and computervision. He is a fellow of the IAPR and serves as the Chair for IAPRTechnical Committee on Biometrics. He is an Associate Editor of theIEEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITYSCIENCE.