DotFAN: A Domain-transferred Face Augmentation Network for Pose and Illumination Invariant Face Recognition
BBARE DEMO OF IEEETRAN.CLS FOR IEEE JOURNALS 1
DotFAN: A Domain-transferred Face AugmentationNetwork for Pose and Illumination Invariant FaceRecognition
Hao-Chiang Shao,
Member, IEEE,
Kang-Yu Liu, Chia-Wen Lin † , Fellow, IEEE, and Jiwen Lu
SeniorMember, IEEE
Abstract —The performance of a convolutional neural network(CNN) based face recognition model largely relies on the rich-ness of labelled training data. Collecting a training set withlarge variations of a face identity under different poses andillumination changes, however, is very expensive, making thediversity of within-class face images a critical issue in practice. Inthis paper, we propose a 3D model-assisted domain-transferredface augmentation network (DotFAN) that can generate a seriesof variants of an input face based on the knowledge distilledfrom existing rich face datasets collected from other domains.DotFAN is structurally a conditional CycleGAN but has twoadditional subnetworks, namely face expert network (FEM) andface shape regressor (FSR), for latent code control. While FSRaims to extract face attributes, FEM is designed to capture aface identity. With their aid, DotFAN can learn a disentangledface representation and effectively generate face images of variousfacial attributes while preserving the identity of augmented faces.Experiments show that DotFAN is beneficial for augmenting smallface datasets to improve their within-class diversity so that abetter face recognition model can be learned from the augmenteddataset.
Index Terms —domain-transfer learning, multi-domain image-to-image translation, data augmentation, face recognition,
I. I
NTRODUCTION
Face recognition is one of the most considerable researchtopics in the field of computer vision. Benefiting frommeticulously-designed CNN architectures and loss functions[14], [9], [32], the performance of face recognition modelshave been significantly advanced. The performance of a CNN-based face recognition model largely relies on the richness oflabeled training data. However, collecting a training set withlarge variations of a face identity under different poses andillumination changes is very expensive, making the diversityof within-class face images a critical issue in practice. A facerecognition model may fail, if a test face contains what themodel did not learn from an anemic training set. To avoidthis circumstance, our idea is to distill the knowledge within a
Hao-Chiang Shao is with the Department of Statistics andInformation Science, Fu Jen Catholic University, Taiwan. (e-mail:[email protected])Kang-Yu Liu is with the Department of Electrical Engineering, NationalTsing Hua University, Hsinchu, Taiwan.Chia-Wen Lin (corresponding author) is with the Department of ElectricalEngineering and the Institute of Communications Engineering, National TsingHua University, Hsinchu, Taiwan. (e-mail: [email protected])Prof. Jiwen Lu is with Tsinghua University, Beijing, China. (e-mail:[email protected])Manuscript uploaded to arXiv on Feb. 23rd, 2020. Fig. 1. DotFAN aims to enrich an anemic domain via identity-preserving facegeneration based on the knowledge, i.e., disentangled facial representation,distilled from data in a rich domain. rich data domain and then transfer the distilled knowledge toenrich an incomprehensive set of training samples in a targetdomain via domain-transferred augmentation. Specifically, weaim to train a composite network, which learns a disentangledrepresentation of facial attributes from rich face datasets, sothat this network can generate face variants—each associatingwith a different pose angle, a shadow due to different illu-mination condition, or a different facial expression—of eachface subject in an anemic dataset for the data augmentationpurpose. Hence, we propose a
Domain-transferred FaceAugmentation Network (DotFAN) , whose design concept isillustrated in Fig. 1.We regard the proposed DotFAN as a face augmentation ap-proach in which any identity class—no matter a minority classor not—can be enriched by synthesizing face samples basedon the knowledge learned from rich face datasets in otherdomains via domain transfer. To this end, DotFAN first learnsa disentangled facial representation, through which the faceinformation can be spanned by various face attribute codes,from rich datasets. Then, exploiting the disentangled facialrepresentation, DotFAN can generate synthetic face samplesneighboring to the input faces in the sample space so thatthe diversity of each face-identify class can be significantlyenhanced. As a result, the performance of a face recognitionmodel trained on the enriched dataset can be improved as well.Utilizing two auxiliary subnetworks, namely a face-expertmodel (FEM) [6], [26] and a face shape regressor (FSR), Dot-FAN operates intrinsically in a 3D model-assisted data-driven a r X i v : . [ c s . C V ] F e b ARE DEMO OF IEEETRAN.CLS FOR IEEE JOURNALS 2 fashion. FEM is a purely data-driven subnetwork pretrained ona domain rich in face identities, whereas FSR is driven by a 3Dface model and pretrained on another domain with rich posesand expressions. Hence, FSM ensures that synthesized vaiantsof an input face are of the same identity as the input, whileFSR collaborating with illumination code offers the modelfor synthesizing faces with various poses, lighting (shadow)conditions, and expressions. In addition, inspired by FaceID-GAN [30], we used the 3D face model (e.g., 3DMM [1]) tocharacterize face attributes with only hundreds of parameters.Thereby, the size of FSR, and its training set as well, is largelyreduced, making it realizable with a light CNN. Furthermore,the loss terms related to FEM and FSR act as regularizersduring the training stage. This design prevents DotFAN fromcommon issues in data-driven approaches, e.g. overfitting dueto small training dataset.Moreover, DotFAN is distinguishable from FaceID-GANbecause of following reasons. First, based on a 3-player gamestrategy, FaceID-GAN regards its face-expert model as anadditional discriminator that needs to be trained jointly with itsgenerator and discriminator in an adversarial training manner.Because its face-expert model assists its discriminator ratherthan its generator, FaceID-GAN guarantees only the upper-bound of identity-dissimilarity. This design also prevents itsface expert model from pretraining and impedes the wholetraining speeds. Furthermore, since it cannot be pretrained on arich-domain data, this makes it very difficult to do knowledgetransfer from a rich dataset to another dataset in an on-linelearning manner. On the contrary, DotFAN regards its FEMas a regularizer to guarantee that the identity informationis not altered by the generator. Accordingly, FEM can bepretrained on a rich dataset and play a role of an inspectorin charge of overseeing identity-preservability. This designnot only carries out the identity-preserving face generationtask, but also stabilizes and speeds up the training processby not intervening the competition between generator anddiscriminator.The main contributions of DotFAN are threefold. • We are the first to propose a domain-transferred face aug-mentation scheme that can easily transfer the knowledgedistilled from a rich domain to an anemic domain, whilepreserving the identity of augmented faces in the targetdomain. • Through well disentangled facial representation learnedfrom existing face data, DotFAN offers a unique unifiedframework that can incorporate prominent face attributes(pose, illumination, shape, expression) for face recogni-tion and can be easily extended to other face related tasks. • DotFAN well beats the state-of-the-arts by a significantgain margin in face recognition application with small-size training data available. This makes it a powerful toolfor low-shot learning applications. Although DotFAN can synthesize faces with various expressions, we donot particularly consider it in this work, as we do not find a suitable labeledtraining set with rich expressions.
II. R
ELATED W ORK
Recently, various algorithms have been proposed to addressthe issue of small sample size with dramatic variations in facialattributes in face recognition [5], [24], [23], [29]. This sectionreviews works on GAN-based image-to-image translation, facegeneration, and face frontalization/rotation techniques relatedto face augmentation. (A) GAN-based image-to-image translation:
GAN and its variants have been widely adopted in a varietyof fields, including image super-resolution, image synthesis,image style transfer, and domain adaptation. DCGAN [27]incorporates deep CNNs into GAN for unsupervised repre-sentation learning. DCGAN enables arithmetic operations inthe feature space so that face synthesis can be controlledby manipulating attribute codes. The concept of generatingimages with a given condition has been adopted in succeedingworks, such as Pix2pix [19] and CycleGAN [38]. Pix2pixrequires pair-wise training data to derive the translation re-lationship between two domains, whereas CycleGAN relaxessuch limitation and exploits unpaired training inputs to achievedomain-to-domain translation. After CycleGAN, StarGAN [5]addresses the multi-domain image-to-image translation issue.With the aids of a multi-task learning setting and a design ofdomain classification loss, StarGAN’s discriminator minimizesonly the classification error associated to a known label. Asa result, the domain classifier in the discriminator can guidethe generator to learn the differences among multiple domains.Recently, an attribute-guided face generation method based ona conditional CycleGAN was proposed in [24]. This methodsynthesizes a high-resolution face based on an low-resolutionreference face and an attribute code extracted from anotherhigh-resolution face. Consequently, by regarding faces of thesame identity as one sub-domain of faces, we deem that faceaugmentation can be formulated as a multi-domain image-to-image translation problem that can be solved with the aid ofattribute-guided face generation strategy. (B) Face frontalization and rotation:
We regard the identity-preserving face rotation task as aninverse problem of the face frontalization technique used tosynthesize a frontal face from a face image with arbitrarypose variation. Typical face frontalization and rotation methodssynthesize a 2D face via 3D surface model manipulation, in-cluding pose angle control and facial expression control, suchas FFGAN [35] and FaceID-GAN [30]. Still, some designsutilize specialized sub-networks or loss terms to reach thegoal. For example, based on TPGAN [18], the pose invariantmodule (PIM) proposed in [37] contains an identity-preservingfrontalization sub-network and a face recognition sub-network;the CNN proposed in [36] establishes a dense correspondencebetween paired non-frontal and frontal faces; and, the facenormalization model (FNM) proposed in [26] involves a face-expert network, a pixel-wise loss, and a face attention discrim-inators to generate a faces with canonical-view and neutralexpression. Finally, some methods approached this issue bymeans of disentangled representations, such as DR-GAN [31]and CAPG-GAN [16]. The former utilizes an encoder-decoderstructure to learn a disentangled representation for face rota-
ARE DEMO OF IEEETRAN.CLS FOR IEEE JOURNALS 3
Fig. 2. Training flow of DotFAN. FEM and FSR are independently pre-trained subnetworks, whereas E , G , and D are trained as a whole. ˜ f p and ˜ f l denoterespectively a pose code and an illumination code randomly given in the training routine; and, f l is the ground-truth illumination code provided by the trainingset. Note that for inference, the data flow begins from x and ends at G (˜ f ) . tion, whereas the latter adopts a two-discriminator frameworkto learn simultaneously pose and identity information. (C) Data augmentation for face recognition: To facilitate face recognition, there are several face normal-ization and data augmentation methods. Face normalizationmethods aim to align face images by removing the volatilityresulting from illumination variations, changes of facial ex-pressions, and different pose angles [26], whereas the dataaugmentation method attempts to increase the richness offace images, often in aspects of pose angle and illuminationconditions, for the training routine. To deal with illuminationvariations, conventional approaches utilized either physicalmodels, e.g. Retinex theory [22], or 3D reconstruction strategyto remove/correct the shadow on a 2D image [10], [33].Moreover, to mitigate the influence brought by pose angles,two categories of methods were proposed, namely pose-invariant face recognition methods and face rotation methods.While the former category focuses on learning pose-invariantfeatures from a large-scale dataset [25], [2], the latter category,including face frontalization techniques, aims to learn therelationship between rotation angle and resulting face imagevia a generative model [35], [18], [37], [31], [16], [30].Because face rotation methods are designed to increase thediversity of the view-points of face image data, they are alsobeneficial for augmentation tasks.Based on these meticulous designs, DotFAN is imple-mented as a uni-generator conditional CycleGAN, involving anencoder-decoder framework and two sub-networks for learningdisentangled attribute codes and triggered by several lossterms, such as cycle-consistency loss and domain classificationloss, as will be elaborated later.III. D
OMAIN -T RANSFERRED F ACE A UGMENTATION
The proposed DotFAN is a framework to synthesize faceimages of one domain based on the knowledge, i.e., dis-entangled facial representation, learned from others. For agiven input face x , the generator G of DotFAN is trainedto synthesize a face G ( f ) based on an input attribute code f comprising i) a general latent code l x = E ( x ) extractedfrom x by the general facial encoder, ii) an identity code f id indicating the face identity, iii) an attribute code f p describingfacial attributes including pose angle and facial expressions,and iv) an illumination code f l . Through this design, a faceimage can be embedded via a disentangled representation inan attribute code f = [ l x , f id , f p , f l ] . Fig. 2 depicts the flow-diagram of DotFAN, where each component will be elaboratedin following subsections. A. Disentangled Facial Representation
To obtain a disentangled representation, the attribute code f used by DotFAN for generating face variants is derivedcollaboratively by a general facial encoder E , a face-expertsub-network FEM, a shape-regression sub-network FSR, andan illumination code f l . FEM and FSR are two well pre-trainedsub-networks. FEM learns to extract identity-aware featuresfrom faces (of each identity) with various head poses andfacial expressions, whereas FSR aims to learn pose featuresbased on a 3D model. The illumination code is a × one-hot vector specifying label-free case (corresponding to datafrom CASIA [34]) and illumination conditions (associatedwith selected Multi-PIE dataset [11]). (A) Face-Expert Model (FEM): FEM, denoted by Φ fem ,enables DotFAN to extract and to transplant the face identityfrom an input source to synthesized face images. Thoughconventionally face identity extraction is considered as aclassification problem and optimized by using a cross-entropyloss, recent methods, e.g., CosFace [32] and ArcFace [9],proposed to adopt angular information instead. ArcFace mapsface features onto a unit hyper-sphere and adjust between-classdistances by using a pre-defined margin value so that a morediscriminative feature representation can be obtained. UsingArcFace, FEM ensures not merely a fast training speed forlearning face identity but also the efficiency in optimizing thewhole DotFAN network. (B) Face Shape Regressor (FSR): FSR, denoted as Φ fsr ,aims to extract face attributes including face shape, pose, and ARE DEMO OF IEEETRAN.CLS FOR IEEE JOURNALS 4 expression. A fully data-driven approach requires to learn aCNN model of high complexity to completely characterize theface attributes without a prior model, which implies the needof a large variety of labeled face samples for training, therebyrunning a high risk of overfitting. Instead, we use a face model-assisted CNN based on the widely adopted 3D MorphableModel (3DMM [1]) to significantly reduce the model size (say,a light CNN), as 3DMM can fairly accurately characterizethe face attributes using only hundreds of parameters. Wefollow HPEN’s strategy [40] to prepare ground-truth 3DMMparameters Θ x of a given face x from CASIA dataset [1].Then, we train FSR via Weighted Parameter Distance Cost(WPDC) [39] defined in Eq. (1), with a modified importancematrix, as shown in Eq. (2). L wpdc = (cid:0) Φ fsr ( x ) − Θ x (cid:1) t W (cid:0) Φ fsr ( x ) − Θ x (cid:1) (1) W = ( w R , w T , w shape , w exp ) , (2)where w R , w t d , w shape , and w exp are the distance-basedweighting coefficients for Θ x (including a × vectorizedrotation matrix R , a × translation vector T , a × vector α shape , and a × α exp ) derived by 3DMM. Note that3DMM expresses a face as S = ¯ S + A shape α shape + A exp α exp ,where ¯ S is the mean face, A shape denotes the PCA basisspanning shape information, A exp is the basis for facialexpressions, and α shape and α exp are weighting vectors.While training DotFAN, α shape that represents facial shapeis unchanged, and components of the translation, rotation, andexpression could be replaced by arbitrary values. (C) General facial encoder E and illumination code f l : E is used to capture other features, which cannot be repre-sented by shape and identity codes, on a face. f l is a one-hotvector specifying the lighting condition, based on which ourmodel synthesizes a face. Note that because CASIA has noshadow labels, for f l of a face from CASIA, its former entries are set to be ’s and its th entry f casial = 1 ; thismeans to skip shading and to generate a face with the sameillumination setting and the same shadow as the input. B. Generator
The generator G takes an attribute code f = [ l x , f id , f p , f l ] as its input to synthesize aface G ( f ) . Described below are loss terms composing theloss function of our generator. (A) Cycle-consistency loss :In our design, we adopt the cycle-consistency loss to retainface contents after performing two transformations dual toeach other. That is, L cycle = (cid:107) G (cid:0) G (˜ f ) , f ) (cid:1) − x (cid:107) /N , (3)where N is the number of pixels, G (˜ f ) is a synthetic facederived according to an input attribute code ˜ f . This loss guar-antees our generator can learn the transformation relationshipbetween any two dual attribute codes. (B) Pose-symmetric loss :Based on a common assumption that a human face is sym-metrical, a face with an x ◦ pose angle and a face with a − x ◦ angle should be symmetric about the ◦ axis. Consequently, we design a pose-symmetric loss based on which DotFAN canlearn to generate ± x ◦ faces from either training sample. Thispose-symmetric loss is evaluated with the aid of a face-mask M ( · ) , which is defined as a function of 3DMM parameterspredicted by FSR and makes this loss term focus on the faceregion by filtering out the background, as described below: L sym = (cid:107) M (ˆ f − ) · (cid:0) G (ˆ f − ) − ˆ x − (cid:1) (cid:107) /N . (4)Here, ˆ f − = [ l x , f id , ˆ f − p , f l ] , in which ˆ f − p = Φ fsr (ˆ x − ) , and theother three attribute codes are extracted from x . Additionally, ˆ x − is the horizontally-flipped version of x . In sum, this termmeasures the L -norm of the difference between a syntheticface and the horizontally-flipped version of x within a region-of-interest defined by a mask M . (C) Identity-Preserving Loss :We adopt the following identity-preserving loss to ensure thatthe identity code of a synthesized face G ( ˜f ) is identical to thatof input face x . That is, L id = (cid:107) Φ fem ( x ) − Φ fem (cid:0) G (˜ f ) (cid:1) (cid:107) /N . (5) (D) Pose-consistency loss :This term guarantees that the pose and expression featureextracted from a synthetic face is consistent with ˜ f p used togenerate the synthetic face. That is, L pose = (cid:107) ˜ f p − Φ fsr (cid:0) G (˜ f ) (cid:1) (cid:107) /N . (6) C. Discriminator
By regarding faces of the same identity as one sub-domainof faces, the task of augmenting faces of different identitiesbecomes a multi-domain image-to-image translation problemaddressed in StarGAN [5]. Hence, we exploit an adversarialloss to make augmented faces photo-realistic. To this end, weuse the domain classification loss to verify if G (˜ f ) is properlyclassified to a target domain label f l that we used to specifythe illumination condition of G (˜ f ) . In addition, in order tostabilize the training process, we adopted the loss design usedin WGAN-GP [12]. Consequently, these two loss terms canbe expressed as follows: L Dadv = D src ( G (˜ f )) − D src ( x )+ λ gp · (cid:0) (cid:107)∇ ˆ x D src (ˆ x ) (cid:107) − (cid:1) L Gadv = − D src ( G (˜ f )) , (7)where λ gp is a trade-off factor for the gradient penalty, ˆ x isuniformly sampled from the linear interpolation between x and synthesized G (˜ f ) , and D src reflects a distribution oversources given by the discriminator; and, L Dcls = − log D cls ( f x l | x ) L Gcls = − log D cls ( f l | G (˜ f )) , (8)where f x l is the ground-truth illumination code of x .In sum, the discriminator aims to produce probabilitydistributions over both source and domain labels, i.e., D : x → { D src ( x ) , D cls ( x ) } . Empirically, λ gp = 10 . ARE DEMO OF IEEETRAN.CLS FOR IEEE JOURNALS 5
D. Full objective function
In order to optimize the generator and alleviate the trainingdifficulty, we pretrained FSR and FEM with correspondinglabels. Therefore, while training the generator and the dis-criminator, no additional label is needed. The full objectivefunctions of DotFAN can be expressed as: L G = L Gadv + L Gcls + L id + L pose + L sym + L cycle L D = L Dadv + L Dcls . (9)Two loss terms in L D are equal-weighted; and, the weightingfactors of terms in L G in turn are , , , , , and .IV. E XPERIMENTAL R ESULTS
A. Dataset
DotFAN is trained jointly on
CMU Multi-PIE [11] and
CASIA [34]. Multi-PIE contains more than , images of identities, each with 20 different sorts of illumination and15 different poses. We select images of pose angles rangingin between ± ◦ and illumination codes from 0 to 12 to formour first training set, containing totally , faces. Fromthis training set, DotFAN learns the representative featuresfor a wide range of pose angles, illumination conditions, andresulting shadows. Our second dataset is the whole CASIAset that contains , images of , identifies, eachhaving about 50 images of different poses and expressions.Since CASIA contains a rich collection of face identities, ithelps DotFAN learn features for representing identities.To evaluate the performance of DotFAN on face synthesis,four additional datasets are used: LFW [17],
IJB-A [21],
SurveilFace-1 , and
SurveilFace-2 . LFW has , imagesof , identities; IJB-A has , images of iden-tities; SurveilFace-1 has , images of identities; andSurveilFace-2 contains , images of identities. We eval-uate the performance of DotFAN’s face frontalization on LFWand IJB-A. Besides, because faces in two SurveilFace datasetsare taken in uncontrolled real working environments, theyare contaminated by strong backlight, motion blurs, extremeshadow conditions, or influences from various viewpoints.Hence, they mimic the real-world conditions and thus aresuitable for evaluating the face augmentation performance.The two SurveilFace sets are private data provided by a videosurveillance provider. We will make them publicly availableafter removing personal labels.We exploit CelebA to simulate the data augmentation pro-cess. CelebA contains , images of , identitieswith 40 kinds of diverse binary facial attributes. We randomlyselect a fixed number of images of each face identity fromCelebA to form our simulation set, called “ sub-CelebA ” andconducted data augmentation experiments on both CelebA andsub-CelebA by using DotFAN. B. Implementation Details
Before training, we align the face images in the Multi-PIEand CASIA by MTCNN [8]. Structurally, our FEM is obtainedby Resnet-50 pretained on MS-Celeb-1M [13], and FSR isimplemented by a MobileNet [15] pretained on CASIA. To
Fig. 3. Face frontalization results derived by different methods. train DotFAN, each input face is resized to × . Bothgenerator and discriminator exploit Adam optimizer [20] with β = 0 . and β = 0 . . The total number of trainingiterations is , with a batch-size of , and the numberof training epochs is . The learning rate is initially set to be − and begins to decay after the -th training epoch. C. Face Synthesis
We verify the efficacy of DotFAN through the visual qualityof i) face frontalization and ii) face rotation results. (A) Face frontalization:
First, we verify if the identityinformation extracted from a frontalized face, produced byDotFAN, is of the same class as the identity of a givensource face. Following [30], we measure the performanceby using a face recognition model trained on MS-Celeb-1M. Next, we conduct frontalization experiments on LFW.Fig. 3 illustrates the frontalization results derived by differentmethods; meanwhile, Tables I and II show the comparison offace verification results of frontalized faces. This experimentset validates that i) compared with other methods, DotFANachieves comparable visual quality in face frontalization, ii)shadows can be effectively removed by DotFAN, and iii)DotFAN outperforms the other methods in terms of verificationaccuracy, especially in the experiment on IJB-A shown inTable II, where DotFAN reports a much better TAR, i.e., . on [email protected] and . on [email protected], than ex-isting approaches. (B) Face Rotation Fig. 4 demonstrates DotFAN’s capability insynthesizing faces of given attributes, including pose angles,facial expressions, and shadows, while retaining the associ-ated identities. The source faces presented in the left-mostcolumn in Fig. 4 come from four datasets, i.e., CelebA, LFW,CFP [28], and SurveilFace. CelebA and LFW are two widely-adopted face datasets; CFP contains images with extreme poseangles, e.g., ± ◦ ; and, SurveilFace contains faces of variantillumination conditions and faces affected by motion-blurs.This experiment shows that DotFAN can stably synthesizevisually-pleasing face images based on 3DMM parameters de-scribing 3D templates. Finally, Fig. 5 shows some synthesized ARE DEMO OF IEEETRAN.CLS FOR IEEE JOURNALS 6 (a)(b)(c)(d)(e)
Fig. 4. Synthesized faces for face samples from different datasets generated by DotFAN. The left-most column shows the inputs with random attributes (e.g.,poses, expressions, and motion blurs). The top-most row illustrates 3D templates with specific poses and expressions. To guarantee the identity informationof each synthetic face is observable, columns 3–11 show shadow-free results, and columns and show faces with shadows. (a) 3D templates. (b) CelebA,(c) LFW, (d) CFP, and (e) SurveilFace.Method Verification AccuracyHPEN [40] 96.25 ± ± ± ± TABLE IV
ERIFICATION ACCURACY ON
LFW.Method [email protected] [email protected] [25] 73.3 ± ± ± ± ± ± ± ± ± ± ± TABLE IIT
RUE -A CCEPT -R ATE (TAR)
OF VERIFICATIONS ON
IJB-A. faces with shadows assigned with four different illuminationcodes. Note that all synthesized faces presented in this paperare produced by the same DotFAN model; no more data-oriented fine-tuning is required.
D. Face Augmentation
To evaluate the comprehensiveness of domain-transferredaugmentation by DotFAN, we perform data augmentationon the same dataset by using DotFAN, FaceID-GAN, and
Fig. 5. Face augmentation examples (CelebA) containing augmented faceswith illumination conditions and poses. StarGAN first; then, we compare the recognition accuracyof different MobileFaceNet models [4], each trained on anaugmented dataset, by testing them on LFW and SurveilFace.StarGAN used in this experiment is trained on Mutli-PIE thatis rich in illumination conditions; meanwhile, the FaceID-GAN is trained on CASIA to learn pose and expressionrepresentations.Table III summarizes the results of this experiment set.We interpret the results focusing on Sub-experiment(a). InSub-experiment(a), we randomly select faces of each iden- ARE DEMO OF IEEETRAN.CLS FOR IEEE JOURNALS 7Method LFW SurveilFace-1 SurveilFace-2ACC AUC @FAR=0.001 @FAR=0.01 AUC @FAR=0.001 @FAR=0.01 AUC(a)
Sub-CelebA(3) (totally , images)RAW 83.1 90.2 20.5 34.4 83.2 18.0 33.3 84.8StarGAN 85.9 92.5 25.1 39.6 87.5 27.4 46.7 91.4FaceID-GAN 92.5 97.6 34.6 53.5 92.8 32.3 54.0 94.3Proposed 1x 93.6 98.1 35.7 56.2 93.6 34.7 57.8 95.0Proposed 3x 94.7 98.7 36.8 58.3 94.6 36.5 60.8 95.6(b) Sub-CelebA(8) (totally , images)RAW 94.0 98.5 37.8 58.7 94.4 38.3 61.0 95.2StarGAN 94.3 98.5 42.6 60.7 94.9 42.8 65.6 95.8FaceID-GAN 96.5 99.3 48.1 65.6 96.0 45.7 67.9 96.8Proposed 1x 97.3 99.5 53.2 71.2 97.0 49.1 72.2 97.2Proposed 3x 97.2 99.5 53.2 68.9 96.9 47.3 70.0 97.1(c) Sub-CelebA(13) (totally , images)RAW 96.3 99.1 47.4 67.8 96.2 43.5 67.0 96.5StarGAN 96.7 99.3 48.3 68.1 96.7 46.3 70.0 96.7FaceID-GAN 97.2 99.5 53.3 71.3 97.0 50.2 72.3 97.4Proposed 1x 97.6 99.6 56.2 75.1 97.7 50.4 73.9 97.7Proposed 3x 97.5 99.7 56.7 75.5 97.7 53.9 72.2 97.8(d) CelebA (full CelebA dataset, , images) RAW 97.6 99.6 53.5 73.8 97.7 48.7 73.0 97.5StarGAN 97.7 99.6 55.0 74.2 97.7 53.0 73.8 97.6FaceID-GAN 98.0 99.7 57.6 76.4 98.1 54.1 76.5 98.0Proposed 1x 98.3 99.8 62.4 80.9 98.4 57.1 76.7 98.1Proposed 3x 98.4 99.7 61.4 78.9 98.2 54.7 77.8 98.0
TABLE IIIP
ERFORMANCE COMPARISON AMONG FACE RECOGNITION MODELS TRAINED ON DIFFERENT DATASETS . H
ERE , S UB -C ELEB A( x ) DENOTES A SUBSETFORMED BY RANDOMLY SELECTING x IMAGES OF EACH FACE SUBJECT FROM C ELEB
A.Fig. 6. Comparison of face verification accuracy on LFW trained on differentaugmented dataset. The horizontal spacing highlights the size of raw trainingdataset sampled from CelebA. tity from CelebA to form the
RAW training set, namely
Sub-CelebA(3) , leading to about , training samplesin raw Sub-CelebA(3). The MobileFaceNet trained on rawSub-CelebA(3) achieves a verification accuracy of . onLFW, a true accept rate (TAR) of . at FAR = . on SurveilFace-1, and a TAR of . at FAR = . onSurveilFace-2. After generating about , additional faceimages via DotFAN to double the size of training set, theverification accuracy on LFW becomes . , and the TARvalues on SurveilFace datasets are all nearly doubled, as shownin the row named Proposed 1x . This experiment shows Dot-FAN is effective in face data augmentation and outperformsStarGAN and FaceID-GAN significantly. Furthermore, whenwe augment about , additional faces to quadruple thesize of training set, i.e., Proposed 3x , we have only a minorimprovement in verification accuracy compared to
Proposed1x . This fact reflects that the marginal benefit a model canextract from the data diminishes as the number of samplesincreases when there is information overlap among data, as is (a)(b)(c)(e) (f)(d)
Fig. 7. Ablation study on loss terms. (a) Full loss. (b) w/o L id , (c) w/o L cls ,(d) w/o L cycle , (e) w/o L pose , and (f) w/o L sym . what reported in [7].Consequently, Table III and Fig. 6 reveals three remark-able points. First, although the improvement in verificationaccuracy decreases as the size of raw training set increases,DotFAN achieves significant performance gain on augment-ing a small-size face training set, as demonstrated in all(RAW, Proposed 1x) data pairs. Second, the results obey the law of diminishing marginal utility in Economics, asdemonstrated in all (Proposed 1x, Proposed 3x) data pairs.That is, a procedure is adequate to enrich a small dataset.Experiments also show that the Proposed 3x procedure seemsto reach the upper-bound of data richness. Third, by integratingattribute controls on pose angle, illuminating condition, andfacial expression with an identity-preserving design, DotFANoutperforms StarGAN and FaceID-GAN in domain-transferredface augmentation tasks.
ARE DEMO OF IEEETRAN.CLS FOR IEEE JOURNALS 8
E. Ablation Study
In this section, we verify the effect brought by each lossterm. Fig. 7 depicts the faces generated by using differentcombinations of loss terms. The top-most row shows facesgenerated by using the full generator loss L G described inEq.(9), whereas the remaining rows respectively show syn-thetic results derived without one certain loss term.As illustrated in Fig. 7(b), without L id , DotFAN failsto preserve the identity information although other facialattributes can be successfully retained. By contrast, without L cls , DotFAN cannot control the illumination condition, andthe resulting faces all share the same shadow (see Fig. 7(c)).These two rows evidence that L cls and L id are indispensable inDotFAN design. Moreover, Fig. 7(d) shows some unrealisticfaces, e.g., a rectangular-shaped ear in the frontalized face;accordingly, L cycle is important for photo-realistic synthe-sis. Finally, Fig. 7(e)–(f) show that L pose and L sym arecomplementary to each other. As long as either of themfunctions, DotFAN can generate faces of different face angles.However, because L sym is designed to learn only the mappingrelationship between + x ◦ face and − x ◦ face by ignoringbackground outside the face region, artifacts may occur inthe background region if L sym works solely, as shown in Fig.7(e). V. C ONCLUSION
We proposed a Domain-transferred Face Augmentation net-work (DotFAN) for generating a series of variants of an inputface image based on the knowledge of disentangled facialrepresentation distilled from huge datasets. DotFAN is de-signed as a conditional CycleGAN with two extra subnetworksto learn the disentangled facial representation and producea normalized face so that it can effectively generate faceimages of various facial attributes while preserving identity ofsynthetic images. Moreover, we proposed a pose-symmetricloss through which DotFAN can synthesize a pair of pose-symmetric face images directly at once. Extensive experimentsdemonstrate the effectiveness of DotFAN in augmenting small-size face datasets and improving their within-subject diversity.As a result, a better face recognition model can be learnedfrom an enriched training set derived by DotFAN.R
EFERENCES[1] V. Blanz, T. Vetter, et al. A morphable model for the synthesis of 3Dfaces. In
Proc. ACM SIGGRAPH , 1999.[2] K. Cao, Y. Rong, C. Li, X. Tang, and C. C. Loy. Pose-robust facerecognition via deep residual equivariant mapping. In
Proc. IEEE Conf.Comput. Vis. Pattern Recognit. , pages 5187–5196, 2018.[3] J.-C. Chen, V. M. Patel, and R. Chellappa. Unconstrained face verifica-tion using deep cnn features. In
Proc. IEEE Winter Conf. Appl. Comput.Vis. , pages 1–9, 2016.[4] S. Chen, Y. Liu, X. Gao, and Z. Han. Mobilefacenets: Efficient CNNs foraccurate real-time face verification on mobile devices. In
Proc. ChineseConf. Biometric Recognit. , pages 428–438, 2018.[5] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan:Unified generative adversarial networks for multi-domain image-to-image translation. In
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. ,pages 8789–8797, 2018.[6] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T.Freeman. Synthesizing normalized faces from facial identity features.In
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. , pages 3703–3712,2017. [7] Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie. Class-balanced lossbased on effective number of samples. In
Proc. IEEE Conf. Comput.Vis. Pattern Recognit. , pages 9268–9277, 2019.[8] J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation viamulti-task network cascades. In
Proc. IEEE Conf. Comput. Vis. PatternRecognit. , pages 3150–3158, 2016.[9] J. Deng, J. Guo, N. Xue, and S. Zafeiriou. Arcface: Additive angularmargin loss for deep face recognition. In
Proc. IEEE Conf. Comput.Vis. Pattern Recognit. , pages 4690–4699, 2019.[10] G. D. Finlayson, S. D. Hordley, and M. S. Drew. Removing shadowsfrom images. In
Proc. European Conf. Comput. Vis. , pages 823–836,2002.[11] I. Gross, R.and Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie.
Image Vis. Comput. , 28(5):807–813, 2010.[12] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville.Improved training of Wasserstein GANs. In
Proc. Adv. Neural Inf. Proc.Syst. , pages 5767–5777, 2017.[13] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. MS-celeb-1m: A datasetand benchmark for large-scale face recognition. In
Proc. European Conf.Comput. Vis. , pages 87–102, 2016.[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. In
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. , pages770–778, 2016.[15] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T.Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convo-lutional neural networks for mobile vision applications. arXiv preprintarXiv:1704.04861 , 2017.[16] Y. Hu, X. Wu, B. Yu, R. He, and Z. Sun. Pose-guided photorealisticface rotation. In
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. , pages8398–8406, 2018.[17] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller. Labeled facesin the wild: A database forstudying face recognition in unconstrainedenvironments. 2008.[18] R. Huang, S. Zhang, T. Li, and R. He. Beyond face rotation: Global andlocal perception GAN for photorealistic and identity preserving frontalview synthesis. In
Proc. IEEE Int. Conf. Comput. Vis. , pages 2439–2448,2017.[19] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translationwith conditional adversarial networks. In
Proc. IEEE Conf. Comput. Vis.Pattern Recognit. , pages 1125–1134, 2017.[20] D. Kinga and J. B. Adam. A method for stochastic optimization. In
Proc. Int. Conf. Learn. Represent. , volume 5, 2015.[21] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen, P.Grother, A. Mah, and A. K. Jain. Pushing the frontiers of unconstrainedface detection and recognition: Iarpa janus benchmark a. In
Proc. IEEEConf. Comput. Vis. Pattern Recognit. , pages 1931–1939, 2015.[22] E. H. Land and J. J. McCann. Lightness and retinex theory.
Josa ,61(1):1–11, 1971.[23] T. Li, R. Qian, C. Dong, S. Liu, Q. Yan, W. Zhu, and L. Lin. Beautygan:Instance-level facial makeup transfer with deep generative adversarialnetwork. In
Proc. ACM Multimedia , pages 645–653, 2018.[24] Y. Lu, Y.-W. Tai, and C.-K. Tang. Attribute-guided face generation usingconditional CycleGAN. In
Proc. European Conf. Comput. Vis. , pages282–297, 2018.[25] I. Masi, S. Rawls, G. Medioni, and P. Natarajan. Pose-aware facerecognition in the wild. In
Proc. IEEE Conf. Comput. Vis. PatternRecognit. , pages 4838–4846, 2016.[26] Y. Qian, W. Deng, and J. Hu. Unsupervised face normalization withextreme pose and expression in the wild. In
Proc. IEEE Conf. Comput.Vis. Pattern Recognit. , pages 9851–9858, 2019.[27] A. Radford, L. Metz, and S. Chintala. Unsupervised representationlearning with deep convolutional generative adversarial networks. arXivpreprint arXiv:1511.06434 , 2015.[28] S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chellappa, andD. W. Jacobs. Frontal to profile face verification in the wild. In
Proc.IEEE Winter Conf. Appl. Comput. Vis. , pages 1–9, 2016.[29] W. Shen and R. Liu. Learning residual images for face attributemanipulation. In
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. , pages4030–4038, 2017.[30] Y. Shen, P. Luo, J. Yan, X. Wang, and X. Tang. FaceID-GAN: Learninga symmetry three-player GAN for identity-preserving face synthesis. In
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. , pages 821–830, 2018.[31] L. Tran, X. Yin, and X. Liu. Disentangled representation learning GANfor pose-invariant face recognition. In
Proc. IEEE Conf. Comput. Vis.Pattern Recognit. , pages 1415–1424, 2017.[32] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu.CosFace: Large margin cosine loss for deep face recognition. In
Proc.IEEE Conf. Comput. Vis. Pattern Recognit. , pages 5265–5274, 2018.[33] Y. Wang, L. Zhang, Z. Liu, G. Hua, Z. Wen, Z. Zhang, and D. Samaras.
ARE DEMO OF IEEETRAN.CLS FOR IEEE JOURNALS 9
Face relighting from a single image under arbitrary unknown lightingconditions.
IEEE Trans. Pattern Anal. Mach. Intell. , 31(11):1968–1984,2008.[34] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation fromscratch. arXiv preprint arXiv:1411.7923 , 2014.[35] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker. Towards large-poseface frontalization in the wild. In
Proc. IEEE Int. Conf. Comput. Vis. ,pages 3990–3999, 2017.[36] Z. Zhang, X. Chen, B. Wang, G. Hu, W. Zuo, and E. R. Hancock.Face frontalization using an appearance-flow-based convolutional neuralnetwork.
IEEE Trans. Image Process. , 28(5):2187–2199, 2018.[37] J. Zhao, Y. Cheng, Y. Xu, L. Xiong, J. Li, F. Zhao, K. Jayashree, S.Pranata, S. Shen, J. Xing, et al. Towards pose invariant face recognitionin the wild. In
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. , pages2207–2216, 2018.[38] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-imagetranslation using cycle-consistent adversarial networks. In
Pro. IEEEInt. Conf. Comput. Vis. , pages 2223–2232, 2017.[39] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li. Face alignment acrosslarge poses: A 3D solution. In
Proc. IEEE Conf. Comput. Vis. PatternRecognit. , pages 146–155, 2016.[40] X. Zhu, Z. Lei, J. Yan, D. Yi, and S. .Z Li. High-fidelity pose andexpression normalization for face recognition in the wild. In
Proc. IEEEConf. Comput. Vis. Pattern Recognit. , pages 787–796, 2015. A PPENDIX
In this section, we show i) architectures of DotFAN’sgenerator, general facial encoder, and discriminator, ii) faceexamples in the SurveilFace dataset, iii) DotFAN’s capabilityfor disentangled face representation, and iv) face imagesgenerated by DotFAN’s data augmentation process.
A. Model Architecture
Listed in Table S4, Table S5, and Table S6 are the networkstructures of DotFAN’s encoder ( E ), generator ( G ), and dis-criminator ( D ), respectively. Specified below are the notationsused in Tables S4-S6. • H : Height of the input image. • W : Width of the input image. • N : Number of output channels. • K : Kernel size. • S : Stride size. • P : Padding size. • Batch : Batch normalization layer. • CONV : 2D convolution layer. • FC : Fully-connected layer. • TRANSPOSECONV : 2D transpose convolution layer (forupsampling) • AvgPool : Average pooling layer • Note that we set H = W = 112 in all our experiments. B. SurveilFace Dataset
Demonstrated in Figure S8 are face examples of the Surveil-Face datasets. The two SurveilFace datasets were collectedfrom a working-place surveillance system. Hence, uncon-trolled real working environments may result in face photos af-fected by various extreme conditions, such as strong backlight,motion blurs, extreme shadows, or unconstrained viewpoints.These two datasets mimic the real-world conditions and thusare suitable for evaluating the face augmentation performance.
Fig. S8. Image samples of SurveilFace dataset. Here we show four extremeconditions: (a) strong-backlight, (b) motion-blur, (c) extreme shadow, and (d)unconstrained viewpoint.
C. Disentangled Facial Representation
Figure S9 exhibits synthesized faces to show DotFAN’scapabilily for disentangled face representation. By exploitingour attribute code f = [ l x , f id , f p , f l ] , this experiment aims toshow we can control the face synthesis result by manipulating f . In Figure S9, each row shows a sequence of faces. Eachsequence was derived according to the convex combination—controlled by a scalar parameter α —of two input attributecodes, i.e., f R of the right-most face and f L of the left-most. The sequence shown in the first row is derived by f L = [ l lx , f lid , f lp , f l ] and f R = [ l rx , f rid , f rp , f l ] . With theillumination condition being fixed, we show that both thehairdo and the identity information vary smoothly with α .The second row and the third row of Figure S9 show the faceinterpolation results of controlled pose codes f p . Because bothpose information and expression information were encodedinto f p , these two sequences evidence that we can control theface synthesis by even editing only a segment of f . Finally,the fourth row shows synthetic faces derived according tolinearly interpolated illumination codes. Note that althoughthe illumination code f l is a one-hot-vector, DotFAN can stillapproximate a shadow that varies almost linearly. D. Data Augmentation
Finally, demonstrated in Figure S10 are face augmentationexamples derived by different methods.
ARE DEMO OF IEEETRAN.CLS FOR IEEE JOURNALS 10
Fig. S9. Disentangled facial representation. For each row, the morphing sequence is generated by using an attribute code f ( α ) = α f R + (1 − α ) f L with ≤ α ≤ . TABLE S4A RCHITECTURE OF GENERAL FACIAL ENCODER E . ARE DEMO OF IEEETRAN.CLS FOR IEEE JOURNALS 11
TABLE S5A
RCHITECTURE OF GENERATOR G .TABLE S6A RCHITECTURE OF DISCRIMINATOR D . ARE DEMO OF IEEETRAN.CLS FOR IEEE JOURNALS 12 (b)(c)(d) (a) (b)(c)(d) (a) (b)(c)(d)(a) (b) (c)(d) (a)(a)