[PDF] ByeGlassesGAN: Identity Preserving Eyeglasses Removal for Face Images

Abstract

In this paper, we propose a novel image-to-image GAN framework for eyeglasses removal, called ByeGlassesGAN, which is used to automatically detect the position of eyeglasses and then remove them from face images. Our ByeGlassesGAN consists of an encoder, a face decoder, and a segmentation decoder. The encoder is responsible for extracting information from the source face image, and the face decoder utilizes this information to generate glasses-removed images. The segmentation decoder is included to predict the segmentation mask of eyeglasses and completed face region. The feature vectors generated by the segmentation decoder are shared with the face decoder, which facilitates better reconstruction results. Our experiments show that ByeGlassesGAN can provide visually appealing results in the eyeglasses-removed face images even for semi-transparent color eyeglasses or glasses with glare. Furthermore, we demonstrate significant improvement in face recognition accuracy for face images with glasses by applying our method as a pre-processing step in our face recognition experiment.

Full PDF

BByeGlassesGAN: Identity Preserving EyeglassesRemoval for Face Images

Yu-Hui Lee − − − and Shang-Hong Lai , Department of Computer Science, National Tsing Hua University, Taiwan Microsoft AI R&D Center, Taiwan [email protected] , [email protected] Abstract.

In this paper, we propose a novel image-to-image GAN frame-work for eyeglasses removal, called ByeGlassesGAN, which is used to au-tomatically detect the position of eyeglasses and then remove them fromface images. Our ByeGlassesGAN consists of an encoder, a face decoder,and a segmentation decoder. The encoder is responsible for extractinginformation from the source face image, and the face decoder utilizesthis information to generate glasses-removed images. The segmentationdecoder is included to predict the segmentation mask of eyeglasses andcompleted face region. The feature vectors generated by the segmentationdecoder are shared with the face decoder, which facilitates better recon-struction results. Our experiments show that ByeGlassesGAN can pro-vide visually appealing results in the eyeglasses-removed face images evenfor semi-transparent color eyeglasses or glasses with glare. Furthermore,we demonstrate signiﬁcant improvement in face recognition accuracy forface images with glasses by applying our method as a pre-processing stepin our face recognition experiment.

Keywords:

Generative Adversarial Networks, Face Attributes Manip-ulation, Face Recognition

Face recognition has been researched extensively and widely used in our dailylives. Although state-of-the-art face recognition systems are capable of recogniz-ing faces for practical applications, their accuracies are degraded when the faceimages are partially occluded, such as wearing eyeglasses. An obvious reasoncauses this problem is that the eyeglasses may occlude some important infor-mation on faces, leading to discrepancies in facial feature values. For example,the thick frame of glasses may block the eyes. Hence, in the past, researchersproposed to apply the PCA-based methods [24,20] to remove eyeglasses fromface images. However, the PCA-based method can only provide approximateglasses removal image via face subspace projection. In addition, they did notreally evaluate their methods on diverse face recognition tasks.Another reason for the degradation of face recognition accuracy with eye-glasses is that face images with eyeglasses are considerably fewer than glasses-free images. It is hard to make the recognition model learn the feature of various a r X i v : . [ c s . MM ] A ug Y. Lee and S. Lai

Fig. 1.

Examples of glasses removal by ByeGlassesGAN. kinds of eyeglasses. Recently, alongside with the popularity of face attributesmanipulation, some GAN based methods, such as [6] and [30], improved thecapability of recognizing faces with eyeglasses by synthesizing a large amount ofimages of faces with eyeglasses for training a face recognition model.Diﬀerent from the previous works, we aim at improving face recognitionaccuracy by removing eyeglasses with the proposed GAN model before facerecognition. With the proposed GAN-based method, we can not only improveface recognition accuracy, the visually appealing glasses-removed images can alsobe used for some interesting applications, like applying virtual makeup.The main contributions of this work are listed as follows:1. We propose a novel glasses removal framework, which can automaticallydetect and remove eyeglasses from a face image.2. Our proposed framework combines the mechanisms of the feature sharingbetween 2 decoders to acquire better visual results, and an identity classiﬁerto make sure the identity in the glasses-removed face image is well preserved.3. We come up with a new data synthesis method to train a glasses removalnetwork, which eﬀectively simulates color lens, glare of reﬂection as well asthe refraction on eyeglasses.4. In the experiment, we demonstrate that the face recognition accuracy issigniﬁcantly improved for faces with eyeglasses after applying the proposedeyeglasses removal method as a pre-processing step.

Face attributes manipulation is a research topic that attracts a lot of attention.Along with the popularity of GAN, there are many impressive GAN-based meth-ods proposed for editing face attributes. [16] and [30] edit face attributes throughan attribute transformation network and a mask network. Both of them preservethe identity of the source images by using the predicted mask to constrain theediting area. AttGAN [7] edits face images through the attribute classiﬁcationconstraint and reconstruction learning. ELEGANT [25] can not only manipulateface images but also manipulate images according to the attributes of referenceimages. ERGAN [9] removes eyeglasses by switching features extracted froma face appearance encoder and an eye region encoder. Besides, there are sev-eral face attributes editing methods which are not GAN-based. For example, yeGlassesGAN 3

DFI [23] manipulated face images through linear interpolation of the featurevectors of diﬀerent attributes. [2] achieved identity preserving face attributesediting by disentangling the identity and attributes vectors of face images withthe mechanisms of Variational Autoencoder and GAN. However, these face at-tributes manipulation methods suﬀer from the instability problem. For example,the identity may not be preserved or some artifacts may be generated.

Eyeglasses removal can also be seen as a face image completion problem. Re-cently there are many deep learning works [21,29,11,26,17,13] focusing on imagecompletion. Context Encoder [21] is the ﬁrst deep learning and GAN basedinpainting method. After that, [11] signiﬁcantly improved the quality of inpaint-ing results by using both a global and a local discriminators, with one of themfocusing on the whole image and the other focusing on the edited region. Partial-Conv [17] masks the convolution to reduce the discrepancy (e.g. color) betweenthe inpainted part and the non-corrupted part. Recently, there are also someinteresting and interactive completion methods [28,13] which support free-forminput. Users can easily control the inpainted result by adding desired sketcheson the corrupted regions.The main diﬀerence between the proposed ByeGlassesGAN and existing im-age completion methods is that our method does not require a predeﬁned maskfor the completion. In fact, the eyeglasses removal problem is not the same asimage completion because the glasses region could be either transparent or semi-transparent. Our method can exploit the original image in the glasses region toprovide better glasses-removed result. Besides, compared with the face attributesmanipulation methods described above, our method can automatically removethe glasses and better preserve the face identity in the glasses-removed images.

In this paper, we propose a multi-task learning method which aims at predict-ing the position of eyeglasses and removing them from the source image. Sincewe expect the eyeglasses-removed images can improve the performance of facerecognition, the generated results of ByeGlassesGAN must look realistic and wellpreserve the identities of the source images.

Figure 2 illustrates the overall framework of our ByeGlassesGAN, which containsa generator, an identity extractor, and two discriminators. The generator (G)can be separated into 3 deep neural networks, encoder (E), face decoder (FD),and segmentation decoder (SD). Here we assume the training data contains aset of face images ( x ) associated with the corresponding glasses-removed images( y ) and the corresponding masks ( m ) of eyeglasses region and the completed Y. Lee and S. Lai

Fig. 2. Framework of ByeGlassesGAN.

Each input image ( x ) is ﬁrst fed intothe Encoder to encode the feature vector. The Face Decoder and the SegmentationDecoder then manipulate the glasses-removed image (ˆ y ) and the segmentation mask( ˆ m ) of eyeglasses and face shape with the extracted vectors. Two discriminators areincluded to make sure both the whole generated image (ˆ y ) and the edited part (ˆ y Local )look realistic. An Identity Extractor is also applied to minimize the distance betweenthe identity feature vectors computed from the output image (ˆ y ) and the ground truthimage. face shape. Given a source face image x , which ﬁrst goes through the encoderto encode the feature vector of image x . After that, we synthesize the glasses-removed image ˆ y with face decoder using the feature vector mentioned above.Meanwhile, a segmentation decoder is there for generating the binary mask ˆ m of the glasses region. However, after testing with this baseline model, we foundthat although there are many good removal results, when the eyeglasses arespecial or the face is not frontal, the removal eﬀect may degrade. Hence, wewere wondering whether there exists a good representation for face that canhelp remove eyeglasses. Since eyeglasses removal can be regarded as a kind ofinpainting task on the face region, we can include semantic segmentation mask offace and eyeglasses regions into the framework. The segmentation of face shapeis an excellent hint to guide FD to know the characteristics of each pixel in theface region and should maintain consistency with the neighboring pixels. Afterthe experiment, we found that making SD predict the binary mask of face shapeas well greatly improves the glasses-removed results. Hence, we let SD not onlypredict the binary mask of eyeglasses, but also the mask of face shape. Besides,the information obtained from SD is shared with FD with the skip connectionsto guide FD synthesizing images. Thus, we haveˆ y = F D ( E ( x )) (1)and ˆ m = SD ( E ( x )) (2)where ˆ m is a 2-channel mask, one of the channels indicates the position of theglasses region, and the other is for the face shape. Furthermore, in order to ensure yeGlassesGAN 5 the quality of synthetic output ˆ y , we adopt a global and a local discriminator [11]to make sure both the synthetic image ˆ y and the inpainted frame area ˆ y Local look realistic. Besides, we also include an Identity Extractor to minimize thedistance between the identity feature vectors computed from the output image(ˆ y ) and the ground truth image ( y ). The proposed ByeGlassesGAN is trained with the objective function consistingof four diﬀerent types of loss functions, i.e. the adversarial loss, per-pixel loss,segmentation loss, and identity preserving loss. They are described in detailssubsequently.

Adversarial Loss

In order to make the generated images as realistic as possi-ble, we adopt the strategy of adversarial learning. Here we apply the objectivefunction of LSGAN [19,31] since it can make the training process of GAN morestable than the standard adversarial loss. Here we adopt 2 kinds of GAN loss, L GlobalGAN and L LocalGAN for training the discriminators. Equation 3 shows the globaladversarial loss L GlobalD . L GlobalD = E y ∼ P y [( D Global ( y ) − ] + E x ∼ P x [( D Global (ˆ y )) ] (3)When computing L LocalD , we replace y , ˆ y , D Global by y Local , ˆ y Local , and D Local in Equation 3. y Local = y (cid:12) m g , and ˆ y Local = ˆ y (cid:12) m g . (cid:12) denotes the element-wiseproduct operator, and m g is the ground truth binary mask of eyeglasses region.For training the generator, the GAN loss is shown below(Equation 4). Whencomputing L LocalG , we also replace ˆ y and D Global by ˆ y Local and D Local in Equa-tion 4. L GlobalG = E x ∼ P x [( D Global (ˆ y ) − ] (4) Per-pixel Loss

We compute the L distance between the generated image ˆ y and the ground truth image y . Per-pixel loss enforces the output of generatorto be similar to the ground truth. We adopt two kinds of L loss, L GlobalL and L LocalL . L LocalL is used for enhancing the removal ability of the generator in theedited region. The global L1 loss is given by L GlobalL = L y, y ) = E x ∼ P x [ (cid:107) y − ˆ y (cid:107) ] (5)When computing L LocalL , we replace ˆ y and y by ˆ y Local and y Local in Equation 5.

Segmentation Loss

Since we expect ByeGlassesGAN to predict the segmen-tation mask which facilitates eyeglasses removal, here we adopt binary cross

Y. Lee and S. Lai entropy loss for generating the segmentation mask of the eyeglasses region andthe face shape. It is given by L Seg = E x ∼ P x − ( m · log ( ˆ m ) + (1 − m ) · log (1 − ˆ m )) (6)where ˆ m is the generated mask, and m denotes the ground truth segmentationmask. Identity Preserving

In order to preserve the identity information in theglasses-removed images, we employ an Identity Extractor (IE), which is in facta face classiﬁer.The identity distance loss is introduced to our generator, which is used tominimize the distance between IE ( y ) and IE (ˆ y ). Similar to the concept of per-ceptual loss, after extracting the feature of y and ˆ y through the identity extractor,we compute the mean square error between these two feature vectors, given by L ID = E x ∼ P x ,y ∼ P y [ (cid:107) IE (ˆ y ) − IE ( y ) (cid:107) ] (7)This loss encourages the eyeglasses-removed image ˆ y shares the same iden-tity information with ground truth image y in the feature space of the identityextractor model.Note that IE is a ResNet34 classiﬁer pretrained on UMDFaces dataset. Whentraining the Identity Classiﬁer, we transpose the output feature vector of layer4in ResNet into a 512-dimensional vector, and adopt the ArcFace [5] loss.Finally, the overall loss function of the generator is given as follows: L G = λ L GlobalG + λ L LocalG + λ L GlobalL + λ L LocalL + λ L Seg + λ L ID (8) Our GAN-based eyeglasses removal framework contains a generator, two discrim-inators, and an identity extractor. There are one encoder (E) and two decoders(face decoder, FD, and segmentation decoder, SD) in our generator. Follow-ing ELEGANT [25], the encoder (E) consists of 5 convolutional blocks, and eachblock contains a convolutional layer followed by an instance normalization layersand LeakyReLU activation. Both of the face decoder (FD) and the segmenta-tion decoder (SD) consist of 5 deconvolutional blocks and an output block. Eachdeconvolutional block contains a deconvolutional layer followed by an instancenormalization layers and ReLU activation. The output block of the FD is a de-convolutional layer followed by Tanh activation, while the output block of theSD is a deconvolutional layer followed by Sigmoid activation. Since the only areaexpected to be modiﬁed in the source image is the region of eyeglasses, otherparts of the image should be kept unchanged. Here we adopt U-NET [22] archi-tecture to be the generator of our ByeGlasess-GAN. Skip connections are addedto the corresponding layers between E-FD and E-SD. U-Net can considerably re-duce the information loss compared with the common encoder-decoder. Besides, yeGlassesGAN 7 skip connections are also added between the corresponding layers of SD and FD,which are used for making the information acquired from SD guide the FD toreconstruct images. The network architecture used for the two discriminatorsare PatchGAN proposed in pix2pix [12].

Since we expect the proposed method can not only remove glasses but also helpimprove face recognition performance for images with eyeglasses, we need tomake sure the detailed attributes (eye shape, eye color, skin color etc.) of theglasses-removed face remain the same as those of the source image. To deal withthis problem, the best solution is to collect a large scale of well-aligned imagesof subjects wearing and without wearing glasses. This kind of paired data isdiﬃcult to collect, hence, here we generate the well-aligned paired image datavia synthesizing adding eyeglasses onto real face images.We use CelebA [18] dataset to train the proposed ByeGlassesGAN in ourexperiments. CelebA is a dataset containing 202,599 images of 10,177 celebritiesand annotated with 40 face attribute labels for each image. First, we align allimages according to 5 facial landmarks (left/right eye, nose, left/right mouth)into the size of 256x256, and then roughly classify all images into 3 kinds of headpose, frontal, left-front, and right-front using dlib and OpenCV. We manuallylabel the binary masks of 1,000 images with eyeglasses in CelebA as our glassespool ( S G ), and use the rest of the images with glasses as our testing set. Thesebinary masks precisely locate the position of eyeglasses on each image, so we canmake use of them to easily extract 1,000 diﬀerent styles of glasses. After that, werandomly put glasses from glasses pool onto each glasses-free images according tothe head pose (Figure 3, top). In order to make the synthetic glasses images lookmore realistic to the real one, we randomly apply diﬀerent levels of deformationaround the outer side lens to simulate the refraction of eyeglasses. After that,we dye various colors on the glasses lenses. In addition, we generate many semi-transparent light spot images in diﬀerent shape. These spots are used to applyon the glasses lenses to simulate the reﬂected glare on real eyeglasses. This stepmuch improves the ability of our ByeGlassesGAN to manipulate realistic glassesimages (Figure 3, bottom). Besides, to generate segmentation mask for the faceshape, we used the pre-trained BiSeNet [27] which is trained on CelebAMask-HQ [15] dataset to obtain the face shape mask of the glasses-free images. Finally,we obtain 184,862 pairs of data in total as the training dataset. We will releasethe glasses pool described above as a new dataset for future research relatedeyeglasses detection, synthesis, or removal tasks. We implement our ByeGlasses GAN with PyTorch. We use Adam for the op-timizer, setting β = 0 . β = 0 . Y. Lee and S. Lai

Fig. 3. How to put on glasses?

We ﬁrst label the head pose of all the images inour training set, and label 1,000 eyeglasses segmentation mask of the glasses-images toform glasses pool( S G ). Each glasses-free image in the training set can be randomly puton glasses from S G according to the head pose label and the binary mask. After that,we’ve also applied several photorealism steps to our synthetic images with eyeglasses. hyperparameters in loss function, we set λ = 1, λ = 1, λ = 100, λ = 200, λ = 3, and λ = 5. Here we train ByeGlassesGAN on a GTX1080 with thebatch size set to 16. Figure 4, Upper shows the visual results of our method on CelebA [18] dataset.All samples are real glasses images in the testing set. The identity of each samplevisually remains the same. The generated segmentation masks are also able topoint out the accurate region of the face shape. Here we also show some visualresults of the face images not in CelebA dataset. Figure 4, Bottom shows thevisual results of our method testing on the wild data. Our method can not onlyremove eyeglasses from delicate portraits of celebrities, but also images takenfrom ordinary camera on mobile phones or laptops. Besides, when we synthesizethe training data, we take the head pose of each face image into considerationand generate training image pairs for faces of diﬀerent poses. Hence, our methodis able to deal with non-frontal face images as well. As shown in the bottom row,our method can remove not only the glasses frame but also the tinted lenses.Besides, here we also perform experiments on the other two kinds of modelswhich are (A) Baseline model: Segmentation decoder only predicts the binarymask of eyeglasses, and there is no skip connection between FD and SD. (B) Seg-mentation decoder predicts the binary mask of both eyeglasses and face shape,but there is still no skip connection between FD and SD. As shown in Figure 5,when we predict the face shape mask in Experiment B, the removal results im-prove a lot since the face shape mask shares some similar features with theremoved result comparing to only predicting glasses mask in Experiment A. Be-sides, after we add skip connections between the 2 decoders, the segmentationdecoder can better guide the face decoder. Sharing the features of the segmen-tation mask with the face decoder helps the edited region keep consistency withthe neighboring skin region, especially when the face is not-frontal or the glassesmay not locate in a normal way. yeGlassesGAN 9

Fig. 4.

Eyeglasses-removed results. Upper: CelebA results. Bottom: Wild data results.Images credit: MeGlass [6] dataset, and photos taken by ourselves.

Fig. 5.

Some glasses removal results under diﬀerent combinations: (A) baseline model,(B) baseline model with predicting face shape, and (C) complete model in the proposednetwork.

We compare the visual results of the proposed method to the other state-of-the-art methods as well, including Pix2pix [12], CycleGan [31], StarGAN [4],ELEGANT [25], ERGAN [9], and SaGAN [30]. For comparison, we simply uti-lize the source code released by the ﬁrst 5 previous works without any change.For SaGAN, since there is no source code, we carefully implement it ourselves.Pix2pix is a method that needs paired training data, so here we train the pix2pixmodel using the same data as we mentioned in Section 4. CycleGAN, StarGAN,ELEGANT, and SaGAN are methods adopt unsupervised learning, so we di-rectly use the original CelebA dataset for training the eyeglasses removal net-works. ERGAN is an unsupervised method developed for removing eyeglasses.Here we directly apply the model released by the authors to obtain the results.Figure 6 shows the removal results of diﬀerent methods. As shown in the ﬁg-ure, there are many artifacts pop out in pix2pix. For the other 5 unsupervisedmethods, even there exist visually appealing results, it is still diﬃcult for them

Fig. 6.

Eyeglasses removal results compared with the other six methods.

Fig. 7.

Extracting eyeglasses from the glasses-removed images. These extracted eye-glasses can be used for synthesizing training pairs. to directly remove the glasses stably without generating any artifacts. For EL-EGANT and ERGAN, both methods need an additional glasses-free image toguide the removal, so the removal results depend on similarity between the inputface image with glasses and the reference image without glasses. Besides, it isworth mentioning that in the last 3 rows in Figure 6, our data synthesis methodcan eﬀectively strengthen the ability of ByeGlassesGAN to remove reﬂected glareon the lenses.Since our method is able to produce high quality glasses-removed result inwhich the only edited part is the glasses area, we can easily extract the eyeglasseson the input image by applying thresholding operation to the edited region. Theedited region is the diﬀerence between the input and the output images. Thus,these extracted eyeglasses can also be used for synthesizing training pairs forglasses removal, synthesis, or detection tasks in the future.(Figure 7)

Following [25], here we utilize Fr´ e chet Inception Distance(FID) [8] to measurehow well our glasses removal method performs. FID represents the distance be-tween the Inception embedding of the real and the generated images, whichreveals how close the embeddings of images in two domains are distributed in yeGlassesGAN 11 feature space. Table 1 shows the FID distances of diﬀerent methods. Here weperform experiment on 2 diﬀerent datasets, CelebA [18] and MeGlass [6]. Thereal images set contains real glasses-free images. For the generated images set,it consists of the glasses removed images after applying each of the 6 diﬀerentmodels to some real face images with glasses. As shown in Table 1, our Bye-GlassesGAN outperforms the others in both datasets. The FIDs of celebA withCycleGan, StarGAN and ELEGANT methods were reported in [25]. Besides, wealso do the ablation study of removing the segmentation decoder branch, andthe perceptual quality of the glasses removal results is not as good as those withthe segmentation decoder as shown in Table 1. Table 1.

FID distances of diﬀerent methods applied on MeGlass and CelebA datasets.

Pix2pix CycleGAN StarGAN ELEGANT SaGAN ERGAN ours w/o SD oursMeGlass 39.93 29.40 NULL 41.09 44.94 38.25 28.26

CelebA 50.38 48.82 142.35 60.71 50.06 NULL 44.76

To further compare our image synthesis method with the others, we conducta user study. We randomly select 7 portraits with glasses from the testing setand apply diﬀerent glasses removal methods on all of them, and there are 42glasses-removed results in total. We invite 49 subjects to evaluate these imagesand compute the mean opinion score (MOS). As shown in Table 2, apparently,the glasses-removal results by our method are the most preferred since it receivesthe highest MOS score.

Table 2.

Mean opinion scores of the glasses-removed results of diﬀerent methods. Itis obvious that our ByeGlassesGAN has the highest score.

Methods Pix2pix CycleGAN StarGAN ELEGANT SaGAN OursMOS 2.43 3.23 2.06 1.82 2.65

In this section, we demonstrate the eﬀect of using our glasses removal GAN asa pre-processing step for the face recognition task.First, we train a face recognition model on the whole UMDFaces dataset.UMDFaces [1] is a well-annotated dataset containing 367,888 images of 8,277subjects, there are both faces with and without eyeglasses. The face recognitionmodel we use here is the PyTorch implementation of MobileFaceNets [3]. All thetraining images are resized to 112x112, and the embedding features extractedfrom the face recognition module is 128-dimensional, following the original set-ting in MobileFaceNets. We train the face recognition model on a GTX1080 for

40 epochs, and the batchsize is 128. The recognition model achieves the accuracyof 98.3% on LFW [10].For testing, we use MeGlass [6] dataset, which is a subset of MegaFace [14]. Inthe testing set of MeGlass, there are images of 1,710 subjects. For each subject,there are 2 images with eyeglasses, one for gallery and the other for probe.There are also 2 images without glasses of each subject, still, one for gallery andthe other for probe. Here we show 7 kinds of experimental protocols below. Nomatter which one of the protocols, all images in gallery are glasses-free faces. – All images in probe are glasses-free images. – All images in probe are images with glasses. – All images in probe are images with glasses, but we remove the glasses withdiﬀerent methods including CycleGAN, SaGAN, ELEGANT, pix2pix, andour ByeGlassesGAN before face recognition.As shown in Table 3, when images in gallery and probe are all glasses-free,the face recognition model(Experiment M) described above can achieve highaccuracy on both veriﬁcation and identiﬁcation task. However, if we change theprobe into images with glasses, the accuracy degrades a lot.

Table 3.

The eﬀect of eyeglasses in face recognition: all the images in gallery areglasses-free images. The ﬁrst column denotes which kind of images are there in probe.

Experiment M : The face recognition model used is MobileFaceNet with 112x112input image size.

Experiment M

TAR@FAR=10 − TAR@FAR=10 − TAR@FAR=10 − Rank-1no glasses 0.9129 0.8567 0.7673 0.9018with glasses 0.8509 0.7374 0.5708 0.8275

Due to the accuracy degradation for face images with eyeglasses as shownin Table 3, we then apply glasses removal methods to remove the eyeglasses inprobe before face recognition. The quantitative results are shown in Table 4. Asshown in row 2, row 3, row 4, and row 5 of Table 4, removing glasses with Cy-cleGAN, SaGAN, ELEGANT, and pix2pix degrades the accuracy of face recog-nition. However, removing eyeglasses with our ByeGlassesGAN can improve theaccuracy. Especially when FAR is small, the improvement in TAR is more evi-dent. Comparing the unpaired training ones with our work may not be fair, butto utilize glasses removal into face recognition task, paired training is a betterstrategy. Besides, we also train the proposed GAN model without considering L ID , as shown in row 7, without the Identity Extractor, the improvement offace recognition decrease since there is no mechanism to constrain the generatorfrom producing artifacts which may be seen as noise for face recognition.Here we also demonstrate face recognition experiments on our Identity Extractorused for training our ByeGlassesGAN. As shown in Table 5, when FAR is 10 − ,we can improve TAR even more obviously by about 6%. yeGlassesGAN 13 Table 4.

Accuracy of face recognition: all the images in gallery are glasses-free images.The ﬁrst column denotes the type of images with or without applying a speciﬁc glassesremoval pre-processing method in the probe.

Experiment M

TAR@FAR=10 − TAR@FAR=10 − TAR@FAR=10 − Rank-1no removal 0.8509 0.7374 0.5708 0.8275CycleGAN 0.8298 0.7205 0.5329 0.7994SaGAN 0.8386 0.7257 0.5684 0.8088ELEGANT 0.7497 0.5977 0.3719 0.6994pix2pix 0.8444 0.7327 0.5251 0.8216ours without IE 0.8573 0.7626 0.5813 0.8358ours

Table 5.

Accuracy of face recognition: all the images in gallery are glasses-free images.The ﬁrst column denotes the type of images with or without applying our glasses re-moval pre-processing method in the probe.

Experiment R : The face recognition modelused here is the Identity Extractor used for training ByeGlassesGAN with 256x256 in-put image size.

Experiment R

TAR@FAR=10 − TAR@FAR=10 − TAR@FAR=10 − Rank-1no removal 0.8801 0.7830 0.6292 0.8538ours

However, for practical applications, there might not be only glasses-portraitsin probe and only glasses-free-portraits in gallery. Hence, here we do anotherface recognition experiment described as follows: – In gallery: 1 glasses-free image and 1 image with glasses for each person. – In probe: 1 glasses-free image and 1 image with glasses for each person. – For the no removal experiment, no matter there are eyeglasses on the imagesor not, we use the original images. – For the with removal experiment, no matter there are eyeglasses on theimages or not, we do eyeglasses removal with ByeGlassesGAN for all theimages in both probe and gallery before face recognition.As shown in Table 6, we can see applying glasses removal as a pre-processingstep can still beneﬁt face recognition even when there are glasses-free images.When FAR is 10 − , we evidently improve TAR by about 7%. This experimentnot only demonstrates the eﬀectiveness of our glasses removal method, but alsoreveals that when applying our method to the glasses-free images, images remainsalmost the same, and the feature and identity embedding of the pre-processedface images are still well preserved.Besides, to make sure applying image synthesis before recognition does notharm features of faces, we have done an experiment of computing the cosinedistance between features of with-glasses portrait and real glasses-free portraitof same person in the feature space of the recognition model, and the cosinedistance between features of glasses-removed image and real glasses-free image. Table 6.

Face recognition accuracy when both gallery and probe sets contain faceimages with and without glasses.

Experiment M

TAR@FAR=10 − TAR@FAR=10 − TAR@FAR=10 − Rank-1no removal 0.8507 0.7516 0.5927 0.9175with removal

We found that our image synthesis method can eﬀectively shorten the cosinedistance for 1,335 out of 1,710 image pairs in MeGlass dataset after applyingour glasses removal method. Due to the improvement of face recognition and thereduction in the cosine distance for almost 80% image pairs, we are conﬁdent thatour method cannot only manipulate visually appealing glasses-removed results,but it’s also worth removing eyeglasses with our method as a preprocessing stepfor face recognition.

In this paper, we propose a novel multi-task framework to automatically detectthe eyeglasses area and remove them from a face image. We adopt the mechanismof identity extractor to make sure the output of the proposed ByeGlassesGANmodel preserves the same identity as that of the source image. As our GAN-basedglasses removal framework can predict the binary mask of face shape as well, thisspatial information is exploited to remove the eyeglasses from face images andachieve very realistic result. In the face recognition experiment, we showed thatour method can signiﬁcantly enhance the accuracy of face recognition by about7%TAR@FAR=10 − . However, there are still some limitations of our work, forexample, we cannot generate convincing glasses removal results for some specialglasses or when the lighting condition is very extreme.With the advancement of face parsing methods, we believe that combiningface parsing can eﬀectively extend this work to other attributes removal tasks,such as removing beard or hat. In the future, we will aim to make our Bye-GlassesGAN more robust under special or extreme conditions, and extend theproposed framework to other face attributes removal tasks. yeGlassesGAN 15 References

1. Bansal, A., Nanduri, A., Castillo, C.D., Ranjan, R., Chellappa, R.: Umd-faces: An annotated face dataset for training deep networks. arXiv preprintarXiv:1611.01484v2 (2016)2. Bao, J., Chen, D., Wen, F., Li, H., Hua, G.: Towards open-set identity preservingface synthesis. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. pp. 6713–6722 (2018)3. Chen, S., Liu, Y., Gao, X., Han, Z.: Mobilefacenets: Eﬃcient cnns for accuratereal-time face veriﬁcation on mobile devices. In: Chinese Conference on BiometricRecognition. pp. 428–438. Springer (2018)4. Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: Uniﬁed gener-ative adversarial networks for multi-domain image-to-image translation. In: Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition.pp. 8789–8797 (2018)5. Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss fordeep face recognition. arXiv preprint arXiv:1801.07698 (2018)6. Guo, J., Zhu, X., Lei, Z., Li, S.Z.: Face synthesis for eyeglass-robust face recog-nition. In: Chinese Conference on Biometric Recognition. pp. 275–284. Springer(2018)7. He, Z., Zuo, W., Kan, M., Shan, S., Chen, X.: Arbitrary facial attribute editing:Only change what you want. arXiv preprint arXiv:1711.10678 (3) (2017)8. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trainedby a two time-scale update rule converge to a local nash equilibrium. In: Advancesin Neural Information Processing Systems. pp. 6626–6637 (2017)9. Hu, B., Yang, W., Ren, M.: Unsupervised eyeglasses removal in the wild. arXivpreprint arXiv:1909.06989 (2019)10. Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild:A database for studying face recognition in unconstrained environments. Tech.Rep. 07-49, University of Massachusetts, Amherst (October 2007)11. Iizuka, S., Simo-Serra, E., Ishikawa, H.: Globally and locally consistent image com-pletion. ACM Transactions on Graphics (ToG) (4), 107 (2017)12. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi-tional adversarial networks. In: Proceedings of the IEEE conference on computervision and pattern recognition. pp. 1125–1134 (2017)13. Jo, Y., Park, J.: Sc-fegan: Face editing generative adversarial network with user’ssketch and color. arXiv preprint arXiv:1902.06838 (2019)14. Kemelmacher-Shlizerman, I., Seitz, S.M., Miller, D., Brossard, E.: The megafacebenchmark: 1 million faces for recognition at scale. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. pp. 4873–4882 (2016)15. Lee, C.H., Liu, Z., Wu, L., Luo, P.: Maskgan: Towards diverse and interactive facialimage manipulation. arXiv preprint arXiv:1907.11922 (2019)16. Li, M., Zuo, W., Zhang, D.: Deep identity-aware transfer of facial attributes. arXivpreprint arXiv:1610.05586 (2016)17. Liu, G., Reda, F.A., Shih, K.J., Wang, T.C., Tao, A., Catanzaro, B.: Image inpaint-ing for irregular holes using partial convolutions. In: Proceedings of the EuropeanConference on Computer Vision (ECCV). pp. 85–100 (2018)18. Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In:Proceedings of the IEEE international conference on computer vision. pp. 3730–3738 (2015)6 Y. Lee and S. Lai19. Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Paul Smolley, S.: Least squares gen-erative adversarial networks. In: Proceedings of the IEEE International Conferenceon Computer Vision. pp. 2794–2802 (2017)20. Park, J.S., Oh, Y.H., Ahn, S.C., Lee, S.W.: Glasses removal from facial image usingrecursive error compensation. IEEE transactions on pattern analysis and machineintelligence (5), 805–811 (2005)21. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context en-coders: Feature learning by inpainting. In: Proceedings of the IEEE conference oncomputer vision and pattern recognition. pp. 2536–2544 (2016)22. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi-cal image segmentation. In: International Conference on Medical image computingand computer-assisted intervention. pp. 234–241. Springer (2015)23. Upchurch, P., Gardner, J., Pleiss, G., Pless, R., Snavely, N., Bala, K., Weinberger,K.: Deep feature interpolation for image content changes. In: Proceedings of theIEEE conference on computer vision and pattern recognition. pp. 7064–7073 (2017)24. Wu, C., Liu, C., Shum, H.Y., Xy, Y.Q., Zhang, Z.: Automatic eyeglasses removalfrom face images. IEEE transactions on pattern analysis and machine intelligence26