[PDF] Supervised COSMOS Autoencoder: Learning Beyond the Euclidean Loss!

Abstract

Autoencoders are unsupervised deep learning models used for learning representations. In literature, autoencoders have shown to perform well on a variety of tasks spread across multiple domains, thereby establishing widespread applicability. Typically, an autoencoder is trained to generate a model that minimizes the reconstruction error between the input and the reconstructed output, computed in terms of the Euclidean distance. While this can be useful for applications related to unsupervised reconstruction, it may not be optimal for classification. In this paper, we propose a novel Supervised COSMOS Autoencoder which utilizes a multi-objective loss function to learn representations that simultaneously encode the (i) "similarity" between the input and reconstructed vectors in terms of their direction, (ii) "distribution" of pixel values of the reconstruction with respect to the input sample, while also incorporating (iii) "discriminability" in the feature learning pipeline. The proposed autoencoder model incorporates a Cosine similarity and Mahalanobis distance based loss function, along with supervision via Mutual Information based loss. Detailed analysis of each component of the proposed model motivates its applicability for feature learning in different classification tasks. The efficacy of Supervised COSMOS autoencoder is demonstrated via extensive experimental evaluations on different image datasets. The proposed model outperforms existing algorithms on MNIST, CIFAR-10, and SVHN databases. It also yields state-of-the-art results on CelebA, LFWA, Adience, and IJB-A databases for attribute prediction and face recognition, respectively.

Full PDF

11 Supervised COSMOS Autoencoder: Learning Beyond theEuclidean Loss!

Maneet Singh,

Student Member, IEEE,

Shruti Nagpal,

Student Member, IEEE,

Mayank Vatsa,

Senior Member, IEEE,

Richa Singh,

Senior Member, IEEE, and Afzel Noore,

Senior Member, IEEE

Abstract —Autoencoders are unsupervised deep learning models used for learning representations. In literature, autoencoders haveshown to perform well on a variety of tasks spread across multiple domains, thereby establishing widespread applicability. Typically, anautoencoder is trained to generate a model that minimizes the reconstruction error between the input and the reconstructed output,computed in terms of the Euclidean distance. While this can be useful for applications related to unsupervised reconstruction, it maynot be optimal for classiﬁcation. In this paper, we propose a novel Supervised COSMOS Autoencoder which utilizes a multi-objectiveloss function to learn representations that simultaneously encode the (i) “similarity” between the input and reconstructed vectors interms of their direction, (ii) “distribution” of pixel values of the reconstruction with respect to the input sample, while also incorporating(iii) “discriminability” in the feature learning pipeline. The proposed autoencoder model incorporates a Cosine similarity andMahalanobis distance based loss function, along with supervision via Mutual Information based loss. Detailed analysis of eachcomponent of the proposed model motivates its applicability for feature learning in different classiﬁcation tasks. The efﬁcacy ofSupervised COSMOS autoencoder is demonstrated via extensive experimental evaluations on different image datasets. The proposedmodel outperforms existing algorithms on MNIST, CIFAR-10, and SVHN databases. It also yields state-of-the-art results on CelebA,LFWA, Adience, and IJB-A databases for attribute prediction and face recognition, respectively.

Index Terms —Supervised autoencoder, Cosine similarity, Mahalanobis distance, Mutual information (cid:70)

NTRODUCTION

Traditionally, most classiﬁcation tasks suffer from the inher-ent challenge of extracting representative features from the givendata, followed by performing effective classiﬁcation. In orderto learn a robust classiﬁcation model, the extracted featuresshould be invariant to modiﬁcations in the input space, capturethe distinct properties of the input samples, and be represen-tative of the data. Research in this area has been progressingover the last few decades, with developments across differentspectra of hand-crafted and learning based algorithms. Thepast decade has speciﬁcally witnessed several advancementsin this area, with large focus on deep learning techniques [1].Majority of the research in deep learning focuses on Convo-lutional Neural Network (CNN) and several advancementshave been achieved with it. Additionally, other deep learningalgorithms such as Autoencoder, Deep Belief Network, andDeep Boltzmann Machine have also shown promises, and webelieve that increased research focus may enable future growthof these algorithms. This research builds upon this philosophyand focuses on extending the capabilities of autoencoder basedrepresentation learning.Autoencoders are unsupervised deep learning models, uti-lized for learning representations of the given input data [2].For input data X , the loss function of a traditional single layerautoencoder is formulated as: arg min W , W (cid:48) (cid:107) X − W (cid:48) φ ( WX ) (cid:107) F (1)where, W , W (cid:48) correspond to the encoding and decodingweights of the autoencoder model, respectively. X containsvectorized samples stacked column-wise. For example, if there • M. Singh, S. Nagpal, M. Vatsa, and R. Singh are with IIIT-Delhi, India.A. Noore is with Texas A&M University, Kingsville, USA. E-mail: { maneets, shrutin, mayank, rsingh } @iiitd.ac.in, [email protected] are n samples each with dimension [64 × × , X correspondsto a matrix of dimension [ n × . φ represents the activationfunction, which can correspond to linear (unit) or non-linearactivation such as sigmoid , tanh , or ReLU . The model learnsthe representation ( φ ( WX ) ) such that the Euclidean distancebetween the reconstruction ( ˆX = W (cid:48) φ ( WX ) ) and the inputsample ( X ) is minimized. Using the above equation, if themodel learns a representation of dimension , the encodingweights ( W ) have a dimension of [12288 × , while thedecoding weights ( W (cid:48) ) have dimension [3072 × .In the literature, autoencoder and its variants have beenshown to perform well on a variety of tasks such as face de-tection and recognition, object and speech recognition, as wellas bio-medical applications [3], [4], [5], [6], [7]. Improvementshave been proposed to the autoencoder model by introducingdifferent regularization techniques, such as (cid:96) -norm and (cid:96) -norm [8]. These techniques are often applied on the weightmatrix and result in the following loss function: arg min W , W (cid:48) (cid:107) X − W (cid:48) φ ( WX ) (cid:107) F + λR (2)where, R corresponds to the additional regularization term,and λ refers to the regularization parameter. One of the mostpopular variants of the traditional autoencoder model is theDenoising Autoencoder, which learns features that are robustto noise in the input space [9]. Models such as the Contractiveand Higher Order Contractive Autoencoder have also been pro-posed which learn representations robust to different variationsby localizing the input space [10], [11].In an attempt to encode class speciﬁc information, re-searchers have also proposed different autoencoder architec-tures by incorporating class labels during the feature learningprocess. Regularization techniques such as (cid:96) , -norm or groupsparse regularizer introduce supervision in autoencoders [26].Zheng et al. [27] proposed the Contrastive Autoencoder, whichlearns representations while reducing the inter-class variations. a r X i v : . [ c s . C V ] O c t TABLE 1: Brief literature review of autoencoder based formulations.

Authors Approach Supervised

Vincent et al. (2010) [9] Stacked Denoising Autoencoder (SDAE): Noise is introduced at the input layer NoNg (2011) [8] (cid:96) norm is introduced in the loss function of an AE to learn sparse representations NoRifai et al. (2011) [11] Contractive AE (CAE): Added penalty term - Jacobian of input wrt hidden layer NoHinton et al. (2011) [12] Transforming Auto-encoder: combination of proposed capsules YesWang et al. (2014) [13] Generalized AE: Incorporates the structure of the dataspace in the representation NoKingma et al. (2014) [14] Variational AE: Generate synthetic data by learning the training data distribution NoZhang et al. (2015) [15] Stacked Multichannel AE: Learns mapping to reduce gap b/w real & synthetic data NoGao et al. (2015) [16] Mimics SDAE - probe is noisy input, and gallery images are expected output YesZhuang et al. (2015) [17] Class labels are encoded in the ﬁnal layer to incorporate supervision YesGhifary et al. (2015) [18] Multi-task AE: Single representation has multiple outputs for different domains NoMajumdar et al. (2017) [3] L-CSSE: Incorporated a group sparse regularizer to learn class-speciﬁc features YesMeng et al. (2017) [19] Introduced relational term to model the relationship b/w the input data NoWang et al. (2017) [20] FSAE: Incorporated feature selection in AE YesZhang et al. (2017) [21] Conditional Adversarial AE: Generate identity speciﬁc data for age variations YesTran et al. (2017) [22] Cascaded Residual AE learns difference between input data and completed data NoSethi et al. (2018) [23] R-Codean: Residual autoencoder with Cosine and Euclidean distance based loss function NoZeng et al. (2018) [24] Coupled Deep AE: Learns features of LR and HR image patches, along with a mapping YesKodirov et al. (2018) [25] Semantic AE: Additional constraint on decoder to reconstruct original visual feature No Gao et al. [16] proposed Supervised Autoencoder which utilizedthe class label information of the gallery and probe (same ordifferent) to reduce the difference in representations. Singh etal. [28] presented a Class Representative Autoencoder whichlearns features while reducing the intra-class variations andincreasing the inter-class variations. To the best of our knowl-edge, all existing models work with a Euclidean distance basedautoencoder. Only recently, Sethi et al. [23] proposed a residualautoencoder which incorporates Cosine and Euclidean distancein the loss function of an autoencoder.A separate area of research focuses on the use of VariationalAutoencoder (VAE) [14], [29] for the task of synthetic datageneration. VAEs attempt to learn the training data distribu-tion in order to generate synthetic data from it. While VAEshave gained signiﬁcant attention over the past few years, it isimportant to note that it is primarily used for data generation,as opposed to learning representations for classiﬁcation. Fur-ther, Hinton et al. [12] built upon the traditional autoencodernetwork and proposed Capsules for learning effective repre-sentations. Table 1 summarizes some of the recently proposedautoencoders and its variants. It is important to note that whileresearchers have focused on learning discriminative featuresuseful for classiﬁcation using autoencoders, majority of thetechniques focus on adding a penalty term along with theEuclidean distance based reconstruction error.

In this research, keeping the goal of classiﬁcation in mind, weexamine the philosophy of a Euclidean distance based Autoen-coder for the task of feature learning. We believe that whileencoding feature vectors using the Euclidean loss generates representative features for reconstruction, it might not result inoptimal features for classiﬁcation. We propose a multi-objectiveloss function based formulation, termed as Supervised COSineMahalanObiS (COSMOS) Autoencoder. The loss function of theproposed model aims at learning representations by encodingthe (i) similarity between the input and reconstructed vectorsin terms of their direction, (ii) distribution of pixel values of thereconstructed output with respect to the input sample, while in-corporating (iii) discriminability in the feature learning process.This is achieved by building a model which incorporates Cosinesimilarity and Mahalanobis distance, along with an additionalMutual Information based penalty term for supervision. Cosinesimilarity is able to encode the direction variations betweenimage vectors, while Mahalanobis distance attempts to model

Increasing Euclidean Distance(a) (b) (c) (d) (e) (f) (g) (i) Euclidean Autoencoder(ii) Proposed Supervised COSMOS Autoencoder

Fig. 1: Illustrating the limitations of using Euclidean distancebased autoencoder, along with the advantage of the proposedSupervised COSMOS autoencoder. Images (b)-(g) are sorted byincreasing distance from image (a).the pixel distributions, and Mutual Information introducessupervision. Detailed analysis of the proposed model, alongwith experimental results on benchmark datasets and challeng-ing face analysis tasks further highlight the usability of theproposed Supervised COSMOS autoencoder. In the followingsection, each component of the proposed model is explained indetail, along with the ﬁnal proposed autoencoder model.

ROPOSED S UPERVISED

COSMOS A

UTOENCODER

The loss function of a traditional autoencoder minimizes themean squared error (Euclidean distance) between pixel valuesof the input and the reconstructed image. As shown in Fig. 1 (i),this may not necessarily yield the best weight vectors to classifyan image with large variations compared to the training data. Apre-trained autoencoder (trained on faces) is used for extractingfeatures from these images. The Euclidean distance betweenthe representations of each image (b)-(g) are calculated withthe ﬁrst image (a). The images are then sorted by increasingdistance. Based on the distances calculated, it is observed thatthe distance between representations (of same individual) withlarge illumination variation is higher, as compared to featuresof different individuals under similar pose or illuminationsettings. A similar trend is observed in the image space aswell. This implies that for an input image (a), a Euclideandistance based autoencoder (that is used for classiﬁcation) mayprefer having (b) or (c) at the reconstruction layer, i.e. images

Encoding Layers

Decoding LayersRepresentation ℒ Cos X, ෡ X + ℒ

Mah X, ෡ X Input (X)

Reconstruction ( ෡𝐗 ) Fig. 2: Proposed COSMOS autoencoder with 3 hidden layers.of different subjects with similar illumination, as opposed to(g) which is the same subject’s image with minor illuminationvariation. This suggests that while Euclidean distance workswell with images of similar distribution, different covariates offace recognition may affect the classiﬁcation performance.Inspired by these observations, in this research, we proposea multi-objective loss function for an autoencoder, which is ableto learn representations while encoding the (i) “direction” vari-ations between image vectors, (ii) “distribution” of pixel values,while incorporating (iii) “supervision”. The formulation of atraditional autoencoder is modiﬁed to incorporate two differentdistance metrics,

Cosine and

Mahalanobis . Both these metricsare more resilient to non-identically and non-independentlydistributed feature vectors. This enables the feature learningmodel to incorporate the direction, and magnitude of the lossbetween the input and its reconstruction. Since the aim ofa classiﬁcation pipeline is to obtain improved classiﬁcationperformance, we also incorporate supervision in the formulationof the proposed autoencoder. This is accomplished by usingMutual Information (MI) between the original class labels andpredicted labels as a penalty term in the loss function. If themutual information is high, the dependence between the twovectors is high, thus resulting in good classiﬁcation accuracy.This introduces discriminability during the feature learningprocess. As a toy example, Fig. 1 (ii) also presents the rank-listobtained by using a trained Supervised COSMOS autoencoderfor feature extraction. It is motivating to observe that the pro-posed model encodes features invariant to changes in the inputspace. In contrast to the Euclidean distance based autoencoder(Fig. 1 (i)), features extracted by the proposed model are able tocorrectly match the probe against the given gallery set. We nextdescribe the formulation of the proposed

Supervised COSMOSAutoencoder , along with the optimization.

Cosine similarity models the similarity between two vectorsin terms of the direction variations. It calculates the similaritybased on the relationship of the vector values in contrast tothe absolute magnitude difference between the two. Therefore,it has extensively been used in subspace learning algorithmsthat attempt to ﬁnd vectors that best represent the given set ofclasses. In order to incorporate the ﬁrst objective of encoding“direction information” we propose to utilize Cosine similaritybetween the input and the output in an autoencoder, i.e.: L Cos ( X , ˆX ) = (cid:107) X (cid:12) ( W (cid:48) φ ( WX )) (cid:107) = X · ( W (cid:48) φ ( WX )) (cid:107) X (cid:107) F × (cid:107) W (cid:48) φ ( WX ) (cid:107) F (3) L Cos ( X , ˆX ) represents the Cosine similarity between input X and reconstruction ˆX . (cid:12) represents the Cosine similarityoperator. An autoencoder model with Cosine similarity andregularizer R can be represented as: arg min W , W (cid:48) (cid:0) −|| X (cid:12) W (cid:48) φ ( WX ) || + λR (cid:1) (4)where, R is the regularizer and λ is the regularization constant.As opposed to Euclidean distance based autoencoder, the abovemodel does not attempt to replicate the pixel values of the inputdata at the reconstruction layer. On the other hand, it learnsrepresentations such that the relationship between the pixels atthe reconstruction layer is similar to that of the input layer.For instance, the Cosine autoencoder would be invariant toillumination variations between the input and reconstruction.The second objective of modeling the distribution of pixelvalues of the reconstruction with respect to the input sampleis achieved by utilizing the Mahalanobis distance. Mahalanobisdistance accounts for the variability in the data distribution andis a unit-less scale-invariant distance metric which is used tomeasure the distance between two given points. For input X and its reconstruction ˆX , it is mathematically expressed as: L Mah ( X , ˆX ) = (cid:107) X ⊕ ( W (cid:48) φ ( WX )) (cid:107) =( X − ( W (cid:48) φ ( WX ))) T M ( X − ( W (cid:48) φ ( WX ))) (5)where, L Mah represents the squared Mahalanobis (pseudo)distance between X and ˆX = W (cid:48) φ ( WX ) . M represents apseudo-distance matrix having the dimensions [ m × m ] , where m corresponds to the vectorized dimension of the input sample.Traditionally, in Mahalanobis distance calculations, M is asymmetric positive semi-deﬁnite matrix, however, for minimiz-ing the reconstruction error of the autoencoder model, theseconstraints are relaxed. From Equation 5, it can be seen thatEuclidean distance is a special case of Mahalanobis distance,where M is an identity matrix. Thus, Mahalanobis distanceis less constrained than Euclidean distance, encoding the dis-tribution of data as well. The autoencoder formulation withMahalanobis (pseudo) distance can be represented as: arg min W , W (cid:48) , M (cid:0) || X ⊕ ( W (cid:48) φ ( WX )) || + λR (cid:1) (6)Minimizing the Mahalanobis distance ensures weight vectorsare selected such that the distance between the input and itsreconstruction is minimized when both are projected onto M.This implies that the learned representation encodes informa-tion invariant to minor manipulation of pixels such as rotationor illumination. We next combine the two objective functionsand propose COSMOS autoencoder, i.e.: arg min W , W (cid:48) , M , ( −|| X (cid:12) ˆX || + || X ⊕ ˆX || + λR ) (7)Fig. 2 presents a pictorial representation of the COSMOS au-toencoder. Cosine similarity and Mahalanobis distance basedloss function facilitates learning of robust representations, how-ever, it does not introduce discriminability with respect to theclass labels. This is incorporated with the help of a MutualInformation based penalty term. The ﬁnal objective of learning discriminative features isachieved via Mutual Information. Mutual Information has suc-cessfully been used in several image processing tasks includingimage registration. Recently, it has been incorporated to encodesupervision in the feature extraction process [30]. This leads tolearning discriminative features which enhances the classiﬁca-tion performance. Mutual Information (MI) is deﬁned as: MI ( Y P , Y L ) = p ( Y P , Y L ) log (cid:18) p ( Y P , Y L ) p ( Y P ) p ( Y L ) (cid:19) (8)where, Y P represents the predicted label and Y L is the groundtruth label of the input data, and p ( x ) is the probability of x. Wepropose to incorporate mutual information as a penalty term tointroduce supervision in the autoencoder model. A traditionalautoencoder with mutual information based loss function canbe represented as: arg min W , W (cid:48) ,ω (cid:0) || X − W (cid:48) φ ( WX ) || F − λ MI ( Y P , Y L ; ω ) + λ R (cid:1) (9)where, ω is the weight of the mutual information based classi-ﬁer (cid:0)(cid:80) ω T φ ( WX ) (cid:1) . Mutual information between the groundtruth label and the predicted label is encoded as a supervisedregularizer. Since mutual information is a similarity term, it isadded in the loss function with a negative sign. λ and λ arethe regularization constants.Similarly, we incorporate supervision in the proposed COS-MOS autoencoder and the loss function of a single layer Super-vised COSMOS Autoencoder can be written as: arg min W , W (cid:48) , M ,ω (cid:0) −|| X (cid:12) W (cid:48) φ ( WX ) || + || X ⊕ W (cid:48) φ ( WX ) || − λ MI ( Y P , Y L ; ω ) + λ R ) (10)where, λ and λ are the regularization constants. Thus, theproposed supervised COSMOS autoencoder builds over a tra-ditional autoencoder by using a multi-objective loss function. Itcombines the Cosine similarity and Mahalanobis distance alongwith Mutual Information based supervision loss for learningrobust features for classiﬁcation. In the above mentioned formulation, the encoding and decod-ing weights are assumed to be tied, i.e., W (cid:48) = W T . The super-vised layers of COSMOS are optimized using the alternatingminimization approach [31]. It is a well established approachfor the minimization of a function over multiple parameters.For the k th iteration, the optimizations are as follows: Step 1: Optimizing weight of COSMOS ( W ) : W k ← arg min W L Mah ( X , ˆX ) + L Cos ( X , ˆX ) − λ MI ( Y P , Y L ; ω ) + λ R (11) Step 2: Optimizing pseudo-covariance matrix ( M ) : M k ← arg min M L Mah ( X , ˆX ) (12) Step 3: Optimizing Mutual Information based Classiﬁer ( ω ): ω ( k ) ← arg min ω − λ MI ( Y P , Y L ; ω ) (13)The above three steps are repeated iteratively until maxi-mum iterations are reached or model converges. ReLU acti-vation is applied on each layer and dropout is used as a reg-ularizer. The values of regularization constants are computed experimentally by performing a grid search. In order to preventthe problem of vanishing gradients, skip connections [32] areadded in the proposed Supervised COSMOS Autoencoder. Aconnection is added between each alternate encoding layerwhich facilitates gradient ﬂow at the time of feature learning.

The proposed formulation of supervised COSMOS autoencoderis applied in object and face classiﬁcation applications. Fig.3 illustrates the pipeline adopted for the same. As shown inthe image, the input image is tessellated into nine overlap-ping patches which are provided as input to the SupervisedCOSMOS autoencoder to learn discriminative features. This isdone in order to encode local features. The learned features arethen classiﬁed using a 2-layer Neural Network of dimension[ n , n ], where n is the input feature size. Results from eachlocal level are then combined using sum rule fusion. Theregularization constants are updated adaptively. The proposedmodel is implemented in Theano using Adam optimization.The model is trained on a workstation with Intel Xeon 2.6 GHzprocessor with 64 GB RAM, and NVIDIA K40 GPU. ATASETS AND E XPERIMENTAL P ROTOCOLS

Performance of the proposed supervised COSMOS autoen-coder based framework is demonstrated on seven benchmarkdatasets. Details regarding each are provided below:

MNIST Dataset [33] has images of handwritten digits - 0to 9, with dimensions × . The training data contains60,000 images pertaining to all 10 classes, whereas the test setcomprises of 10,000 images. Both the training and testing setscontain equal samples from all classes. CIFAR-10 Dataset [34] is a large image dataset of different ob-ject categories having dimensions × × . It consists of 60,000RGB images corresponding to 10 different classes. The datasetis divided into training and testing partitions having 50,000 and10,000 images, respectively. Equal number of samples from eachclass are ensured in the training and testing sets. The Street View House Numbers (SVHN) Dataset [35] is areal world dataset of RGB images of dimension × × . Itcontains over 600,000 images of house numbers obtained fromGoogle Street View images. It contains images of 10 classes - 0to 9, which are centered around a single character. The databasecontains 73,257 digits for training, 26,032 digits for testing, andan additional 531,131 digits as unsupervised training set. CelebA Dataset [36] is a large scale face attribute datasetcontaining 20 images per subject for 10,000 subjects. Eachimage is annotated with 40 attributes and ﬁve landmark points.The images have large pose variation and background cluttermaking the data challenging. The results are reported on thepre-deﬁned protocol for attribute prediction.

Labeled Faces in the Wild Attributes (LFWA) Dataset [36]consists of 13,233 images of 5,749 subjects. The dataset is createdby labeling attributes in images of LFW dataset. Similar toCelebA dataset, this dataset is also used for the task of attributeprediction for the 40 attributes annotated in each image.

Adience Dataset [37] contains 26,580 face images pertaining to2,284 individuals. The images contain several variations acrossappearance, noise, pose, lighting, and capture devices. Thisdataset has primarily been used for predicting age and genderfrom face images. It contains labels pertaining to male andfemale, and eight different age groups. Pre-deﬁned protocolfor ﬁve fold cross-validation specifying the training and testingpartitions has also been provided.

Fig. 3: Illustrating the steps involved for utilizing Supervised COSMOS for classiﬁcation.TABLE 2: Details of datasets used in this research along with the architectural details of the proposed model.

Database Classes Image Size Patch Size Total Images Architecture of the proposed model

MNIST 10 ×

28 14 × × × × × × × × × TABLE 3: Comparison with state-of-the-art results on bench-mark MNIST, CIFAR-10, and SVHN datasets.

Algorithm Classiﬁcation Error (%)MNIST CIFAR-10 SVHN A E WTA AE [39] 0.48 19.90 6.90Adversarial AE [40] 0.85 - -Self-Paced AE [41] 3.32 - -GSAE [26] 1.10 22.6 7.6 C NN DropConnect [42]

RCNN [45] 0.31 8.69

MIM [46] 0.31 8.52 1.97FitNet [47] 0.38 6.06 -Tuned CNN [48] - 6.37 -ResNet [32] - 6.43 -Wide-Resnet [49] -

Proposed Framework 0.21 5.35 1.08

IJB-A Dataset [38] contains 5,712 face images and 2,085 videosof 500 individuals. The images are captured with differentdevices in varied environment and pose variations. The pre-deﬁned face identiﬁcation protocol is used in the experiments.Table 2 summarizes the dataset details as well as the ar-chitecture of the proposed model. Experimental evaluation isperformed using the pre-deﬁned protocols pertaining to eachdataset. All protocols ensure disjoint training and testing splits.

ESULTS AND A NALYSIS

The proposed Supervised COSMOS Autoencoder frameworkhas been evaluated on three tasks: image classiﬁcation, attributeprediction, and face recognition. Experiments have been per-formed on the benchmark MNIST, CIFAR-10, SVHN, CelebA,LFWA, Adience, and IJB-A datasets. Comparison has beenperformed with state-of-the-art algorithms, and other existingdeep learning models. This is followed by an ablation study onthe proposed framework, in order to understand the effect ofeach component. The following subsections discuss the resultsand observations across the experiments. TABLE 4: Comparison with existing algorithms on CelebA andLFWA datasets. The reported accuracy is the mean classiﬁcationaccuracy obtained over all the attributes.

Architecture CelebA LFWA A E Sethi et al. [23] 90.14 84.80Hou et al. [52] 88.73 - C NN Wang et al. [53] 88.00 87.00Zhong et al. [54] 89.80 85.90Rozsa et al. [55] 90.80 -Rudd et al. [56] 90.94 -Hand and Chellappa et al. [57] 91.26 86.30Kalayeh et al. [58] 91.80 87.13He et al. [59] 91.81 85.28Wang et al. [60] 92.00 -Han et al. [61] 93.00 86.00

Proposed Framework 94.14 88.26

Tables 3 - 7 present the classiﬁcation performance of the pro-posed supervised COSMOS framework for the three tasks ofimage classiﬁcation, attribute prediction, and face recognition.Comparison has also been performed with the current state-of-the-art techniques and other deep learning algorithms.

Image Classiﬁcation:

Table 3 presents comparison of theproposed Supervised COSMOS model with state-of-the-art(peer-reviewed) results reported on MNIST, CIFAR-10, andSVHN datasets. In case an algorithm does not show results ona particular dataset, it is represented as a ‘-’. It can be observedthat the proposed model achieves improved or comparableperformance on all three datasets as compared to state-of-the-art and other existing CNN based algorithms. For MNIST,Supervised COSMOS achieves an error of 0.21%, which isequivalent to the best reported result [42]. On the SVHNdataset, the proposed model achieves a classiﬁcation error of1.08%, thus reporting an improvement over the current state-of-the-art result. On the other hand, it achieves an error of 5.35%on the CIFAR-10 dataset, and is among the top-3 performingmodels on this dataset.

Attribute Prediction:

Classiﬁcation accuracies of the pro- (a) Eye Glasses(b) Oval Face

Fig. 4: Score distribution of CelebA test samples for the best and worst performing attribute,

Eye Glasses and

Oval Face . Comparisoncan be performed across the traditional Euclidean distance based autoencoder, COSMOS, and Supervised COSMOS autoencoder.TABLE 5: Classiﬁcation accuracies (%) of existing algorithmsand the proposed model on Adience Dataset.

Algorithm Gender Age C NN Levi and Hassner [62] 86.8 ± ± ± ± VGG-Face + Attention [67] 93.0 ± ± Proposed Framework 95.07 ± ± TABLE 6: Confusion matrix of Supervised COSMOS Autoen-coder on the Adience dataset for gender classiﬁcation.

Predicted A c t u a l Male FemaleMale 94.68% 5.32%Female 4.56% 95.44% posed model for CelebA and LFWA datasets are reported inTable 4. Pre-deﬁned protocols are used to perform attributeprediction for the 40 annotated attributes in CelebA and LFWAdatasets, and gender and age classiﬁcation on the Adiencedataset. In literature, Han et al. [61] obtain the best performanceof 93.00% on the CelebA dataset. The proposed approach yieldsan improvement of around 1.1% on it, resulting in 94.14%.Similarly on the LFWA dataset, the proposed approach obtainsa mean accuracy of 88.26%, demonstrating improvement overthe existing state-of-the-art results presented by Kalayeh et al. [58]. These results illustrate the efﬁcacy of the proposed modelon large datasets, thereby encouraging the use of proposedSupervised COSMOS framework.Table 5 presents the gender and age classiﬁcation accura-cies on the Adience dataset. FDAR-NET [66] yields the bestaccuracy of 92.5% and 80.5% for gender and age classiﬁcation,respectively. It can be observed that the proposed Supervised TABLE 7: Identiﬁcation results (%) on the IJB-A face dataset.

Algorithm Rank-1 Rank-10 C NN DCNN

Manual +Metric [68] 0.852 ± ± ± ± ± ± ± ± VGGFace2 [72] ± ± ± ± COSMOS autoencoder achieves a gender classiﬁcation accuracyof 95.07%, and an age classiﬁcation accuracy of 77.98%. Themodel improves the current state-of-the-art by 2.57% for gen-der classiﬁcation, while achieving second best results for ageclassiﬁcation. The proposed Supervised COSMOS autoencoderalso presents a reduced standard deviation across the ﬁve foldsfor both tasks. This shows that the model learns robust featuresfor classiﬁcation. Table 6 presents the confusion matrix obtainedfor gender classiﬁcation. It can be observed that the proposedmodel performs well on both the classes, without being biasedtowards any one class.

Face Recognition:

The proposed supervised COSMOSframework is also evaluated on the IJB-A dataset for the task offace recognition. IJB-A dataset is a part of the IARPA’s JANUSproject, and is one of the most challenging face databases.As per the standard practice, Table 7 presents the rank-1 andrank-10 identiﬁcation accuracies on the IJB-A dataset. It can beobserved that the proposed Supervised COSMOS autoencoderachieves among the best performing results on IJB-A dataset.As compared to existing architectures on both the ranks, thesuperior performance along with the low values of standarddeviation obtained across different folds further promote theusage of the proposed model for face identiﬁcation tasks.

Comparison with Other Deep Learning Algorithms:

Tables3 - 7 can be analyzed to compare the performance of theproposed Supervised COSMOS Autoencoder framework with

TABLE 8: Ablation study on the proposed Supervised COSMOS autoencoder for SVHN and CelebA datasets.

Effect of CelebA SVHNDistance

Euclidean Cosine Mahalanobis Euclidean Cosine Mahalanobis

Metric

Euclidean+MI Cosine+MI Mahalanobis+MI Euclidean+MI Cosine+MI Mahalanobis+MI via MI

Euc. + Cos Euc. + Maha. Cos + Maha. Euc. + Cos Euc. + Maha. Cos + Maha.

Combination

TABLE 9: Accuracies (%) obtained with varying number oflayers of COSMOS model.

No of Layers 3 4 5 6 7 8

CelebA 85.71 89.33 92.46 93.98 other existing deep learning techniques, speciﬁcally, autoen-coder and convolutional neural network (CNN) based mod-els. It is interesting to note that most of the top performingalgorithms incorporate CNNs in their classiﬁcation pipeline.The proposed technique is among the few autoencoder basedframeworks which achieves improved or comparable perfor-mance to existing CNN models. It is our belief that the incor-poration of supervision during the training of the proposedsupervised COSMOS facilitates learning of discriminative yet representative features. The class speciﬁc characteristics encodedat the feature level are further accentuated while learning aclassiﬁer, thereby resulting in improved performance.

The proposed Supervised COSMOS Autoencoder is formulatedusing a multi-objective loss function combining Cosine sim-ilarity, Mahalanobis distance, and Mutual Information basedsupervision. In order to understand the effect of each objectivefunction and their various combinations, ablation study hasbeen performed with the CelebA and SVHN datasets. Table8 presents the performance of different components of theproposed framework.

Effect of Distance Metric:

The ﬁrst set of experiments isperformed to evaluate autoencoder models built using differentdistance metrics i.e. Euclidean distance, Cosine similarity, andMahalanobis distance (Equations 1, 4, 6). It is observed that Co-sine similarity and Mahalanobis distance based autoencodersyield improved classiﬁcation performance as compared to thetraditional Euclidean distance based autoencoder. This can beattributed to the fact that while autoencoders with Euclideandistance loss function attempt to replicate the input at the re-construction layer, Cosine similarity and Mahalanobis distancebased loss functions focus on features that are invariant tominute rotation and illumination variations. This afﬁrms ourhypothesis that Euclidean distance based autoencoders mightnot be best suited for classiﬁcation tasks.

Effect of Supervision via MI:

The next set of experimentsanalyze the effect of Mutual Information (MI) based supervi-sion (Eq. 8) in the autoencoder model. MI based penalty termis added in the autoencoder formulation built using Euclideandistance, Cosine similarity, and Mahalanobis distance, indepen-dently (Table 8). The addition of MI based supervision leads toan improvement of . − . , except with Cosine similarity,where the accuracy on SVHN reduces slightly. This strengthensour claim that incorporating MI based supervision duringfeature learning helps improve the classiﬁcation performance,by facilitating learning of discriminative representations. Effect of Distance Metric Combination:

The third set ofexperiments enable us to understand the effect of combination Fig. 5: Classiﬁcation performance variation before and afterincorporating tessellation in the proposed framework.of distance metrics for the loss function of an autoencoder.It can be observed that the autoencoder utilizing a Cosinesimilarity and Mahalanobis distance (COSMOS) based lossfunction outperforms other combinations, as well as individualloss functions for both the datasets. The COSMOS autoencoderyields . − . higher classiﬁcation accuracy than other com-binations on CelebA and SVHN respectively. This also afﬁrmsour hypothesis that “direction” and “distribution” informationcan jointly help extract better features. Fig. 4 presents the scoredistributions of the best and worst performing attributes, EyeGlasses and

Oval Face of the CelebA dataset. Plots for a speciﬁcattribute can be analyzed in order to observe the progressiveimprovement of distributions. For the

Eye Glasses attribute, clas-siﬁcation via the Euclidean distance based autoencoder resultsin a minor overlap of scores between the two classes, whichis almost eliminated with the proposed supervised COSMOSautoencoder. A similar trend is observed for the

Oval Face attribute, where the traditional autoencoder suffers from alarge overlap, which is signiﬁcantly reduced with the proposedsupervised COSMOS model.

Effect of Number of Layers, Initialization, and Tessella-tion:

Table 9 demonstrates the effect of varying the numberof layers of Supervised COSMOS. The best results for CelebAdataset are obtained with seven hidden layers, and a reductionin accuracy is observed as we go further. Similarly, the bestresults for SVHN dataset are obtained using six hidden layers.Fig. 5 presents the accuracy obtained by using SupervisedCOSMOS and Supervised COSMOS with tessellation on SVHNand CelebA datasets. It can be observed that incorporatingtessellation improves the performance of the proposed frame-work by around 2%. Models learned on the image patchesare able to encode local information about the image, whilethe full image based network focuses more on learning globalfeatures. Combining information from both the componentsresults in a holistic feature representation, which in turn en-hances the classiﬁcation performance. Fig. 6 presents someimages of CelebA dataset which were correctly classiﬁed by the

Fig. 6: Sample images of CelebA correctly classiﬁed only by theproposed supervised COSMOS model.proposed Supervised COSMOS model only. Upon observingthese samples closely, we see that the proposed model handlespose as well as illumination variations, and learns featuresrobust to such variations. These samples further demonstratethe efﬁcacy of the proposed model and motivate its use forrobust feature extraction for classiﬁcation.

ONCLUSION

Over the past decade, researchers have actively pursued thedomain of deep learning in order to learn robust featuresand effective classiﬁers. Deep learning has shown to performwell on several tasks, however, majority of the research hasfocused on the speciﬁc paradigm of Convolutional NeuralNetworks. While CNNs have been well studied, it is our beliefthat research along other different paradigms should also bepursued in order to develop competitive algorithms. Anotherpromising paradigm of deep learning is the autoencoder, whichlearns representative features of the input. In this research wehave developed a novel autoencoder formulation, termed asthe Supervised COSMOS autoencoder, which learns featuresspeciﬁcally for the task of classiﬁcation. The proposed au-toencoder has a multi-objective loss function that incorporates(i) Cosine similarity to encode “direction” information, (ii)Mahalanobis distance to encode “distribution” information ofthe input with respect to the reconstruction, and (iii) MutualInformation based supervision in order to learn discriminativefeatures. This enables the model to learn supervised featuresinvariant to minor vector variations in illumination and rota-tion. Experimental evaluations on image classiﬁcation, attributeprediction, and face recognition showcase the versatility ofthe proposed approach. State-of-the-art results are obtainedon standard benchmark datasets such as MNIST, CIFAR-10,SVHN, CelebA, LFWA, Adience, and IJB-A, which demonstratethe effectiveness of the proposed Supervised COSMOS autoen-coder.

CKNOWLEDGMENTS

This research is partly supported by MEITY (Govt. of India). M.Singh, M. Vatsa, and R. Singh are partly supported by InfosysCAI at IIIT-Delhi, and S.Nagpal is partly supported via TCSPhD Fellowship. R EFERENCES [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”

Nature , vol.521, no. 7553, 2015.[2] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality ofdata with neural networks,”

Science , vol. 313, no. 5786, pp. 504 –507, 2006.[3] A. Majumdar, R. Singh, and M. Vatsa, “Face recognition via classsparsity based supervised encoding,”

IEEE Transactions on PatternAnalysis and Machine Intelligence , vol. 39, pp. 1273–1280, 2017.[4] J. Xu, L. Xiang, Q. Liu, H. Gilmore, J. Wu, J. Tang, and A. Madab-hushi, “Stacked sparse autoencoder (SSAE) for nuclei detection onbreast cancer histopathology images,”

IEEE Transactions on MedicalImaging , vol. 35, no. 1, pp. 119–130, 2016. [5] W. Xu, H. Sun, C. Deng, and Y. Tan, “Variational autoencoderfor semi-supervised text classiﬁcation,” in

AAAI Conference onArtiﬁcial Intelligence , 2017, pp. 3358–3364.[6] S. Zhai and Z. M. Zhang, “Semisupervised autoencoder for sen-timent analysis,” in

AAAI Conference on Artiﬁcial Intelligence , 2016,pp. 1394–1400.[7] J. Zhang, S. Shan, M. Kan, and X. Chen, “Coarse-to-ﬁne auto-encoder networks (CFAN) for real-time face alignment,” in

Eu-ropean Conference on Computer Vision , 2014, pp. 1–16.[8] A. Ng, “Sparse autoencoder,”

CS294A Lecture notes , vol. 72, pp.1–19, 2011.[9] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. A. Manzagol,“Stacked denoising autoencoders: Learning useful representationsin a deep network with a local denoising criterion,”

Journal ofMachine Learning Research , vol. 11, pp. 3371–3408, 2010.[10] S. Rifai, G. Mesnil, P. Vincent, X. Muller, Y. Bengio, Y. Dauphin,and X. Glorot, “Higher order contractive auto-encoder,” in

Eu-ropean Conference on Machine Learning and Knowledge Discovery inDatabases , 2011, pp. 645–660.[11] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contrac-tive auto-encoders: Explicit invariance during feature extraction,”in

International Conference on Machine Learning , 2011, pp. 833–840.[12] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto-encoders,” in

International Conference on Artiﬁcial Neural Networks ,2011, pp. 44–51.[13] W. Wang, Y. Huang, Y. Wang, and L. Wang, “Generalized au-toencoder: A neural network framework for dimensionality reduc-tion,” in

IEEE Conference on Computer Vision and Pattern RecognitionWorkshops , 2014, pp. 496–503.[14] D. P. Kingma and M. Welling, “Stochastic gradient vb and thevariational auto-encoder,” in

International Conference on LearningRepresentations , 2014.[15] X. Zhang, Y. Fu, S. Jiang, L. Sigal, and G. Agam, “Learningfrom synthetic data using a stacked multichannel autoencoder,” in

International Conference on Machine Learning and Applications , 2015,pp. 461–464.[16] S. Gao, Y. Zhang, K. Jia, J. Lu, and Y. Zhang, “Single sampleface recognition via learning deep supervised autoencoders,”

IEEETransactions on Information Forensics and Security , vol. 10, pp. 2108–2118, 2015.[17] F. Zhuang, X. Cheng, P. Luo, S. J. Pan, and Q. He, “Supervised rep-resentation learning: Transfer learning with deep autoencoders,”in

International Joint Conference on Artiﬁcial Intelligence , 2015, pp.4119–4125.[18] M. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi,“Domain generalization for object recognition with multi-taskautoencoders,” in

IEEE International Conference on Computer Vision ,2015, pp. 2551–2559.[19] Q. Meng, D. Catchpoole, D. Skillicom, and P. J. Kennedy, “Re-lational autoencoder for feature extraction,” in

International JointConference on Neural Networks , 2017, pp. 364–371.[20] S. Wang, Z. Ding, and Y. Fu, “Feature selection guided auto-encoder,” in

AAAI Conference on Artiﬁcial Intelligence , 2017.[21] Z. Zhang, Y. Song, and H. Qi, “Age progression/regression by con-ditional adversarial autoencoder,” in

IEEE Conference on ComputerVision and Pattern Recognition , 2017.[22] L. Tran, X. Liu, J. Zhou, and R. Jin, “Missing modalities impu-tation via cascaded residual autoencoder,” in

IEEE Conference onComputer Vision and Pattern Recognition , 2017.[23] A. Sethi, M. Singh, R. Singh, and M. Vatsa, “Residual codean au-toencoder for facial attribute analysis,”

Pattern Recognition Letters ,2018, doi = https://doi.org/10.1016/j.patrec.2018.03.010.[24] K. Zeng, J. Yu, R. Wang, C. Li, and D. Tao, “Coupled deepautoencoder for single image super-resolution,”

IEEE transactionson cybernetics , vol. 47, no. 1, pp. 27–37, 2017.[25] E. Kodirov, T. Xiang, and S. Gong, “Semantic autoencoder forzero-shot learning,” in

The IEEE Conference on Computer Vision andPattern Recognition , July 2017.[26] A. Sankaran, M. Vatsa, R. Singh, and A. Majumdar, “Group sparseautoencoder,”

Image and Vision Computing , vol. 60, pp. 64 – 74,2017.[27] X. Zheng, Z. Wu, H. Meng, and L. Cai, “Contrastive auto-encoderfor phoneme recognition,” in

International Conference on Acoustics,Speech and Signal Processing , 2014, pp. 2529–2533.[28] M. Singh, S. Nagpal, R. Singh, and M. Vatsa, “Class representativeautoencoder for low resolution multi-spectral gender classiﬁca- tion,” in

International Joint Conference on Neural Networks , 2017, pp.1026–1033.[29] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther,“Autoencoding beyond pixels using a learned similarity metric,”in

International Conference on Machine Learning , 2016.[30] J. J.-Y. Wang, Y. Wang, S. Zhao, and X. Gao, “Maximum mutualinformation regularized classiﬁcation,”

Engineering Applications ofArtiﬁcial Intelligence , vol. 37, pp. 1 – 8, 2015.[31] W. Byrne, “Alternating minimization and Boltzmann machinelearning,”

IEEE Transactions on Neural Networks , vol. 3, no. 4, pp.612–620, 1992.[32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learningfor image recognition,” in

IEEE Conference on Computer Vision andPattern Recognition , 2016, pp. 770–778.[33] Y. LeCun and C. Cortes, “MNIST handwritten digit database,”

AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist ,2010.[34] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images:A large data set for nonparametric object and scene recogni-tion,”

IEEE Transactions on Pattern Analysis and Machine Intelligence ,vol. 30, pp. 1958–1970, 2008.[35] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y.Ng, “Reading digits in natural images with unsupervised featurelearning,” in

NIPS Workshop on Deep Learning and UnsupervisedFeature Learning , 2011.[36] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributesin the wild,” in

IEEE International Conference on Computer Vision ,2015, pp. 3730–3738.[37] E. Eidinger, R. Enbar, and T. Hassner, “Age and gender estimationof unﬁltered faces,”

IEEE Transactions on Information Forensics andSecurity , vol. 9, no. 12, pp. 2170–2179, 2014.[38] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen,P. Grother, A. Mah, M. Burge, and A. K. Jain, “Pushing the frontiersof unconstrained face detection and recognition: IARPA JanusBenchmark A,” in

IEEE Conference on Computer Vision and PatternRecognition , 2015, pp. 1931–1939.[39] A. Makhzani and B. J. Frey, “Winner-take-all autoencoders,” in

Advances in Neural Information Processing Systems , C. Cortes, N. D.Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds., 2015, pp.2791–2799.[40] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow, “Adversarialautoencoders,” in

International Conference on Learning Representa-tions , 2016.[41] T. Yu, C. Guo, L. Wang, S. Xiang, and C. Pan, “Self-paced autoen-coder,”

IEEE Signal Processing Letters , vol. 25, no. 7, pp. 1054–1058,2018.[42] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Regu-larization of neural networks using dropconnect,” in

InternationalConference on Machine Learning , vol. 28, no. 3, 2013, pp. 1058–1066.[43] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deepneural networks for image classiﬁcation,” in

IEEE Conference onComputer Vision and Pattern Recognition , 2012, pp. 3642–3649.[44] C. Lee, P. W. Gallagher, and Z. Tu, “Generalizing pooling functionsin convolutional neural networks: Mixed, gated, and tree,” in

International Conference on Artiﬁcial Intelligence and Statistics , 2016,pp. 464–472.[45] M. Liang and X. Hu, “Recurrent convolutional neural networkfor object recognition,” in

IEEE Conference on Computer Vision andPattern Recognition , 2015, pp. 3367–3375.[46] Z. Liao and G. Carneiro, “On the importance of normalisationlayers in deep learning with piecewise linear activation units,”in

IEEE Winter Conference on Applications of Computer Vision , 2016.[47] D. Mishkin and J. Matas, “All you need is a goodinit,”

CoRR , vol. abs/1511.06422, 2015. [Online]. Available:http://arxiv.org/abs/1511.06422[48] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram,M. Patwary, M. Prabhat, and R. Adams, “Scalable Bayesian opti-mization using deep neural networks,” in

International Conferenceon Machine Learning , 2015, pp. 2171–2180.[49] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in

British Machine Vision Conference , 2016, pp. 87.1–87.12.[50] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten,“Densely connected convolutional networks,” in

IEEE Conferenceon Computer Vision and Pattern Recognition , 2017.[51] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing betweencapsules,” in

Advances in Neural Information Processing Systems ,2017, pp. 3856–3866. [52] X. Hou, L. Shen, K. Sun, and G. Qiu, “Deep feature consistentvariational autoencoder,” in

IEEE Winter Conference on Applicationsof Computer Vision , 2017, pp. 1133–1141.[53] J. Wang, Y. Cheng, and R. S. Feris, “Walk and learn: Facialattribute representation learning from egocentric video and con-textual data,” in

IEEE Conference on Computer Vision and PatternRecognition , 2016, pp. 2295–2304.[54] Y. Zhong, J. Sullivan, and H. Li, “Leveraging mid-level deeprepresentations for predicting face attributes in the wild,” in

IEEEInternational Conference on Image Processing , 2016, pp. 3239–3243.[55] A. Rozsa, E. M. Rudd, and T. E. Boult, “Adversarial diversity andhard positive generation,” in

IEEE Conference on Computer Visionand Pattern Recognition Workshops , 2016, pp. 25–32.[56] E. M. Rudd, M. G ¨unther, and T. E. Boult, “Moon: A mixed objectiveoptimization network for the recognition of facial attributes,” in

European Conference on Computer Vision , 2016, pp. 19–35.[57] E. Hand and R. Chellappa, “Attributes for improved attributes:A multi-task network utilizing implicit and explicit relationshipsfor facial attribute classiﬁcation,” in

AAAI Conference on ArtiﬁcialIntelligence , 2017.[58] M. M. Kalayeh, B. Gong, and M. Shah, “Improving facial attributeprediction using semantic segmentation,” in

IEEE Conference onComputer Vision and Pattern Recognition , 2017.[59] K. He, Y. Fu, W. Zhang, C. Wang, Y.-G. Jiang, F. Huang, andX. Xue, “Harnessing synthesized abstraction images to improvefacial attribute recognition,” in

International Joint Conference onArtiﬁcial Intelligence , 2018, pp. 733–740.[60] F. Wang, H. Han, S. Shan, and X. Chen, “Deep multi-task learningfor joint prediction of heterogeneous face attributes,” in

IEEEInternational Conference on Automatic Face Gesture Recognition , 2017,pp. 173–179.[61] H. Han, A. K. Jain, S. Shan, and X. Chen, “Heterogeneous faceattribute estimation: A deep multi-task learning approach,”

IEEETransactions on Pattern Analysis and Machine Intelligence , 2018,doi=10.1109/TPAMI.2017.2738004.[62] G. Levi and T. Hassner, “Age and gender classiﬁcation usingconvolutional neural networks,” in

IEEE Conference on ComputerVision and Pattern Recognition Workshops , 2015, pp. 34–42.[63] R. Rothe, R. Timofte, and L. J. V. Gool, “DEX: deep expectation ofapparent age from a single image,” in

IEEE International Conferenceon Computer Vision Workshop , 2015, pp. 252–257.[64] J. v. d. Wolfshaar, M. F. Karaaba, and M. A. Wiering, “Deepconvolutional neural networks and support vector machines forgender recognition,” in

IEEE Symposium Series on ComputationalIntelligence , 2015, pp. 188–195.[65] G. Ozbulak, Y. Aytar, and H. K. Ekenel, “How transferable areCNN-based features for age and gender classiﬁcation?” in

Interna-tional Conference of the Biometrics Special Interest Group , 2016.[66] H. Liu, X. Shen, and H. Ren, “FDAR-Net: Joint convolutionalneural networks for face detection and attribute recognition,” in

International Symposium on Computational Intelligence and Design ,vol. 2, 2016, pp. 184–187.[67] P. Rodrguez, G. Cucurull, J. M. Gonfaus, F. X. Roca, and J. Gonzlez,“Age and gender recognition in the wild with deep attention,”

Pattern Recognition , vol. 72, pp. 563 – 571, 2017.[68] J. C. Chen, R. Ranjan, A. Kumar, C. H. Chen, V. M. Patel, andR. Chellappa, “An end-to-end system for unconstrained faceveriﬁcation with deep convolutional neural networks,” in

IEEEInternational Conference on Computer Vision Workshop , 2015, pp. 360–368.[69] J. Yang, P. Ren, D. Chen, F. Wen, H. Li, and G. Hua, “Neu-ral aggregation network for video face recognition,”

CoRR , vol.abs/1603.05474, 2016.[70] L. Xiong, J. Karlekar, J. Zhao, J. Feng, S. Pranata, and S. Shen,“A good practice towards top performance of face recognition:Transferred deep feature fusion,”

CoRR , vol. abs/1704.00438,2017. [Online]. Available: http://arxiv.org/abs/1704.00438[71] R. Ranjan, C. D. Castillo, and R. Chellappa, “L2-constrainedsoftmax loss for discriminative face veriﬁcation,”

CoRR , vol.abs/1703.09507, 2017. [Online]. Available: http://arxiv.org/abs/1703.09507[72] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vg-gface2: A dataset for recognising faces across pose and age,” in