[PDF] Exponential Discriminative Metric Embedding in Deep Learning

Abstract

With the remarkable success achieved by the Convolutional Neural Networks (CNNs) in object recognition recently, deep learning is being widely used in the computer vision community. Deep Metric Learning (DML), integrating deep learning with conventional metric learning, has set new records in many fields, especially in classification task. In this paper, we propose a replicable DML method, called Include and Exclude (IE) loss, to force the distance between a sample and its designated class center away from the mean distance of this sample to other class centers with a large margin in the exponential feature projection space. With the supervision of IE loss, we can train CNNs to enhance the intra-class compactness and inter-class separability, leading to great improvements on several public datasets ranging from object recognition to face verification. We conduct a comparative study of our algorithm with several typical DML methods on three kinds of networks with different capacity. Extensive experiments on three object recognition datasets and two face recognition datasets demonstrate that IE loss is always superior to other mainstream DML methods and approach the state-of-the-art results.

Full PDF

EExponential Discriminative Metric Embeddingin Deep Learning

Bowen Wu a, ∗ , Zhangling Chen b , Jun Wang c , Huaming Wu b a Center for Combinatorics, Nankai University, Tianjin 300071, China b Center for Applied Mathematics, Tianjin University, Tianjin 300072, China c School of Mathematics, Tianjin University, Tianjin 300072, China

Abstract

With the remarkable success achieved by the Convolutional Neural Net-works (CNNs) in object recognition recently, deep learning is being widelyused in the computer vision community. Deep Metric Learning (DML), inte-grating deep learning with conventional metric learning, has set new recordsin many ﬁelds, especially in classiﬁcation task. In this paper, we propose areplicable DML method, called Include and Exclude (IE) loss, to force thedistance between a sample and its designated class center away from themean distance of this sample to other class centers with a large margin inthe exponential feature projection space. With the supervision of IE loss,we can train CNNs to enhance the intra-class compactness and inter-classseparability, leading to great improvements on several public datasets rang-ing from object recognition to face veriﬁcation. We conduct a comparativestudy of our algorithm with several typical DML methods on three kindsof networks with diﬀerent capacity. Extensive experiments on three objectrecognition datasets and two face recognition datasets demonstrate that IEloss is always superior to other mainstream DML methods and approach thestate-of-the-art results.

Keywords:

Deep metric learning, Object recognition, Face veriﬁcation,Intra-class compactness, Inter-class separability ∗ Corresponding author.E-mail address: [email protected] (B. Wu).

Preprint submitted to Neurocomputing March 8, 2018 a r X i v : . [ c s . C V ] M a r . Introduction Recently, Convolutional Neural Networks (CNNs) are continuously set-ting new records in classiﬁcation aspect, such as object recognition [1, 2, 3, 4],scene recognition [5, 6], face recognition [7, 8, 9, 10, 11, 12], age estimation[13, 14] and so on. Facing the more and more complex data, the deeperand wider CNNs tend to obtain better accuracies. Meanwhile, many trou-bles will show up, such as gradient saturating, model overﬁtting, parameteraugmentation, etc. To solve the ﬁrst problem, some non-linear activations[15, 16, 17] have been proposed. Considerable eﬀorts have been made toreduce model overﬁtting, such as data augmentation [1, 18], dropout [19, 1],regularization [15, 20]. Besides, some model compressing methods [21, 22]have largely reduced the computing complexity of original models, with theperformance improved simultaneously.In general object recognition, scene recognition and age estimation, theidentities of the possible testing samples are within the training set. Sothe training and testing sets have the same object classes but not the sameimages. In this case, softmax classiﬁer is often used to designate a label tothe input.For face recognition, the deeply learned features need to be not onlyseparable but also discriminative. It can be roughly divided into two aspects,namely face identiﬁcation and face veriﬁcation. The former is the same asobject recognition, the training and testing sets have the same face identities,aims at classifying an input image into a large number of identity classes.Face veriﬁcation is to classify a pair of images as belonging to the sameidentity or not (i.e. binary classiﬁcation). Since it is impractical to pre-collect enough number of all the possible testing identities for training, faceveriﬁcation is becoming the mainstream in this ﬁeld. As clariﬁed by DeepIDseries [9, 23, 10]: classifying all the identities simultaneously instead of binaryclassiﬁers for training can make the learned features more discriminativebetween diﬀerent classes. So we decide to use the joint supervision of softmaxclassiﬁer and metric loss function to train and the veriﬁcation signal of featuresimilarity discriminant to test as shown in Section 4.3. Fig. 1 illustratesthe general face recognition pipeline, which maps the input images to thediscriminative deep features progressively, then to the predicted labels.A recent trend towards deep learning with more discriminative featuresis to reinforce CNNs with better metric loss functions, namely Deep MetricLearning (DML), such that the intra-class compactness and inter-class sep-2 igure 1: The typical framework of face recognition. The process of deep feature learningand metric learning is shown in the second row. arability are simultaneously maximized. Inspired by this idea, many metriclearning methods have been proposed. It can be traced back to early sub-space face recognition methods such as Linear Discriminant Analysis (LDA)[24], Bayesian face [25], and uniﬁed subspace [26]. For example, LDA aims atmaximizing the ratio between inter-class and intra-class variations by ﬁndingthe optimal projection direction. Some metric learning methods [27, 28, 29]have been proposed to project the original feature space into another metricspace, such that the features of the same identity are close and those of dif-ferent identities stay apart. Subsequent contrastive loss [23] and triplet loss[11] have witnessed their success in face recognition.Interestingly, closely related to DML is the Learning to Hash, which is oneof the major solutions to nearest neighbor search problem. Given the highdimensionality and high complexity of multimedia data, the cost of ﬁndingthe exact nearest neighbor is prohibitively high. Learning to Hash, a data-dependent hashing approach, aims to learn hash functions from a speciﬁcdataset so that the nearest neighbor search result in the hash coding spaceis as close as possible to the search result in the original space, signiﬁcantlyimproving the search eﬃciency and space cost. The main methodology ofLearning to Hash is similarity preserving, i.e., minimizing the gap betweenthe similarities computed in the original space and the similarities in thehash coding space in various forms. [30] utilizes linear LDA with trace ratio3riterion to learn hash functions, where the pseudo labels and the hash codesare jointly learned. [31] proposes a semi-supervised deep learning hashingmethod for fast multimedia retrieval, to simultaneously learn a good multi-media representation and hash function. More comprehensive survey aboutdimension reduction and using diﬀerent similarity preserving algorithms tohashing can be found in [32, 33]. Surprisingly, most of the similarity metricloss functions could be used for Learning to Hash.Because of the large scale of training set, it is unreasonable to addressall of them in each iteration. Mini-batch based Stochastic Gradient Descent(SGD) algorithm [34] doesn’t reﬂect the real distribution of the total trainingset, so a superior sampling strategy becomes very important to the trainingprocess. Besides, selecting appropriate pairs or triplets like previous maydramatically increase the number of training samples. As a result, it is in-evitably hard to converge to an optimum steadily. In this paper, we propose anovel well-generalized metric loss function, named Include and Exclude (IE)loss, to make the deeply learned features more discriminative between diﬀer-ent classes and closer to each other between images of the same class. Thisidea is veriﬁed by Fig. 2 in Section 3.1. Obviously, the inter-class distanceis away from the intra-class distance with a large margin. When training,we learn a center for each class like center loss [12] does. Subsequently, weshow that center loss is a variant of the special case of our method. Thereis another parameter σ to regularize the distance between the features andtheir corresponding class centers. Furthermore, we use a hyperparameter Q to control the number of valuable inter-class distances to accelerate theconvergence of our model. We simultaneously use the supervision signals ofsoftmax loss and IE loss to train the network. Extensive experiments onobject recognition and face veriﬁcation validate the eﬀectiveness of IE loss.Our method signiﬁcantly improves the performance compared to the originalsoftmax method, and competitive with other nowadays mainstream DMLalgorithms. The main contributions are summarized as follows: • To the best of our knowledge, we are the ﬁrst to practice the ideaof enforcing the mean inter-class distance larger than the intra-classdistance with a margin in the exponential feature projection space, asopposed to the distance between a sample and its nearest cluster centersin magnet loss [35], avoiding the large intra-class distances. • Instead of some oﬀ-line complicated sampling strategies, our DMLmethod can achieve a satisfactory result only using the mini-batch4ased SGD, greatly simplifying the training process. • To achieve a better performance rapidly, we introduce a hyperparam-eter Q to restrict the number of nearest inter-class distances in eachmini-batch to accelerate the convergence of our model. • We do extensive experiments on several common datasets, includingMNIST, CIFAR10, CIFAR100, Labeled Faces in the Wild (LFW) andYouTube Faces (YTF), to verify the eﬀectiveness, robustness and gen-eralization of IE loss.

2. Related work

In recent years, deep learning has been successfully applied in computervision and other AI domains, such as object recognition [3], face recognition[11], image retrieval [36, 37], speech recognition [38] and natural languageprocessing [39]. Most of the time, deep learning models are prone to bedeeper and wider. But more complicated deep networks are accompaniedby larger training set, model overﬁtting and costly computational overhead.Considering these, there produce some new DML methods, which concate-nate the conventional metric learning losses to the end of the deeply learnedfeatures. In classiﬁcation aspect, DML generally aims at mapping the origi-nally learned features into a more discriminative feature space by maximizingthe inter-class variations and minimizing the intra-class variations. To somedegree, a properly chosen metric loss function would make the training easyto converge to an optimal model without too much training data. We willbrieﬂy discuss some typical DML methods below.Sun et al. [23] encourage all faces of one identity to be projected onto asingle point in the embedding space. They use an ensemble of 25 networkson diﬀerent face patches to get the ﬁnal concatenated features. Both PCAand Joint Bayesian classiﬁer [27] are used to achieve the ﬁnal performance of99 .

47% on LFW. The loss function is mainly based on the idea of contrastiveloss, which minimizes the intra-class distance and enforces the inter-classdistance larger than a ﬁxed margin.Schroﬀ et al. [11] employ the triplet loss, which stems from LMNN [28],to encourage a distance constraint similar to the contrastive loss. Diﬀerently,the triplet loss requires a triple of training samples as input at a time, not apair. The triplet loss minimizes the distance between an anchor sample and apositive sample, and maximizes the distance between the anchor sample and5 negative sample, in order to make the inter-class distance larger than theintra-class distance by a margin relatively. They also use the so far largesttraining database about 200M face images, and set an insurmountable recordon LFW of 99 .

3. The proposed approaches

We ﬁrst clarify the notations which will be used in subsequential sections.Let us assume the training set consists of M input-label pairs D = { x n , y n } Mn =1 belonging to C classes. We consider a parameterized map f ( x n , Θ) , n =1 , · · · , M , and Θ are the model parameters. In this work, the transformationis selected as some complex CNN architectures. We further deﬁne C ( f n ) asthe class label of feature f n , and µ C ( f n ) as the corresponding class center. In this section, some existing superior DML methods are ﬁrst presented.

Triplet Loss

Schroﬀ et al. [11] have veriﬁed the eﬀectiveness of tripletloss with a large training set. But the exponentially increased computationalcomplexity of training examples and the diﬃculty of convergence impede itsgeneral application. The formula is as follows: L (Θ) = M (cid:88) i =1 (cid:8) (cid:107) f ( x ai ) − f ( x pi ) (cid:107) − (cid:107) f ( x ai ) − f ( x ni ) (cid:107) + α (cid:9) + . (1)Here, x ai , x pi and x ni refer to the anchor, positive and negative images in atriplet, respectively. α is the predeﬁned margin. L - Softmax Loss

Liu et al. [40] achieve a ﬂexible learning objectivewith adjustable diﬃculty, by altering the classiﬁcation angle margin betweenclasses. Although the relatively rigorous learning objective with adjustable6ngle margin can avoid overﬁtting, the diﬃcult convergence hinders its gen-eralization to many other deep networks. It is crucial to continuously adjustthe component weight between softmax and L-Softmax to guarantee the pro-gressing of training. L (Θ) = − M M (cid:88) i =1 log (cid:32) exp ( (cid:107) W y i (cid:107)(cid:107) x i (cid:107) ψ ( θ y i )) exp ( (cid:107) W y i (cid:107)(cid:107) x i (cid:107) ψ ( θ y i )) + (cid:80) j (cid:54) = y i exp ( (cid:107) W j (cid:107)(cid:107) x i (cid:107) cos ( θ j )) (cid:33) . (2)It generally requires that ψ ( θ ) = (cid:40) cos( mθ ) , ≤ θ ≤ πm D ( θ ) , πm < θ ≤ π (3)where W is the weight matrix of the fully connected layer before softmaxlayer, and W y i is the y i -th column of W . θ y i is the angle between x i and itscorresponding weight vector W y i , and m is an integer to control the learningobjective. Meanwhile, D ( θ ) must be monotonically decreased to satisfy therequirement for any θ . Center Loss

Wen et al. [12] propose a new loss function, which regardsthe distance of a sample away from its corresponding class center as theobjective penalization. The joint supervision of center loss and softmax lossmakes this approach outperform most existing best results on some facerecognition benchmark databases. L (Θ) = 12 M M (cid:88) i =1 (cid:107) f ( x i ) − µ ( f ( x i )) (cid:107) , (4)where µ ( f ( x i )) is the class center of f ( x i ). As clariﬁed in [35], magnet loss liberates us from the unreasonable priortarget neighbourhood assignments, and divides each class into several clus-ters, aims at maintaining the distributions of diﬀerent classes in the repre-sentation space. As a result, the similar samples in diﬀerent classes may becloser than that in the same classes. Speciﬁcally, intra-class variations maybe larger than inter-class variations in object recognition and face recogni-tion. Thus some local distribution maintaining loss functions like magnet loss7ill not bring so many beneﬁts to the practical classiﬁcation tasks. Despitethe great performance on LFW by triplet loss on GoogLeNet [3], its trainingineﬀectiveness and the exponentially increased training samples hinder thewidespread application to generic classiﬁcation tasks.Considering the diﬃculty of magnet loss to reproduce and the disadvan-tages mentioned above, we propose a replicable DML method, called IE loss,to learn the discriminative features. We calculate all the distances betweena sample and other class centers in a mini-batch to take of advantage ofbatch information, as compared to the pair/triplet samples like previous.The objective is initially deﬁned as follows: L (Θ) = 1 M M (cid:88) n =1 (cid:40) − log exp ( − σ (cid:107) f n − µ C ( f n ) (cid:107) − α ) (cid:80) c (cid:54) = C ( f n ) exp ( − σ (cid:107) f n − µ c (cid:107) ) (cid:41) + , (5)where {·} + is the hinge loss function, α is a predeﬁned margin hyperparam-eter, σ = M − (cid:80) n ∈D (cid:107) f n − µ C ( f n ) (cid:107) is the variance of examples away fromtheir respective class centers in the feature space. When training, the classcenter µ C ( f n ) and variance σ should update together with the deep feature f n . This means we should use the entire training set in each iteration. Ob-viously, it is impractical. So we decide to employ the mini-batch based SGDalgorithm to update the parameters. The denominator in log part is com-puted by summing all the inter-class distances between a sample and otherclass centers appear in the mini-batch. This approach seems to be a naturalchoice with the probability interpretation, the same to softmax loss.Some existing similar DML methods express that a sample quite far awayfrom the corresponding class center should vanish from its term in our objec-tive, approximating the denominator of Equation 5 with a small number ofnearest classes. Variance standardization also renders the objective invariantto the characteristic length scale of the problem. Whereas, all these beneﬁtsare based on a superb neighborhood sampling strategy for each class to keepthe local distribution. Diﬀerent from the strategy exploited in [35] whichsampling the nearest K clusters in each class, we decide to use the Q nearestclass centers to obtain the objective. The improved objective loss function isformulated as follows: L (Θ) = 1 M M (cid:88) n =1 (cid:40) − log exp ( − σ (cid:107) f n − µ C ( f n ) (cid:107) − α ) (cid:80) Qc =1 ,c (cid:54) = C ( f n ) exp ( − σ Q (cid:107) f n − µ c (cid:107) ) (cid:41) + , (6)8here Q is an eﬀectively selected number of diﬀerent inter-class distancesbetween a sample and other class centers in a mini-batch, and these dis-tances are sorted in ascending order. We can choose a proper Q according todiﬀerent training datasets to acquire the best performance. One can noticethat the sophisticated oﬀ-line nearest clusters sampling strategy is avoided,and the mini-batch based SGD works well for our training. Besides, the toolarge inter-class distances are removed to accelerate the convergence, whichis especially valid for the datasets with many classes. Subsequent results willshow that the proposed method can greatly improve the training eﬃciencywithout sacriﬁcing speed, since these auxiliary loss layers are removed in theclassiﬁcation step.When we set Q = 1 and σ = 0 .

5, Eq.(6) immediately reduces to Eq.(7). L (Θ) = 1 M M (cid:88) n =1 (cid:26) (cid:107) f n − µ C ( f n ) (cid:107) + α − min c (cid:54) = C ( f n ) (cid:107) f n − µ c (cid:107) (cid:27) + . (7)It is clear that this formula is a variant of the eﬃcient center loss and tripletloss. This loss function seems more appropriate to reﬂect the characteris-tics of our proposed method. It apparently forces the minimum inter-classdistance larger than the intra-class distance with a margin α . Figure 2: Visualization of the deeply learned 2D features on training and testing sets ofMNIST, regarding softmax loss, L-Softmax loss, center loss and IE loss, respectively. Thepoints with diﬀerent colors correspond to the features from diﬀerent classes. λ is theweighting parameter between softmax loss and IE loss in our ﬁnal objective,to keep the balance between these two supervision symbols. Algorithm 1

The parameter updating algorithm of IE loss.

Input: training set D = { x n , y n } Mn =1 , initialized parameters θ c in convolu-tional layers, W , σ and µ q ( q = 0 , , . . . , Q ) in loss layer where q = 0corresponds to the case of µ C ( f n ) , hyperparameters α and λ , learningrate η t and total iterative steps T . Output: model parameters θ c . for t = 1 , , · · · , T do compute the loss function L t = L tsoftmax + λ L tIE compute the gradients ∂ L t ∂f tn = ∂ L tsoftmax ∂f tn + λ ∂ L tIE ∂f tn ∂ L t ∂W t = ∂ L tsoftmax ∂W t + λ ∂ L tIE ∂W t = λ ∂ L tIE ∂W t ∂ L t ∂µ tq = ∂ L tsoftmax ∂µ tq + λ ∂ L tIE ∂µ tq = λ ∂ L tIE ∂µ tq ∂ L t ∂σ t = ∂ L tsoftmax ∂σ t + λ ∂ L tIE ∂σ t = λ ∂ L tIE ∂σ t update parameters W t +1 = W t − η t · ∂ L t ∂W t = W t − η t · λ · ∂ L tIE ∂W t µ t +1 q = µ tq − η t · ∂ L t ∂µ tq = µ tq − η t · λ · ∂ L tIE ∂µ tq σ t +1 = σ t − η t · ∂ L t ∂σ t = σ t − η t · λ · ∂ L tIE ∂σ t θ t +1 c = θ tc − η t (cid:80) Mn =1 ∂ L t ∂f tn · ∂f tn ∂θ tc end for To alleviate the computational complexity of real gradients, we assume10 n , µ c , σ are three independent variables. One can refer to Appendix A forthe complete derivation process. The gradients of L IE (Θ) with respect to f n , µ c , σ are estimated as follows: ∂ L IE (Θ) ∂f n = 1 M M (cid:88) n =1  f n − µ C ( f n ) σ − f n σ Q + (cid:80) Qc =1 ,c (cid:54) = C ( f n ) exp ( − σ Q (cid:107) f n − µ c (cid:107) ) · µ c σ Q (cid:80) Qc =1 ,c (cid:54) = C ( f n ) exp ( − σ Q (cid:107) f n − µ c (cid:107) )  , (8) ∂ L IE (Θ) ∂µ q =  M (cid:80) Mn =1 (cid:32) exp ( − σ Q (cid:107) f n − µ q (cid:107) ) · fn − µqσ Q (cid:80) Qc =1 ,c (cid:54) = C ( fn ) exp ( − σ Q (cid:107) f n − µ c (cid:107) ) (cid:33) , q (cid:54) = C ( f n ) − M (cid:80) Mn =1 f n − µ q σ , q = C ( f n ) , (9) ∂ L IE (Θ) ∂σ = 1 M M (cid:88) n =1  (cid:80) Qc =1 ,c (cid:54) = C ( f n ) exp ( − σ Q (cid:107) f n − µ c (cid:107) ) · (cid:107) f n − µ c (cid:107) σ Q (cid:80) Qc =1 ,c (cid:54) = C ( f n ) exp ( − σ Q (cid:107) f n − µ c (cid:107) ) − (cid:107) f n − µ C ( fn ) (cid:107) σ  . (10)

4. Experiments

The concrete implementation details are given in Section 4.1. In Section4.2, three kinds of CNNs with diﬀerent capacity are given to validate theeﬀectiveness of our algorithm on object recognition databases (MNIST [41],CIFAR10 [42] and CIFAR100 [42]). Some experiments on face recognitiondatabases (LFW [43] and YTF [44]) are also performed in Section 4.3.

We use the Caﬀe library [45] to implement our experiments, and a speed-up parallel computing technique by two Tesla K80 GPUs is exploited. Allthe networks in this part are based on some existing CNNs. We partitionthem into three classes: the lighter, the normal and the powerful. We willrefer to [L], [N] and [P] as their respective notations in the following ex-periments. The normal networks are shown in Table 1 and Table 5 whichare inspired by [40, 12]. Also, the powerful ones are similar to [46, 4]. Weadopt ReLU [1] as the default activation function except in Table 1 wherethe PReLU [16] is used. The weight decay and momentum is set to 0.0005and 0.9. Note that the mean subtraction image preprocessing is performedif not mentioned. The normally used SGD works well for the training. Thelighter networks are some known structures built in Caﬀe library, and wecomply with their original setings. In all these cases, we set α as 0 . as the entire inter-class distances in the mini-batch, if not speciﬁed. Thejoint supervision of softmax loss and IE loss is necessary to accelerate theconvergence of training process. When testing, the softmax classiﬁer is usedfor object recognition, and cosine similarity metric is computed to obtainthe face veriﬁcation accuracies. For a fair comparison, we train four kindsof models in each experiment, namely under the supervision of softmax loss,softmax loss and L-Softmax loss, softmax loss and center loss, softmax lossand IE loss. For simplicity, we refer to the four original loss names as theircorresponding methods. The details of every experiment about the trainingsetups will be presented in their respective subsections subsequently. In allthe experiments, only a single model is used to achieve the ﬁnal performance. Table 1: Some normal CNN architectures for diﬀerent benchmark datasets. Conv1.x,Conv2.x and Conv3.x denote structures that may contain multiple successive convolutionallayers. Batch normalization is used in these networks.

MNIST (for Fig .

2) Conv0 . x Conv1 . x Pool1 Conv2 . x Pool2 Conv3 . x Pool3 Fully ConnectedNum Layer - 2 1 2 1 2 1 1Filt Dim - 5 2 5 2 5 2 1Num Filt - 32 - 64 - 128 - 2Stride - 1 2 1 2 1 2 1Pad - 2 - 2 - 2 - - MNIST

Conv0 . x Conv1 . x Pool1 Conv2 . x Pool2 Conv3 . x Pool3 Fully ConnectedNum Layer 1 3 1 3 1 3 1 1Filt Dim 3 3 2 3 2 3 2 1Num Filt 64 64 - 64 - 64 - 256Stride 1 1 2 1 2 1 2 1Pad 1 1 - 1 - 1 - - CIFAR10

Conv0 . x Conv1 . x Pool1 Conv2 . x Pool2 Conv3 . x Pool3 Fully ConnectedNum Layer 1 4 1 4 1 4 1 1Filt Dim 3 3 2 3 2 3 2 1Num Filt 64 64 - 96 - 128 - 256Stride 1 1 2 1 2 1 2 1Pad 1 1 - 1 - 1 - - CIFAR100

Conv0 . x Conv1 . x Pool1 Conv2 . x Pool2 Conv3 . x Pool3 Fully ConnectedNum Layer 1 4 1 4 1 4 1 1Filt Dim 3 3 2 3 2 3 2 1Num Filt 96 96 - 192 - 384 - 512Stride 1 1 2 1 2 1 2 1Pad 1 1 - 1 - 1 - - MNIST

This handwritten dataset has 60,000 training images and10,000 testing images. In this section, we use two CNNs to validate the12eneralization of our algorithm. One is the lighter LeNet included in Caﬀelibrary. We train it according to the default updating strategy of learningrate and parameter initialization, eventually terminate it at 12k. The nor-mal one is depicted in Table 1. This model is trained with the batch sizeof 256, and the learning rate is started from 0.01, divided by 10 at 12k and15k iterations, eventually terminated at 20k iterations. In all these experi-ments, we only preprocess the images by dividing by 256 to provide them inrange [0,1] as inputs. Some existing best results and the compared methodsare shown in Table 2. It is obvious that IE loss not only outperforms otherDML methods under the same setings, but also among the top performancecompared to other state-of-the-art methods.

Table 2: Recognition error rate (%) on MNIST dataset.

Method Error Rate (%)DropConnect [20] 0.57CNN [47] 0.53Maxout [15] 0.45DSN [48] 0.39R-CNN [49] . GenPool [50] . Softmax [L] 0.83L-Softmax [L] 0.74Center [L] 0.76IE [L] . Softmax [N] 0.61L-Softmax [N] 0.47Center [N] 0.58IE [N] . CIFAR10

This dataset has 10 classes of objects with 50k for trainingand 10k for testing. The experiments on three CNNs are carried out here.The lighter one is the Cifar10 network built in Caﬀe library. The updatingstrategy and initialization of parameters follow the original settings. Thenormal one is depicted in Table 1. We start with a learning rate of 0.01,divide it by 10 at 10k and 17k iterations, and eventually terminate it at22k iterations. Simple mean/std normalization and horizontal ﬂips are usedto preprocess the dataset. The powerful one is WRN-28-10 as illustratedin [46], with some diﬀerences. The WRN-28-10 network is said to achieve a13omparable accuracy with more than 1000 layers raw ResNet [4] on CIFAR10.To speed up the training process, we ﬁne-tune the other three compared DMLmethods from the softmax baseline model. In this experiment, the dataset ispreprocessed by global contrast normalization and mean/std normalization.We follow the standard data augmentation [40] for training, and the batchsize is 128. The results are listed in Table 3. We can observe that ourmethod always achieves the best performance among the four compared DMLmethods regardless of the size of CNNs.

Table 3: Recognition error rate (%) on CIFAR10 dataset.

Method Error Rate (%)Maxout [15] 11.68DSN [48] 9.69DropConnect [20] 9.41All-CNN [51] 9.08R-CNN [49] 8.69GenPool [50] . Softmax [L] 21.88L-Softmax [L] -Center [L] 19.40IE [L] . Softmax [N] 11.56L-Softmax [N] 9.59Center [N] 10.25IE [N] . Softmax [P] 6.59L-Softmax [P] 6.46Center [P] 6.17IE [P] . CIFAR100

The ﬁnal part of this section, we will verify the eﬀectivenessof IE loss on CIFAR100 dataset. This dataset is just like the CIFAR10,except it has 100 classes containing 600 images per class, where 500 fortraining and 100 for testing. The 100 classes in CIFAR100 are grouped into20 superclasses. Each image comes with a “ﬁne” label (the class to which itbelongs) and a “coarse” label (the superclass to which it belongs). We use theformer protocol here. By convention, the normal network is shown in Table 1,and the powerful one is WRN-28-10. Also, the training strategy is the sameas which described in CIFAR10. For the powerful WRN-28-10, we ﬁne-tune14he other three compared DML methods from the softmax baseline model.Diﬀerently, to better inspect the eﬀectiveness of the compared methods withthe capacity of networks growing, we preprocess the dataset in the same wayon the normal and powerful networks, only by simple mean/std normalizationand horizontal ﬂips to augment data. In Table 4, we can clearly ﬁnd thatour method consistently performs better than other compared approaches.

Table 4: Recognition error rate (%) on CIFAR100 dataset.

Method

Error Rate (%)Maxout [15] 38.57DSN [48] 34.57All-CNN [51] 33.71R-CNN [49] . Softmax [N] 33.31L-Softmax [N] 30.79Center [N] 29.39IE [N] . Softmax [P] 27.06L-Softmax [P] 26.21Center [P] 26.15IE [P] . From the results presented above, one can ﬁnd that our IE loss alwaysachieves the best results among the four compared DML methods on threeobject recognition datasets. Speciﬁcally, the performance of center loss andL-Softmax loss ﬂuctuates signiﬁcantly with diﬀerent network structures. InFig. 3, the training and testing process on CIFAR10 and CIFAR100 withthe normal CNNs are displayed. It can be seen that the convergence rate ofour IE loss is comparable with other compared loss functions, avoiding thenotoriously slow convergence of triplet loss. Considering the performance gapbetween training and testing, one can observe that IE loss can mitigate theserious overﬁtting of softmax loss and the diﬃcult convergence of L-Softmaxloss. The testing accuracies of our method about diﬀerent λ and α , and thebest settings of them on the normal networks are shown in Appendix B15 a)(b)Figure 3: Accuracy vs. iteration curves using the normal networks on (a) CIFAR10 datasetand (b) CIFAR100 dataset. Diﬀerent from object recognition, face veriﬁcation is to compute the fea-ture similarity of two images, and threshold comparison is exploited to decidewhether the same person or not. Speciﬁcally, we use softmax classiﬁer andmetric loss functions to jointly supervise the training process, and the cosinesimilarity of two features is used to obtain the testing accuracy (Fig. 4). Inthis section, we evaluate our approach for face veriﬁcation on LFW and YTF16 igure 4: The general pipeline for face veriﬁcation in this paper, where classiﬁer lossfunction is used to train and similarity discriminant is used to obtain the ﬁnal veriﬁcationaccuracy. datasets. These two face datasets are the recognized benchmarks for face im-age and video, respectively. We use the publicly available CASIA-WebFace[52] as the training set, which originally has 494,414 labeled face images from10,575 individuals. After removing the images failing to detect and misla-beled, the resulting dataset for our training is just over 430K images. Thecropped faces of all images are detected by [53], and 5 facial landmarks arelabeled to globally align the face images by similarity transformation [54].The normal network is depicted in Table 5, which is a reduced version ofResNet [4] with 27 convolutional layers. The input faces are cropped to112 ×

96 RGB images, and the batch size is 256. Besides, the images arenormalized by subtracting the mean image and dividing by 128. We startthe training with a learning rate of 0.1, and divide it by 10 at 16K, 24Kiterations, then terminate it at 28K iterations. For face images, we ﬁnd thatusing wider ResNet with fewer layers like WRN-28-10 does not bring so manybeneﬁts, and accompanied by rapidly growing memory space. So we decideto widen the network listed in Table 5 to obtain the powerful one. Speciﬁ-cally, we widen all the convolutional layers between Conv1 and Conv4 witha widening factor 2. When testing, we extract the features from both thefrontal face and its mirror image, and merge the two features by element-wisesummation. All the evaluations are based on the similarity scores of image17 able 5: The normal ResNet architecture used for face veriﬁcation. Resblock is the classicalResidual unit which consists of two consecutive convolutional layers and a unit mapping.

Layer Type Filter Size / Stride Output Size Depth ParamsConv0 convolution 3 × / × ×

32 1 0.86KConv1 convolution 3 × / × ×

64 1 18KPool1 max pooling 2 × / × ×

64 0 -Resblock1 convolution 3 × / × ×

64 2 73KConv2 convolution 3 × / × ×

128 1 73KPool2 max pooling 2 × / × ×

128 0 -Resblock2 convolution 3 × / × ×

128 2 294KResblock3 convolution 3 × / × ×

128 2 294KConv3 convolution 3 × / × ×

256 1 294KPool3 max pooling 2 × / × ×

256 0 -Resblock4 convolution 3 × / × ×

256 2 1179KResblock5 convolution 3 × / × ×

256 2 1179KResblock6 convolution 3 × / × ×

256 2 1179KResblock7 convolution 3 × / × ×

256 2 1179KResblock8 convolution 3 × / × ×

256 2 1179KConv4 convolution 3 × / × ×

512 1 1179KPool4 max pooling 2 × / × ×

512 0 -Resblock9 convolution 3 × / × ×

512 2 4718KResblock10 convolution 3 × / × ×

512 2 4718KResblock11 convolution 3 × / × ×

512 2 4718KFc5 fully connection - 1 × ×

512 1 5242K pairs, which are computed by the cosine similarity of two representationsafter PCA.Considering the diﬀerence from previous experiments, we select Q as theﬁrst 20% inter-class distances in every mini-batch to calculate the objectivehere. The reason is that some datasets like CASIA-WebFace have too manysubjects, most of the inter-class distances tend to be very large in our method,thus leading to the diﬃcult convergence of training process. Fig. 5a showsthe veriﬁcation accuracies on LFW with Q ranging from 0 to 100% of thenumber of inter-class distances. The importance of choosing a proper Q is displayed clearly. Here, we regard the case when Q = 0 as the originalsoftmax method. LFW

This dataset contains 13,233 face images of 5,749 diﬀerent iden-tities from the Internet, with large variations in pose, expression and illumi-nation. For comparison purpose, algorithms typically report the mean faceveriﬁcation accuracies and the ROC curves on 6000 given face pairs, followingthe standard protocol of unrestricted with labeled outside data [43]. Accord-ing to previous experience, we ﬁnd that a properly chosen λ which balances18 a) (b)Figure 5: (a) Veriﬁcation accuracies of IE loss with diﬀerent Q/N on LFW using thenormal network, where N is the number of inter-class distances regarding a sample in amini-batch. (b) Face veriﬁcation accuracies of IE Loss on LFW with diﬀerent λ using thenormal network. the weight between softmax loss and IE loss can improve the performance.So we experiment our method across a wide range of λ from 0 to 0.1 to selectthe best setting. The results on LFW are shown in Fig. 5b. It can be seenthat IE loss is stable with diﬀerent λ , and the best setting is 0.05.Fig. 6a illustrates the veriﬁcation accuracies of ﬁve loss functions withtwo diﬀerent similarity metrics for testing. The results show that cosinesimilarity is more suitable than L2 similarity for our feature representations.Obviously, our method is robust to both cases, and always achieves the bestperformance. YTF

This dataset consists of 3,425 videos from 1,595 diﬀerent people,with an average of 2.15 videos for everyone. Besides, the average length of avideo clip is 181.3 frames, with each clip duration varying from 48 frames to6,070 frames. Just as the experiments on LFW, we report the results on 5,000video pairs in Table 6, according to the unrestricted protocol with labeledoutside data in [44]. Also, Fig. 7 shows the accuracy of IE loss in regard todiﬀerent λ ranging from 0 to 0.1 and the ROC curves of ﬁve compared lossfunctions.From the veriﬁcation results in Table 6 and ROC curves on these twodatasets, we can ﬁnd that the performance on the powerful network is con-sistently superior to which on the normal one except the L-Softmax loss. IE19 a) (b)Figure 6: (a)Veriﬁcation accuracies of compared loss functions with two diﬀerent similaritymetrics on LFW using the normal network. (b) ROC curves of ﬁve compared loss functionson LFW. (a) (b)Figure 7: (a) Face veriﬁcation accuracies of IE Loss on YTF with diﬀerent λ using thenormal ResNet. (b) ROC curves of ﬁve compared loss functions on YTF. loss is always outstanding in the ﬁve loss functions under a small trainingdataset of CASIA-WebFace, and competitive with the state-of-the-art meth-ods using larger training datasets or model ensemble. Noticeably, the resultsof triplet loss and L-Softmax loss are not satisfactory, and there exhibits alarge margin of triplet loss compared to the results in [11]. This convincingly20emonstrates the diﬃcult convergence and big data dependence of tripletloss. We conjecture that maybe the performance of our method can be im-proved considerably if a larger training set or a more powerful network isused. Anyway, the excellent performance undoubtedly verify the great gen-eralization of IE loss. The visualization of some datasets is shown in Fig. 8. Table 6: Face veriﬁcation performance (%) on LFW and YTF datasets.

Method Points for Alig . Outside Data Networks Acc . on LFW (%) Acc . on YTF (%)High-dim LBP [55] 27 100K - 95 .

17 -DeepFace[7] 73 4M 3 97.35 91.40Gaussian Face [8] - 20K 1 98.52 -DeepID [9] 5 200K 1 97.45 -DeepID-2+ [10] 18 300K 25 99.47 93.20FaceNet [11] - 200M 1 .

63 95 . DCNN [56] 7 490K 1 97.45 -CASIA-WebFace [52] 2 490K 1 97.73 90.60Softmax [N] 5 430K 1 97.42 91.52Triplet Loss [N] 5 430K 1 98.20 92.16L-Softmax [N] 5 430K 1 98.86 . Center [N] 5 430K 1 98.91 93.80IE [N] 5 430K 1 . . IE [P] 5 430K 1 .

15 94 .

5. Conclusion and future work

In this paper, we propose a powerful and replicable DML method, whichenforces the mean inter-class distance larger than the intra-class distancewith a margin, to enhance the discriminability of the deeply learned fea-tures in object recognition and face veriﬁcation. Extensive experiments onseveral public datasets have convincingly demonstrated the eﬀectiveness ofour method. The results also exhibit the excellent generalization of IE lossin various size of CNNs. Instead of requiring a superior neighborhood sam-pling strategy, our approach only uses mini-batch based SGD to conduct theexperiments, avoiding the exponentially increased computational complex-ity of image pairs or triplets. Maybe a better hard sample mining strategycould improve the performance further. Inspired by the outstanding perfor-mance of IE loss in object recognition and face recognition, we will exploreits extension in the case where the swarm intelligent methods are exploited21 a) Samples of CIFAR100(b) Face images in LFWFigure 8: Some examples of the datasets in our experiments. The image pairs in red arethose positive pairs that our method succeeds to recognize, while the softmax methodfails. Likewise, the green ones are some negative pairs. to optimize the clustering algorithm [57, 58] in the following work. In thefuture, we will delve into DML to explore its extensive applications to othertasks.

Acknowledgements

The authors would like to thank Kun Shang, Mengya Zhang, RuipengShen and Wenjuan Li for their helpful advices. This research was supportedby the National Science Foundation of China.

Appendix A

In this section, we concretely describe the deduction of gradient formulas(9) ∼ (11) listed in Section 3.2. First, we rewrite Eq.(6) as follows: L = 1 M M (cid:88) n =1 (cid:40) − log exp ( − σ (cid:107) f n − µ C ( f n ) (cid:107) − α ) (cid:80) Qc =1 ,c (cid:54) = C ( f n ) exp ( − σ Q (cid:107) f n − µ c (cid:107) ) (cid:41) + . (A.1)We need to compute the gradient formulas of L with respect to f n , µ c and σ . Note that directly computing the real gradients of them leads to costly22omputational complexity in training. So we will consider f n , µ c and σ asthree independent variables. If the value in {·} is positive, then ∂ L ∂f n = − M · ∂∂f n (cid:32) M (cid:88) n =1 log exp ( − σ (cid:107) f n − µ C ( f n ) (cid:107) − α ) (cid:80) Qc =1 ,c (cid:54) = C ( f n ) exp ( − σ Q (cid:107) f n − µ c (cid:107) ) (cid:33) = 1 M · ∂∂f n  (cid:107) f n − µ C ( f n ) (cid:107) σ + α + log Q (cid:88) c =1 ,c (cid:54) = C ( f n ) exp ( − σ Q (cid:107) f n − µ c (cid:107) )  = 1 M M (cid:88) n =1 (cid:32) f n − µ C ( f n ) σ − f n σ Q + (cid:80) Qc =1 ,c (cid:54) = C ( f n ) exp ( − σ Q (cid:107) f n − µ c (cid:107) ) · µ c σ Q (cid:80) Qc =1 ,c (cid:54) = C ( f n ) exp ( − σ Q (cid:107) f n − µ c (cid:107) ) (cid:33) . (A.2) ∂ L ∂µ q = 1 M · ∂∂µ q  (cid:107) f n − µ C ( f n ) (cid:107) σ + α + log Q (cid:88) c =1 ,c (cid:54) = C ( f n ) exp ( − σ Q (cid:107) f n − µ c (cid:107) )  . (A.3)When q (cid:54) = C ( f n ), we have ∂ L ∂µ q = 1 M M (cid:88) n =1 (cid:32) exp ( − σ Q (cid:107) f n − µ q (cid:107) ) · f n − µ q σ Q (cid:80) Qc =1 ,c (cid:54) = C ( f n ) exp ( − σ Q (cid:107) f n − µ c (cid:107) ) (cid:33) . (A.4)When q = C ( f n ), we have ∂ L ∂µ q = − M M (cid:88) n =1 f n − µ q σ . (A.5) ∂ L ∂σ = 1 M · ∂∂σ  (cid:107) f n − µ C ( f n ) (cid:107) σ + α + log Q (cid:88) c =1 ,c (cid:54) = C ( f n ) exp ( − σ Q (cid:107) f n − µ c (cid:107) )  = 1 M M (cid:88) n =1  (cid:80) Qc =1 ,c (cid:54) = C ( f n ) exp ( − σ Q (cid:107) f n − µ c (cid:107) ) · (cid:107) f n − µ c (cid:107) σ Q (cid:80) Qc =1 ,c (cid:54) = C ( f n ) exp ( − σ Q (cid:107) f n − µ c (cid:107) ) − (cid:107) f n − µ C ( f n ) (cid:107) σ  . (A.6)23 ppendix B Table B.1: The recognition accuracy of IE loss on MNIST regarding diﬀerent value of λ and α respectively with (a) LeNet built in Caﬀe library and (b) MNIST network depictedin Tab.1. (a) λ accuracy α accuracy0.110 0.9939 0.01 0.99450.115 0.9936 0.03 0.99390.120 0.9940 0.05 0.99380.125 0.9949 0.07 0.99430.130 0.9944 (b) λ accuracy α accuracy0.001 0.9964 0.01 0.99610.004 0.9958 0.03 0.99650.007 0.9952 0.05 0.99620.010 0.9963 0.07 0.99670.030 0.9961 0.09 0.99620.050 0.9962 able B.2: The recognition accuracy of IE loss on CIFAR10 regarding diﬀerent value of λ and α respectively with (a) CIFAR10 built in Caﬀe library and (b) CIFAR10 networkdepicted in Tab.1. (a) λ accuracy α accuracy0.001 0.8028 0.001 0.80570.004 0.8054 0.005 0.80180.008 0.8064 0.010 0.80680.010 0.8063 0.050 0.80290.040 0.8011 0.100 0.80930.080 0.7950 0.150 0.80320.100 0.8033 0.200 0.79720.130 0.8012 0.250 0.79890.160 0.8064 0.300 0.79960.190 0.7959 0.350 0.80590.210 0.7998 (b) λ accuracy α accuracy0.001 0.9086 0.001 0.90930.005 0.9102 0.005 0.90870.008 0.9109 0.010 0.90750.011 0.9108 0.050 0.90660.015 0.9088 Table B.3: The recognition accuracy of IE loss on CIFAR100 with the CIFAR100 networkdepicted in Tab.1, in regard to diﬀerent value of λ and α respectively. λ α Here we describe the accuracy results about diﬀerent hyperparametersand the optimal settings on object recognition using the little and normal25etworks in details. All the experiments in this part obey the followingsteps. First, we ﬁx α to 0.1 and vary λ according to its corresponding rangein diﬀerent databases. Then, we ﬁx λ to the best setting from the previousresults and vary α to ﬁnd the ﬁnal optimal setting. Both the optimal valuesof λ and α are displayed in bold. ReferencesReferences [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

Advances in neural infor-mation processing systems , pp. 1097–1105, 2012.[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” in

International Conference on LearningRepresentations , 2014.[3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Er-han, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolu-tions,” in

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , pp. 1–9, 2015.[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for im-age recognition,” in

IEEE Conference on Computer Vision and PatternRecognition , 2015.[5] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Objectdetectors emerge in deep scene cnns,” in

International Conference onLearning Representations , 2014.[6] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learningdeep features for scene recognition using places database,” in

Advancesin neural information processing systems , pp. 487–495, 2014.[7] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closingthe gap to human-level performance in face veriﬁcation,” in

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,pp. 1701–1708, 2014. 268] C. Lu and X. Tang, “Surpassing human-level face veriﬁcation perfor-mance on lfw with gaussianface,” in

Proceedings of the 29th AAAI Con-ference on Artiﬁcial Intelligence , 2014.[9] Y. Sun, X. Wang, and X. Tang, “Deep learning face representation frompredicting 10,000 classes,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pp. 1891–1898, 2014.[10] Y. Sun, X. Wang, and X. Tang, “Deeply learned face representations aresparse, selective, and robust,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pp. 2892–2900, 2015.[11] F. Schroﬀ, D. Kalenichenko, and J. Philbin, “Facenet: A uniﬁed em-bedding for face recognition and clustering,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pp. 815–823,2015.[12] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learningapproach for deep face recognition,” in

European Conference on Com-puter Vision , pp. 499–515, Springer, 2016.[13] G. Levi and T. Hassner, “Age and gender classiﬁcation using convo-lutional neural networks,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition Workshops , pp. 34–42, 2015.[14] H. Liu, J. Lu, J. Feng, and J. Zhou, “Group-aware deep feature learningfor facial age estimation,”

Pattern Recognition , 2016.[15] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. C. Courville, andY. Bengio, “Maxout networks.,”

Proceedings of the 30th InternationalConference on Machine Learning , vol. 28, pp. 1319–1327, 2013.[16] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers: Sur-passing human-level performance on imagenet classiﬁcation,” in

Pro-ceedings of the IEEE International Conference on Computer Vision ,pp. 1026–1034, 2015.[17] W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and im-proving convolutional neural networks via concatenated rectiﬁed linearunits,” in

International Conference on Machine Learning , 2016.2718] D. C. Cire¸san, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhu-ber, “High-performance neural networks for visual object classiﬁcation,” arXiv preprint arXiv:1102.0183 , 2011.[19] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R.Salakhutdinov, “Improving neural networks by preventing co-adaptationof feature detectors,” arXiv preprint arXiv:1207.0580 , 2012.[20] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Regulariza-tion of neural networks using dropconnect,” in

Proceedings of the 30thInternational Conference on Machine Learning , pp. 1058–1066, 2013.[21] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressingdeep neural networks with pruning, trained quantization and huﬀmancoding,” in

International Conference on Learning Representations , 2015.[22] Y. Sun, X. Wang, and X. Tang, “Sparsifying neural network connec-tions for face recognition,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pp. 4856–4864, 2016.[23] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face rep-resentation by joint identiﬁcation-veriﬁcation,” in

Advances in NeuralInformation Processing Systems , pp. 1988–1996, 2014.[24] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfacesvs. ﬁsherfaces: Recognition using class speciﬁc linear projection,”

IEEETransactions on pattern analysis and machine intelligence , vol. 19, no. 7,pp. 711–720, 1997.[25] B. Moghaddam, T. Jebara, and A. Pentland, “Bayesian face recogni-tion,”

Pattern Recognition , vol. 33, no. 11, pp. 1771–1782, 2000.[26] X. Wang and X. Tang, “A uniﬁed framework for subspace face recogni-tion,”

IEEE Transactions on pattern analysis and machine intelligence ,vol. 26, no. 9, pp. 1222–1228, 2004.[27] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun, “Bayesian face revis-ited: A joint formulation,” in

European Conference on Computer Vision ,pp. 566–579, Springer, 2012. 2828] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learningfor large margin nearest neighbor classiﬁcation,” in

Advances in neuralinformation processing systems , pp. 1473–1480, 2005.[29] M. Kan, S. Shan, Y. Su, D. Xu, and X. Chen, “Adaptive discrimi-nant learning for face recognition,”

Pattern Recognition , vol. 46, no. 9,pp. 2497–2509, 2013.[30] J. Song, L. Gao, Y. Yan, D. Zhang, and N. Sebe, “Supervised hashingwith pseudo labels for scalable multimedia retrieval,” in

Proceedings ofthe 23rd ACM international conference on Multimedia , pp. 827–830,ACM, 2015.[31] L. Gao, J. Song, F. Zou, D. Zhang, and J. Shao, “Scalable multimediaretrieval by deep learning hashing with relative similarity learning,” in

Proceedings of the 23rd ACM international conference on Multimedia ,pp. 903–906, ACM, 2015.[32] L. Gao, J. Song, X. Liu, J. Shao, J. Liu, and J. Shao, “Learning inhigh-dimensional multimedia data: the state of the art,”

MultimediaSystems , vol. 23, no. 3, pp. 303–313, 2017.[33] J. Wang, T. Zhang, N. Sebe, H. T. Shen, et al. , “A survey on learningto hash,”

IEEE Transactions on Pattern Analysis and Machine Intelli-gence , vol. PP.[34] Y. LeCun, L. Bottou, Y. Bengio, and P. Haﬀner, “Gradient-based learn-ing applied to document recognition,”

Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, 1998.[35] O. Rippel, M. Paluri, P. Dollar, and L. Bourdev, “Metric learning withadaptive density discrimination,” in

International Conference on Learn-ing Representations , 2016.[36] J. Song, “Binary generative adversarial networks for image retrieval,”in

Proceedings of the 32th AAAI Conference on Artiﬁcial Intelligence ,2018.[37] J. Song, L. Gao, L. Liu, X. Zhu, and N. Sebe, “Quantization-basedhashing: a general framework for scalable image and video retrieval,”

Pattern Recognition , vol. 75, pp. 175–187, 2018.2938] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition withdeep recurrent neural networks,” in

Acoustics, speech and signal pro-cessing (icassp), 2013 ieee international conference on , pp. 6645–6649,IEEE, 2013.[39] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, andC. Potts, “Recursive deep models for semantic compositionality over asentiment treebank,” in

Proceedings of the 2013 conference on empiricalmethods in natural language processing , pp. 1631–1642, 2013.[40] W. Liu, Y. Wen, Z. Yu, and M. Yang, “Large-margin softmax loss forconvolutional neural networks,” in

Proceedings of The 33rd InternationalConference on Machine Learning , pp. 507–516, 2016.[41] Y. LeCun, C. Cortes, and C. J. Burges, “The mnist database of hand-written digits,” 1998.[42] A. Krizhevsky and G. Hinton, “Learning multiple layers of features fromtiny images,” tech. rep., 2009.[43] G. B. Huang and E. Learned-Miller, “Labeled faces in the wild: Up-dates and new reporting procedures,”

Dept. Comput. Sci., Univ. Mas-sachusetts Amherst, Amherst, MA, USA, Tech. Rep , pp. 14–003, 2014.[44] L. Wolf, T. Hassner, and I. Maoz, “Face recognition in unconstrainedvideos with matched background similarity,” in , pp. 529–534, IEEE, 2011.[45] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caﬀe: Convolutional architecture forfast feature embedding,” in

Proceedings of the 22nd ACM internationalconference on Multimedia , pp. 675–678, ACM, 2014.[46] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in

BritishMachine Vision Conference , 2016.[47] K. Jarrett, K. Kavukcuoglu, Y. Lecun, et al. , “What is the best multi-stage architecture for object recognition,” in , pp. 2146–2153, IEEE, 2009.3048] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervisednets.,” in

AISTATS , vol. 2, p. 5, 2015.[49] M. Liang and X. Hu, “Recurrent convolutional neural network for ob-ject recognition,” in

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pp. 3367–3375, 2015.[50] C.-Y. Lee, P. W. Gallagher, and Z. Tu, “Generalizing pooling functionsin convolutional neural networks: Mixed, gated, and tree,” in

Interna-tional Conference on Artiﬁcial Intelligence and Statistics , 2016.[51] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striv-ing for simplicity: The all convolutional net,” in

International Confer-ence on Learning Representations , 2015.[52] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation fromscratch,” arXiv preprint arXiv:1411.7923 , 2014.[53] S. Wu, M. Kan, Z. He, S. Shan, and X. Chen, “Funnel-structured cascadefor multi-view face detection with alignment-awareness,”

Neurocomput-ing , vol. 221, pp. 138–145, 2017.[54] J. Zhang, S. Shan, M. Kan, and X. Chen, “Coarse-to-ﬁne auto-encodernetworks (cfan) for real-time face alignment,” in

European Conferenceon Computer Vision , Springer, 2014.[55] D. Chen, X. Cao, F. Wen, and J. Sun, “Blessing of dimensionality: High-dimensional feature and its eﬃcient compression for face veriﬁcation,”in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pp. 3025–3032, 2013.[56] J.-C. Chen, V. M. Patel, and R. Chellappa, “Unconstrained face veriﬁ-cation using deep cnn features,” in

Winter Conference on Applicationsof Computer Vision , pp. 1–9, 2016.[57] N. Zeng, Z. Wang, H. Zhang, W. Liu, and F. E. Alsaadi, “Deep beliefnetworks for quantitative analysis of a gold immunochromatographicstrip,”

Cognitive Computation , vol. 8, no. 4, pp. 684–692, 2016.[58] N. Zeng, H. Zhang, B. Song, W. Liu, Y. Li, and A. M. Dobaie, “Facialexpression recognition via learning deep sparse autoencoders,”