[PDF] Quality Aware Network for Set to Set Recognition

Abstract

This paper targets on the problem of set to set recognition, which learns the metric between two image sets. Images in each set belong to the same identity. Since images in a set can be complementary, they hopefully lead to higher accuracy in practical applications. However, the quality of each sample cannot be guaranteed, and samples with poor quality will hurt the metric. In this paper, the quality aware network (QAN) is proposed to confront this problem, where the quality of each sample can be automatically learned although such information is not explicitly provided in the training stage. The network has two branches, where the first branch extracts appearance feature embedding for each sample and the other branch predicts quality score for each sample. Features and quality scores of all samples in a set are then aggregated to generate the final feature embedding. We show that the two branches can be trained in an end-to-end manner given only the set-level identity annotation. Analysis on gradient spread of this mechanism indicates that the quality learned by the network is beneficial to set-to-set recognition and simplifies the distribution that the network needs to fit. Experiments on both face verification and person re-identification show advantages of the proposed QAN. The source code and network structure can be downloaded at this https URL

Full PDF

QQuality Aware Network for Set to Set Recognition

Yu LiuSenseTime Group Limited [email protected]

Junjie YanSenseTime Group Limited [email protected]

Wanli OuyangUniversity of Sydney [email protected]

Abstract

This paper targets on the problem of set to set recogni-tion, which learns the metric between two image sets. Im-ages in each set belong to the same identity. Since images ina set can be complementary, they hopefully lead to higheraccuracy in practical applications. However, the qualityof each sample cannot be guaranteed, and samples withpoor quality will hurt the metric. In this paper, the qual-ity aware network (QAN) is proposed to confront this prob-lem, where the quality of each sample can be automaticallylearned although such information is not explicitly providedin the training stage. The network has two branches, wherethe ﬁrst branch extracts appearance feature embedding foreach sample and the other branch predicts quality score foreach sample. Features and quality scores of all samples ina set are then aggregated to generate the ﬁnal feature em-bedding. We show that the two branches can be trained inan end-to-end manner given only the set-level identity an-notation. Analysis on gradient spread of this mechanismindicates that the quality learned by the network is beneﬁ-cial to set-to-set recognition and simpliﬁes the distributionthat the network needs to ﬁt. Experiments on both face veri-ﬁcation and person re-identiﬁcation show advantages of theproposed QAN. The source code and network structure canbe downloaded at GitHub

1. Introduction

Face veriﬁcation [12, 26, 27, 28, 30] and person re-identiﬁcation [5,6,20,42] have been well studied and widelyused in computer vision applications such as ﬁnancial iden-tity authentication and video surveillance. Both the twotasks need to measure the distance between two face or per-son images. Such tasks can be naturally formalized as ametric learning problem, where the distance of images fromthe same identity should be smaller than that from different https://github.com/sciencefans/Quality-Aware-Network Note that weare developing P-QAN (a ﬁne-grained version of QAN, see Sec.5) in thisrepository. So the performance may be higher than that we report in thispaper.

Legible samples

Shake Blur

Shake

Blur G oo d Q u a li t y AvePoolAvePool

MaxPool

Figure 1.

Illustration of our motivation, best viewed in color.

Left column:

A classical puzzle in set-to-set recognition. Bothset A (upper) and B (lower) contain noisy image samples causedby shake and blur. Their features (shown by histograms in middlerow) are more similar to samples in other class than the inner class.

Right column:

Distributions and samples of two identities in hy-perspace. Top: Due to the noisy, variances of two identities arelarge and they both have hard negative samples. Bottom: Qualityaware network (QAN) weaken the noisy samples and narrow downidentities’ variances, which makes them more discriminative. identities. Built on large scale training data, convolutionalneural networks and carefully designed optimization crite-rion, current methods can achieve promising performanceon standard benchmarks, but may still fail due to appear-ance variations caused by large pose or illumination.In practical applications, instead of one single image, aset of images for each identity can always be collected. Forexample, the image set of one identity can be sampled fromthe trajectory of the face or person in videos. Images in aset can be complementary to each other, so that they providemore information than a single image, such as images fromdifferent poses. The direct way to aggregate identity infor-1 a r X i v : . [ c s . C V ] A p r ation from all images in a set can be simply max/averagepooling appearance features of all images. However, oneproblem in this pooling is that some images in the set maybe not suitable for recognition. As shown in Figure 1, bothsets from left-top and left-bottom hold noisy images causedby shake or blur. If the noisy images are treated equallyand max/average pooling is used to aggregate all images’features, the noisy images will mislead the ﬁnal representa-tion.In this paper, in order to be robust to images with poorquality as described above and simultaneously use the richinformation provided by the other images, our basic idea isthat each image can have a quality score in aggregation. Forthat, we propose a quality aware network (QAN), which hastwo branches and then aggregated together. The ﬁrst branchnamed feature generation part extracts the feature embed-ding for each image, and the other branch named qualitygeneration part predicts quality score for each image. Fea-tures of images in the whole set are then aggregated by theﬁnal set pooling unit according to their quality.A good property of our approach is that we do not su-pervise the model by any explicit annotations of the quality.The network can automatically assign low quality scores toimages with poor quality in order to keep the ﬁnal featureembedding useful in set-to-set recognition. To implementthat, an elaborate model is designed in which embeddingbranch and score generation branch can be jointly trainedthrough optimization of the ﬁnal embedding. Specially inthis paper, we use the joint triplet and softmax loss on topof image sets. The designed gradient of image set poolingunit ensures the correctness of this automatic process.Experiments indicate that the predicted quality score iscorrelated with the quality annotated by human, and the pre-dicted quality score performs better than human in recogni-tion. In this paper, we show the applications of the proposedmethod on both person re-identiﬁcation and face veriﬁca-tion. For person re-identiﬁcation task, the proposed qualityaware network improves top-1 matching rates over the base-line by 14.6% on iLIDS-VID and 9.0% on PRID2011. Forface veriﬁcation, the proposed method reduces 15.6% and29.32% miss ratio when the false positive rate is 0.001 onYouTube Face and IJB-A benchmarks.The main contributions of the paper are summarized asfollows. • The proposed quality aware network automaticallygenerates quality scores for each image in a set andleads to better representation for set-to-set recognition. • We design an end-to-end training strategy and demon-strate that the quality generation part and feature gen-eration part beneﬁt from each other during back prop-agation. • Quality learnt by QAN is better than quality estimated by human and we achieves new state-of-the-art perfor-mance on four benchmarks for person re-identiﬁcationand face veriﬁcation.

2. Related work

Our work is build upon recent advances in deep learn-ing based person re-identiﬁcation and unconstrained facerecognition. In person re-identiﬁcation, [20, 37, 41] usefeatures generated by deep convolutional network and ob-tain state-of-the-art performance. To learn face representa-tions in unconstrained face recognition, Huang et al. [11]uses convolutional Restricted Boltzmann Machine whiledeep convolutional neural network is used in [28, 30]. Fur-thermore, [26, 29] use deeper convolutional network andachieved accuracy that even surpasses human performance.The accuracy achieved by deep learning on image-basedface veriﬁcation benchmark LFW [12] has been promotedto 99.78%. Although deep neural network has achievedsuch great performance on these two problems, in presentworld, unconstrained set-to-set recognition is more chal-lenging and useful.Looking backward, there are two different approacheshandling set-to-set recognition. The ﬁrst approach takes im-age set as a convex hull [2], afﬁne hull [10] or subspace[1, 13]. Under these settings, samples in a set distribute in aHilbert space or Grassmann mainfold so that this issue canbe formulated as a metric learning problem [23, 39].Some other works degrade set-to-set recognition topoint-to-point recognition through aggregating images ina set to a single representation in hyperspace. The mostfamous approach in this kind is the Bag of features [17],which uses histogram to represent the whole set for fea-ture aggregation. Another classical work is vector of locallyaggregated descriptors (VLAD) [14], which aggregates alllocal descriptors from all samples. Temporal max/averagepooling is used in [36] to integrate all frames’ features gen-erated by recurrent convolutional network. This methoduses the 1st order statistics to aggregate the set. The 2ndorder statistics is used in [32, 43] in assuming that samplesfollow Gaussian distribution. In [8], original faces in a setare classiﬁed into 20 bins based on their pose and quality.Then faces in each bin are pooled to generate features andﬁnally feature vectors in all bins are merged to be the ﬁnalrepresentation. [38] uses attention mechanism to summarizeseveral sample points to a single aggregated point.The proposed QAN belongs to the second approach. Itdiscards the dross and selects the essential information inall images. Different from recent works which learn ag-gregation based on ﬁxed feature [38] or image [8], theQAN learns feature representation and aggregation simulta-neously. [7] proposed a similar quality aware module named“memorability based frame selection” which takes “visualentropy” to be the score of a frame. But the score of a frame Image number N N FCN

Quality generation unit Q μ Set pooling

Image set Middle representation Set level representation

Image level representation ID signalID signalFeature generation part

Figure 2.

The end-to-end learning structure of quality aware net. The input of this structure is three image sets S anchor , S pos and S neg belong to class A , A and B . Each of them pass through the fully convolutional network (FCN) to generate the middle representations,which will be fed to quality generation part and feature generation part. The former generates quality score for each image and the lattergenerates ﬁnal representation for each image. Then the scores and representations of all image will be aggregated by set pooling unit andthe ﬁnal representation of the image set will be produced. We use softmax-loss and triplet-loss to be the supervised ID signal. is deﬁned by human and independent with feature genera-tion unit. In QAN, score is automatically learned and qual-ity generation unit is joint trained with feature generationunit. Due to mutual beneﬁt between the two parts duringtraining, performance is improved signiﬁcantly by jointlyoptimizing images aggregation parameter and images’ fea-ture generator.

3. Quality aware network (QAN)

In our work we focus on improving image set embeddingmodel, which maps an image set S = { I , I , · · · , I N } toan representation with ﬁxed dimension so that image setswith different number of images are comparable with eachother. Let R a ( S ) and R I i denote representation of S and I i . R a ( S ) is determined by all elements in S , therefore itcan be denoted as R a ( S ) = F ( R I , R I , · · · , R I N ) . (1) The R I i is produced by a feature extraction process, con-taining traditional hand-craft feature extractors or convo-lutional neural network. F ( · ) is an aggregative function,which maps a variable-length input set to a representationof ﬁxed dimension. The challenge is to ﬁnd an optimized F ( · ) , which aggregate features from the whole image setto obtain the most discriminative representation. Based onnotion that images with higher quality are easier for recog-nition while images with lower quality containing occlusionand large pose have less effect on set representation, we de-note F ( · ) as F ( R I , R I , · · · , R I N ) = (cid:80) Ni =1 µ i R I i (cid:80) Ni =1 µ i (2) µ i = Q ( I i ) , (3) where Q ( I i ) predicts a quality score µ i for image I i . So therepresentation of a set is a fusion of each images’ features,weighted by their quality scores. In this paper, feature generation and aggregation moduleis implemented through an end-to-end convolutional neu-ral network named QAN as shown in Fig. 2. Two branchesare splited from the middle of it. In the ﬁrst branch, qual-ity generation part followed by a set pooling unit composesthe aggregation module. And in the second branch, fea-ture generation part generates images’ representation. Nowwe introduce how an image set ﬂows through QAN. At thebeginning of the process, all images are sent into a fullyconvolutional network to generate middle representations.After that, QAN is divided into two branches. The ﬁrst one(upper) named quality generation part is a tiny convolutionneural network (see Sec. 3.4 for details) which is employedto predict quality score µ . The second one (lower), calledfeature generation part, generates image representations R I for all images. µ and R I are aggregated at set pooling unit F , and then pass through a fully connected layer to get theﬁnal representation R a ( S ) . To sum up, this structure gen-erates quality scores for images, uses these quality scoresto weight images’ representations and sums them up to pro-duce the ﬁnal set’s representation. We train the QAN in an end-to-end manner. The dataﬂow is shown in Fig. 2. QAN is supposed to generate dis-criminative representations for images and sets belonging todifferent identities. For image level training, a fully connec-tion layer is established after feature generation part, whichis supervised by Softmax loss L class . For set level training,a set’s representation R a ( S ) is supervised by L veri whichis formulated as: L veri = (cid:107) R a ( S a ) − R a ( S p ) (cid:107) − (cid:107) R a ( S a ) − R a ( S n ) (cid:107) + δ (4) The loss function above is referred as

Triplet Loss in pre-vious works [26]. We deﬁne S a as anchor set , S p as pos-itive set , and S n as negative set . This function minimizesvariances of intra-class samples while Softmax loss cannotuarantee that because softmax-loss directly optimizes theprobability of each class, but not the discrimination of rep-resentation.Keeping this in mind, we consider the set pooling opera-tion F . The gradients back propagated through set poolingunit can be formulated as follows, ∂ F ∂R I i = ∂R a ( S ) ∂R I i = µ i (5) ∂ F ∂µ i = ∂R a ( S ) ∂µ i = R I i − R a ( S ) (6) So we can formulate propagation process of the ﬁnal loss as ∂L veri ∂R I i = ∂R a ( S ) ∂R I i · ∂L veri ∂R a ( S ) = ∂L veri ∂R a ( S ) · µ i (7) ∂L veri ∂µ i = ∂R a ( S ) ∂µ i · ( ∂L veri ∂R a ( S ) ) T = D (cid:88) j =1 ( ∂L veri ∂R a ( S ) j · ( x ij − R a ( S ) j )) (8) Where D is the dimension of images’ representation. Wediscuss how a quality score µ is automatically learned bythis back propagation process. R a (S anchor )R a (S neg )Gradient of S anchor Gradient of S neg

Figure 3.

Two different identities in training, best viewed in color.Red translucent dots and green translucent dots indicate images insets of two different identities. And the two solid dots denote theweighted centers of the two sets, which are also the representa-tions of two sets S anchor and S neg . The gradients of S anchor and S neg are shown with red arrows. The x ni and x ai are two imagerepresentations in two sets. Automatic gradient of µ . After back-propagationthrough set pooling unit, gradient of µ i with regard to L veri can be calculated according to the Eq. 8, which is the dotproduct of gradient from R a ( S ) and R I i . So if angle of ∇ R a ( S ) and R I i belongs to ( − ◦ , ◦ ), µ i ’s gradient willbe positive. For example, as shown in Fig. 3, the angle of ∇ R a ( S neg ) and x ni − R a ( S neg ) is less than ◦ , so the x (cid:48) ni s quality score µ ni will become larger after this back propa-gation process. In contrast, the relative direction of x a i is inthe opposite side of the gradient of R a ( S anchor ) , making itobviously a hard sample, so its quality score µ ai will tend tobe smaller. Obviously, samples in the “correct” directionsalong with set gradient always score higher in quality, whilethose in the “wrong” directions gain lower weight. For ex-ample in Fig. 3, green samples in the upper area and redsamples in the lower area keep improving their quality con-sistently while in the middle area, sample’s quality reduces.To this end, µ i represents whether i − th image is a goodsample or a hard sample. This conclusion will be furtherdemonstrated by experiments. µ regulates the attention of R I i . The gradient of R I i isshown in Eq. 7 with a factor µ i , together with the gradientpropagated from Softmax loss. Since most of hard sampleswith lower µ i are always poor images or even full of back-ground noises, the factor µ i in gradient of R I i weaken theirharmful effect on the whole model. That is, their impact onparameters in feature generation part is negligible duringback propagation. This mechanism helps feature genera-tion part to focus on good samples and neglect ones, whichbeneﬁts set-to-set recognition. ConvNet Sigmoid L1 NormalizationMiddle representation of pool4 layerN x 512 x 14 x 14 N x 1 x 1 x 1 N x 1 x 1 x 1Origin scores Sigmoid and L1 normalization for all scores in set Final scores μ for allimages in set

Figure 4.

Structure of quality generation unit. The input of thisunit is middle representations of a set which contains N imagesand it produces the normalized weights of all N images.

In quality aware network (QAN), quality generation partis a convolution neural network. We design different scoregeneration parts start at different feature maps. We useQAN split at Pool4 as an instance. As shown in Fig. 4,the output spatial of Pool4 layer is × × . In orderto generate a × quality score, the convolution part con-tains a 2-stride pooling layer and a ﬁnal pooling layer withkernel size × . A fully connected layer is followed bythe ﬁnal pooling layer to generate the original quality score.After that, the origin scores of all images in a set are sent to a m e p er s o n d i ff ere n t s c o re From fine to inferior

Figure 5.

Samples with their qualities predicted by QAN, best viewed in color.

Top:

Comparison between two images from same person.From up to down , each column shows the two frames of a same person. The quality of the top one is better than the bottom one.

Bottom:

Random selected images in test set sorted by quality scores from left to right , best viewed in color. sigmoid layer and group L1-normalization layer to generatethe ﬁnal scores µ . For QAN split at Pool3 , we will adda block containing three 1-stride convolution layer and a 2-stride pooling layer at the beginning of quality generationunit.

4. Experiments

In this section, we ﬁrst explore the meaning of the qual-ity score learned by QAN. Then QAN’s sensitivity to levelof feature is analysed. Based on above knowledge, we eval-uate QAN on two human re-identiﬁcation benchmarks andtwo unconstrained face veriﬁcation benchmarks. Finally,we analyse the concept learned by QAN and compare itwith score labelled by human.

Qualitative analysis

We visualize images with their µ generated by QAN to explore the meaning of µ . Instancesof same person with different qualities are shown in the ﬁrsttwo rows in Fig. 5. All images are selected from test set.The two images in the same column belong to a same per-son. The upper images are random selected from imageswith quality scores higher than 0.8 and the lower images areselected from images with quality scores lower than the cor-responding higher one. It is easy to ﬁnd that images with de- formity, superposition, blur or extreme light condition tendto obtain lower quality scores than normal images.The last two rows in Fig. 5 give some examples of otherimages random selected from test set. They are sorted bytheir quality scores from left to right. We can observe thatinstances with quality scores larger than 0.70 are easy torecognize by human while the others are hard. Especiallymany of hard images include two or more bodies in the cen-ter and we can hardly discriminate which one is the righttarget. Quantitative analysis

In order to measure the relation-ship between the quality labelled by human and µ predictedby QAN, 1000 images in YouTube Face are selected ran-domly and the quality of them are rated subjectively by 6volunteers, where each volunteer estimates a quality scorefor each image, ranging from 0 to 1. All the ratings ofeach volunteer are aligned by logistic regression. Then the6 aligned scores of each image are averaged and ﬁnally nor-malized to [0 , to get the ﬁnal quality score from human.We divide the images into ten partitions based on hu-man’s score as shown in Fig. 6. In which we show the cor-responding quality statistics generated by QAN. It is obvi-ous that the scores given by the QAN are strongly corre-lated with human-deﬁned quality. We further analyse the499,500 image pairs from these 1000 images and ask hu- score by QAN s c o r e by hu m a n Figure 6.

Comparison of qualities estimated by human and predictedby QAN. man and QAN to select the better one in each pair. Resultshows that the decision made by QAN has 78.1% in com-mon with human decision.

Datasets.

For person re-identiﬁcation, we collect134,942 frames with 16,133 people and 212,726 boundingboxes as the training data. Experiments are conducted onPRID2011 [9] and iLiDS-VID [33] datasets. PRID2011contains frames in two views captured at different positionsof a street.

CameraA has 385 identities while

CameraB has 749 identities, and the two videos have a overlap of 200people. Each person has 5 to 675 images, and the averagenumber is 100. iLIDS-VID dataset has 300 people, and eachperson has two sets also captured from different positions.Each person has 23 to 192 images.

Evaluation procedure.

The results are reported in termsof

Cumulative Matching Characteristics (CMC) table, eachcolumn in which represents matching rate in a certain top-N matching. Two settings are used for comprehensive eval-uation. In the ﬁrst setting, we follow the state-of-the-artmethod described in [40] and [34]. The sets whose framenumber is larger than 21 are used in PRID2011, and all thesets in iLIDS-VID are used. Each dataset is divided intotwo parts for ﬁne-tuning and testing, respectively. For thetesting set, sets form

CameraA are taken as probe set whilesets from

CameraB are taken as the gallery. The ﬁnal num-ber is reported as the average of “10-fold cross validation”.In the second setting, we conduct cross-dataset testing. Dif-ferent from the ﬁrst setting, we ignore the ﬁnetuning processand use all data to test our model. That is, in PRID2011, theﬁrst 200 people from

CameraA serve as probes, and all setsfrom

CameraB are used as the gallery set. In iLIDS-VID,

CameraA are used as the probe set, and Camera B serve asgallery set.

Baseline.

We implement two baseline approaches. Inthe ﬁrst baseline, we use average pooling to aggregate allimages’ representations. In the second baseline, a minimal cosine distance between two closures is used to be their sim-ilarity.

Results of evaluation obeying “10-fold cross validation” onPRID2011 and iLIDS-VID are shown in Table 1 and Ta-ble 2. Beneﬁting from the large scale training dataset, ourCNN+AvePool and CNN+Min(cos) baselines are close toor even better than the state-of-the-art. Notice that most ofthe leading methods listed in table consider both appearanceand spatio-temporal information while our method onlyconsiders appearance information. On PRID2011 dataset,QAN increase top-1 matching rate by 11.1% and 29.4%compared with CNN+AvePool and CNN+Min(cos). OniLIDS-VID dataset, inherent noise is much more than thatin PRID2011, which signiﬁcantly inﬂuence the accuracy ofCNN+Min(cos) since operator “Min(cos)” is more sensitivethan “AvePool” to noisy samples . However, QAN achievesmore gain on this noisy dataset. It increase top-1 matchingrate by 12.21% and 37.9%.PRID2011Methods CMC1 CMC5 CMC10 CMC20QAN

CNN+AvePool 81.3 96.6 98.5 99.6CNN+Min(cos) 69.8 91.3 97.1 99.8CNN+RNN [36] 70 90 95 97STFV3D [22] 42.1 71.9 84.4 91.6TDL [40] 56.7 80.0 87.6 93.6eSDC [34] 48.3 74.9 87.3 94.4DVR [34] 40.0 71.7 84.5 92.2LFDA [25] 43.7 72.8 81.7 90.9KISSME [16] 34.4 61.7 72.1 81.0LADF [21] 47.3 75.5 82.7 91.1TopRank [19] 31.7 62.2 75.3 89.4

Table 1.

Comparison of QAN,

AvePool , Min(cos) and otherstate-of-the-art methods on PRID2011, where the number repre-sents the cumulative matching rate in CMC curve.

Based on these two experiments, QAN signiﬁcantlyoutperforms two baselines on both datasets. It also per-forms better than many state-of-the-art approaches andpushes top-1 matching rate 20.3% higher than previousbest CNN+RNN [36] on PRID2011 and 10% on iLIDS-VID. The performance gain is more signiﬁcant on noisyiLIDS-VID dataset, which meets the expectation and provesQAN’s ability to deal with images of poor quality.

To prevent our model from over-ﬁtting the quality distribu-tion of test set, we conduct dataset cross evaluation. WeLIDS-VIDMethods CMC1 CMC5 CMC10 CMC20QAN eSDC [34] 41.3 63.5 72.7 83.1DVR [34] 39.5 61.1 71.7 81.0LFDA [25] 32.9 68.5 82.2 92.6KISSME [16] 36.5 67.8 78.8 87.1LADF [21] 39.0 76.8 89.0 96.8TopRank [19] 22.5 56.1 72.7 85.9

Table 2.

Comparison of QAN,

AvePool , Min(cos) and otherhuman re-identiﬁcation methods on iLIDS-VID, where the num-ber represents the cumulative matching rate on CMC curve.

PRID2011Methods CMC1 CMC5 CMC10 CMC20QAN

CNN+AvePool 29.4 57.5 68.8 80.2CNN+Min(L2) 28.5 57.1 67.1 78.6CNN+RNN [36] 28 57 69 81

Table 3.

Cross-dataset performance of QAN on PRID2011, wherethe number represents the cumulative accuracy on CMC curve. iLIDS-VIDMethods CMC1 CMC5 CMC10 CMC20QAN

CNN+AvePool 44.1 65.8 78.5 88.9CNN+Min(L2) 41.9 61.7 75.5 79.5

Table 4.

Cross-dataset performance of QAN on iLIDS-VID,where the number represents the cumulative accuracy on CMCcurve. extract set representation of iLIDS-VID and PRID2011 di-rectly using trained QAN without ﬁne-tuning. The QANrepresentation is then evaluated for CMC scores. Table 3and 4 shows the results of QAN and the two baselines. Itcan be found that the QAN is robust even in cross-datasetsetting. It improves top-1 matching by 15.6% and 8.2%compared to the baselines. This result shows that the qual-ity distribution learned from different datasets by QAN isable to generalize to other datasets.

Datasets.

For face veriﬁcation, we train our base modelon extended version of VGG Face dataset [24], in whichwe extend the identity number from 2.6K to 90K and im- age number from 2.6M to 5M. The model is evaluated onYouTube Face Database [35] and IARPA Janus BenchmarkA (IJB-A) dataset. YouTube Face contains 3425 videosof 1595 identities. It is challenging in that most faces areblurred or has low resolution. IJB-A dataset contains 2042videos of 500 people. Faces in IJB-A have large pose vari-ance.

Evaluation procedure.

We follow the 1:1 protocolin both two benchmarks and evaluate results using re-ceiver operating characteristic (ROC) curves. Area undercurve (AUC) and accuracy are two important indicators ofthe ROC. The datasets are evaluated using 10-fold cross-validation.

Training details.

All faces in training and testing setsare detected and aligned by a multi-task region proposal net-work as described in [3]. Then we crop the face regions andresize them to × . After that, a convolutional neuralnetworks with × inputs are used for face veriﬁca-tion. It begins with a 2-stride convolution layer, followedby 4 basic blocks, while each block has three 1-stride con-volution layers and one 2-stride pooling layers. After that, afully connected layer is used to get the ﬁnal feature. Qualitygeneration branch is built on top of the third pooling layer,where the spatial size of middle representation response is × × . We pre-train the network supervised byclassiﬁcation signal and then train the whole QAN. Method Accuracy(%) AUCQAN ± ± CNN+AvePool 95.46 ± ± ± ± ± ± ± ± ± Table 5.

Average accuracy and AUC of QAN on YouTube Facedataset, compared with baselines and other state-of-the-arts.

TPR@FPR 1e-3 1e-2 1e-1QAN ± ± ± CNN+AvePool 85.30 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 6.

TPRs of QAN at speciﬁc FPRs on IJB-A dataset, com-pared with baselines and other state-of-the-arts. −3 −2 −1 T r u e P o s iti v e R a t e QAN_pool2Baseline(AvePool)Baseline(MinCos)DeepFaceEigenPEPDDML(combine)

Figure 7.

Average ROC curves of differentmethods on YouTube Face Dataset −3 −2 −1 T r u e P o s iti v e R a t e Baseline(MinCos)Baseline(AvePool)QAN@FC&FixQAN@FCQAN@InputQAN@Pool1QAN@Pool2QAN@Pool3QAN@Pool4

Figure 8.

ROC results for score generation partlearned by different level of feature. −3 −2 −1 F a l s e P o s iti v e R a t e QAN_pool2Baseline(AvePool)Baseline(MinCos)HumanScore

Figure 9.

QAN with human score performsbetter than the two baselines but worse thanthat scored by network.

On YouTube Face dataset, it can be observed in Fig. 7and Table 5 that the accuracy and AUC of our baselines aresimilar with the state-of-the-art methods such as FaceNetand NAN. Based on this baseline, QAN further reduces15.6% error ratio. Under ROC evaluation metric, QAN sur-passes NAN by 8% and DeepFace by 80% at 0.001 FPR(false positive rate), which ensembles 25 models.On IJB-A dataset, QAN signiﬁcantly outperforms thestate-of-the-art algorithm NAN by 10.81% at 0.001 FPR,4.5% at 0.01 FPR and 2.12% at FPR=0.1, as shown in Ta-ble 6. Compared with average pooling baseline, QAN re-duces false negative rate at above three FPRs by 29.32%,6.45% and 7.91%.Our experiments on the two tasks show that QAN is ro-bust for set-to-set recognition. Especially on the point oflow FPR, QAN can recall more matched samples with lesserrors.

There is no explicit supervision signals for the cascadescore generation unit in training. So another problem arises:is it better to use human-deﬁned scores instead of lettingthe network learn itself? In YouTube Face experiment, wereplace the quality score Q ( I ) with volunteer-rated scoreand get the following result in Fig. 9, which is better thanthe two baselines but inferior to the result of original QAN.It shows that Q is similar with human thoughts, but moresuitable for recognition. Quality score by human can alsoenhance the accuracy but is still worse than QAN’s. Level of middle representation may affect the perfor-mance of QAN. We use YouTube Face to analyse this factorby comparing different conﬁgurations.In the ﬁrst conﬁguration, the weight generation part isconnected to the image. In the second to ﬁfth conﬁgura-tions, weight generation part is set after four pooling layersin each block, respectively. In the sixth conﬁguration, we connect weight generation part to a fully connected layer.For the ﬁnal conﬁguration, we ﬁx all parameters before theﬁnal fully connection layer in the sixth conﬁguration andonly update parameters in weight generation part, which istaken as the seventh structure. To minimize the inﬂuenceby parameters’ number, the total size of different models isrestricted to the same by changing the channel number.Results are shown in Fig. 8. It can be found that the per-formance of QAN improves at the beginning and reachesthe top accuracy at

Pool3 . The end-to-end training ver-sion of feature generation part with quality generation partperforms better than that of ﬁxed. So we can make the con-clusion that 1) the middle level feature is better for QANto learn and 2) signiﬁcant improvement can be achieved byjointly training feature generation part and quality genera-tion part.

5. Conclusion and future work

In this paper we propose a Quality Aware Network(QAN) for set-to-set recognition. It automatically learns theconcept of quality for each sample in a set without super-vised signal and aggregates the most discriminative samplesto generate set representation. We theoretically and experi-mentally demonstrate that the quality predicted by networkis beneﬁcial to set representation and better than human la-belled.QAN can be seen as an attention model that pay attentionto high quality elements in a image set. However, an imagewith poor quality may still has some discriminative regions.Considering this, our future work will explore a ﬁne-grainedquality aware network that pay attention to high quality re-gions instead of high quality images in a image set. eferences [1] Ronen Basri, Tal Hassner, and Lihi Zelnik-Manor. Approximatenearest subspace search.

IEEE Transactions on Pattern Analysis andMachine Intelligence , 33(2):266–278, 2011. 2[2] Hakan Cevikalp and Bill Triggs. Face recognition based on imagesets. In

CVPR’10 , pages 2567–2573. IEEE, 2010. 2[3] Dong Chen, Gang Hua, Fang Wen, and Jian Sun. Supervised trans-former network for efﬁcient face detection. In

European Conferenceon Computer Vision , pages 122–138. Springer, 2016. 7[4] Jun-Cheng Chen, Rajeev Ranjan, Amit Kumar, Ching-Hui Chen,Vishal Patel, and Rama Chellappa. An end-to-end system for uncon-strained face veriﬁcation with deep convolutional neural networks.In

ICCV Workshops , pages 118–126, 2015. 7[5] Michela Farenzena, Loris Bazzani, Alessandro Perina, VittorioMurino, and Marco Cristani. Person re-identiﬁcation by symmetry-driven accumulation of local features. In

CVPR, 2010 IEEE Confer-ence on , pages 2360–2367. IEEE, 2010. 1[6] Shaogang Gong, Marco Cristani, Shuicheng Yan, and Chen ChangeLoy.

Person re-identiﬁcation , volume 1. Springer, 2014. 1[7] Gaurav Goswami, Romil Bhardwaj, Richa Singh, and Mayank Vatsa.Mdlface: Memorability augmented deep learning for video facerecognition. In

Biometrics (IJCB), 2014 IEEE International JointConference on , pages 1–7. IEEE, 2014. 2[8] Tal Hassner, Iacopo Masi, Jungyeon Kim, Jongmoo Choi, ShaiHarel, Prem Natarajan, and Gerard Medioni. Pooling faces: tem-plate based face recognition with pooled face images. In

CVPR’16Workshops , pages 59–67, 2016. 2[9] Martin Hirzer, Csaba Beleznai, Peter M. Roth, and Horst Bischof.Person re-identiﬁcation by descriptive and discriminative classiﬁca-tion. In

Proc. Scandinavian Conference on Image Analysis (SCIA) ,2011. 6[10] Yiqun Hu, Ajmal S Mian, and Robyn Owens. Sparse approximatednearest points for image set classiﬁcation. In

CVPR’11 , pages 121–128. IEEE, 2011. 2[11] Gary B. Huang. Learning hierarchical representations for face ver-iﬁcation with convolutional deep belief networks. In

CVPR , CVPR’12, pages 2518–2525, Washington, DC, USA, 2012. IEEE Com-puter Society. 2[12] Gary B Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the wild: A database for studying facerecognition in unconstrained environments. Technical report, Tech-nical Report 07-49, University of Massachusetts, Amherst, 2007. 1,2[13] Zhiwu Huang, Ruiping Wang, Shiguang Shan, and Xilin Chen. Pro-jection metric learning on grassmann manifold with application tovideo based face recognition. In

CVPR’15 , pages 140–149, 2015. 2[14] Herv´e J´egou, Matthijs Douze, Cordelia Schmid, and Patrick P´erez.Aggregating local descriptors into a compact image representation.In

CVPR’10 , pages 3304–3311. IEEE, 2010. 2[15] Joshua C Klontz, Brendan F Klare, Scott Klum, Anubhav K Jain, andMark J Burge. Open source biometric recognition. In

Biometrics:Theory, Applications and Systems (BTAS), 2013 , pages 1–8. IEEE,2013. 7[16] Martin Koestinger, Martin Hirzer, Paul Wohlhart, Peter M Roth, andHorst Bischof. Large scale metric learning from equivalence con-straints. In

CVPR’12 , pages 2288–2295. IEEE, 2012. 6, 7[17] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bagsof features: Spatial pyramid matching for recognizing natural scenecategories. In

CVPR’06 , volume 2, pages 2169–2178. IEEE, 2006. 2[18] Haoxiang Li, Gang Hua, Xiaohui Shen, Zhe Lin, and JonathanBrandt. Eigen-pep for video face recognition. In

Computer Vision–ACCV 2014 , pages 17–33. Springer, 2014. 7[19] Nan Li, Rong Jin, and Zhi-Hua Zhou. Top rank optimization in lineartime. In

Advances in Neural Information Processing Systems , pages1502–1510, 2014. 6, 7[20] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deepreid: Deepﬁlter pairing neural network for person re-identiﬁcation. In

ICCV ,pages 152–159, 2014. 1, 2 [21] Zhen Li, Shiyu Chang, Feng Liang, Thomas Huang, Liangliang Cao,and John Smith. Learning locally-adaptive decision functions forperson veriﬁcation. In

CVPR’13 , pages 3610–3617, 2013. 6, 7[22] Kan Liu, Bingpeng Ma, Wei Zhang, and Rui Huang. A spatio-temporal appearance representation for viceo-based pedestrian re-identiﬁcation. 6, 7[23] Jiwen Lu, Gang Wang, Weihong Deng, Pierre Moulin, and Jie Zhou.Multi-manifold deep metric learning for image set classiﬁcation. In

CVPR’15 , pages 1137–1145, 2015. 2[24] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition.In

British Machine Vision Conference , 2015. 7[25] Sateesh Pedagadi, James Orwell, Sergio Velastin, and BoghosBoghossian. Local ﬁsher discriminant analysis for pedestrian re-identiﬁcation. In

ICCV’13 , pages 3318–3325, 2013. 6, 7[26] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet:A uniﬁed embedding for face recognition and clustering. In

Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 815–823, 2015. 1, 2, 3, 7[27] Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang. Deeplearning face representation by joint identiﬁcation-veriﬁcation. In

NIPS , pages 1988–1996, 2014. 1[28] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep learning face rep-resentation from predicting 10,000 classes. In

CVPR , pages 1891–1898, 2014. 1, 2[29] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deeply learned facerepresentations are sparse, selective, and robust. In

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition ,pages 2892–2900, 2015. 2, 7[30] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf.Deepface: Closing the gap to human-level performance in face veri-ﬁcation. In

ICCV , pages 1701–1708, 2014. 1, 2, 7[31] Dayong Wang, Charles Otto, and Anil K Jain. Face search at scale:80 million gallery. arXiv preprint arXiv:1507.07242 , 2015. 7[32] Ruiping Wang, Huimin Guo, Larry S Davis, and Qionghai Dai. Co-variance discriminative learning: A natural and efﬁcient approach toimage set classiﬁcation. In

CVPR’12 , pages 2496–2503. IEEE, 2012.2[33] Taiqing Wang, Shaogang Gong, Xiatian Zhu, and Shengjin Wang.Person re-identiﬁcation by video ranking. In

ECCV 2014 , pages 688–703. Springer, 2014. 6[34] Taiqing Wang, Shaogang Gong, Xiatian Zhu, and Shengjin Wang.Person re-identiﬁcation by discriminative selection in video ranking.2016. 6, 7[35] Lior Wolf, Tal Hassner, and Itay Maoz. Face recognition in uncon-strained videos with matched background similarity. In

CVPR’11 ,pages 529–534. IEEE, 2011. 7[36] Lin Wu, Chunhua Shen, and Anton van den Hengel. Deep recurrentconvolutional networks for video-based person re-identiﬁcation: Anend-to-end approach. arXiv preprint arXiv:1606.01609 , 2016. 2, 6,7[37] Tong Xiao, Hongsheng Li, Wanli Ouyang, and Xiaogang Wang.Learning deep feature representations with domain guided dropoutfor person re-identiﬁcation. arXiv preprint arXiv:1604.07528 , 2016.2[38] Jiaolong Yang, Peiran Ren, Dong Chen, Fang Wen, Hongdong Li,and Gang Hua. Neural aggregation network for video face recogni-tion. arXiv preprint arXiv:1603.05474 , 2016. 2, 7[39] Meng Yang, Pengfei Zhu, Luc Van Gool, and Lei Zhang. Face recog-nition based on regularized nearest points between image sets. In

Automatic Face and Gesture Recognition (FG), 2013 Workshops on ,pages 1–7. IEEE, 2013. 2[40] Jinjie You, Ancong Wu, Xiang Li, and Wei-Shi Zheng.Top-push video-based person re-identiﬁcation. arXiv preprintarXiv:1604.08683 , 2016. 6, 7[41] Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su, ShengjinWang, and Qi Tian. Mars: A video benchmark for large-scale personre-identiﬁcation. In

European Conference on Computer Vision , pages868–884. Springer, 2016. 242] Wei-Shi Zheng, Shaogang Gong, and Tao Xiang. Person re-identiﬁcation by probabilistic relative distance comparison. In

CVPR, 2011 IEEE conference on , pages 649–656. IEEE, 2011. 1[43] Pengfei Zhu, Lei Zhang, Wangmeng Zuo, and David Zhang. Frompoint to set: Extend the learning of distance metrics. In