[PDF] Person Recognition in Personal Photo Collections

Abstract

People nowadays share large parts of their personal lives through social media. Being able to automatically recognise people in personal photos may greatly enhance user convenience by easing photo album organisation. For human identification task, however, traditional focus of computer vision has been face recognition and pedestrian re-identification. Person recognition in social media photos sets new challenges for computer vision, including non-cooperative subjects (e.g. backward viewpoints, unusual poses) and great changes in appearance. To tackle this problem, we build a simple person recognition framework that leverages convnet features from multiple image regions (head, body, etc.). We propose new recognition scenarios that focus on the time and appearance gap between training and testing samples. We present an in-depth analysis of the importance of different features according to time and viewpoint generalisability. In the process, we verify that our simple approach achieves the state of the art result on the PIPA benchmark, arguably the largest social media based benchmark for person recognition to date with diverse poses, viewpoints, social groups, and events. Compared the conference version of the paper, this paper additionally presents (1) analysis of a face recogniser (DeepID2+), (2) new method naeil2 that combines the conference version method naeil and DeepID2+ to achieve state of the art results even compared to post-conference works, (3) discussion of related work since the conference version, (4) additional analysis including the head viewpoint-wise breakdown of performance, and (5) results on the open-world setup.

Full PDF

IIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ??, NO. ??, ?? 20?? 1

Person Recognition in Personal Photo Collections

Seong Joon Oh,Rodrigo Benenson, Mario Fritz, and Bernt Schiele,

Fellow, IEEE

Abstract —People nowadays share large parts of their personal lives through social media. Being able to automatically recognisepeople in personal photos may greatly enhance user convenience by easing photo album organisation. For human identiﬁcation task,however, traditional focus of computer vision has been face recognition and pedestrian re-identiﬁcation. Person recognition in socialmedia photos sets new challenges for computer vision, including non-cooperative subjects (e.g. backward viewpoints, unusual poses)and great changes in appearance. To tackle this problem, we build a simple person recognition framework that leverages convnetfeatures from multiple image regions (head, body, etc.). We propose new recognition scenarios that focus on the time and appearancegap between training and testing samples. We present an in-depth analysis of the importance of different features according to timeand viewpoint generalisability. In the process, we verify that our simple approach achieves the state of the art result on the PIPA [1]benchmark, arguably the largest social media based benchmark for person recognition to date with diverse poses, viewpoints, socialgroups, and events.Compared the conference version of the paper [2], this paper additionally presents (1) analysis of a face recogniser (DeepID2+ [3]), (2)new method naeil2 that combines the conference version method naeil and DeepID2+ to achieve state of the art results evencompared to post-conference works, (3) discussion of related work since the conference version, (4) additional analysis including thehead viewpoint-wise breakdown of performance, and (5) results on the open-world setup.

Index Terms —Computer vision, Person recognition, Social media. (cid:70)

NTRODUCTION W ITH the advent of social media and the shift of imagecapturing mode from digital cameras to smartphones andlife-logging devices, users share massive amounts of personalphotos online these days. Being able to recognise people in suchphotos would beneﬁt the users by easing photo album organisa-tion. Recognising people in natural environments poses interestingchallenges; people may be focused on their activities with the facenot visible, or can change clothing or hairstyle. These challengesare largely new – traditional focus of computer vision researchfor human identiﬁcation has been face recognition (frontal, fullyvisible faces) or pedestrian re-identiﬁcation (no clothing changes,standing pose).Intuitively, the ability to recognise faces in the wild [3], [4] isstill an important ingredient. However, when people are engagedin an activity (i.e. not posing) their faces become only partiallyvisible (non-frontal, occluded) or simply fully invisible (back-view). Therefore, additional information is required to reliablyrecognize people. We explore other cues that include (1) body of aperson that contains information about the shape and appearance;(2) human attributes such as gender and age; and (3) scene context.See Figure 1 for a list of examples that require increasing numberof contextual cues for successful recognition.This paper presents an in-depth analysis of the person recogni-tion task in the social media type of photos: given a few annotatedtraining images per person, who is this person in the test image? • S. Oh, R. Benenson, and M. Fritz were with the Computer Vision andMultimodal Computing Group, Max Planck Institute for Informatics,Saarbr¨ucken, Germany when this work was done; they are currently withLINE Plus (South Korea), Google (Switzerland), and CISPA HelmholtzCenter i.G. (Germany), respectively. E-mail: [email protected], [email protected], and [email protected]. • B. Schiele is with the Computer Vision and Multimodal Computing Group,Max Planck Institute for Informatics, Saarbr¨ucken, Germany. E-mail:[email protected] received ??; revised ??.

Head ✔ ✘ ✘ ✘ Body ✔ ✔ ✘ ✘

Attributes ✔ ✔ ✔ ✘ All cues ✔ ✔ ✔ ✔

Fig. 1: In social media photos, depending on face occlusion orpose, different cues may be effective. For example, the surfer inthe third column is not recognised using only head and body cuesdue to unusual pose. However, she is successfully recognised whenadditional attribute cues are considered.The main contributions of the paper are summerised as follows: • Propose realistic and challenging person recognition sce-narios on the PIPA benchmark (§3). • Provide a detailed analysis of the informativeness ofdifferent cues, in particular of a face recognition moduleDeepID2+ [3] (§4). • Verify that our journal version ﬁnal model naeil2 achieves the new state of the art performance on PIPA(§5). • Analyse the contribution of cues according to the amountof appearance and viewpoint changes (§6). • Discuss the performance of our methods under the open-world recognition setup (§7) • Code and data are open source: available at https://goo.gl/DKuhlY. a r X i v : . [ c s . C V ] O c t EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ??, NO. ??, ?? 20?? 2

ELATED WORK

We review work on human identiﬁcation based on various visualcues. Faces are the most obvious and widely studied cue, whileother biometric cues have also been considered. We discuss howour personal photo setup is different from them.

Face

The bulk of previous work on person recognition focuses on faces.The Labeled Faces in the Wild (LFW) [4] has been a great testbedfor a host of works on the face identiﬁcation and veriﬁcationoutside the lab setting. The benchmark has saturated, attributingto the deep features [3], [5], [6], [7], [8], [9], [10], [11], [12],[13] trained on large scale face databases that outperform thetraditional methods involving sophisticated classiﬁers based onhand-crafted features and metric learning approaches [14], [15],[16], [17]. While faces are clearly the most discriminative cuefor recognising people in personal photos as well, they are oftenoccluded in natural footage - e.g. the person may be engagedin other activities. Since LFW contains largely frontal views ofsubjects, it does not fully represent the setup we are interested in.LFW is also biased to public ﬁgures.Some recent face recognition benchmarks have introducedmore face occlusions. IARPA Janus Benchmark A (IJB-A) [18]and Celebrities in Frontal Proﬁle (CFP) [19] datasets includefaces with proﬁle viewpoints; however, both IJB-A and CFP donot consider subject fully turning away from the camera (back-view) and the subjects are limited to public ﬁgures. Age Database(AgeDB) [20] evaluates the recognition across long time span(years), but is again biased towards public ﬁgures; recognitionacross age gap is a part of our task and we focus on personalphotos without celebrity bias.MegaFace [21], [22] is perhaps the largest known open sourceface database over personal photos on Flickr. However, MegaFacestill does not contain any back-view subject and it is not designedto evaluate the ability to combine cues from multiple body re-gions. Face recognition datasets are not suitable for training andevaluating systems that identify a human from face and other bodyregions.

Pedestrian Re-Identiﬁcation from RGB Images

Not only face, but the entire body and clothing patterns havealso been explored as cue for human identiﬁcation. For example,pedestrian re-identiﬁcation (re-id) tackles the problem of matchingpedestrian detections in different camera views. First benchmarksinclude VIPeR [23], CAVIAR [24], CUHK [25], while nowadaysmost re-id papers report results on Market 1501 [26], MARS [27],CUHK03 [28], and DukeMTMC-reID [29]. There is an activeline of research on pedestrian re-id, starting with hand-craftedfeatures [30], [31], [32] and has moved towards deep feature basedschemes [28], [33], [34], [35], [36], [37], [38], [39], [40].However, the re-id datasets and benchmarks do not fully coverthe social media setup in three aspects. (1) Subjects are pedestriansand mostly appear in the standing pose; in personal photos peoplemay be engaged in a diverse array of activities and poses - e.g.skiing, performing arts, presentation. (2) Typically resolution islow; person recognition in personal photos includes the problemof matching identities across a huge resolution range - from selﬁesto group photos.

Pedestrian Re-Identiﬁcation from Depth Images

In order to identify humans based on body shapes, potentially toenable recognition independent of clothing changes, researchershave proposed depth-based re-identiﬁcation setups. Datasets in-clude RGBD-ID [41], IAS-Lab RGBD-ID [42], and recent SO-MAset [43]. SOMAset in particular has clothing changes enforcedin the dataset. There is a line of work [41], [42], [43], [44]that has improved the recognition technology under this setup.While recognition across clothing changes is related to our taskof identifying human in personal photos, the RGBD based re-identiﬁcation typically requires depth information for good per-formance; for personal photos depth information is unavailable.Moreover, the relevant datasets are collected in controlled labsetup, while personal photos are completely unconstrained.

Other biometric cues

Traditionally, ﬁngerprints and iris patterns have been consideredstrong visual biometric cues [45], [46]. Gaits [47] are also knownto be an identity correlated feature. We do not use them explicitlyin this work, as such information is not readily given in personalphotos.

Personal Photos

Personal photos have distinct characteristics that set new chal-lenges not fully addressed before. For example, people may beengaged in certain activity, not cooperating with the photographer,and people may change clothing over time. Some pre-convnetwork have addressed this problem in a small scale [48], [49],[50], [51], combining cues from face as well as clothing regions.Among these, Gallagher et al. [51] have published the dataset“Gallagher collection person” for benchmarking ( ∼ images,32 identities). It was not until the appearance PIPA dataset [1] wasthere a large-scale dataset of personal photos. The dataset consistsof Flickr personal account images (Creative Commons) and isfairly large in scale ( ∼

40k images, ∼

2k identities), with diverseappearances and subjects with all viewpoints and occlusion levels.Heads are annotated with bounding boxes each with an identitytag. We describe PIPA in greater detail in §3.

There exist multiple tasks related to person recognition [52]differing mainly in the amount of training and testing data. Faceand surveillance re-identiﬁcation is most commonly done viaveriﬁcation: given one reference image (gallery) and one testimage (probe), do they show the same person? [4], [53]. In thispaper, we consider two recognition tasks. • Closed world identiﬁcation: given a single test image(probe), who is this person among the identities that areamong the training identities (gallery set)? • Open world recognition [21] (§7) : given a singe testimage (probe), is this person among the training identities(gallery set)? If so, who?

Other related tasks are, face clustering [7], [54], [55], ﬁndingimportant people [56], or associating names in text to faces inimages [57], [58].

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ??, NO. ??, ?? 20?? 3

Since the introduction of the PIPA dataset [1], multiple works haveproposed different methods for solving the person recognitionproblem in social media photos. Zhang et al. proposed the PoseInvariant Person Recognition (

PIPER ) [1], obtaining promisingresults by combining three ingredients: DeepFace [5] (face recog-nition module trained on a large private dataset), poselets [59](pose estimation module trained with 2k images and 19 keypointannotations), and convnet features trained on detected poselets[60], [61].Oh et al. [2], the conference version of this paper, haveproposed a much simpler model naeil that extracts AlexNet cuesfrom multiple ﬁxed image regions. In particular, unlike

PIPER itdoes not require data-heavy DeepFace or time-costly poselets; ituses only 17 cues (

PIPER uses over 100 cues); it still outperforms

PIPER .There have been many follow-up works since then. Kumar etal. [62] have improved the performance by normalising the bodypose using pose estimation. Li et al. [63] considered exploitingpeople co-occurrence statistics. Liu et al. [64] have proposed totrain a person embedding in a metric space instead of training aclassiﬁer on a ﬁxed set of identities, thereby making the modelmore adaptable to unseen identities. We discuss and compareagainst these works in greater detail in §5 Some works haveexploited the photo-album metadata, allowing the model to reasonover different photos [65], [66].In this journal version, we build naeil2 from naeil andDeepID2+ [3] to achieve the state of the art result among thepublished work on PIPA. We provide additional analysis of cuesaccording to time and viewpoint changes.

ATASET AND EXPERIMENTAL SETUP

Dataset

The PIPA dataset (“People In Photo Albums”) [1] is, to the bestof our knowledge, the ﬁrst dataset to annotate people’s identitieseven when they are pictured from the back. The annotatorslabelled instances that can be considered hard even for humans(see qualitative examples in ﬁgure 19, 20). PIPA features

37 107

Flickr personal photo album images (Creative Commons license),with

63 188 head bounding boxes of identities. The headbounding boxes are tight around the skull, including the faceand hair; occluded heads are hallucinated by the annotators. Thedataset is partitioned into train , val , test , and leftover sets, withrough ratio

45 : 15 : 20 : 20 percent of the annotated heads. Theleftover set is not used in this paper. Up to annotation errors,neither identities nor photo albums by the same uploader areshared among these sets.

Task

At test time, the system is given a photo and ground truth headbounding box corresponding to the test instance (probe). The taskis to choose the identity of the test instance among a given set ofidentities (gallery set, 200 ∼

500 identities) each with ∼

10 trainingsamples.In §7, we evaluate the methods when the test instance may bea background person (e.g. bystanders – no training image given).The system is then also required to determine if the given instanceis among the seen identities (gallery set).

Protocol

We follow the PIPA protocol in [1] for data utilisation and modelevaluation. The train set is used for convnet feature training.The test set contains the examples for the test identities. Foreach identity, the samples are divided into test and test . Forevaluation, we perform a two-fold cross validation by training onone of the splits and testing on the other. The val set is likewisesplit into val and val , and is used for exploring different modelsand tuning hyperparameters. Evaluation

We use the recognition rate (or accuracy), the rate of correct iden-tity predictions among the test instances. For every experiment, weaverage two recognition rates obtained from the (training, testing)pairs ( val , val ) and ( val , val ) – analogously for test . We consider four different ways of splitting the training and testingsamples ( val / and test / ) for each identity, aiming to evaluatedifferent level of generalisation ability. The ﬁrst one is from aprior work, and we introduce three new ones. Refer to table 1 fordata statistics and ﬁgure 3 for visualisation. Original split O [1] The Original split shares many similar examples per identityacross the split – e.g. photos taken in a row. The Original splitis thus easy - even nearest neighbour on raw RGB pixels works(§5.1). In order to evaluate the ability to generalise across long-term appearance changes, we introduce three new splits below.

Album split A [2] The Album split divides training and test samples for each identityaccording to the photo album metadata. Each split takes the albumswhile trying to match the number of samples per identity as wellas the total number of samples across the splits. A few albums areshared between the splits in order to match the number of samples.Since the Flickr albums are user-deﬁned and do not always strictlycluster events and occasions, the split may not be perfect.

Time split T [2] The Time split divides the samples according to the time the photowas taken. For each identity, the samples are sorted accordingto their “photo-taken-date” metadata, and then divided accordingto the newest versus oldest basis. The instances without timemetadata are distributed evenly. This split evaluates the temporalgeneralisation of the recogniser. However, the “photo-taken-date”metadata is very noisy with lots of missing data.

Day split D [2] The Day split divides the instances via visual inspection to ensurethe ﬁrm “appearance change” across the splits. We deﬁne twocriteria for division: (1) a ﬁrm evidence of date change such as { change of season, continent, event, co-occurring people } and/or(2) visible changes in { hairstyle, make-up, head or body wear } .We discard identities for whom such a division is not possible.After division, for each identity we randomly discard samplesfrom the larger split until the sizes match. If the smaller splithas ≤ instances, we discard the identity altogether. The Daysplit enables clean experiments for evaluating the generalisationperformance across strong appearance and event changes. EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ??, NO. ??, ?? 20?? 4

Fig. 2: Face detections and head annotations in PIPA. The matchesare determined by overlap (intersection over union). For matchedfaces (heads), the detector DPM component gives the orientationinformation (frontal versus non-frontal). val test

O A T D O A T D s p l . instance 4820 4859 4818 1076 6443 6497 6441 2484identity 366 366 366 65 581 581 581 199 s p l . instance 4820 4783 4824 1076 6443 6389 6445 2485identity 366 366 366 65 581 581 581 199 TABLE 1: Split statistics for val and test sets. Total number ofinstances and identites for each split is shown.

Instances in PIPA are annotated by humans around their heads(tight around skull). We additionally compute face detections overPIPA for three purposes: (1) to compare the amount of identityinformation in head versus face (§4), (2) to obtain head orientationinformation for further analysis (§6), and (3) to simulate thescenario without ground truth head box at test time (§7). We usethe open source DPM face detector [67].Given a set of detected faces (above certain detection scorethreshold) and the ground truth heads, the match is made accordingto the overlap (intersection over union). For matched heads, thecorresponding face detections tell us which DPM component isﬁred, thereby allowing us to infer the head orientation (frontal orside view). See Appendix §A for further details.Using the DPM component, we partition instances in PIPAas follows: (1) detected and frontal ( FR , 41.29%), (2) detectedand non-frontal ( NFR , 27.10%), and (3) no face detected (

NFD ,31.60%). We denote detections without matching ground truthhead as Background. See ﬁgure 2 for visualisation.

UES FOR RECOGNITION

In this section, we investigate the cues for recognising people insocial media photos. We begin with an overview of our model.Then, we experimentally answer the following questions: howinformative are ﬁxed body regions (no pose estimation) (§4.4)?How much does scene context help (§4.6)? Is it head or face (headminus hair and background) that is more informative (§4.7)? Andhow much do we gain by using extended data (§4.8 & §4.9)? Howeffective is a specialised face recogniser (§4.11)? Studies in thissection are based exclusively on the val set.

Features s buhf fh h ubs Fig. 4: Regions consideredfor feature extraction: face f ,head h , upper body u , fullbody b , and scene s . Morethan one cue can be extractedper region (e.g. h , h ).At test time, given a ground truthhead bounding box, we estimateﬁve different regions depicted inﬁgure 4. Each region is fed intoone or more convnets to obtaina set of cues. The cues are con-catenated to form a feature vectordescribing the instance. Through-out the paper we write + to de-note vector concatenation. LinearSVM classiﬁers are trained overthis feature vector (one versusthe rest). In our ﬁnal system, ex-cept for DeepID2+ [3], all fea-tures are computed using the sev-enth layer (fc7) of AlexNet [60]pre-trained for ImageNet clas-siﬁcation. The cues only differamongst each other on the im-age area and the ﬁne-tuning used(type of data or surrogate task) toalter the AlexNet, except for theDeepID2+ [3] feature. We choose ﬁve different image regions based on the groundtruth head annotation (given at test time, see the protocol in §3).The head rectangle h corresponds to the ground truth annota-tion. The full body rectangle b is deﬁned as (3 × head width , × head height ) , with the head at the top centre of the full body.The upper body rectangle u is the upper-half of b . The sceneregion s is the whole image containing the head.The face region f is obtained using the DPM face detectordiscussed in §3.2. For head boxes with no matching detection (e.g.back views and occluded faces), we regress the face area from thehead using the face-head displacement statistics on the train set.Five respective image regions are illustrated in ﬁgure 4.Note that the regions overlap with each other, and that de-pending on the person’s pose they might be completely off. Forexample, b for a lying person is likely to contain more backgroundthan the actual body. While precise body parts obtained viapose estimation [68], [69] may contribute to even better perfor-mances [64], we choose not to use it for the sake of efﬁciency.Our simple region selection scheme still leads to the state of theart performances, even compared to methods that do rely on poseestimation (§5). Unless speciﬁed otherwise AlexNet is ﬁne-tuned using the PIPA train set ( ∼ k instances, ∼ . k identities), cropped at ﬁvedifferent image regions, with k mini-batch iterations (batchsize ). We refer to the base cue thus obtained as f , h , u , b , or s , depending on the cropped region. On the val set we found theﬁne-tuning to provide a systematic ∼ percent points (pp) gainover the non-ﬁne-tuned AlexNet (ﬁgure 5). We use the seventhlayer (fc7) of AlexNet for each cue ( dimensions).We train for each identity a one-versus-all SVM classiﬁer withthe regularisation parameter C = 1 ; it turned out to be an insensi-tive parameter in our preliminary experiments. As an alternative, EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ??, NO. ??, ?? 20?? 5

Fig. 3: Visualisation of Original, Album, Time and Day splits for three identities (rows 1-3). Greater appearance gap is observed fromOriginal to Day splits.

Effect of fine-tunning on recognition accuracy

0k 100k 200k 300k 400k 500k657585 500kNumber of mini-batch updates V a li da t i on s e t a cc u r a cy BodyUpperBodyHead

Fig. 5: PIPA val set performance of different cues versus the SGDiterations in ﬁne-tuning.the naive nearest neighbour classiﬁer has also been considered.However, on the val set the SVMs consistently outperforms theNNs by a ∼ pp margin. Table 2 shows the val set results of each region individually and incombination. Head h and upper body u are the strongest individualcues. Upper body is more reliable than the full body b becausethe lower body is commonly occluded or cut out of the frame, Cue AccuracyChance level 1.04Scene (§4.6) s b u h f f f + h f + h + u f + h + u + b f + h + u + b + s s s + b s + b + u s + b + u + h s + b + u + h + f h + b P = f + h + u + b P s = P + s TABLE 2: PIPA val set accuracy of cues based on different imageregions and their concatenations ( + means concatenation).and thus is usually a distractor. Scene s is, unsurprisingly, theweakest individual cue, but it still useful information for personrecognition (far above chance level). Importantly, we see that allcues complement each other, despite overlapping pixels. Overall, EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ??, NO. ??, ?? 20?? 6 H ea d ( h ) s i ze U pp e r body ( u ) s i ze r-patch sw Original split Day splitsw results Fig. 6: Regions considered for analysis. We consider cues fromhead ( h ) as well as upper body ( u ) sized patches that are eitherchosen randomly (r-patch, column 1) or in sliding window manner(sw, column 2). Recognition results for sw are visualised inOriginal (column 3) and Day (column 4) splits.our features and combination strategy are effective. fhubs In order to further justify the choice of the ﬁve image regions( fhubs ), we compare their informativeness against three baselinetypes of cues: (1) random patch (r-patch), (2) sliding window (sw),and (3) random initialisation (r-init). For each type, we considerhead sized ( h , ground truth head size) and upper body sized ( u , × of head size) regions. See ﬁgure 6 columns 1 and 2 forr-patch and sw.Speciﬁcally, (1) for head sized r-patch we sample regions fromwithin the original upper body region ( u ); for upper body sized r-patch, we sample from within ± head away from the originalupper body region. (2) For head sized sw, we set the stride ashalf of h width/height, while for upper body sized ones, we setthe stride as h width/height themselves. The r-patch and sw areﬁxed across person instances with respect to the respective headlocations. (3) The r-init are always based on the original headand upper body regions, but the features are trained with differentrandom initialisations.The results for the sliding window regions are shown in ﬁgure6 columns 3 and 4, under Original and Day splits, respectively.In all sizes and domain gaps, the original head region is themost informative one. The informativeness of head region isampliﬁed under the Day split (larger domain gap), with largerperformance gap between head and context regions – clothingand event changes in the context regions hamper identiﬁcation. §6contains more in-depth analysis regarding this point.We compare the three types of regions against our choiceof regions fhubs and quantitatively. See ﬁgure 7 for the plotshowing the trade off between the accuracy and the number of cuesused. For fhubs , we progressively combine from f to s (5 cues Method AccuracyGist s gist s places 205 s places s places s s = s TABLE 3: PIPA val set accuracy of different scene cues. Seedescriptions in §4.6.maximum). For random patch (r-patch) and random initialisation(r-init), the combination order is randomised. For sliding window(sw), we combine the regions nearest to head ﬁrst, and expand thecombination diameter. Note that for every baseline region types (r-patch/init and sw), we consider combining head and upper bodysize boxes together ( hu ).From ﬁgure 7 we observe, most importantly, that our choice ofregions fhubs gives the best performance-complexity (measuredin number of cues) trade off among the regions and combinationsconsidered under both small and large domain gaps (Original andDay splits). Amongst the baseline types, sw and r-init beat ther-patch in general; it is important to focus on the head region,the most identity-relevant part. In the Original split, it helps tocombine h and u for all baseline types; it is important to combinemulti-scale cues. However, the best trade off is given by fhubs ,one that samples the from diverse regions and scales. Conclusion

Our choice of the regions fhubs efﬁciently captures diverseidentity information at diverse scales (only ﬁve cues). fhubs beats the baseline region selection methods including randompatches, sliding windows, and ensemble of randomly initialisedcues in terms of the performance-complexity (number of cues)trade off. s ) Scene region is the whole image containing the person of in-terest. Other than a ﬁne-tuned AlexNet we considered multiplefeature types to encode the scene information. s gist : using theGist descriptor [70] ( dimensions). s places : instead ofusing AlexNet pre-trained on ImageNet, we consider an AlexNet(PlacesNet) pre-trained on scene categories of the “PlacesDatabase” [71] ( ∼ . million images). s places 205 : Instead ofthe dimensions PlacesNet feature vector, we also considerusing the score vector for each scene category ( dimensions). s , s : ﬁnally we consider using AlexNet in the same way asfor body or head (with zero or k iterations of ﬁne-tuning onthe PIPA person recognition training set). s places : s places ﬁne-tuned for person recognition. Results

Table 3 compares the different alternatives on the val set. TheGist descriptor s gist performs only slightly below the convnetoptions (we also tried the dimensional version of Gist,obtaining worse results). Using the raw (and longer) featurevector of s places is better than the class scores of s places 205 .Interestingly, in this context pre-training for places classiﬁcationis better than pre-training for objects classiﬁcation ( s places versus s ). After ﬁne-tuning s reaches a similar performanceas s places . EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ??, NO. ??, ?? 20?? 7 V a li d a t i o n A cc u r a c y fhubsh : sw h : r-init h : r-patch u : sw u : r-init u : r-patch hu : sw hu : r-init hu : r-patch (a) Original V a li d a t i o n A cc u r a c y fhubsh : sw h : r-init h : r-patch u : sw u : r-init u : r-patch hu : sw hu : r-init hu : r-patch (b) Day Fig. 7: Comparison of our region choice fhubs against three types of baseline region types, random patch (r-patch), sliding window(sw), and random initialisation (r-init), based on head ( h ) and upper body ( u ) sized regions. We report the val set accuracy against thenumber of cues used. The combination orders for r-patch and r-init are random; for sw, we combine regions close to the head ﬁrst.Experiments trying different combinations indicate that there islittle complementarity between these features. Since there is nota large difference between s places and s , for the sake ofsimplicity we use s as our scene cue s in all other experiments. Conclusion

Scene s by itself, albeit weak, can obtain results far above thechance level. After ﬁne-tuning, scene recognition as pre-trainingsurrogate task [71] does not provide a clear gain over (ImageNet)object recognition. h ) or face ( f )? A large portion of work on face recognition focuses on the faceregion speciﬁcally. In the context of photo albums, we aim toquantify how much information is available in the head versus theface region. As discussed in §3.2, we obtain the face regions f from the DPM face detector [67]. Results

There is a large gap of ∼ percent points performance between f and h in table 2 highlighting the importance of including thehair and background around the face. Conclusion

Using h is more effective than f , but f result still shows a fairperformance. As with other body cues, there is a complementaritybetween h and f ; we suggest to use them together. h cacd , h casia ) It is well known that deep learning architectures beneﬁt fromadditional data. DeepFace [5] used by

PIPER [1] is trained over . · faces of · persons (the private SFC dataset [5]).In comparison our cues are trained over ImageNet and PIPA’s · faces over . · persons. To measure the effect oftraining on larger data we consider ﬁne-tuning using two opensource face recognition datasets: CASIA-WebFace (CASIA) [72]and the “Cross-Age Reference Coding Dataset” (CACD) [73]. Method AccuracyMore data (§4.8) h h + h cacd h + h casia h + h casia + h cacd h pipa11m h pipa11 h + h pipa11 u peta5 u + u peta5 A = h pipa11 + u peta5 h + u h + u + A naeil (§4.10) naeil [2] 91.70 TABLE 4: PIPA val set accuracy of different cues based onextended data. See §4.8, §4.9, and §4.10 for details.CASIA contains . · images of . · persons (mainlyactors and public ﬁgures). When ﬁne-tuning AlexNet over theseidentities (using the head area h ), we obtain the h casia cue.CACD contains · faces of · persons with varyingages. Although smaller in total number of images than CASIA,CACD features greater number of samples per identity ( ∼ × ) .The h cacd cue is built via the same procedure as h casia . Results

See the top part of table 4 for the results. h + h cacd and h + h casia improve over h (1.0 and 2.2 pp, respectively). Extra convnettraining data seems to help. However, due to the mismatch indata distribution, h cacd and h casia on their own are about ∼ ppworse than h . Conclusion

Extra convnet training data helps, even if they are from differenttype of photos. h pipa11 , u peta5 ) Albeit overall appearance might change day to day, one couldexpect that stable, long term attributes provide means for recogni-tion. We build attribute cues by ﬁne-tuning AlexNet features not

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ??, NO. ??, ?? 20?? 8

Attribute Classes CriteriaAge Infant Not walking (due to young age)Child Not fully grown body sizeYoung Adult Fully grown & Age < Middle Age ≤ Age ≤ Senior Age ≥ Gender Female Female lookingMale Male lookingGlasses None No eyewearGlasses Transparant glassesSunglasses Glasses with eye occlusionHaircolour Black BlackWhite Any hint of whitenessOthers Neither of the aboveHairlength No hair Absolutely no hair on the scalpLess hair Hairless for > upper scalpShort hair When straightened, < cmMed hair When straightened, < chin levelLong hair When straightened, > chin level TABLE 5: PIPA attributes details.for the person recognition task (like for all other cues), but ratherfor the attribute prediction surrogate task. We consider two setsattributes, one on the head region and the other on the upper bodyregion.We have annotated identities in the PIPA train and val sets( in total) with ﬁve long term attributes: age, gender,glasses, hair colour, and hair length (see table 5 for details). Webuild h pipa11 by ﬁne-tuning AlexNet features for the task of headattribute prediction.For ﬁne-tuning the attribute cue h pipa11 , we consider twoapproaches: training a single network for all attributes as a multi-label classiﬁcation problem with the sigmoid cross entropy loss,or tuning one network per attribute separately and concatenatingthe feature vectors. The results on the val set indicate the latter( h pipa11 ) performs better than the former ( h pipa11m ).For the upper body attribute features, we use the “PETApedestrian attribute dataset” [74]. The dataset originally has attributes annotations for · full-body pedestrian images.We chose the ﬁve long-term attributes for our study: gender, age(young adult, adult), black hair, and short hair (details in table 5).We choose to use the upper-body u rather than the full body b forattribute prediction – the crops are much less noisy. We train theAlexNet feature on upper body of PETA images with the attributeprediction task to obtain the cue u peta5 . Results

See results in table 4. Both PIPA ( h pipa11 ) and PETA ( u peta5 )annotations behave similarly ( ∼ pp gain over h and u ), and showcomplementary ( ∼ pp gain over h + u ). Amongst the attributesconsidered, gender contributes the most to improve recognitionaccuracy (for both attributes datasets). Conclusion

Adding attribute information improves the performance. naeil ) [2]

The ﬁnal model in the conference version of this paper combinesﬁve vanilla regional cues ( P s = P + s ), two head cues trained withextra data ( h cacd , h casia ), and ten attribute cues ( h pipa11 , u peta5 ),resulting in 17 cues in total. We name this method naeil [2] .

1. “naeil”, 내일 , means “tomorrow” and sounds like “nail”. Results

See table 4 for the results. naeil , by combining all the cuesconsidered naively, achieves the best result 91.70% on the val set.

Conclusion

Cues considered thus far are complementary, and the combinedmodel naeil is effective. h deepid ) [3] Face recognition performance have improved signiﬁcantly in re-cent years with better architectures and larger open source datasets[3], [4], [5], [7], [8], [9], [10], [75], [76]. In this section, westudy how much face recognition helps in person recognition.While DeepFace [5] used by the

PIPER [1] would have enabledmore direct comparison against the

PIPER , it is not publiclyavailable. We thus choose the DeepID2+ face recogniser [3]. Facerecognition technology is still improving quickly, and larger andlarger face datasets are being released – the analysis in this sectionwould be an underestimate of current and future face recognisers.The DeepID2+ network is a siamese neural network that takes25 different crops of head as input, with the joint veriﬁcation-identiﬁcation loss. The training is based on large databases con-sisting of CelebFaces+ [77], WDRef [78], and LFW [4] – totalling . · faces of . · persons. At test time, it ensembles thepredictions from the 25 crop regions obtained by facial landmarkdetections. The resulting output is dimensional head featurethat we denote as h deepid .Since the DeepID2+ pipeline begins with the facial landmarkdetection, the DeepID2+ features are not available for instanceswith e.g. occluded or backward orientation heads. As a result,only

52 709 out of

63 188 instances ( . ) have the DeepID2+features available, and we use vectors of zeros as features for therest. Results - Original split

See table 6 for the val set results for h deepid and related com-binations. h deepid in itself is weak ( . ) compared to thevanilla head feature h , due to the missing features for the back-views. However, when combined with h , the performance reaches . by exploiting information from strong DeepID2+ facefeatures and the viewpoint robust h features.Since the feature dimensions are not homogeneous ( versus ), we try L normalisation of h and h deepid beforeconcatenation ( h ⊕ h deepid ). This gives a further boost( . ) – better than h + h cacd + h casia , previous best modelon the head region ( . ). Results - Album, Time and Day splits

Table 6 also shows results for the Album, Time, and Day splits onthe val set. While the general head cue h degrades signiﬁcantly onthe Day split, h deepid is a reliable cue with roughly the same levelof recognition in all four splits ( ∼ ). This is not surprising,since face is largely invariant over time, compared to hair, clothing,and event.On the other splits as well, the complementarity of h and h deepid is guaranteed only when they are L normalised beforeconcatenation. The L normalised concatenation h ⊕ h deepid envelops the performance of individual cues on all splits. EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ??, NO. ??, ?? 20?? 9

Method Original Album Time Day h h deepid h + h deepid h ⊕ h deepid naeil [2] 91.70 86.37 80.66 49.21 naeil + h deepid naeil2 TABLE 6: PIPA val set accuracy of methods involving h deepid .The optimal combination weights are λ (cid:63) = [0 .

60 1 .

05 1 .

00 1 . for Original, Album, Time, and Day splits, respectively. ⊕ means L normalisation before concatenation. Conclusion

DeepID2+, with face-speciﬁc architecture/loss and massiveamount of training data, contributes highly useful information forthe person recognition task. However, being only able to recogniseface-visible instances, it needs to be combined with orientation-robust h to ensure the best performance. Unsurprisingly, having aspecialised face recogniser helps more in the setup with largerappearance gap between training and testing samples (Album,Time, and Day splits). Better face recognisers will further improvethe results in the future. naeil with h deepid ( naeil2 ) We build the ﬁnal model of the journal version, namely the naeil2 by combining naeil and h deepid . As seen in §4.11,naive concatenation is likely to fail due to even larger difference indimensionality ( ×

17 = 69 632 versus ). We consider L normalisation of naeil and h deepid , and then performing aweighted concatenation. naeil ⊕ λ h deepid = naeil || naeil || + λ · h deepid || h deepid || , (1)where, λ > is a parameter and + denotes a concatenation. Optimisation of λ on val set λ determines how much relative weight to be given to h deepid . Aswe have seen in §4.11, the amount of additional contribution from h deepid is different for each split. In this section, we ﬁnd λ (cid:63) , theoptimal values for λ , for each split over the val set. The resultingcombination of naeil and h deepid is our ﬁnal method, naeil2 . λ (cid:63) is searched on the equi-distanced points { , . , . , · · · , } .See ﬁgure 8 for the val set performance of naeil ⊕ λ h deepid with varying values of λ . The optimal weights are found at λ (cid:63) = [0 .

60 1 .

05 1 .

00 1 . for Original, Album, Time, and Daysplits, respectively. The relative importance of h deepid is greateron splits with larger appearance changes. For each split, we denote naeil2 as the combination naeil and h deepid based on theoptimal weights.Note that the performance curve is rather stable for λ ≥ . in all splits. In practice, when the expected amount of appearancechanges of subjects are unknown, our advice would be to choose λ ≈ . . Finally, we remark that the weighted sum can also bedone for the cues in naeil ; ﬁnding the optimal cue weightsis left as a future work. Fig. 8: PIPA val set accuracy of naeil ⊕ λ h deepid for varyingvalues of λ . Round dots denote the maximal val accuracy. Results

See table 6 for the results of combining naeil and h deepid .Naively concatenated, naeil + h deepid performs worse than h deepid on the Day split ( . vs . ). However, theweighted combination naeil2 achieves the best performance onall four splits. Conclusion

When combining naeil and h deepid , a weighted combinationis desirable, and the resulting ﬁnal model naeil2 beats all thepreviously considered models on all four splits. TEST SET RESULTS AND COMPARISON

In this section, we measure the performance of our ﬁnal model andkey intermediate results on the PIPA test set, and compare againstthe prior arts. See table 7 for a summary.

We consider two baselines for measuring the inherent difﬁculty ofthe task. First baseline is the “chance level” classiﬁer, which doesnot see the image content and simply picks the most commonlyoccurring class. It provides the lower bound for any recognitionmethod, and gives a sense of how large the gallery set is.Our second baseline is the raw RGB nearest neighbour classi-ﬁer h rgb . It uses the raw downsized ( × pixels) and blurredRGB head crop as the feature. The identity of the Euclideandistance nearest neighbour training image is predicted at test time.By design, h rgb is only able to recognize near identical head cropsacross the test / splits. Results

See results for “chance level” and h rgb in table 7. While the“chance level” performance is low ( ≤ in all splits), weobserve that h rgb performs unreasonably well on the Originalsplit (33.77%). This shows that the Original splits share manynearly identical person instances across the split, and the task isvery easy. On the harder splits, we see that the h rgb performancediminishes, reaching only 6.78% on the Day split. Recognition onthe Day split is thus far less trivial – simply taking advantage ofpixel value similarity would not work. EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ??, NO. ??, ?? 20?? 10

Special modules General featuresMethod Face rec. Pose est. Data Arch. Original Album Time DayChance level (cid:55) (cid:55) − − H ea d h rgb (cid:55) (cid:55) − − h (cid:55) (cid:55) I+P Alex 76.42 67.48 57.05 36.48 h + h casia + h cacd (cid:55) (cid:55) I+P+CC Alex 80.32 72.82 63.18 45.45 h deepid DeepID2+ [3] (cid:55) − − h ⊕ h deepid DeepID2+ [3] (cid:55)

I+P Alex 85.94 81.95 75.85 66.00

DeepFace [1] DeepFace [5] (cid:55) − − − − − B ody b (cid:55) (cid:55) I+P Alex 69.63 59.29 44.92 20.38 h + b (cid:55) (cid:55) I+P Alex 83.36 73.97 63.03 38.15 P = f + h + u + b (cid:55) (cid:55) I+P Alex 85.33 76.49 66.55 42.17

GlobalModel [1] (cid:55) (cid:55)

I+P Alex 67.60 − − −

PIPER [1] DeepFace [5] Poselets [59] I+P Alex 83.05 − − −

Pose [62] (cid:55)

Pose group I+P+V Alex 89.05 82.37 74.84 56.73

COCO [64] (cid:55)

Part det. [79] I+P Goog,Res I m a g e P s = P + s (cid:55) (cid:55) I+P Alex 85.71 76.68 66.55 42.31 naeil = P s + E [2] (cid:55) (cid:55) I+P+E Alex 86.78 78.72 69.29 46.54

Contextual [66] DeepID [77] (cid:55)

I+P Alex 88.75 83.33 77.00 59.35 naeil2 (this paper) DeepID2+ [3] (cid:55)

I+P+E Alex 90.42

TABLE 7: PIPA test set accuracy (%) of the proposed method and prior arts on the four splits. For each method, we indicate any facerecognition or pose estimation module included, and the data and convnet architecture for other features.Cues on extended data E = h casia + h cacd + h pipa11 +u peta5 . ⊕ means concatenation after L normalisation.In the data column, I indicates ImageNet [61] and P indicates PIPA train set. CC means CACD [73] + CASIA [72] and E meansCC + PETA [74]. V indicates the VGGFace dataset [8].In the architecture column, (Alex,Goog,Res) refers to (AlexNet [60],GoogleNetv3 [80],ResNet50 [81]).

Conclusion

Although the gallery set is large enough, the task can be madearbitrarily easy by sharing many similar instances across the splits(Original split). We have remedied the issue by introducing threemore challenging splits (Album, Time, and Day) on which thenaive RGB baseline ( h rgb ) no longer works (§3.1). We consider our four intermediate models ( h , h + h casia + h cacd , h deepid , h ⊕ h deepid ) and a prior work DeepFace [1], [5].We observe the same trend as described in the previoussections on the val set (§4.7, 4.11). Here, we focus on thecomparison against

DeepFace [5]. Even without a specialisedface module, h already performs better than DeepFace (76.42%versus 46.66%, Original split). We believe this is for two reasons:(1)

DeepFace only takes face regions as input, leaving outvaluable hair and background information (§4.7), (2)

DeepFace only makes predictions on 52% of the instances where the facecan be registered. Note that h deepid also do not always makeprediction due to the failure to estimate the pose (17% failureon PIPA), but performs better than DeepFace in the consideredscenario (68.06% versus 46.66%, Original split).

We consider three of our intermediate models ( b , h + b , P = f + h + u + b ) and four prior arts ( GlobalModel [1],

PIPER [1],

Pose [62],

COCO [64]).

Pose [62] and

COCO [64] methodsappeared after the publication of the conference version of thispaper [2]. See table 7 for the results.Our body cue b and Zhang et al.’s GlobalModel [1] are thesame methods implemented independently. Unsurprisingly, theyperform similarly (69.63% versus 67.60%, Original split).Our h + b method is the minimal system matching Zhang et al.’s PIPER [1] ( . versus 83.05%, Original split). The feature vector of h + b is about times smaller than PIPER , and do notmake use of face recogniser or pose estimator.In fact,

PIPER captures the head region via one of itsposelets. Thus, h + b extracts cues from a subset of PIPER ’s“

GlobalModel+Poselets ” [1], but performs better (83.36%versus . , Original split). Methods since the conference version [2]

Pose by Kumar et al. [62] uses extra keypoint annotations onthe PIPA train set to generate pose clusters, and train separatemodels for each pose cluster (PSM, pose-speciﬁc models). Byperforming a form of pose normalisation they have improved theresults signiﬁcantly: 2.27 pp and 10.19 pp over naeil on Originaland Day splits, respectively.

COCO by Liu et al. [64] proposes a novel metric learningloss for the person recognition task. Metric learning gives anedge over classiﬁer-based methods by enabling recognition ofunseen identities without re-training. They further use Faster-RCNN detectors [79] to localise face and body more accurately.The ﬁnal performance is arguably good in all four splits, comparedto

Pose [62] or naeil [2]. However, one should note that theface, body, upper body, and full body features in

COCO are basedon GoogleNetv3 [80] and ResNet50 [81] – the numbers are notfully comparable to all the other methods that are largely based onAlexNet.

We consider our two intermediate models ( P s = P + s , naeil = P s + E ) and Contextual [66], a method which appeared afterthe conference version of this paper [2].Our naeil performs better than

PIPER [1] (86.78% versus83.05%, Original split), while having a 6 times smaller featurevector and not relying on face recogniser or pose estimator.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ??, NO. ??, ?? 20?? 11

Fig. 9: PIPA test set relative accuracy of various methods in thefour splits, against the ﬁnal system naeil2 . Methods since the conference version [2]

Contextual by Li et al. [66] makes use of person co-occurrencestatistics to improve the results. It performs 1.97 pp and 12.81pp better than naeil on Original and Day splits, respectively.However, one should note that

Contextual employs a facerecogniser DeepID [77]. We have found that a specialised facerecogniser improves the recognition quality greatly on the Daysplit (§4.11). naeil2 naeil2 is a weighted combination of naeil and h deepid (see§4.12 for details). Observe that by attaching a face recognisermodule on naeil , we achieve the best performance on Album,Time, and Day splits. In particular, on the Day split, naeil2 makes a 8.85 pp boost over the second best method COCO [64](table 7). On the Original split,

COCO performs better (2.36 ppgap), but note that

COCO uses more advanced feature representa-tions (GoogleNet and ResNet).Since naeil2 and

COCO focus on orthogonal techniques,they can be combined to yield even better performances.

We report computational times for some pipelines in our method.The feature training takes 2-3 days on a single GPU machine.The SVM training takes 42 seconds for h ( dim) and seconds for naeil on the Original split (581 classes, samples). Note that this corresponds to a realistic user scenarioin a photo sharing service where ∼ identities are known tothe user and the average number of photos per identity is ∼ . NALYSIS

In this section, we provide a deeper analysis of individual cuestowards the ﬁnal performance. In particular, we measure howcontributions from individual cues (e.g. face and scene) changewhen either the system has to generalise across time or headviewpoint. We study the performance as a function of the numberof training samples per identity, and examine the distribution ofidentities according to their recognisability.

We measure the contribution of individual cues towards the ﬁnalsystem naeil2 (§4.12) by dividing the accuracy for each inter-mediate method by the performance of naeil2 . We report resultsin the four splits in order to determine which cues contributemore when there are larger time gap between training and testingsamples and vice versa.

Results

See ﬁgure 9 for the relative performances in four splits. The cuesbased more on context (e.g. b and s ) see greater drop from theOriginal to Day split, whereas cues focused on face f and head h regions tend to drop less. Intuitively, this is due to the greaterchanges in clothing and events in the Day split.On the other hand, h deepid increases in its relative contributionfrom Original to Day split, nearly explaining 90% of naeil2 inthe Day split. h deepid provides valuable invariant face feature es-pecially when the time gap is great. However, on the Original split h deepid only reaches about 75% of naeil2 . Head orientationrobust naeil should be added to attain the best performance. Conclusion

Cues involving context are stronger in the Original split; cuesaround face, especially the h deepid , are robust in the Day split.Combining both types of cues yields the best performance over allconsidered time/appearance changes. We study the impact of test instance viewpoint on the proposedsystems. Cues relying on face are less likely to be robust tooccluded faces, while body or context cues will be robust againstviewpoint changes. We measure the performance of models on thehead orientation partitions deﬁned by a DPM head detector (see§3.2): frontal FR , non-frontal NFR , and no face detected

NFD . NFD subset is a proxy for back-view and occluded-face instances.

Results

Figure 10 shows the accuracy of methods on the three head orien-tation subsets for the Original and Day splits. All the consideredmethods show worse performance from frontal FR to non-frontal NFR and no face detected

NFD subsets. However, in the Originalsplit, naeil2 still robustly predicts the identities even for the

NFD subset ( ∼ accuracy). On the Day split, naeil2 alsodo struggle on the NFD subset ( ∼ accuracy). Recognition of NFD instances under the Day split constitutes the main remainingchallenge of person recognition.In order to measure contributions from individual cues in dif-ferent head orientation subsets, we report the relative performanceagainst the ﬁnal model naeil2 in ﬁgure 11. The results arereported on the Original and Day splits. Generally, cues basedon more context (e.g. b and s ) are more robust when face is notvisible than the face speciﬁc cues (e.g. f and h ). Note that h deepid performance drops signiﬁcantly in NDET , while naeil generallyimproves its relative performance in harder viewpoints. naeil2 envelops the performance of the individual cues in all orientationsubsets.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ??, NO. ??, ?? 20?? 12

Fig. 10: PIPA test set accuracy of methods on the frontal ( FR ),non-frontal ( NFR ), and no face detected (

NFD ) subsets. Left:Original split, right: Day split.Fig. 11: PIPA test set relative accuracy of frontal ( FR ), non-frontal( NFR ), and non-detection (

NDET ) head orientations, relative to theﬁnal model naeil2 . Left: Original split, right: Day split.

Conclusion naeil is more viewpoint robust than h deepid , making a con-trast to the time-robustness analysis (§6.1). The combined model naeil2 takes the best of both worlds. The remaining challengefor person recognition lies on the no face detected NFD instancesunder the Day split. Perhaps image or social media metadata couldbe utilised (e.g. camera statistics, time and GPS location, socialmedia friendship graph).

Here, we investigate the viewpoint generalisability of our models.For example, we challenge the system to identify a person fromthe back, having only shown frontal face samples.

Results

Figure 12 shows the accuracies of the methods, when they aretrained either only on the frontal subset FR (left plot) or onlyon the no face detected subset NFD (right plot). When trained on FR , naeil2 has difﬁculties generalising to the NFD subset ( FR versus NFD performance is ∼ to ∼ in Original; ∼ to ∼ in Day). However, the absolute performance is still farabove the random chance (see §5.1), indicating that the learnedidentity representations are to a certain degree generalisable. The naeil features are more robust in this case than h deepid , withless dramatic drop from FR to NFD . When no face is given during training (training on

NFD sub-set), identities are much harder to learn in general. The recognitionperformance is low even for no-generalisation case: ∼ and ∼ for Original and Day, respectively, when trained and testedon NFD . Conclusion naeil2 does generalise marginally across viewpoints, largelyattributing to the naeil features. It seems quite hard to learnidentity speciﬁc features (either generalisable or not) from back-views or occluded faces (

NFD ). We examine the effect of the ratio of head orientations in thefeature training set on the quality of the head feature h . We ﬁx thenumber of training examples that consists only of frontal FR andnon-frontal faces NFR , while varying their ratio.One would hypothesize that the maximal viewpoint robustnessof the feature is achieved at a balanced mixture of FR and NFR foreach person; also that h trained with FR ( NFR ) subset is relativelystrong at predicting FR ( NFR ) subset (respectively).

Results

Figure 13 shows the performance of h trained with various FR to NFR ratios on FR , NFR , and

NFD subsets. Contrary to thehypothesis, changing the distribution of head orientations in thefeature training has < effect on their performances across allviewpoint subsets in both Original and Day splits. Conclusion

No extra care is needed to control the distribution of head orien-tations in the feature training set to improve the head feature h .Features on larger image regions (e.g. u and b ) are expected to beeven less affected by the viewpoint distribution. This section provides analysis on the impact of input resolution.We aim to identify methods that are robust in different range ofresolutions.

Results

Figure 14 shows the performance with respect to the input reso-lution (head height in pixels). The ﬁnal model naeil2 is robustagainst low input resolutions, reaching ∼ even for instanceswith < pixel heads on Original split. On the day split, naeil2 is less robust on low resolution examples ( ∼ ).Component-wise, note that naeil performance is nearlyinvariant to the resolution level. naeil tends to be more robustfor low resolution input than the h deepid as it is based on bodyand context features and do not need high resolution faces. Conclusion

For low resolution input naeil should be exploited, while forhigh resolution input h deepid should be exploited. If unsure, naeil2 is a good choice – it envelops the performance of bothin all resolution levels. EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ??, NO. ??, ?? 20?? 13

Fig. 12: PIPA test set performance when the identity classiﬁer (SVM) is only trained on either frontal ( FR , left) or no face detected( NFD , right) subset. Related scenario: a robot has only seen frontal views of people; who is this person shown from the back view?Fig. 13: Train the feature h with different mixtures of frontal FR and non-frontal NFR heads. The viewpoint wise performance isshown for the Original (left) and Day (right) splits.Fig. 14: PIPA test set accuracy of systems at different levels ofinput resolution. Resolution is measured in terms of the headheight (pixels).

We are interested in two questions: (1) if we had more samplesper identity, would person recognition be solved with the currentmethod? (2) how many examples per identity are enough togather substantial amount of information about a person? Toinvestigate the questions, we measure the performance of methodsat different number of training samples per identity. We perform10 independent runs per data point with ﬁxed number of trainingexamples per identity (subset is uniformly sampled at each run). Fig. 15: Recognition accuracy at different number of trainingsamples per identity. Error bars indicate ± standard deviationfrom the mean. Results

Figure 15 shows the trend of recognition performances of methodswith respect to different levels of training sample size. naeil2 saturates after ∼ training examples per person in Originaland Day splits, reaching ∼ and ∼ , respectively, at examples per identity. At the lower end, we observe that 1example per identity is already enough to recognise a person farabove the chance level ( ∼ and ∼ on Original and Day,respectively). Conclusion

Adding a few times more examples per person will not push theperformance to 100%. Methodological advances are required tofully solve the problem. On the other hand, the methods alreadycollect substantial amount of identity information only from singlesample per person (far above chance level).

Finally, we study how much proportion of the identities areeasy to recognise and how many are hopeless. We study thisby computing the distribution of identities according to their per-identity recognition accuracies.

Results

Figure 16 shows the per identity accuracy for each identity in adescending order for each considered method. On the Original

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ??, NO. ??, ?? 20?? 14

Fig. 16: Per identity accuracy of on the Original and Day splits.The identities are sorted according to the per identity accuracy foreach method separately.split, naeil2 gives accuracy for out of the test identities, whereas there was only one identity where themethod totally fails. On the other hand, on the Day split thereare out of the test identities for whom naeil2 achieves accuracy and identities with zero accuracy. In particular, naeil2 greatly improves the per-identity accuracy distributionover naeil , which gives zero prediction for identities. Conclusion

In the Original split, naeil2 is doing well on many of theidentities already. In the Day split, the h deepid feature has greatlyimproved the per-identity performances, but naeil2 still missessome identities. It is left as future work to focus on the hardidentities. PEN - WORLD RECOGNITION

So far, we have focused on the scenario where the test instancesare always from a closed world of gallery identities. However,for example when person detectors are used to localise instances,as opposed to head box annotations, the detected person maynot be one of the gallery set. One may wonder how our personrecognisers would perform when the test instance could be anunseen identity.In this section, we study the task of “open-world personrecognition”. The test identity may be either from a gallery set(training identities) or from a background set (unseen identities).We consider the scenario where test instances are given by aface detector [67] while the training instance locations have beenannotated by humans.Key challenge for our recognition system is to tell apart galleryidentities from background faces, while simultaneously classifyingthe gallery identities. Obtained from a detector, the backgroundfaces may contain any person in the crowd or even non-faces. Wewill introduce a simple modiﬁcation of our recognition systems’test time algorithm to let them further make the gallery versusbackground prediction. We will then discuss the relevant metricsfor our systems’ open-world performances.

At test time, body part crops are inferred from the detected faceregion ( f ). First, h is regressed from f , using the PIPA train setstatistics on the scaling and displacement transformation from f to h . All the other regions ( u , b , s ) are computed based on h inthe same way as in §3.2 of main paper. Fig. 17: Diagram of various subsets generated by a person recog-nition system in an open world setting (cf. Figure 2 of main paper). T P s : sound true positive, T P u : unsound true positive, F P : falsepositive,

F N : false negative. See text for the deﬁnitions.To measure if the inferred head region h is sound and compat-ible with the models trained on h (as well as u and b ), we train thehead model h on head annotations and test on the heads inferredfrom face detections. The recognition performance is . ,while when trained and tested on the head annotations, theperformance is . . We see a small drop, but not signiﬁcant– the inferred regions to be largely compatible.The gallery-background identity detection is done by thresh-olding the ﬁnal SVM score output. Given a recognition systemand test instance x , let S k ( x ) be the SVM score for identity k . Then, we apply a thresholding parameter τ > to predictbackground if max k S k ( x ) < τ , and predict the argmax galleryidentity otherwise. The evaluation metric should measure two aspects simultane-ously: (1) ability to tell apart background identities, (2) abilityto classify gallery identities. We ﬁrst introduce a few terms tohelp deﬁning the metrics. Refer to ﬁgure 17 for a visualisation.We say a detected test instance x is a “foreground prediction” if max k S k ( x ) ≥ τ . A foreground prediction is either a true positive( T P ) or a false positive (

F P ), depending on whether x is a galleryidentity or not. If x is a T P , it is either a sound true positive

T P s or an unsound true positive T P u , depending on the classiﬁcationresult arg max k S k ( x ) . A false negative ( F N ) is incurred if agallery identity is predicted as background.We ﬁrst measure the system’s ability to screen backgroundidentities while at the same time classifying the gallery identities.The recognition recall (RR) at threshold τ is deﬁned as follows RR( τ ) = | T P s || face det. ∩ head anno. | = | T P s || T P ∪ F N | . (2)To factor out the performance of face detection, we constrain ourevaluation to the intersection between face detections and headannotations (the denominator T P ∪ F N ). Note that the metric isa decreasing function of τ , and when τ → −∞ the correspondingsystem is operating under the closed world assumption.The system enjoys high RR when τ is decreased, but thesystem then predicts many background cases as foreground ( F P ).To quantify the trade-off we introduce a second metric: false

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ??, NO. ??, ?? 20?? 15 positive per image (FPPI) . Given a threshold τ > , FPPI isdeﬁned as FPPI( τ ) = | F P || images | , (3)measuring how many wrong foreground predictions the systemmakes per image. It is also a decreasing function of τ . When τ → ∞ , the FPPI attains zero. Figure 18 shows the recognition rate (RR) versus false positiveper image (FPPI) curves parametrised by τ . As τ → ∞ , RR ( τ ) approaches the close world performance on the face detectedsubset ( FR ∪ NFR ): . (Original) and . (Day) for naeil . In the open-world case, for example when the systemmakes one FPPI, the recognition recall for naeil is . (Original) and . (Day). Transitioning from the open worldto close world, we see quite some drop, but one should note thatthe set of background face detections is more than × greater thanthe foreground faces.Note that the DeepID2+ [3] is not a public method, and so wecannot compute h deepid features ourselves; we have not includedthe h deepid or naeil2 results in this section. Although performance is not ideal, a simple SVM score thresh-olding scheme can make our systems work in the open worldrecognition scenario.

ONCLUSION

We have analysed the problem of person recognition in socialmedia photos where people may appear with occluded faces, indiverse poses, and in various social events. We have investigatedefﬁcacy of various cues, including the face recogniser DeepID2+[3], and their time and head viewpoint generalisability. For betteranalysis, we have contributed additional splits on PIPA [1] thatsimulate different amount of time gap between training and testingsamples.We have made four major conclusions. (1) Cues based on faceand head are robust across time (§6.1). (2) Cues based on contextare robust across head viewpoints (§6.2). (3) The ﬁnal model naeil2 , a combination of face and context cues, is robust acrossboth time and viewpoint and achieves a ∼ pp improvement overa recent state of the art on the challenging Day split (§5.5). (4)Better convnet architectures and face recognisers will improve theperformance of the naeil and naeil2 frameworks in the future§5.5).The remaining challenges are mainly the large time gap andoccluded face scenarios (§6.2). One possible direction is to exploitnon-visual cues like GPS and time metadata, camera parameters,or social media album/friendship graphs. Code and data arepublicly available at https://goo.gl/DKuhlY. A CKNOWLEDGMENTS

This research was supported by the German Research Foundation(DFG CRC 1223). R EFERENCES [1] N. Zhang, M. Paluri, Y. Taigman, R. Fergus, and L. Bourdev, “Beyondfrontal faces: Improving person recognition using multiple cues,” in

CVPR , 2015.[2] S. J. Oh, R. Benenson, M. Fritz, and B. Schiele, “Person recognition inpersonal photo collections,” in

ICCV , 2015.[3] Y. Sun, X. Wang, and X. Tang, “Deeply learned face representations aresparse, selective, and robust,” arXiv , 2014.[4] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled facesin the wild: A database for studying face recognition in unconstrainedenvironments,” UMass, Tech. Rep., 2007.[5] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing thegap to human-level performance in face veriﬁcation,” in

CVPR , 2014.[6] E. Zhou, Z. Cao, and Q. Yin, “Naive-deep face recognition: Touching thelimit of lfw benchmark or not?” arXiv , 2015.[7] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A uniﬁed embed-ding for face recognition and clustering,” arXiv , 2015.[8] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in

BMVC , 2015.[9] J.-C. Chen, V. M. Patel, and R. Chellappa, “Unconstrained face veriﬁca-tion using deep cnn features,” in

WACV . IEEE, 2016, pp. 1–9.[10] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative featurelearning approach for deep face recognition,” in

European Conferenceon Computer Vision . Springer, 2016, pp. 499–515.[11] R. Ranjan, C. D. Castillo, and R. Chellappa, “L2-constrainedsoftmax loss for discriminative face veriﬁcation,” arXiv preprintarXiv:1703.09507 , 2017.[12] F. Wang, W. Liu, H. Liu, and J. Cheng, “Additive margin softmax forface veriﬁcation,” arXiv preprint arXiv:1801.05599 , 2018.[13] J. Deng, J. Guo, and S. Zafeiriou, “Arcface: Additive angular margin lossfor deep face recognition,” arXiv:1801.07698 , 2018.[14] M. Guillaumin, J. Verbeek, and C. Schmid, “Is that you? metric learningapproaches for face identiﬁcation,” in

ICCV , 2009.[15] D. Chen, X. Cao, F. Wen, and J. Sun, “Blessing of dimensionality: High-dimensional feature and its efﬁcient compression for face veriﬁcation,”in

CVPR , 2013.[16] X. Cao, D. Wipf, F. Wen, and G. Duan, “A practical transfer learningalgorithm for face veriﬁcation,” in

ICCV , 2013.[17] C. Lu and X. Tang, “Surpassing human-level face veriﬁcation perfor-mance on lfw with gaussianface,” arXiv , 2014.[18] B. F. Klare, E. Taborsky, A. Blanton, J. Cheney, K. Allen, P. Grother,A. Mah, M. Burge, and A. K. Jain, “Pushing the frontiers of un-constrained face detection and recognition: Iarpa janus benchmark a,” algorithms , vol. 13, p. 4, 2015.[19] S. Sankaranarayanan, A. Alavi, C. D. Castillo, and R. Chellappa,“Triplet probabilistic embedding for face veriﬁcation and clustering,” in

Biometrics Theory, Applications and Systems (BTAS), 2016 IEEE 8thInternational Conference on . IEEE, 2016, pp. 1–8.[20] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kotsia, andS. Zafeiriou, “Agedb: The ﬁrst manually collected in-the-wild agedatabase,” in

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition Workshop , vol. 2, no. 3, 2017, p. 5.[21] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard,“The megaface benchmark: 1 million faces for recognition at scale,” in

CVPR , 2016, pp. 4873–4882.[22] A. Nech and I. Kemelmacher-Shlizerman, “Level playing ﬁeld for millionscale face recognition,” in

CVPR , 2017.[23] D. Gray, S. Brennan, and H. Tao, “Evaluating appearance models forrecognition, reacquisition, and tracking,” in

Proc. IEEE InternationalWorkshop on Performance Evaluation for Tracking and Surveillance(PETS) , vol. 3, no. 5. Citeseer, 2007.[24] D. S. Cheng, M. Cristani, M. Stoppa, L. Bazzani, and V. Murino,“Custom pictorial structures for re-identiﬁcation.” in

BMVC , vol. 1, no. 2.Citeseer, 2011, p. 6.[25] W. Li, R. Zhao, and X. Wang, “Human reidentiﬁcation with transferredmetric learning.” in

ACCV (1) , 2012, pp. 31–44.[26] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalableperson re-identiﬁcation: A benchmark,” in

Proceedings of the IEEEInternational Conference on Computer Vision , 2015, pp. 1116–1124.[27] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian, “Mars:A video benchmark for large-scale person re-identiﬁcation,” in

EuropeanConference on Computer Vision . Springer, 2016, pp. 868–884.[28] W. Li, R. Zhao, T. Xiao, and X. Wang, “Deepreid: Deep ﬁlter pairingneural network for person re-identiﬁcation,” in

CVPR , 2014.[29] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performancemeasures and a data set for multi-target, multi-camera tracking,” in

European Conference on Computer Vision . Springer, 2016, pp. 17–35.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ??, NO. ??, ?? 20?? 16

Fig. 18: Recognition recall (RR) versus false positive per image (FPPI) of our recognition systems in the open world setting. Curvesare parametrised by τ – see text for details. [30] W. Li and X. Wang, “Locally aligned feature transforms across views,”in CVPR , 2013.[31] R. Zhao, W. Ouyang, and X. Wang, “Person re-identiﬁcation by saliencematching,” in

ICCV , 2013.[32] S. Bak, R. Kumar, and F. Bremond, “Brownian descriptor: A rich meta-feature for appearance matching,” in

WACV , 2014.[33] D. Yi, Z. Lei, and S. Z. Li, “Deep metric learning for practical personre-identiﬁcation,” arXiv , 2014.[34] Y. Hu, D. Yi, S. Liao, Z. Lei, and S. Li, “Cross dataset person re-identiﬁcation,” in

ACCV, workshop , 2014.[35] E. Ahmed, M. Jones, and T. K. Marks, “An improved deep learningarchitecture for person re-identiﬁcation,”

Differences , vol. 5, p. 25, 2015.[36] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng, “Person re-identiﬁcation by multi-channel parts-based cnn with improved triplet lossfunction,” in

CVPR , June 2016.[37] T. Xiao, H. Li, W. Ouyang, and X. Wang, “Learning deep featurerepresentations with domain guided dropout for person re-identiﬁcation,”in

CVPR , June 2016.[38] R. R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang,

A SiameseLong Short-Term Memory Architecture for Human Re-identiﬁcation .Cham: Springer International Publishing, 2016, pp. 135–153. [Online].Available: http://dx.doi.org/10.1007/978-3-319-46478-7 9[39] W. Chen, X. Chen, J. Zhang, and K. Huang, “Beyond triplet loss: a deepquadruplet network for person re-identiﬁcation,” in

CVPR , 2017.[40] Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples generated by ganimprove the person re-identiﬁcation baseline in vitro,” in

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition , 2017,pp. 3754–3762.[41] I. B. Barbosa, M. Cristani, A. Del Bue, L. Bazzani, and V. Murino, “Re-identiﬁcation with rgb-d sensors,” in

European Conference on ComputerVision . Springer, 2012, pp. 433–442.[42] M. Munaro, A. Fossati, A. Basso, E. Menegatti, and L. Van Gool, “One-shot person re-identiﬁcation with a consumer depth camera,” in

PersonRe-Identiﬁcation . Springer, 2014, pp. 161–181.[43] I. B. Barbosa, M. Cristani, B. Caputo, A. Rognhaugen, and T. Theoharis,“Looking beyond appearances: Synthetic training data for deep cnns inre-identiﬁcation,”

Computer Vision and Image Understanding , 2017.[44] A. Wu, W.-S. Zheng, and J.-H. Lai, “Robust depth-based person re-identiﬁcation,”

IEEE Transactions on Image Processing , vol. 26, no. 6,pp. 2588–2603, 2017.[45] D. Maltoni, D. Maio, A. K. Jain, and S. Prabhakar,

Handbook ofﬁngerprint recognition . Springer Science & Business Media, 2009.[46] J. Daugman, “How iris recognition works,” in

The essential guide toimage processing . Elsevier, 2009, pp. 715–739.[47] P. Connor and A. Ross, “Biometric recognition by gait: Asurvey of modalities and features,”

Computer Vision and ImageUnderstanding

Proceedings of the eleventh ACMinternational conference on Multimedia . ACM, 2003, pp. 355–358.[49] Y. Song and T. Leung, “Context-aided human recognition–clustering,” in

European Conference on Computer Vision . Springer, 2006, pp. 382–395.[50] D. Anguelov, K.-c. Lee, S. B. Gokturk, and B. Sumengen, “Contextualidentity recognition in personal photo albums,” in

Computer Vision andPattern Recognition, 2007. CVPR’07. IEEE Conference on . IEEE, 2007,pp. 1–7.[51] A. Gallagher and T. Chen, “Clothing cosegmentation for recognizingpeople,” in

CVPR , 2008.[52] S. Gong, M. Cristani, S. Yan, and C. C. Loy,

Person re-identiﬁcation .Springer, 2014.[53] A. Bedagkar-Gala and S. K. Shah, “A survey of approaches and trendsin person re-identiﬁcation,”

IVC , 2014.[54] J. Cui, F. Wen, R. Xiao, Y. Tian, and X. Tang, “Easyalbum: an interactivephoto annotation system based on face clustering and re-ranking,” in

SIGCHI , 2007.[55] C. Otto, D. Wang, and A. K. Jain, “Clustering millions of faces by iden-tity,”

IEEE Transactions on Pattern Analysis and Machine Intelligence ,vol. 40, no. 2, pp. 289–303, Feb 2018.[56] C. S. Mathialagan, A. C. Gallagher, and D. Batra, “Vip: Finding impor-tant people in images,” arXiv , 2015.[57] M. Everingham, J. Sivic, and A. Zisserman, “Hello! my name is... buffy–automatic naming of characters in tv video,” in

BMVC , 2006.[58] ——, “Taking the bite out of automated naming of characters in tvvideo,”

IVC , 2009.[59] L. Bourdev and J. Malik, “Poselets: Body part detectors trained using 3dhuman pose annotations,” in

ICCV , 2009.[60] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

NIPS , 2012.[61] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:A Large-Scale Hierarchical Image Database,” in

CVPR , 2009.[62] V. Kumar, A. Namboodiri, M. Paluri, and C. Jawahar, “Pose-awareperson recognition,” in

CVPR , 2017.[63] Y. Li, G. Lin, B. Zhuang, L. Liu, C. Shen, and A. v. d. Hengel, “Se-quential person recognition in photo albums with a recurrent network,”in

CVPR , 2017.[64] Y. Liu, H. Li, and X. Wang, “Rethinking feature discrimination andpolymerization for large-scale recognition,” in

NIPS Workshop , 2017.[65] S. J. Oh, R. Benenson, M. Fritz, and B. Schiele, “Faceless personrecognition; privacy implications in social media,” in

ECCV , 2016, toappear.[66] H. Li, J. Brandt, Z. Lin, X. Shen, and G. Hua, “A multi-level contextualmodel for person recognition in photo albums,” in

CVPR , June 2016.[67] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool, “Face detectionwithout bells and whistles,” in

ECCV , 2014.[68] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele,“DeeperCut: A deeper, stronger, and faster multi-person pose estimationmodel,” in

Computer Vision – ECCV 2016 , ser. Lecture Notes in Com-

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ??, NO. ??, ?? 20?? 17 puter Science, B. Leibe, Ed., vol. 9910. Amsterdam, The Netherlands:Springer, 2016, pp. 34–50.[69] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person2d pose estimation using part afﬁnity ﬁelds,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2017,pp. 7291–7299.[70] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holisticrepresentation of the spatial envelope,”

IJCV , 2001.[71] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “LearningDeep Features for Scene Recognition using Places Database.”

NIPS ,2014.[72] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation fromscratch,” arXiv , 2014.[73] B.-C. Chen, C.-S. Chen, and W. H. Hsu, “Cross-age reference coding forage-invariant face recognition and retrieval,” in

ECCV , 2014.[74] Y. Deng, P. Luo, C. C. Loy, and X. Tang, “Pedestrian attribute recognitionat far distance,” in

ACMMM , 2014.[75] Z. Zhu, P. Luo, X. Wang, and X. Tang, “Deep learning identity-preservingface space,” in

ICCV , 2013.[76] C. Ding and D. Tao, “A comprehensive survey on pose-invariant facerecognition,” arXiv , 2015.[77] Y. Sun, X. Wang, and X. Tang, “Deep learning face representation frompredicting 10,000 classes,” in

CVPR , 2014, pp. 1891–1898.[78] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun,

BayesianFace Revisited: A Joint Formulation . Berlin, Heidelberg: SpringerBerlin Heidelberg, 2012, pp. 566–579. [Online]. Available: https://doi.org/10.1007/978-3-642-33712-3 41[79] R. G. J. S. Shaoqing Ren, Kaiming He, “Faster R-CNN: Towards real-time object detection with region proposal networks,” arXiv , 2015.[80] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinkingthe inception architecture for computer vision,” in

CVPR , 2016, pp.2818–2826.[81] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

CVPR , June 2016.

Seong Joon Oh received the master’s degree inmathematics from the University of Cambridgein 2014 and his PhD in computer vision from theMax Planck Institute for Informatics in 2018. Hecurrently works as a research scientist at LINEPlus Corporation, South Korea. His research in-terests are computer vision, machine learning,security, and privacy.

Rodrigo Benenson received his electronics en-gineering diploma from UTFSM, Valparaiso in2004 and his PhD in robotics from INRIA/MinesParistech in 2008. He was a postdoctoral asso-ciate with the KULeuven VISICS team from 2008to 2012 and with the Max Planck Institute for In-formatics, Saarbr¨ucken until 2017. He currentlyworks as a research scientist at Google, Z¨urich.His main interests in computer vision are weaklysupervised learning and mobile robotics.

Mario Fritz is faculty member at the CISPAHelmholtz Center i.G. He is working at the inter-section of machine learning and computer visionwith privacy and security. He did his postdoc atthe International Computer Science Institute andUC Berkeley on a Feodor Lynen Research Fel-lowship of the Alexander von Humboldt Founda-tion. He has received funding from Intel, Googleand a collaborative research center on “Methodsand Tools for Understanding and Controlling Pri-vacy”. He has served as an area chair for majorvision conferences (ICCV, ECCV, ACCV, BMVC) and is associate editorfor TPAMI.

Bernt Schiele received the masters degree incomputer science from the University of Karl-sruhe and INP Grenoble in 1994 and the PhDdegree from INP Grenoble in computer visionin 1997. He was a postdoctoral associate andvisiting assistant professor with MIT between1997 and 2000. From 1999 until 2004, he wasan assistant professor with ETH Zurich and, from2004 to 2010, he was a full professor of com-puter science with TU Darmstadt. In 2010, hewas appointed a scientiﬁc member of the MaxPlanck Society and the director at the Max Planck Institute for Informat-ics. Since 2010, he has also been a professor at Saarland University.His main interests are computer vision, perceptual computing, statisticallearning methods, wearable computers, and integration of multimodalsensor data. He is particularly interested in developing methods whichwork under realworld conditions.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. ??, NO. ??, ?? 20?? 18 ✔✘✔✘✔✘ ✔✘✔✘✔✘

Fig. 19: Success and failure cases on the Original split. Single images: test examples. Arrows point to the training samples for thepredicted identities. Green and red crosses indicate correct and wrong predictions. ✔✘✘✔✘✘ ✔✘✘✔✘✘

Fig. 20: Failure cases of naeil2 and