[PDF] An Egocentric Look at Video Photographer Identity

Abstract

Egocentric cameras are being worn by an increasing number of users, among them many security forces worldwide. GoPro cameras already penetrated the mass market, reporting substantial increase in sales every year. As head-worn cameras do not capture the photographer, it may seem that the anonymity of the photographer is preserved even when the video is publicly distributed. We show that camera motion, as can be computed from the egocentric video, provides unique identity information. The photographer can be reliably recognized from a few seconds of video captured when walking. The proposed method achieves more than 90% recognition accuracy in cases where the random success rate is only 3%. Applications can include theft prevention by locking the camera when not worn by its rightful owner. Searching video sharing services (e.g. YouTube) for egocentric videos shot by a specific photographer may also become possible. An important message in this paper is that photographers should be aware that sharing egocentric video will compromise their anonymity, even when their face is not visible.

Full PDF

AAn Egocentric Look at Video Photographer Identity

Yedid Hoshen Shmuel PelegThe Hebrew University of JerusalemJerusaem, Israel

Abstract

Egocentric cameras are being worn by an increasingnumber of users, among them many security forces world-wide. GoPro cameras already penetrated the mass market,reporting substantial increase in sales every year. As head-worn cameras do not capture the photographer, it may seemthat the anonymity of the photographer is preserved evenwhen the video is publicly distributed.We show that camera motion, as can be computed fromthe egocentric video, provides unique identity information.The photographer can be reliably recognized from a fewseconds of video captured when walking. The proposedmethod achieves more than 90% recognition accuracy incases where the random success rate is only 3%.Applications can include theft prevention by locking thecamera when not worn by its rightful owner. Searchingvideo sharing services (e.g. YouTube) for egocentric videosshot by a speciﬁc photographer may also become possible.An important message in this paper is that photographersshould be aware that sharing egocentric video will compro-mise their anonymity, even when their face is not visible.

1. Introduction

The popularity of head worn egocentric cameras is in-creasing. GoPro reports an increase in sales of 66% everyyear, and cameras are widely used by extreme sports enthu-siasts and by law enforcement and military personnel.Special features of egocentric video include: • The camera is worn by the photographer, and is record-ing while the photographer performs normal activities. • The camera moves with the photographer’s head. • The camera does not record images of the photogra-pher. In spite of this we show that photographers canoften be identiﬁed.Photographers feel secure that sharing their egocentricvideos on social media does not compromise their identity(Fig. 1). Police forces routinely release footage of ofﬁcer a) b)

Figure 1. a) A GoPro video uploaded to YouTube allegedly cap-turing a crime from the POV of the robber. Can the robber be rec-ognized? b) A GoPro video uploaded by US soldiers in combat.Are their identities safe? activity and operations of special forces recorded by wear-able cameras are widely published on YouTube. Some haveeven recorded and published their own crimes. A conse-quence of our work is that the photographer identity of suchvideos can sometimes be found from camera motion.Body motion is an accurate and replicable feature foridentifying people over time. It is often recorded by ac-celerometers ([16]) or by an overlooking camera. Egocen-tric video can effectively serve as a head mounted visualgyroscope and can accurately capture body motion infor-mation. It follows that any egocentric video which includeswalking contains body motion information that can accu-rately reveal the photographer.Speciﬁcally, we use sparse optical ﬂow vectors (50 ﬂowvectors per frame) taken over a few steps (4 seconds). Thisresults in a set of time-series, one for each component ofeach optical ﬂow vector. In Fig 2 we show the temporalFourier Transform of one ﬂow vector for three different se-quences, showing visible differences between different pho-tographers.As a ﬁrst approach for determining photographer iden-tity, we computed LPC (Linear Predictive Coding ) [3] co-efﬁcients for each of the optical ﬂow time series. All LPCcoefﬁcients of all optical ﬂow sequences were used as a de-scriptor. Photographer recognition using a non-linear SVMtrained on the LPC descriptor gave 81% identiﬁcation accu- The LPC coefﬁcients of a time series are k values that when scalarmultiplied with the last k measurements of the time series, will optimallypredict the next measurement. a r X i v : . [ c s . C V ] N ov igure 2. Comparison of the temporal frequency spectra for threevideos. Two videos were recorded using camera D1 by users Aand B, the third video was recorded by user A using camera D2. Itis readily seen that the spectra of the two videos recorded by pho-tographer A are very similar to each other despite being recordedby different cameras and at different times. This suggests that aphotographer’s physique is expressed in the motion observed inhis video. racy (vs. accuracy of 3% in random) and veriﬁcation EER(Equal Error Rate) of 10%.Our second approach learns the descriptor and classi-ﬁers using a Convolutional Neural Network (CNN) whichincludes layers corresponding to body motion descriptorextraction and to recognition. The CNN is trained on theoptical-ﬂow features described above. Using CNN im-proves the results over the LPC coefﬁcients, yielding 90%identiﬁcation rate (vs. accuracy of 3% in random) and veri-ﬁcation EER (Equal Error Rate) of 8%.The above experiments were performed on both a small(6 person) public dataset [6] (originally collected for Ego-centric Activity Analysis) and on a new, larger (32 person)dataset collected by us especially for Egocentric Video Pho-tographer Recognition (EVPR).The ability to recognize the photographer quickly andaccurately can be important for camera theft prevention andfor forensic analysis (e.g. who committed the crime). Otherapplications are web search by egocentric video photogra-pher and organization of video collections. Wearing a maskdoes not reduce recognition rate, of course.

2. Previous Work

Determination of the painter of an artwork for prevent-ing forgery and fake artists has attracted attention for cen-turies. Computer vision researchers have presented sev-eral approaches for automatic artist and style classiﬁcationmainly utilizing low-level and object cues [10, 1]. Recognizing the unseen photographer of a picture is aninteresting related problem. In this setting the photographicstyle [24] and the location of the photograph [8, 13] canbe used as cues for photographer recognition. Both meth-ods are unable to distinguish between photographers usingcameras on default settings (such as most wearable cam-eras) and at the same locations. Another approach is auto-matic recognition of the photographer’s reﬂection (e.g. inthe subject’s eyes [18]), but this relies on having reﬂectivesurfaces in the pictures.Photographer recognition from wearable camera video isa novel problem. Such video is jittery due to the motion ofthe photographer’s head and body. Although typically a nui-sance, we show that frame jitters can accurately determinephotographer identity.Human body motion was already used for recognition.Gait recognition is typically done by a video camera ob-serving a person’s shape and dynamic walking style. Thesefeatures are able to recognize a person accurately [17]. Inour scenario, however, the photographer is not seen by thecamera which is worn on his head. Recognition from ac-celerometers carried on the user’s body [16] is also reported.Shiraga et al. [23] studied people recognition wearing abackpack with stereo cameras. Rotation and period of mo-tion were computed using 3D geometry, and users were ac-curately recognized. Unlike all prior art, we are interestedin recognizing photographers of videos taken by standardwearable cameras (e.g. as exist on video sharing websites),nearly all of which are monocular, head or chest mounted.Using optical ﬂow for activity recognition from head-mounted cameras has been done by [11, 19, 22, 14] andothers. Papers [20, 25] used head motion to retrieve head-mounted camera users observed in other videos recorded atthe same time . We, on the other hand, use camera motion torecognize the users of wearable cameras across time .Feature design for time series data has been extensivelystudied, particularly for speech recognition systems ([21]).Speaker veriﬁcation is a long standing problem which is re-lated to this work. Linear Predictive Coding (LPC) descrip-tors are very popular for speaker recognition [7]. We showthat an LPC-based descriptor is highly effective also for userrecognition from egocentric camera video.In this paper we also take an end-to-end approach oflearning features along with the classiﬁer, instead of handdesigning the features. We perform this using convolutionalneural networks (CNN). For an overview of deep networkssee [2]. Learned features are sometimes better than hand-designed features [12].

3. Photographer Recognition from OpticalFlow

Egocentric video suffers from bouncy and unsteady mo-tion caused by photographer head and body motion. Al-2 rame t Frames s+1 to s+60

YX 60 Frames F l o w V e c t o r s X Y a) b)

Figure 3. a) 50 Optical ﬂow vectors are calculated for each frame(only 12 shown here), and represented as two columns (each of 50values), for the x and y optical ﬂow components. b) The featurevector consists of optical ﬂow columns for 60 frames, stacked intotwo 50 ×

60 arrays, for the x and y components of the ﬂow.Horizontal ﬂow component Vertical ﬂow component − → x −→ t −→ t − → x −→ t −→ t Figure 4. Two examples of the ﬂow feature vectors. Each featurevector consists of 50 optical ﬂow vectors per frame, computed foreach of 60 frames. Here only the central row, having 10 ﬂow vec-tors, is shown. The left and right images show the horizontal andvertical components of the optical ﬂow. Note the rich temporalstructure along the time axis. though usually a nuisance, we show that this motion formsthe basis for accurate photographer recognition methods.We present our basic features in Sec. 3.1. Two alterna-tive descriptors and classiﬁers are described in Sec. 3.2 andSec. 3.3.

In the following sections we assume that the videoframes were pre-processed in the following way (seeFig. 3):1. Frames are partitioned into a small number ( m x × m y )of non-overlapping blocks.2. m x × m y optical ﬂow vectors are computed for eachframe using the Lucas Kanade algorithm [15]. We use10 × T seconds of such optical ﬂow vectors istaken. We used T = 4 seconds, which is long enoughto include a few steps. At 15 fps this results in 60frames.4. Each feature vector covers a period of 4 seconds, and we computed feature vectors every 2 seconds. There isan overlap of 2 seconds between two successive featurevectors.We used optical ﬂow features for photographer recog-nition, rather than pixel intensities, as the body motion iseventually expressed by the pixel motion. On the otherhand, recognition should be invariant to the speciﬁc ob-jects seen in the environment, objects that are representedby pixel intensities. CNNs may be able to learn optical ﬂowfrom pixel intensities, but learning this will require muchmore data than we can collect.If dense optical ﬂow were used as a feature, the high fea-ture dimensionality would have lead to overﬁtting on smalldatasets. Using a smaller number of ﬂow vectors gave re-duced accuracy. In looking for the optimal feature size wefound out that a grid size of 10 × LPC [3] is a popular time-series descriptor (e.g. forspeaker veriﬁcation). LPC assumes the data is generated bya physical system, here the photographer’s head and body. Itattempts to learn a linear regression model for its equationsof motion, predicting for each optical ﬂow series the ﬂowvalue in the next frame given the ﬂow values of previous k frames. Given a feature vector, we calculate an LPC modelfor each component of each 4s ﬂow time series (100 mod-els in total). Using too few coefﬁcients yields less accuratepredictions, while too many coefﬁcients causes overﬁtting.We found k =9 to work well for our case. The ﬁnal LPC de-scriptor consisted of all coefﬁcients of all time-series mod-els (100 × M known pho-tographers) and veriﬁcation (classify LPC descriptor intotarget photographer or rest-of-the-world). The non-linear(RBF) classiﬁer outperformed linear SVM in almost allcases. As mentioned before, photographer recognition us-ing a non-linear SVM trained on the LPC descriptor gave81% identiﬁcation rate (vs. random 3%), and the veriﬁca-tion EER (Equal Error Rate) was 10%. In Sec. 3.2 we described a hand-designed descriptor foridentity recognition. The LPC descriptor suffers from sev-eral drawbacks: • The LPC regression model is learned for each time-series separately and ignores the dependence betweenoptical ﬂow vectors.3 nput data :Size: 50 * 60 Depth: 2

Convolution kernels :Size: 50*20*2128 Kernels

Relu Non-Linearity

Conv data :Size: 1 * 51 Depth: 128

Max Pooling :Size: 1*20Stride: 15

Max Pooling data :Size: 1 * 4 Depth: 128

Fully Connected layer 1

128 units

Output layer :Verification2 unitsIdentification:20/32 units

Fully connected + sigmoid non-linearity Fully connected + sigmoid non-linearity

Figure 5. A diagram of our CNN architecture for photographer recognition from a given ﬂow feature vector. The operations on the dataare shown on top, the sizes of subsequent data layers are shown on the bottom. The Neural Network learns the descriptor jointly with theclassiﬁer, therefore automatically creating a descriptor optimal to this task. • The LPC descriptor and SVM classiﬁer are learned in-dependently, the labels cannot directly inﬂuence thedesign of the descriptor.To overcome the above drawbacks, we propose to learn aCNN model for photographer recognition. The CNN learnsdescriptor and classiﬁer end to end, and is able to take ad-vantage both of dataset labels and the dependence betweenfeatures when calculating ﬁlter coefﬁcients. The CNN is amore general architecture, the LPC descriptor is a subset ofdescriptors learnable by the network.Due to the limited number of data points available in ourdatasets, we limit our CNN to only 2 hidden layers. Us-ing more layers increases model capacity but also increasesover-ﬁtting, and this architecture yielded the best perfor-mance. The architecture is illustrated in Fig. 5.Our architecture is tailored especially for egocentricvideo. As we use sparse optical ﬂow we do not assumemuch spatial invariance in the features (differently frommost image recognition tasks). On the other hand the pre-cise temporal offset of the photographer’s actions is usuallynot important, e.g. the precise time of the beginning of aphotographer’s step is less important than the time betweenstrides. Our architecture should therefore be temporally in-variant. The ﬁrst layer was thus designed to be convolu-tional in time but not in space.The kernel size spans all the blocks across the x and y components over K T frames (we use K T = 20 which isa little longer than the typical step duration). The convolu-tional layer consists of M kernels (we use M = 128 ). Theoutputs of the kernels z m = W m ∗ x are passed througha ReLU non-linearity ( max ( z m , ). We pool the outputssubstantially in time, as the feature vector is of high dimen-sion compared to the amount of training data available. To a) b) c) Figure 6. The MAP rule operated on the FPIS dataset: a) Groundtruth labels. b) Raw CNN probabilities. c) MAP rule probabilities(for T =12 seconds.). The MAP classiﬁer visibly ’cleaned up’ theprediction. correspond to the typical time interval between steps we usekernel length of 20 and stride of 15.The data is then passed through two fully connected(afﬁne) layers each followed by a sigmoid non-linearity( σ ( z ) = e − z ). The ﬁrst fully connected hidden layerhas N hidden nodes (we used N = 128 ). The output ofthis layer is the learned CNN descriptor.The second fully connected layer is a linear soft-maxclassiﬁer and has the same number of nodes as the num-ber of output classes: 2 classes for veriﬁcation, and 20 or32 classes for identiﬁcation. Sec. 3.2 and Sec. 3.3 described a method to train aphotographer classiﬁer on a short (4 seconds) video se-quence. The video used for recognition is usually signiﬁ-cantly longer than 4 seconds.We split the video into 4 second subsequences (over-lapping by 2 seconds) and extract their feature vectors V t ( t is the subsequence number). We compute the iden-tity label ( L t ) probability distribution for each featurevector V t using LPC or CNN classiﬁers trained as de-4 igure 7. Classiﬁcation accuracy vs. video length when one fea-ture vector covers T = 4 seconds (Using CNN on the FPSIDataset). Longer video allows extraction of more feature vectors.MAP classiﬁcation consistently beats mode classiﬁcation. Bothmethods can exploit longer sequences and thus improve on 4s se-quence recognition. All methods perform far better than random. scribed above, We then classify the entire video into theglobally most likely label, argmax i (cid:81) t P ( L t = i | V t ) = argmax i (cid:80) t log ( P ( L t = i | V t )) . While this classiﬁer as-sumes that feature vectors are IID, we have found that thisrequirement is not necessary for the success of the method.See Fig. 6 for an example on the FPIS dataset. MAP clas-siﬁcation has helped boost the recognition performance onthe EVPR dataset to around 90% (an increase of 13%) overthe 4s rate.

4. Results

Several experiments were performed to evaluate the ef-fectiveness of our method. As there is no standard datasetfor Egocentric Video Photographer Recognition, we useboth a small (6 person) public dataset - FPSI [6] that wasoriginally collected for egocentric activity analysis. Foreach photographer - morning sequences were used for train-ing, and afternoon sequences for testing.In order to evaluate our method under more principledsettings, we collected a new larger (32 person) dataset -EVPR - speciﬁcally designed for egocentric photographervideo recognition. In the EVPR dataset all photographersrecorded two 7 minute sequences (from which we extractedaround 200 four second sequences each) on the same daywith different head-mounted cameras (D1,D2) for trainingand testing. 20 of the photographers also recorded another7 minute sequence with yet another camera (D3) a weeklater. Both datasets are described in detail in Sec. 6.1. Thedetailed experimental protocol is described in Sec. 6.2.

Fig. 7 presents the photographer recognition test perfor-mance of our network on the FPSI database (6 people). Theaverage correct recognition rate on a single feature vector(describing only 4 seconds of video) is 76% against the ran-dom performance of 16.6%.

Figure 8. CMC rates for same day recognition (for 12s sequences).LPC accuracy: 81% (Top-1) and 88% (Top-2). The CNN furtherimproves the performance with 90% (Top-1) and 93% (Top-2).Both methods far outperform the random rate of 3% (Top-1) and6% (Top-2). Both descriptors also beat the raw features by a largemargin.Figure 9. CMC rates for recognition 1 week later (for 12s se-quences). LPC accuracy: 76% (Top-1) and 86% (Top-2). TheCNN further improves the performance with 91% (Top-1) and96% (Top-2). Both methods far outperform the random rate of5% (Top-1) and 10% (Top-2). Both descriptors also beat the rawfeatures by a large margin.

Test videos are usually longer than 4 seconds, and wehave multiple feature vectors for each person. We com-bine predictions over a longer video using the MAP rule inSec. 3.4. In Fig. 7 we compare the MAP strategy vs. takingthe most frequent 4s prediction in the test video (Mode). Weobserve that using longer sequences further improves recog-nition performance, reaching around 91% accuracy for 50seconds of video. We also observe that MAP classiﬁersconsistently beats the Mode classiﬁer and use it in all otherexperiments.To evaluate the recognition performance on a largerdataset, we show the performance of our method on our newdataset - EVPR. In this experiment the network was trainedon video sequences for each photographer using Camera D1and is evaluated on video sequences recorded on the sameday using Camera D2 and a week later recorded using Cam-era D3. In Fig. 8 and Fig. 9 we present the cumulative matchcurve (CMC) for the same day and week later recognitionresults respectively. We use the Top- k notion, indicatingthat the correct result appeared within the top k predictions5o Stab StabDescriptor 4s 12s 4s 12sLPC 65% 81% 59% 72%CNN 77% 90% 71% 86% Table 1. Same-day CSMC recognition accuracy with and withoutstabilization. of the classiﬁer. In addition to LPC and CNN, an RBF-SVMtrained on the raw optical ﬂow features is used as baseline toevaluate the quality of our descriptors. High accuracy wasachieved in both scenarios, same day CNN recognition ac-curacy is 90% (top 1) and 93% (top 2). The recognition per-formance a week later is better with 91% (top 1) and 96%(top 2). The improved performance numbers a week laterare expected due to the smaller dataset size (20 vs 32), butare nonetheless encouraging as many photographers woredifferent shoes from the D1 training sequence recorded aweek before. This result shows that our method can obtaingood recognition performance on meaningful numbers ofphotographers and across at least a week.To test the possibility that stabilization would take awaysome or all the body motion information in the frame mo-tions, the identiﬁcation experiments were redone with thefollowing pre-processing stage: for each frame (50 ﬂowvectors) the mean framewise vector was calculated and thensubtracted from each of the vectors in the frame. As motionbetween frames is small and some lens distortion correc-tion was performed, this is similar to 2D stabilization. Ta-ble. 1 shows that such ”stabilization” degrades performancesomewhat (4-9%), but accuracy still remains fairly high. Wenote however that more complex stabilization might removemore body motion information.

We also test the veriﬁcation performance obtained by ourmethod. In order to evaluate veriﬁcation performance by asingle number it is common to use the Equal Error Rate(EER), the error rate at which the False Acceptance Rate(FAR) and False Rejection Rate are equal.The EER for both the CNN and LPC descriptors forvideos of length 4s (one feature vector) and 12s (ﬁve fea-ture vectors) is presented in Table. 2 while the ROC curvesare shown in Fig. 10. A detailed description of our proto-col can be found in Sec. 6. It can be seen from our resultsthat high accuracy (low EER) can be obtained by both de-scriptors: LPC 14% (4s), 10% (12s) and CNN 11% (4s),8% (12s). The CNN obtains better performance for bothdurations with a larger improvement for 4s.It should be noted that all test probe photographers apartfrom the target photographer had never been used in train-ing. By focusing on modeling the target photographer wecan separate him from the rest of the world, and are thusable to generalize to unseen test photographers. Descriptor 4s 12sLPC 13.6% 9.6%CNN 11.3% 8.1%

Table 2. Veriﬁcation equal error rates for LPC and CNN descrip-tors with 4s and 12s sequence duration.Figure 10. ROC curves for the veriﬁcation performance of ourmethod for LPC and CNN descriptors of 4s and 12s sequences.For both methods we show the mean ROC curve. The EER ofeach method is given by the point of intersection between the lin-ear line and its ROC curve.Horizontal ﬂow component Vertical ﬂow component − → x −→ t −→ t Figure 11. Examples of a temporal ﬁlter for the horizontal (left)and vertical (right) ﬂow components. Horizontal axis is time, andvertical axis is location along the central line. The horizontal com-ponent ﬁlter appears to be sensitive for certain left-right frequen-cies, while the vertical component ﬁlter is sensitive to oscillatingrotations: When the right side is moving up the left side is movingdown, etc.

5. Discussion

Analysis of CNN features:

In order to analyze the fea-tures learned by the CNN we visualize the ﬁlters learnedby the ﬁrst layer. Fig. 11 shows the horizontal and veriti-cal components of a ﬁrst layer temporal ﬁlter learned by thenetwork. For illustration purposes, only the weights of thecentral line of pixels are shown. Looking at the weights, wesee that the horizontal component ﬁlter is tuned to respondto some speciﬁc frequencies, while the vertical componentlooks for sharp rotations. This behavior appears in severalother ﬁlters suggesting that the network might be using bothspectral and transitive cues.

Transfer Learning for veriﬁcation:

In some scenariosit may not be possible to train a veriﬁcation classiﬁer foreach photographer. In such cases Nearest Neighbors maybe a good alternative. The following approach is taken: An6) b) c)

Figure 12. Common failure cases for the 4-second descriptor: a-b)Sharp turns of the head result in atypical fast motions, sometimescausing motion blur. c) Large moving objects can also cause atyp-ical optical ﬂow patterns. identiﬁcation CNN is trained on half the photographers inthe training dataset. We choose a video by a target pho-tographer (that was not used for training the CNN), and ex-tract its CNN descriptors (as in Sec. 3.3), this set of descrip-tors forms our gallery. Similarly we extract CNN descrip-tors from all video sequences of photographers not usedfor training the CNN, this forms our probe set (excludingthe sequence used as gallery). For each probe descriptorwe check if the euclidean distance from its nearest neigh-bor in the gallery is smaller than some threshold, and if sowe classify it as the target photographer. We used Cam-era D1 sequences for training and D2 sequences for test.16 randomly selected photographers were used for trainingthe CNN, and the rest for veriﬁcation. The same procedurewas carried out for LPC (without training a CNN). Multiple4s sequence predictions are aggregated using simple voting.The average EER for 12s sequences was 15.5% (CNN) and22%(LPC). Although less accurate than trained classiﬁers,this shows the network learns to identity features that aregeneral and can be transferred to identify unseen photogra-phers. Nearest Neighbors classiﬁcation on the optical ﬂowraw features yielded very low performance in accordancewith the ﬁndings of [20, 25].

Veriﬁcation on FPSI:

We tried learning veriﬁcation clas-siﬁers by choosing one photographer from the FPSI datasetas target, and 4 other photographers as negative trainingdata. The morning sequences of the target photographerwere used for training and the afternoon for testing. Wetested the classiﬁcation performance between the afternoonsequences of the target photographer and the remaining 6thnon-target photographer from the FPSI dataset. The net-work however, ﬁt to the train non-target photographers andhas not been able to generalize to the unseen probe pho-tographer. We therefore conclude that a signiﬁcant numberof photographers (such as present in the EVPR dataset) isrequired for training a veriﬁcation classiﬁer.

Failure cases:

In Fig. 12 several cases are shown wherethe 4 second descriptor failed to give correct recognition.Failure can be caused by sharp head movements (some-times causing signiﬁcant blur), by large moving objects, orby lack of features for optical ﬂow computation. It is likelythat by identifying such cases and removing their descrip-tors, higher recognition performance may be achieved.

6. Experimental Procedure

In this section we give a detailed description of the ex-perimental procedure used in Sec. 4.

Two datasets were used for evaluation: a public generalpurpose dataset (FPSI) and a larger dataset (EVPR) col-lected by us to overcome some of the weaknesses of FPSI.

The First-Person Social Interactions (FPSI) dataset was col-lected by Fathi et al. [6] for the purpose of activity analysis.6 individuals (5 males, 1 female) recorded a day’s worthof egocentric video each using head-worn GoPro cameras.Due to battery and memory limitations of the camera, thephotographers occasionally took the cameras off and putthem on again, ensuring that camera extrinsic parameterswere not kept constant.In this work we learn to recognize video photographerswhile walking, rather than sitting or standing. We thereforeextracted the walking portions of each video using manuallabels. It is possible to use a classiﬁer such as described in[19] to ﬁnd the walking intervals.

The FPSI dataset suffers from several drawbacks: it con-tains video only for a small number of photographers (6)and each photographer wears the same hat and camera allthe time. It is therefore conceivable that learning cameraparameters can help recognition. To overcome these issueswe collected a larger dataset - Egocentric Video Photogra-pher Recognition (EVPR).The EVPR consists of head-mounted video sequencescollected from 32 photographers. Each video sequence wasrecorded with a GoPro camera attached to a baseball capworn on the photographer’s head (as in Fig. 13). Each pho-tographer was asked to walk normally for around 7 min-utes along the same road. All photographers recorded two 7minute video sequences on a single day using two differentcameras (and caps). 20 photographers also recorded anothersequence a week later. The use of different cameras fordifferent sequences came to ensure that motion rather thancamera calibration is learned. No effort was made to ensurethat the same shoes would be used on both days (and in factseveral persons had changed shoes between sequences).

Photographer identiﬁcation sets to recognize a photogra-pher from a closed set of M candidates. For this task itis assumed that we have training data from all subjects.7 igure 13. The apparatus used to record the EVPR dataset. We tested our method both on the FPSI the EVPRdatasets. In the FPSI dataset we used for each individualthe ﬁrst 80% of sequences (taken in the morning) for train-ing, and the last 20% sequences recorded in the afternoonfor testing. This is done to reduce overﬁtting to a particulartime or camera setup. Data were randomly sub-sampled toensure equal number of examples for each photographer inboth training and testing sets. The results are described inSec. 4.For the EVPR dataset we used sequences from CameraD1 for training. For testing we use both sequences fromCamera D2 (taken on the same day) and Camera D3 (takena week later, when available). The results on each cameraare compared to analyze whether recognition performancedegrades within a week.

Given a target photographer with a few minutes of train-ing data, and negative training examples by other non-targetphotographers, we verify whether a probe test video se-quence was recorded by the target photographer. Recogni-tion on longer sequences is done by combining the predic-tions from subsequent short sequences. As the FPSI datasetcontains only 6 photographers it was not suitable for thistask (this was elaborated upon in Sec. 5) therefore only theEVPR dataset was used for evaluating performance on thistask. For each of 32 photographers: i) photographer is des-ignated target ii) we selected sequences of the target photog-rapher and 15 non-target photographers (randomly selected)for training a binary classiﬁer. All training sequences were7 minutes (200 descriptors) long and were recorded by cam-era D1. iii) Another sequence recorded by the target pho-tographer and the remaining 16 participants that were notused for training, were used to test the classiﬁer. Test se-quences were recorded by camera D2. iv) The ROC curveand EER ware computed. Average EER and ROC for allphotographers is ﬁnally obtained. As each sequence con-tained about 200 descriptors this formed a signiﬁcant testset. Care was taken to ensure that all photographers (apartfrom the target) would appear in the training or test datasetsbut not in both. This was done to ensure we did not overﬁtto speciﬁc non-target photographers. We replicated positivetraining examples to ensure equal numbers of negative and positive training and test data.

Features:

In all experiments the optical ﬂow grid sizeused was 10 ×

5. In the CNN experiments, all optical ﬂowvalues were divided by the square-root of their absolutevalue, this was found to help performance by decreasing thesigniﬁcance of extreme values. Feature vectors of length 60frames at 15 fps (4s) were used. Feature vectors were ex-tracted every 2s (with a 2s overlap).

Normalization:

We followed the standard practice - Forthe LPC descriptor, all feature vectors were mean and vari-ance normalized across the training set before being usedby the SVM. For the CNN, feature vectors were mean sub-tracted before being input to the CNN.

Training:

The SVM was trained using LIBSVM [4]. Weused σ = 1 e − and C = 1 for LPC, C = 10 for theraw features. The CNN was trained by AdaGrad [5] withlearning rate 0.01 on a GPU using the Caffe [9] package.The mini-batch size was 200.

7. Conclusion

A method to recognize the photographer of head-wornegocentric camera video has been presented. We show thatphotographer identity can be found from body motion in-formation as expressed in camera motion when walking.Recognition was done with both physically motivated handdesigned descriptors, and with a Convolutional Neural Net-work. Both methods gave good recognition performance.The CNN classiﬁer was shown to generalize and improveon the LPC hand-designed descriptor.The time-invariant CNN architecture presented here isquite general and can be used for other video classiﬁcationtasks relying on coarse optical ﬂow.We have tested the effects of simple 2D video stabiliza-tion on classiﬁcation accuracy, and found only slight degra-dation in performance. It is possible that more elaboratestabilization would have a greater effect.The implication of our work is that photographers’ head-worn egocentric videos give much information away. Thisinformation can be used benevolently (e.g. camera theftprevention, user analytics on video sharing websites) or ma-liciously. Care should therefore be taken when sharing suchvideo.

Acknowledgment.

This research was supported by Intel-ICRC and by the Israel Science Foundation.

References [1] Y. Bar, N. Levy, and L. Wolf. Classiﬁcation of artis-tic styles using binarized features derived from a deep8eural network. In

ECCV 2014 Workshops , pages 71–84, 2014. 2[2] Y. Bengio. Learning deep architectures for AI.

Foun-dations and trends in Machine Learning , 2(1):1–127,2009. 2[3] J. P. Campbell Jr. Speaker recognition: a tutorial.

Pro-ceedings of the IEEE , 85(9):1437–1462, 1997. 1, 3[4] C.-C. Chang and C.-J. Lin. LIBSVM: A library forsupport vector machines.

ACM Trans. on IntelligentSystems and Technology , 2011. 8[5] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgra-dient methods for online learning and stochastic opti-mization.

JMLR , 2011. 8[6] A. Fathi, J. K. Hodgins, and J. M. Rehg. Social inter-actions: A ﬁrst-person perspective. In

CVPR , 2012. 2,5, 7[7] S. Furui. Cepstral analysis technique for automaticspeaker veriﬁcation.

ICASSP , 1981. 2[8] J. Hays, A. Efros, et al. Im2gps: estimating geo-graphic information from a single image. In

CVPR’08 ,pages 1–8, 2008. 2[9] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,R. Girshick, S. Guadarrama, and T. Darrell. Caffe:Convolutional architecture for fast feature embedding. arXiv:1408.5093 , 2014. 8[10] C. R. Johnson Jr, E. Hendriks, I. J. Berezhnoy,E. Brevdo, S. M. Hughes, I. Daubechies, J. Li,E. Postma, and J. Z. Wang. Image processing forartist identiﬁcation.

IEEE Signal Processing Maga-zine , 25(4):37–48, 2008. 2[11] K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto.Fast unsupervised ego-action learning for ﬁrst-personsports videos. In

CVPR , 2011. 2[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im-agenet classiﬁcation with deep convolutional neuralnetworks. In

NIPS , 2012. 2[13] S. Lee, H. Zhang, and D. J. Crandall. Predicting geo-informative attributes in large-scale image collectionsusing convolutional neural networks. In

WACV’15 ,pages 550–557, 2015. 2[14] Z. Lu and K. Grauman. Story-driven summarizationfor egocentric video. In

CVPR , 2013. 2[15] B. D. Lucas, T. Kanade, et al. An iterative image regis-tration technique with an application to stereo vision.In

IJCAI , volume 81, pages 674–679, 1981. 3[16] J. Mantyjarvi, M. Lindholm, E. Vildjiounaite, S.-M.Makela, and H. Ailisto. Identifying users of portabledevices from gait pattern with accelerometers. In

ICASSP , 2005. 1, 2 [17] H. Murase and R. Sakai. Moving object recogni-tion in eigenspace representation: gait analysis and lipreading.

Pattern Recognotion Letters , 17(2):155–162,1996. 2[18] K. Nishino and S. K. Nayar. Corneal imaging system:Environment from eyes.

IJCV , 70(1):23–40, 2006. 2[19] Y. Poleg, C. Arora, and S. Peleg. Temporal segmenta-tion of egocentric videos. In

CVPR . 2, 7[20] Y. Poleg, C. Arora, and S. Peleg. Head motion sig-natures from egocentric videos. In

ACCV , 2014. 2,7[21] D. A. Reynolds and R. C. Rose. Robust text-independent speaker identiﬁcation using gaussianmixture speaker models.

IEEE Trans. on Speech andAudio Processing , 3(1):72–83, 1995. 2[22] M. S. Ryoo and L. Matthies. First-person activityrecognition: What are they doing to me? In

CVPR ,2013. 2[23] K. Shiraga, N. T. Trung, I. Mitsugami, Y. Mukaigawa,and Y. Yagi. Gait-based person authentication bywearable cameras. In

INSS , 2012. 2[24] C. Thomas and A. Kovashka. Who’s behind thecamera? identifying the authorship of a photograph. arXiv:1508.05038 , 2015. 2[25] R. Yonetani, K. M. Kitani, and Y. Sato. Ego-surﬁngﬁrst person videos. In