An Egocentric Look at Video Photographer Identity
AAn Egocentric Look at Video Photographer Identity
Yedid Hoshen Shmuel PelegThe Hebrew University of JerusalemJerusaem, Israel
Abstract
Egocentric cameras are being worn by an increasingnumber of users, among them many security forces world-wide. GoPro cameras already penetrated the mass market,reporting substantial increase in sales every year. As head-worn cameras do not capture the photographer, it may seemthat the anonymity of the photographer is preserved evenwhen the video is publicly distributed.We show that camera motion, as can be computed fromthe egocentric video, provides unique identity information.The photographer can be reliably recognized from a fewseconds of video captured when walking. The proposedmethod achieves more than 90% recognition accuracy incases where the random success rate is only 3%.Applications can include theft prevention by locking thecamera when not worn by its rightful owner. Searchingvideo sharing services (e.g. YouTube) for egocentric videosshot by a specific photographer may also become possible.An important message in this paper is that photographersshould be aware that sharing egocentric video will compro-mise their anonymity, even when their face is not visible.
1. Introduction
The popularity of head worn egocentric cameras is in-creasing. GoPro reports an increase in sales of 66% everyyear, and cameras are widely used by extreme sports enthu-siasts and by law enforcement and military personnel.Special features of egocentric video include: • The camera is worn by the photographer, and is record-ing while the photographer performs normal activities. • The camera moves with the photographer’s head. • The camera does not record images of the photogra-pher. In spite of this we show that photographers canoften be identified.Photographers feel secure that sharing their egocentricvideos on social media does not compromise their identity(Fig. 1). Police forces routinely release footage of officer a) b)
Figure 1. a) A GoPro video uploaded to YouTube allegedly cap-turing a crime from the POV of the robber. Can the robber be rec-ognized? b) A GoPro video uploaded by US soldiers in combat.Are their identities safe? activity and operations of special forces recorded by wear-able cameras are widely published on YouTube. Some haveeven recorded and published their own crimes. A conse-quence of our work is that the photographer identity of suchvideos can sometimes be found from camera motion.Body motion is an accurate and replicable feature foridentifying people over time. It is often recorded by ac-celerometers ([16]) or by an overlooking camera. Egocen-tric video can effectively serve as a head mounted visualgyroscope and can accurately capture body motion infor-mation. It follows that any egocentric video which includeswalking contains body motion information that can accu-rately reveal the photographer.Specifically, we use sparse optical flow vectors (50 flowvectors per frame) taken over a few steps (4 seconds). Thisresults in a set of time-series, one for each component ofeach optical flow vector. In Fig 2 we show the temporalFourier Transform of one flow vector for three different se-quences, showing visible differences between different pho-tographers.As a first approach for determining photographer iden-tity, we computed LPC (Linear Predictive Coding ) [3] co-efficients for each of the optical flow time series. All LPCcoefficients of all optical flow sequences were used as a de-scriptor. Photographer recognition using a non-linear SVMtrained on the LPC descriptor gave 81% identification accu- The LPC coefficients of a time series are k values that when scalarmultiplied with the last k measurements of the time series, will optimallypredict the next measurement. a r X i v : . [ c s . C V ] N ov igure 2. Comparison of the temporal frequency spectra for threevideos. Two videos were recorded using camera D1 by users Aand B, the third video was recorded by user A using camera D2. Itis readily seen that the spectra of the two videos recorded by pho-tographer A are very similar to each other despite being recordedby different cameras and at different times. This suggests that aphotographer’s physique is expressed in the motion observed inhis video. racy (vs. accuracy of 3% in random) and verification EER(Equal Error Rate) of 10%.Our second approach learns the descriptor and classi-fiers using a Convolutional Neural Network (CNN) whichincludes layers corresponding to body motion descriptorextraction and to recognition. The CNN is trained on theoptical-flow features described above. Using CNN im-proves the results over the LPC coefficients, yielding 90%identification rate (vs. accuracy of 3% in random) and veri-fication EER (Equal Error Rate) of 8%.The above experiments were performed on both a small(6 person) public dataset [6] (originally collected for Ego-centric Activity Analysis) and on a new, larger (32 person)dataset collected by us especially for Egocentric Video Pho-tographer Recognition (EVPR).The ability to recognize the photographer quickly andaccurately can be important for camera theft prevention andfor forensic analysis (e.g. who committed the crime). Otherapplications are web search by egocentric video photogra-pher and organization of video collections. Wearing a maskdoes not reduce recognition rate, of course.
2. Previous Work
Determination of the painter of an artwork for prevent-ing forgery and fake artists has attracted attention for cen-turies. Computer vision researchers have presented sev-eral approaches for automatic artist and style classificationmainly utilizing low-level and object cues [10, 1]. Recognizing the unseen photographer of a picture is aninteresting related problem. In this setting the photographicstyle [24] and the location of the photograph [8, 13] canbe used as cues for photographer recognition. Both meth-ods are unable to distinguish between photographers usingcameras on default settings (such as most wearable cam-eras) and at the same locations. Another approach is auto-matic recognition of the photographer’s reflection (e.g. inthe subject’s eyes [18]), but this relies on having reflectivesurfaces in the pictures.Photographer recognition from wearable camera video isa novel problem. Such video is jittery due to the motion ofthe photographer’s head and body. Although typically a nui-sance, we show that frame jitters can accurately determinephotographer identity.Human body motion was already used for recognition.Gait recognition is typically done by a video camera ob-serving a person’s shape and dynamic walking style. Thesefeatures are able to recognize a person accurately [17]. Inour scenario, however, the photographer is not seen by thecamera which is worn on his head. Recognition from ac-celerometers carried on the user’s body [16] is also reported.Shiraga et al. [23] studied people recognition wearing abackpack with stereo cameras. Rotation and period of mo-tion were computed using 3D geometry, and users were ac-curately recognized. Unlike all prior art, we are interestedin recognizing photographers of videos taken by standardwearable cameras (e.g. as exist on video sharing websites),nearly all of which are monocular, head or chest mounted.Using optical flow for activity recognition from head-mounted cameras has been done by [11, 19, 22, 14] andothers. Papers [20, 25] used head motion to retrieve head-mounted camera users observed in other videos recorded atthe same time . We, on the other hand, use camera motion torecognize the users of wearable cameras across time .Feature design for time series data has been extensivelystudied, particularly for speech recognition systems ([21]).Speaker verification is a long standing problem which is re-lated to this work. Linear Predictive Coding (LPC) descrip-tors are very popular for speaker recognition [7]. We showthat an LPC-based descriptor is highly effective also for userrecognition from egocentric camera video.In this paper we also take an end-to-end approach oflearning features along with the classifier, instead of handdesigning the features. We perform this using convolutionalneural networks (CNN). For an overview of deep networkssee [2]. Learned features are sometimes better than hand-designed features [12].
3. Photographer Recognition from OpticalFlow
Egocentric video suffers from bouncy and unsteady mo-tion caused by photographer head and body motion. Al-2 rame t Frames s+1 to s+60
YX 60 Frames F l o w V e c t o r s X Y a) b)
Figure 3. a) 50 Optical flow vectors are calculated for each frame(only 12 shown here), and represented as two columns (each of 50values), for the x and y optical flow components. b) The featurevector consists of optical flow columns for 60 frames, stacked intotwo 50 ×
60 arrays, for the x and y components of the flow.Horizontal flow component Vertical flow component − → x −→ t −→ t − → x −→ t −→ t Figure 4. Two examples of the flow feature vectors. Each featurevector consists of 50 optical flow vectors per frame, computed foreach of 60 frames. Here only the central row, having 10 flow vec-tors, is shown. The left and right images show the horizontal andvertical components of the optical flow. Note the rich temporalstructure along the time axis. though usually a nuisance, we show that this motion formsthe basis for accurate photographer recognition methods.We present our basic features in Sec. 3.1. Two alterna-tive descriptors and classifiers are described in Sec. 3.2 andSec. 3.3.
In the following sections we assume that the videoframes were pre-processed in the following way (seeFig. 3):1. Frames are partitioned into a small number ( m x × m y )of non-overlapping blocks.2. m x × m y optical flow vectors are computed for eachframe using the Lucas Kanade algorithm [15]. We use10 × T seconds of such optical flow vectors istaken. We used T = 4 seconds, which is long enoughto include a few steps. At 15 fps this results in 60frames.4. Each feature vector covers a period of 4 seconds, and we computed feature vectors every 2 seconds. There isan overlap of 2 seconds between two successive featurevectors.We used optical flow features for photographer recog-nition, rather than pixel intensities, as the body motion iseventually expressed by the pixel motion. On the otherhand, recognition should be invariant to the specific ob-jects seen in the environment, objects that are representedby pixel intensities. CNNs may be able to learn optical flowfrom pixel intensities, but learning this will require muchmore data than we can collect.If dense optical flow were used as a feature, the high fea-ture dimensionality would have lead to overfitting on smalldatasets. Using a smaller number of flow vectors gave re-duced accuracy. In looking for the optimal feature size wefound out that a grid size of 10 × LPC [3] is a popular time-series descriptor (e.g. forspeaker verification). LPC assumes the data is generated bya physical system, here the photographer’s head and body. Itattempts to learn a linear regression model for its equationsof motion, predicting for each optical flow series the flowvalue in the next frame given the flow values of previous k frames. Given a feature vector, we calculate an LPC modelfor each component of each 4s flow time series (100 mod-els in total). Using too few coefficients yields less accuratepredictions, while too many coefficients causes overfitting.We found k =9 to work well for our case. The final LPC de-scriptor consisted of all coefficients of all time-series mod-els (100 × M known pho-tographers) and verification (classify LPC descriptor intotarget photographer or rest-of-the-world). The non-linear(RBF) classifier outperformed linear SVM in almost allcases. As mentioned before, photographer recognition us-ing a non-linear SVM trained on the LPC descriptor gave81% identification rate (vs. random 3%), and the verifica-tion EER (Equal Error Rate) was 10%. In Sec. 3.2 we described a hand-designed descriptor foridentity recognition. The LPC descriptor suffers from sev-eral drawbacks: • The LPC regression model is learned for each time-series separately and ignores the dependence betweenoptical flow vectors.3 nput data :Size: 50 * 60 Depth: 2
Convolution kernels :Size: 50*20*2128 Kernels
Relu Non-Linearity
Conv data :Size: 1 * 51 Depth: 128
Max Pooling :Size: 1*20Stride: 15
Max Pooling data :Size: 1 * 4 Depth: 128
Fully Connected layer 1
128 units
Output layer :Verification2 unitsIdentification:20/32 units
Fully connected + sigmoid non-linearity Fully connected + sigmoid non-linearity
Figure 5. A diagram of our CNN architecture for photographer recognition from a given flow feature vector. The operations on the dataare shown on top, the sizes of subsequent data layers are shown on the bottom. The Neural Network learns the descriptor jointly with theclassifier, therefore automatically creating a descriptor optimal to this task. • The LPC descriptor and SVM classifier are learned in-dependently, the labels cannot directly influence thedesign of the descriptor.To overcome the above drawbacks, we propose to learn aCNN model for photographer recognition. The CNN learnsdescriptor and classifier end to end, and is able to take ad-vantage both of dataset labels and the dependence betweenfeatures when calculating filter coefficients. The CNN is amore general architecture, the LPC descriptor is a subset ofdescriptors learnable by the network.Due to the limited number of data points available in ourdatasets, we limit our CNN to only 2 hidden layers. Us-ing more layers increases model capacity but also increasesover-fitting, and this architecture yielded the best perfor-mance. The architecture is illustrated in Fig. 5.Our architecture is tailored especially for egocentricvideo. As we use sparse optical flow we do not assumemuch spatial invariance in the features (differently frommost image recognition tasks). On the other hand the pre-cise temporal offset of the photographer’s actions is usuallynot important, e.g. the precise time of the beginning of aphotographer’s step is less important than the time betweenstrides. Our architecture should therefore be temporally in-variant. The first layer was thus designed to be convolu-tional in time but not in space.The kernel size spans all the blocks across the x and y components over K T frames (we use K T = 20 which isa little longer than the typical step duration). The convolu-tional layer consists of M kernels (we use M = 128 ). Theoutputs of the kernels z m = W m ∗ x are passed througha ReLU non-linearity ( max ( z m , ). We pool the outputssubstantially in time, as the feature vector is of high dimen-sion compared to the amount of training data available. To a) b) c) Figure 6. The MAP rule operated on the FPIS dataset: a) Groundtruth labels. b) Raw CNN probabilities. c) MAP rule probabilities(for T =12 seconds.). The MAP classifier visibly ’cleaned up’ theprediction. correspond to the typical time interval between steps we usekernel length of 20 and stride of 15.The data is then passed through two fully connected(affine) layers each followed by a sigmoid non-linearity( σ ( z ) = e − z ). The first fully connected hidden layerhas N hidden nodes (we used N = 128 ). The output ofthis layer is the learned CNN descriptor.The second fully connected layer is a linear soft-maxclassifier and has the same number of nodes as the num-ber of output classes: 2 classes for verification, and 20 or32 classes for identification. Sec. 3.2 and Sec. 3.3 described a method to train aphotographer classifier on a short (4 seconds) video se-quence. The video used for recognition is usually signifi-cantly longer than 4 seconds.We split the video into 4 second subsequences (over-lapping by 2 seconds) and extract their feature vectors V t ( t is the subsequence number). We compute the iden-tity label ( L t ) probability distribution for each featurevector V t using LPC or CNN classifiers trained as de-4 igure 7. Classification accuracy vs. video length when one fea-ture vector covers T = 4 seconds (Using CNN on the FPSIDataset). Longer video allows extraction of more feature vectors.MAP classification consistently beats mode classification. Bothmethods can exploit longer sequences and thus improve on 4s se-quence recognition. All methods perform far better than random. scribed above, We then classify the entire video into theglobally most likely label, argmax i (cid:81) t P ( L t = i | V t ) = argmax i (cid:80) t log ( P ( L t = i | V t )) . While this classifier as-sumes that feature vectors are IID, we have found that thisrequirement is not necessary for the success of the method.See Fig. 6 for an example on the FPIS dataset. MAP clas-sification has helped boost the recognition performance onthe EVPR dataset to around 90% (an increase of 13%) overthe 4s rate.
4. Results
Several experiments were performed to evaluate the ef-fectiveness of our method. As there is no standard datasetfor Egocentric Video Photographer Recognition, we useboth a small (6 person) public dataset - FPSI [6] that wasoriginally collected for egocentric activity analysis. Foreach photographer - morning sequences were used for train-ing, and afternoon sequences for testing.In order to evaluate our method under more principledsettings, we collected a new larger (32 person) dataset -EVPR - specifically designed for egocentric photographervideo recognition. In the EVPR dataset all photographersrecorded two 7 minute sequences (from which we extractedaround 200 four second sequences each) on the same daywith different head-mounted cameras (D1,D2) for trainingand testing. 20 of the photographers also recorded another7 minute sequence with yet another camera (D3) a weeklater. Both datasets are described in detail in Sec. 6.1. Thedetailed experimental protocol is described in Sec. 6.2.
Fig. 7 presents the photographer recognition test perfor-mance of our network on the FPSI database (6 people). Theaverage correct recognition rate on a single feature vector(describing only 4 seconds of video) is 76% against the ran-dom performance of 16.6%.
Figure 8. CMC rates for same day recognition (for 12s sequences).LPC accuracy: 81% (Top-1) and 88% (Top-2). The CNN furtherimproves the performance with 90% (Top-1) and 93% (Top-2).Both methods far outperform the random rate of 3% (Top-1) and6% (Top-2). Both descriptors also beat the raw features by a largemargin.Figure 9. CMC rates for recognition 1 week later (for 12s se-quences). LPC accuracy: 76% (Top-1) and 86% (Top-2). TheCNN further improves the performance with 91% (Top-1) and96% (Top-2). Both methods far outperform the random rate of5% (Top-1) and 10% (Top-2). Both descriptors also beat the rawfeatures by a large margin.
Test videos are usually longer than 4 seconds, and wehave multiple feature vectors for each person. We com-bine predictions over a longer video using the MAP rule inSec. 3.4. In Fig. 7 we compare the MAP strategy vs. takingthe most frequent 4s prediction in the test video (Mode). Weobserve that using longer sequences further improves recog-nition performance, reaching around 91% accuracy for 50seconds of video. We also observe that MAP classifiersconsistently beats the Mode classifier and use it in all otherexperiments.To evaluate the recognition performance on a largerdataset, we show the performance of our method on our newdataset - EVPR. In this experiment the network was trainedon video sequences for each photographer using Camera D1and is evaluated on video sequences recorded on the sameday using Camera D2 and a week later recorded using Cam-era D3. In Fig. 8 and Fig. 9 we present the cumulative matchcurve (CMC) for the same day and week later recognitionresults respectively. We use the Top- k notion, indicatingthat the correct result appeared within the top k predictions5o Stab StabDescriptor 4s 12s 4s 12sLPC 65% 81% 59% 72%CNN 77% 90% 71% 86% Table 1. Same-day CSMC recognition accuracy with and withoutstabilization. of the classifier. In addition to LPC and CNN, an RBF-SVMtrained on the raw optical flow features is used as baseline toevaluate the quality of our descriptors. High accuracy wasachieved in both scenarios, same day CNN recognition ac-curacy is 90% (top 1) and 93% (top 2). The recognition per-formance a week later is better with 91% (top 1) and 96%(top 2). The improved performance numbers a week laterare expected due to the smaller dataset size (20 vs 32), butare nonetheless encouraging as many photographers woredifferent shoes from the D1 training sequence recorded aweek before. This result shows that our method can obtaingood recognition performance on meaningful numbers ofphotographers and across at least a week.To test the possibility that stabilization would take awaysome or all the body motion information in the frame mo-tions, the identification experiments were redone with thefollowing pre-processing stage: for each frame (50 flowvectors) the mean framewise vector was calculated and thensubtracted from each of the vectors in the frame. As motionbetween frames is small and some lens distortion correc-tion was performed, this is similar to 2D stabilization. Ta-ble. 1 shows that such ”stabilization” degrades performancesomewhat (4-9%), but accuracy still remains fairly high. Wenote however that more complex stabilization might removemore body motion information.
We also test the verification performance obtained by ourmethod. In order to evaluate verification performance by asingle number it is common to use the Equal Error Rate(EER), the error rate at which the False Acceptance Rate(FAR) and False Rejection Rate are equal.The EER for both the CNN and LPC descriptors forvideos of length 4s (one feature vector) and 12s (five fea-ture vectors) is presented in Table. 2 while the ROC curvesare shown in Fig. 10. A detailed description of our proto-col can be found in Sec. 6. It can be seen from our resultsthat high accuracy (low EER) can be obtained by both de-scriptors: LPC 14% (4s), 10% (12s) and CNN 11% (4s),8% (12s). The CNN obtains better performance for bothdurations with a larger improvement for 4s.It should be noted that all test probe photographers apartfrom the target photographer had never been used in train-ing. By focusing on modeling the target photographer wecan separate him from the rest of the world, and are thusable to generalize to unseen test photographers. Descriptor 4s 12sLPC 13.6% 9.6%CNN 11.3% 8.1%
Table 2. Verification equal error rates for LPC and CNN descrip-tors with 4s and 12s sequence duration.Figure 10. ROC curves for the verification performance of ourmethod for LPC and CNN descriptors of 4s and 12s sequences.For both methods we show the mean ROC curve. The EER ofeach method is given by the point of intersection between the lin-ear line and its ROC curve.Horizontal flow component Vertical flow component − → x −→ t −→ t Figure 11. Examples of a temporal filter for the horizontal (left)and vertical (right) flow components. Horizontal axis is time, andvertical axis is location along the central line. The horizontal com-ponent filter appears to be sensitive for certain left-right frequen-cies, while the vertical component filter is sensitive to oscillatingrotations: When the right side is moving up the left side is movingdown, etc.
5. Discussion
Analysis of CNN features:
In order to analyze the fea-tures learned by the CNN we visualize the filters learnedby the first layer. Fig. 11 shows the horizontal and veriti-cal components of a first layer temporal filter learned by thenetwork. For illustration purposes, only the weights of thecentral line of pixels are shown. Looking at the weights, wesee that the horizontal component filter is tuned to respondto some specific frequencies, while the vertical componentlooks for sharp rotations. This behavior appears in severalother filters suggesting that the network might be using bothspectral and transitive cues.
Transfer Learning for verification:
In some scenariosit may not be possible to train a verification classifier foreach photographer. In such cases Nearest Neighbors maybe a good alternative. The following approach is taken: An6) b) c)
Figure 12. Common failure cases for the 4-second descriptor: a-b)Sharp turns of the head result in atypical fast motions, sometimescausing motion blur. c) Large moving objects can also cause atyp-ical optical flow patterns. identification CNN is trained on half the photographers inthe training dataset. We choose a video by a target pho-tographer (that was not used for training the CNN), and ex-tract its CNN descriptors (as in Sec. 3.3), this set of descrip-tors forms our gallery. Similarly we extract CNN descrip-tors from all video sequences of photographers not usedfor training the CNN, this forms our probe set (excludingthe sequence used as gallery). For each probe descriptorwe check if the euclidean distance from its nearest neigh-bor in the gallery is smaller than some threshold, and if sowe classify it as the target photographer. We used Cam-era D1 sequences for training and D2 sequences for test.16 randomly selected photographers were used for trainingthe CNN, and the rest for verification. The same procedurewas carried out for LPC (without training a CNN). Multiple4s sequence predictions are aggregated using simple voting.The average EER for 12s sequences was 15.5% (CNN) and22%(LPC). Although less accurate than trained classifiers,this shows the network learns to identity features that aregeneral and can be transferred to identify unseen photogra-phers. Nearest Neighbors classification on the optical flowraw features yielded very low performance in accordancewith the findings of [20, 25].
Verification on FPSI:
We tried learning verification clas-sifiers by choosing one photographer from the FPSI datasetas target, and 4 other photographers as negative trainingdata. The morning sequences of the target photographerwere used for training and the afternoon for testing. Wetested the classification performance between the afternoonsequences of the target photographer and the remaining 6thnon-target photographer from the FPSI dataset. The net-work however, fit to the train non-target photographers andhas not been able to generalize to the unseen probe pho-tographer. We therefore conclude that a significant numberof photographers (such as present in the EVPR dataset) isrequired for training a verification classifier.
Failure cases:
In Fig. 12 several cases are shown wherethe 4 second descriptor failed to give correct recognition.Failure can be caused by sharp head movements (some-times causing significant blur), by large moving objects, orby lack of features for optical flow computation. It is likelythat by identifying such cases and removing their descrip-tors, higher recognition performance may be achieved.
6. Experimental Procedure
In this section we give a detailed description of the ex-perimental procedure used in Sec. 4.
Two datasets were used for evaluation: a public generalpurpose dataset (FPSI) and a larger dataset (EVPR) col-lected by us to overcome some of the weaknesses of FPSI.
The First-Person Social Interactions (FPSI) dataset was col-lected by Fathi et al. [6] for the purpose of activity analysis.6 individuals (5 males, 1 female) recorded a day’s worthof egocentric video each using head-worn GoPro cameras.Due to battery and memory limitations of the camera, thephotographers occasionally took the cameras off and putthem on again, ensuring that camera extrinsic parameterswere not kept constant.In this work we learn to recognize video photographerswhile walking, rather than sitting or standing. We thereforeextracted the walking portions of each video using manuallabels. It is possible to use a classifier such as described in[19] to find the walking intervals.
The FPSI dataset suffers from several drawbacks: it con-tains video only for a small number of photographers (6)and each photographer wears the same hat and camera allthe time. It is therefore conceivable that learning cameraparameters can help recognition. To overcome these issueswe collected a larger dataset - Egocentric Video Photogra-pher Recognition (EVPR).The EVPR consists of head-mounted video sequencescollected from 32 photographers. Each video sequence wasrecorded with a GoPro camera attached to a baseball capworn on the photographer’s head (as in Fig. 13). Each pho-tographer was asked to walk normally for around 7 min-utes along the same road. All photographers recorded two 7minute video sequences on a single day using two differentcameras (and caps). 20 photographers also recorded anothersequence a week later. The use of different cameras fordifferent sequences came to ensure that motion rather thancamera calibration is learned. No effort was made to ensurethat the same shoes would be used on both days (and in factseveral persons had changed shoes between sequences).
Photographer identification sets to recognize a photogra-pher from a closed set of M candidates. For this task itis assumed that we have training data from all subjects.7 igure 13. The apparatus used to record the EVPR dataset. We tested our method both on the FPSI the EVPRdatasets. In the FPSI dataset we used for each individualthe first 80% of sequences (taken in the morning) for train-ing, and the last 20% sequences recorded in the afternoonfor testing. This is done to reduce overfitting to a particulartime or camera setup. Data were randomly sub-sampled toensure equal number of examples for each photographer inboth training and testing sets. The results are described inSec. 4.For the EVPR dataset we used sequences from CameraD1 for training. For testing we use both sequences fromCamera D2 (taken on the same day) and Camera D3 (takena week later, when available). The results on each cameraare compared to analyze whether recognition performancedegrades within a week.
Given a target photographer with a few minutes of train-ing data, and negative training examples by other non-targetphotographers, we verify whether a probe test video se-quence was recorded by the target photographer. Recogni-tion on longer sequences is done by combining the predic-tions from subsequent short sequences. As the FPSI datasetcontains only 6 photographers it was not suitable for thistask (this was elaborated upon in Sec. 5) therefore only theEVPR dataset was used for evaluating performance on thistask. For each of 32 photographers: i) photographer is des-ignated target ii) we selected sequences of the target photog-rapher and 15 non-target photographers (randomly selected)for training a binary classifier. All training sequences were7 minutes (200 descriptors) long and were recorded by cam-era D1. iii) Another sequence recorded by the target pho-tographer and the remaining 16 participants that were notused for training, were used to test the classifier. Test se-quences were recorded by camera D2. iv) The ROC curveand EER ware computed. Average EER and ROC for allphotographers is finally obtained. As each sequence con-tained about 200 descriptors this formed a significant testset. Care was taken to ensure that all photographers (apartfrom the target) would appear in the training or test datasetsbut not in both. This was done to ensure we did not overfitto specific non-target photographers. We replicated positivetraining examples to ensure equal numbers of negative and positive training and test data.
Features:
In all experiments the optical flow grid sizeused was 10 ×
5. In the CNN experiments, all optical flowvalues were divided by the square-root of their absolutevalue, this was found to help performance by decreasing thesignificance of extreme values. Feature vectors of length 60frames at 15 fps (4s) were used. Feature vectors were ex-tracted every 2s (with a 2s overlap).
Normalization:
We followed the standard practice - Forthe LPC descriptor, all feature vectors were mean and vari-ance normalized across the training set before being usedby the SVM. For the CNN, feature vectors were mean sub-tracted before being input to the CNN.
Training:
The SVM was trained using LIBSVM [4]. Weused σ = 1 e − and C = 1 for LPC, C = 10 for theraw features. The CNN was trained by AdaGrad [5] withlearning rate 0.01 on a GPU using the Caffe [9] package.The mini-batch size was 200.
7. Conclusion
A method to recognize the photographer of head-wornegocentric camera video has been presented. We show thatphotographer identity can be found from body motion in-formation as expressed in camera motion when walking.Recognition was done with both physically motivated handdesigned descriptors, and with a Convolutional Neural Net-work. Both methods gave good recognition performance.The CNN classifier was shown to generalize and improveon the LPC hand-designed descriptor.The time-invariant CNN architecture presented here isquite general and can be used for other video classificationtasks relying on coarse optical flow.We have tested the effects of simple 2D video stabiliza-tion on classification accuracy, and found only slight degra-dation in performance. It is possible that more elaboratestabilization would have a greater effect.The implication of our work is that photographers’ head-worn egocentric videos give much information away. Thisinformation can be used benevolently (e.g. camera theftprevention, user analytics on video sharing websites) or ma-liciously. Care should therefore be taken when sharing suchvideo.
Acknowledgment.
This research was supported by Intel-ICRC and by the Israel Science Foundation.
References [1] Y. Bar, N. Levy, and L. Wolf. Classification of artis-tic styles using binarized features derived from a deep8eural network. In
ECCV 2014 Workshops , pages 71–84, 2014. 2[2] Y. Bengio. Learning deep architectures for AI.
Foun-dations and trends in Machine Learning , 2(1):1–127,2009. 2[3] J. P. Campbell Jr. Speaker recognition: a tutorial.
Pro-ceedings of the IEEE , 85(9):1437–1462, 1997. 1, 3[4] C.-C. Chang and C.-J. Lin. LIBSVM: A library forsupport vector machines.
ACM Trans. on IntelligentSystems and Technology , 2011. 8[5] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgra-dient methods for online learning and stochastic opti-mization.
JMLR , 2011. 8[6] A. Fathi, J. K. Hodgins, and J. M. Rehg. Social inter-actions: A first-person perspective. In
CVPR , 2012. 2,5, 7[7] S. Furui. Cepstral analysis technique for automaticspeaker verification.
ICASSP , 1981. 2[8] J. Hays, A. Efros, et al. Im2gps: estimating geo-graphic information from a single image. In
CVPR’08 ,pages 1–8, 2008. 2[9] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,R. Girshick, S. Guadarrama, and T. Darrell. Caffe:Convolutional architecture for fast feature embedding. arXiv:1408.5093 , 2014. 8[10] C. R. Johnson Jr, E. Hendriks, I. J. Berezhnoy,E. Brevdo, S. M. Hughes, I. Daubechies, J. Li,E. Postma, and J. Z. Wang. Image processing forartist identification.
IEEE Signal Processing Maga-zine , 25(4):37–48, 2008. 2[11] K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto.Fast unsupervised ego-action learning for first-personsports videos. In
CVPR , 2011. 2[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im-agenet classification with deep convolutional neuralnetworks. In
NIPS , 2012. 2[13] S. Lee, H. Zhang, and D. J. Crandall. Predicting geo-informative attributes in large-scale image collectionsusing convolutional neural networks. In
WACV’15 ,pages 550–557, 2015. 2[14] Z. Lu and K. Grauman. Story-driven summarizationfor egocentric video. In
CVPR , 2013. 2[15] B. D. Lucas, T. Kanade, et al. An iterative image regis-tration technique with an application to stereo vision.In
IJCAI , volume 81, pages 674–679, 1981. 3[16] J. Mantyjarvi, M. Lindholm, E. Vildjiounaite, S.-M.Makela, and H. Ailisto. Identifying users of portabledevices from gait pattern with accelerometers. In
ICASSP , 2005. 1, 2 [17] H. Murase and R. Sakai. Moving object recogni-tion in eigenspace representation: gait analysis and lipreading.
Pattern Recognotion Letters , 17(2):155–162,1996. 2[18] K. Nishino and S. K. Nayar. Corneal imaging system:Environment from eyes.
IJCV , 70(1):23–40, 2006. 2[19] Y. Poleg, C. Arora, and S. Peleg. Temporal segmenta-tion of egocentric videos. In
CVPR . 2, 7[20] Y. Poleg, C. Arora, and S. Peleg. Head motion sig-natures from egocentric videos. In
ACCV , 2014. 2,7[21] D. A. Reynolds and R. C. Rose. Robust text-independent speaker identification using gaussianmixture speaker models.
IEEE Trans. on Speech andAudio Processing , 3(1):72–83, 1995. 2[22] M. S. Ryoo and L. Matthies. First-person activityrecognition: What are they doing to me? In
CVPR ,2013. 2[23] K. Shiraga, N. T. Trung, I. Mitsugami, Y. Mukaigawa,and Y. Yagi. Gait-based person authentication bywearable cameras. In
INSS , 2012. 2[24] C. Thomas and A. Kovashka. Who’s behind thecamera? identifying the authorship of a photograph. arXiv:1508.05038 , 2015. 2[25] R. Yonetani, K. M. Kitani, and Y. Sato. Ego-surfingfirst person videos. In