[PDF] On the Role of Event Boundaries in Egocentric Activity Recognition from Photostreams

Abstract

Event boundaries play a crucial role as a pre-processing step for detection, localization, and recognition tasks of human activities in videos. Typically, although their intrinsic subjectiveness, temporal bounds are provided manually as input for training action recognition algorithms. However, their role for activity recognition in the domain of egocentric photostreams has been so far neglected. In this paper, we provide insights of how automatically computed boundaries can impact activity recognition results in the emerging domain of egocentric photostreams. Furthermore, we collected a new annotated dataset acquired by 15 people by a wearable photo-camera and we used it to show the generalization capabilities of several deep learning based architectures to unseen users.

Full PDF

OOn the Role of Event Boundaries in Egocentric Activity Recognition fromPhotostreams

Alejandro Cartas

Estefania Talavera Petia Radeva

Mariella Dimiccoli

University of BarcelonaMathematics and Computer Science Department08007 BarcelonaSpain { alejandro.cartas, etalavera, petia.ivanova } @ub.edu Computer Vision CenterUniversitat Aut´onoma de Barcelona08193 Cerdanyola del VallsSpain [email protected]

Abstract

Event boundaries play a crucial role as a pre-processingstep for detection, localization, and recognition tasks of hu-man activities in videos. Typically, although their intrinsicsubjectiveness, temporal bounds are provided manually asinput for training action recognition algorithms. However,their role for activity recognition in the domain of egocen-tric photostreams has been so far neglected. In this paper,we provide insights of how automatically computed bound-aries can impact activity recognition results in the emergingdomain of egocentric photostreams. Furthermore, we col-lected a new annotated dataset acquired by 15 people by awearable photo-camera and we used it to show the gener-alization capabilities of several deep learning based archi-tectures to unseen users.

1. Introduction

Wearable cameras offer a hand-free way to capture theworld from a ﬁrst-person perspective, hence providing richcontextual information about the activities being performedby the user [16]. Similarly to other wearable sensors, wear-able cameras are ubiquitous and allow to capture daily ac-tivities in natural settings.Currently, recognizing daily activities from ﬁrst-person(egocentric) images and videos is a very active area of re-search in computer vision [15, 18, 5, 2, 3]. In this paper,we focus on streams of images captured at regular inter-vals through a wearable photo-camera, also called photo-streams , that have received comparatively little attentionin the literature. With respect to egocentric videos, pho-tostreams usually cover the full day of a person (see Fig.1). However, since the photo-camera typically takes a pic-ture every 30 seconds, temporally adjacent images presentabrupt changes. Consequently, optical ﬂow cannot be re-liably estimated and several ﬁne-grained action are com-pletely missed or too sampled for being identiﬁable. Sincemotion is an important feature to disambiguate activities,

Using cellphone Walking inside Eating alone Walking inside Using computer12:10 12:14 12:18 12:51 14:49Walking inside Walking outside Informal meeting Train/metro Eating not alone18:44 18:49 18:52 19:08 21:29

Figure 1: Sample images captured by a wearable photo-camera user during a day, together with their timestamp andactivity label.recognize them become particularly challenging in the pho-tostream domain.Recently, several papers have proposed different deeplearning architectures to recognize activities from egocen-tric photostreams. The earliest works [5, 3] focused onimage-based approach, aimed at classifying each image in-dependently by its neighbor frames. With the goal of takingadvantage of the temporal coherence of objects that charac-terizes photostreams [1], instead of working at image level,Cartas et al. [2, 4] proposed to train in an end-to-end fash-ion a Long Short Term Memory (LSTM) recurrent neuralnetwork on the top of a CNN by feeding the LSTM using asliding window approach. This strategy allows to copy withboth the not negligible length of photostreams and the lackof knowledge of event boundaries.This approach has showed that considering overlappingsegments of ﬁxed size turn out to be effective to better cap-ture long-term temporal dependencies in photo-streams. Inthis paper,we argue that knowing exactly event boundarieswould allow to further improve activity recognition perfor-mance, since it would allow to capture temporal dependen-cies both within an event and across events.1 a r X i v : . [ c s . C V ] S e p ime Figure 2: Example of events obtained by applying SR-Clustering on a visual lifelog. The color above the images indicatecorrespondence to the event to which consecutive images belong.

Input imageXceptionLSTM unitDense layerActionsoftmax layer

Figure 3: Pipeline of our proposed approach.

2. Event boundaries for activity recognition

In this work, we investigate whether the use of eventboundaries as additional input can improve the recognitionof activities in egocentric photo-sequences. To this goal, weused the temporal segmentation method introduced in [9]that allows to extract events from long unstructured pho-tostreams. Events obtained with such approach, correspondto temporally adjacent images the share both contextual andsemantic features, as shown in Fig. 2. As it can be observed,these events constitute a good basis for activity recognition,since typically, when the user is engaged in an activity, con-textual and semantic features have little variation.

3. Experimental setup

The objective of our experiments was to determine if thetemporal coherence of segmented events from egocentricphotostreams improved the activity recognition at theframe level. Therefore, we trained three many-to-manyLSTM models using the full-day sequence and the auto-matically extracted event segments, i.e. CNN+RF+LSTM,CNN+LSTM, and CNN+Bidirectional LSTM (see Fig. 3). For comparative purposes, we used as a baseline to trainall models the Xception network [6]. Additionally, weimplemented the best model presented in [4], namely thecombination of CNN+RF+LSTM. We measure the activityrecognition performance using the classiﬁcation accuracyand associated macro metrics.

Dataset . We collected over 102,227 pictures from 15college students who were asked to wear an egocentric cam- era on their chest. The camera automatically captured animage at ≈ seconds rate with a 5MP resolution. The an-notation process took into account the continuous context ofactivity sequences. In order to split the data in training andtest sets, all the possible combinations of users for both setswere calculated. Only the combinations with a test set hav-ing all the categories and 20-21% of all images were kept.A histogram of the number of photos per category and splitis shown in Fig. 4. Temporal sequences . The following temporal se-quences were used in the experiments:1.

Fixed size segments . The stateful sliding window train-ing procedure for ﬁxed size segments from [ ? ] forLSTM was also implemented.2. Full sequence . The whole day photostream sequenceof each user were used as a single input.3.

Event segmentation . Groups of sequential images wereobtained by applying the method introduced by Dim-iccoli et al. [9], which temporally segments the givenphotostream as illustrated in Fig 5.

4. Experimental results

In Table 1 we present the performance of all the modelsusing full sequence, SR-Clustering (event segmentation),and the sliding window training procedure (ﬁxed size seg-ments) proposed in [2]. The performance was evaluated us-ing the accuracy and macro metrics for precision, recall, andF1-score.The results indicate that the CNN+Bidirectional LSTMmodel achieves the best performance over all the modelsand on each temporal segmentations. On the other hand, theCNN+RF+LSTM model did not improved the performanceas much as the other models and was even worse than itsbaseline using the sliding window training. This is a con-sequence of the overﬁtting of its base model (CNN+RF) inthe training set, as shown by the categories recall in Table1. This contrasts the results previously obtained in [4] usinganother dataset and it is likely due to the fact that here weare using non-seen users in our test set.Furthermore, the results suggests that the temporal seg-mentation increased the classiﬁcation performance of the http://getnarrative.com/ tt end i ng a s e m i na r B u s C a r C oo k i ng C yc li ng D i s h w a s h i ng D r i n k i ng D r i v i ng E a t i ng F o r m a l M ee t i ng G o i ng t o a ba r I n f o r m a l m ee t i ng R ead i ng R e l a x i ng S hopp i ng S t a i r c li m b i ng T r a i n / M e t r o U s i ng a c o m pu t e r U s i ng m ob il e de v i c e W a l k i ng i n s i de W a l k i ng ou t s i de W r i t i ng P e r s ona l H y g i ene Categories N u m b e r o f I n s t a n ces , ,

970 7762671 ,

043 1 , ,

267 427549976 343551894 243186429 1 , ,

895 9383941 ,

332 8 , , ,

610 1 , ,

466 468209677 4 , , ,

658 94515960 7723141 ,

086 1 , ,

967 22058278 2 , ,

506 31 , , ,

561 3 , , ,

294 7 , , ,

416 7 , , ,

226 94622968 24913262

Dataset Summary

TrainingTestAll

Figure 4: Dataset summary. Please notice that distributions are normalized and the vertical axis has a logarithm scale.Table 1: Activity classiﬁcation performance. Upper part shows the recall for each category and the lower part shows theperformance metrics for all models. The best result per measure is shown in bold but does not take into account the temporalmodels trained using the groundtruth segmentation, that we consider as an upper bound.

Xception+RF+LSTM Xception+LSTM Xception+Bidi LSTMActivity X ce p t i o n X ce p t i o n + R F F i x e d s i ze s e g m e n t s F u ll s e qu e n ce E v e n t s e g m e n t a t i o n F i x e d s i ze s e g m e n t s F u ll s e qu e n ce E v e n t s e g m e n t a t i o n F i x e d s i ze s e g m e n t s F u ll s e qu e n ce E v e n t s e g m e n t a t i o n Accuracy

Macro F1-score

Table 2: Comparison with different Egocentric datasets. Information based on [7]

Non- Native Action Action/ActivityDataset Photo-streams Scripted? Env? Frames Sequences Segments Classes Participants

Ours (cid:88) (cid:88) (cid:88) × (cid:88) (cid:88) × (cid:88) (cid:88) × (cid:88) (cid:88) × × (cid:88) × (cid:88) (cid:88) × × × × × × × × × tested LSTM based models. For instance, Fig. 6 showssome qualitative results. In particular, the automatic eventsegmentation (SR-Clustering) was better than the day seg-mentation as it improved the accuracy, macro precision, andmacro F1-scores in two of the three LSTM based models.Since most of the test users had short day sequences, theday temporal segmentation was the best for CNN+LSTMmodel. Finally, the best macro recall was obtained usingthe Sliding Window training [4] for the CNN+Bidirectional LSTM model. This can be understood as a smoothing effectover the test sequences.

5. Conclusions

This paper has shed light on two poorly investigatedissues in the context of activity recognition from ego-centric photostreams. The ﬁrst issue was related to therole of event boundaries as input for activity recognitionin photostreams. By relying on manually-annotated andigure 5: Example of automatically extracted events usedin the experiments.Figure 6: Examples of qualitative results obtainedfrom three of the evaluated methods (Xception, Xcep-tion+RF+LSTM, and Xception+Bidirectional LSTM) fordifferent activity classes. False and true activity labels for agiven image are marked in red and green, respectively.automatically-extracted event boundaries, in addition tooverlapping batches of images of ﬁxed size, this paperpointed out that activity recognition performances beneﬁtfrom the knowledge of event boundaries. The second is-sue was related to the generalization capabilities of existingmethods for activity recognition. By using a large egocen-tric dataset acquired from 15 users, this paper could elu-cidated for the ﬁrst time, how activity recognition perfor-mance generalize at test time to unseen users. The best re-sults were achieved by using a CNN+Bidirectional LSTMarchitecture on a temporal event segmentation.

Acknowledgments

A.C. was supported by a doctoral fellowship from theMexican Council of Science and Technology (CONA-CYT) (grant-no. 366596). This work was partiallyfounded by TIN2015-66951-C2, SGR 1219, CERCA,

ICREA Academia’2014 and 20141510 (Marat´o TV3). Thefunders had no role in the study design, data collection,analysis, and preparation of the manuscript. M.D. is grate-ful to the NVIDIA donation program for its support withGPU card.

References [1] D. Byrne, A. R. Doherty, C. G. Snoek, G. J. Jones, and A. F.Smeaton. Everyday concept detection in visual lifelogs: val-idation, relationships and trends.

Multimedia Tools and Ap-plications , 49(1):119–144, 2010. 1[2] A. Cartas, M. Dimiccoli, and P. Radeva. Batch-based activ-ity recognition from egocentric photo-streams.

Proceedingson the International Conference in Computer Vision (ICCV),2nd international workshop on Egocentric Perception, Inter-action and Computing, Venice, Italy , 2017. 1, 2[3] A. Cartas, J. Mar´ın, P. Radeva, and M. Dimiccoli. Recogniz-ing activities of daily living from egocentric images. In L. A.Alexandre, J. Salvador S´anchez, and J. M. F. Rodrigues, ed-itors,

Pattern Recognition and Image Analysis , pages 87–95,Cham, 2017. Springer International Publishing. 1[4] A. Cartas, J. Mar´ın, P. Radeva, and M. Dimiccoli. Batch-based activity recognition from egocentric photo-streams re-visited.

Pattern Analysis and Applications , May 2018. 1, 2,3[5] D. Castro, S. Hickson, V. Bettadapura, E. Thomaz,G. Abowd, H. Christensen, and I. Essa. Predicting dailyactivities from egocentric images using deep learning. In proceedings of the 2015 ACM International symposium onWearable Computers , pages 75–82. ACM, 2015. 1[6] F. Chollet. Xception: Deep learning with depthwise separa-ble convolutions. In

CVPR , pages 1800–1807. IEEE Com-puter Society, 2017. 2[7] D. Damen, H. Doughty, G. M. Farinella, S. Fidler,A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett,W. Price, and M. Wray. Scaling egocentric vision: The epic-kitchens dataset. In

European Conference on Computer Vi-sion (ECCV) , 2018. 3[8] D. Damen, T. Leelasawassuk, O. Haines, A. Calway, andW. Mayol-Cuevas. You-do, i-learn: Discovering task rele-vant objects and their modes of interaction from multi-useregocentric video. In

Proceedings of the British Machine Vi-sion Conference . BMVA Press, 2014. 3[9] M. Dimiccoli, M. Bola˜nos, E. Talavera, M. Aghaei, S. G.Nikolov, and P. Radeva. Sr-clustering: Semantic regularizedclustering for egocentric photo streams segmentation.

Com-puter Vision and Image Understanding , 2017. 2[10] K. Ehsani, H. Bagherinezhad, J. Redmon, R. Mottaghi, andA. Farhadi. Who let the dogs out? modeling dog behaviorfrom visual data. In

CVPR , 2018. 3[11] A. Fathi, Y. Li, and J. M. Rehg. Learning to recognize dailyactions using gaze. In A. Fitzgibbon, S. Lazebnik, P. Perona,Y. Sato, and C. Schmid, editors,

Computer Vision – ECCV2012 , pages 314–327, Berlin, Heidelberg, 2012. SpringerBerlin Heidelberg. 3[12] A. B. X. M. J. M. A. C. Fernando de la Torre, Jessica Hod-gins and P. Beltran. Guide to the carnegie mellon univer-sity multimodal activity (cmu-mmac) database. In

Tech. re-port CMU-RI-TR-08-22, Robotics Institute, Carnegie MellonUniversity , April 2008. 3[13] A. A. E. Krishna Kumar Singh, Kayvon Fatahalian. Kr-ishnacam: Using a longitudinal, single-person, egocentricataset for scene understanding tasks. In

IEEE Winter Con-ference on Applications of Computer Vision (WACV) , 2016.3[14] Y. J. Lee, J. Ghosh, and K. Grauman. Discovering impor-tant people and objects for egocentric video summarization.In , pages 1346–1353, June 2012. 3[15] M. Ma, H. Fan, and K. M. Kitani. Going deeper into ﬁrst-person activity recognition. In

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages1894–1903, 2016. 1[16] T.-H.-C. Nguyen, J.-C. Nebel, and F. Florez-Revuelta.Recognition of activities of daily living with egocentric vi-sion: A review.

Sensors (Basel) , 16(1):72, Jan 2016. sensors-16-00072[PII]. 1[17] H. Pirsiavash and D. Ramanan. Detecting activities of dailyliving in ﬁrst-person camera views. In , pages2847–2854, June 2012. 3[18] Y. Poleg, A. Ephrat, S. Peleg, and C. Arora. Compact cnn forindexing egocentric videos. In