Detecting Engagement in Egocentric Video
11 Detecting Engagement in Egocentric Video
Yu-Chuan Su and Kristen Grauman
Department of Computer Science, The University of Texas at Austin
In a wearable camera video, we see what the camera wearersees. While this makes it easy to know roughly what he chose tolook at , it does not immediately reveal when he was engagedwith the environment . Specifically, at what moments did hisfocus linger, as he paused to gather more information aboutsomething he saw? Knowing this answer would benefit variousapplications in video summarization and augmented reality, yetprior work focuses solely on the “what” question (estimatingsaliency, gaze) without considering the “when” (engagement).We propose a learning-based approach that uses long-termegomotion cues to detect engagement, specifically in browsingscenarios where one frequently takes in new visual information(e.g., shopping, touring). We introduce a large, richly annotateddataset for ego-engagement that is the first of its kind. Ourapproach outperforms a wide array of existing methods. Weshow engagement can be detected well independent of both sceneappearance and the camera wearer’s identity.
I. I
NTRODUCTION
Imagine you are walking through a grocery store. You maybe mindlessly plowing through the aisles grabbing your usualfood staples, when a new product display—or an interestingfellow shopper—captures your interest for a few moments.Similarly, in the museum, as you wander the exhibits, occa-sionally your attention is heightened and you draw near toexamine something more closely.These examples illustrate the notion of engagement inego-centric activity, where one pauses to inspect somethingmore closely. While engagement happens throughout daily lifeactivity, it occurs frequently and markedly during “browsing”scenarios in which one traverses an area with the intent oftaking in new information and/or locating certain objects—for example, in a shop, museum, library, city sightseeing, ortouring a campus or historic site. a) Problem definition:
We explore engagement from thefirst-person vision perspective. In particular, we ask: Given avideo stream captured from a head-mounted camera duringa browsing scenario, can we automatically detect those timeintervals where the recorder experienced a heightened levelof engagement? What cues are indicative of first-person en-gagement, and how do they differ from traditional saliencymetrics? To what extent are engagement cues independent ofthe particular person wearing the camera (the “recorder”), orthe particular environment they are navigating? See Fig. 1.While engagement is interesting in a variety of daily lifesettings, for now we restrict our focus to browsing scenarios.This allows us to concentrate on cases where 1) engagementnaturally ebbs and flows repeatedly, 2) the environment offersdiscrete entities (products in the shop, museum paintings, etc.)that may be attention-provoking, which aids objectivity in
TimeEngagementLevel
Detection Detection
Fig. 1: The goal is to identify intervals where the camerawearer’s engagement is heightened, meaning he interrupts hisongoing activity to gather more information about some objectin the environment. Note that this is different than detectingwhat the camera wearer sees or gazes upon, which comesfor “free” with a head-mounted camera and/or eye trackingdevices.evaluation, and 3) there is high potential impact for emergingapplications. b) Applications:
A system that can successfully addressthe above questions would open up several applications. Forexample, it could facilitate camera control, allowing the user’sattention to trigger automatic recording/zooming. Similarly,it would help construct video summaries. Knowing when auser’s engagement is waning would let a system display infoon a heads-up display when it is least intrusive. Beyondsuch “user-centric” applications, third parties would relishthe chance to gather data about user attention at scale—forinstance, a vendor would like to know when shoppers lingerby its new display. Such applications are gaining urgencyas wearable cameras become increasingly attractive tools inthe law enforcement, healthcare, education, and consumerdomains. c) Novelty of the problem:
The rich literature on visualsaliency—including video saliency [1]–[8]—does not addressthis problem. First and foremost, as discussed above, detectingmoments of engagement is different than estimating saliency.Nearly all prior work studies visual saliency from the thirdperson perspective and equates saliency with gaze: salientpoints are those upon which a viewer would fixate his gaze,when observing a previously recorded image/video on a staticscreen. In contrast, our problem entails detecting temporalintervals of engagement as perceived by the person capturingthe video as he moves about his environment . Thus, recorderengagement is distinct from viewer attention . To predict itfrom video requires identifying time intervals of engagementas opposed to spatial regions that are salient (gaze worthy) per a r X i v : . [ c s . C V ] A p r frame. As such, estimating egocentric gaze [9]–[11] is alsoinsufficient to predict first-person engagement. d) Challenges: Predicting first-person engagementpresents a number of challenges. First of all, the motioncues that are significant in third-person video taken with anactively controlled camera (e.g., zoom [4], [12]–[14]) areabsent in passive wearable camera data. Instead, first-persondata contains both scene motion and unstable body motions,which are difficult to stabilize with traditional methods [15].Secondly, whereas third-person data is inherently alreadyfocused on moments of interest that led the recorder toturn the camera on, a first-person camera is “always on”.Thirdly, whereas traditional visual attention metrics operatewith instantaneous motion cues [1], [2], [16], [17] andfixed sliding temporal window search strategies, detectingengagement intervals requires long-term descriptors andhandling intervals of variable length. Finally, it is unclearwhether there are sufficient visual cues that transcend user-or scene-specific properties, or if engagement is stronglylinked to the specific content a user observes (in which case,an exorbitant amount of data might be necessary to learn ageneral-purpose detector). e) Our approach:
We propose a learning approach todetect time intervals where first-person engagement occurs. Inan effort to maintain independence of the camera wearer aswell as the details of his environment, we employ motion-based features that span long temporal neighborhoods andintegrate out local head motion effects. We develop a searchstrategy that integrates instantaneous frame-level estimateswith temporal interval hypotheses to detect intervals of varyinglengths, thereby avoiding a naive sliding window search. Totrain and evaluate our model, we undertake a large-scale datacollection effort. f) Contributions:
Our main contributions are as follows.First, we precisely define egocentric engagement and system-atically evaluate under that definition. Second, we collect alarge annotated dataset spanning 14 hours of activity explicitlydesigned for ego-engagement in browsing situations. Third,we propose a learned motion-based model for detecting first-person engagement. Our model shows better accuracy thanan array of existing methods. It also generalizes to unseenbrowsing scenarios, suggesting that some properties of ego-engagement are independent of appearance content.II. R
ELATED W ORK
A. Third-person image and video saliency
Researchers often equate human gaze fixations as the goldstandard with which a saliency metric ought to correlate [18],[19]. There is increasing interest in estimating saliency fromvideo. Initial efforts examine simple motion cues, such asframe-based motion and flicker [8], [18], [20]. One commonapproach to extend spatial (image) saliency to the videodomain is to sum image saliency scores within a temporalsegment, e.g., [21]. Most methods are unsupervised and entailno learning [4]–[8], [18], [20]. However, some recent workdevelops learned measures, using ground truth gaze data asthe target output [1]–[3], [16], [22]. Our problem setting is quite different than saliency. Saliencyaims to predict viewer attention in terms of where in the framea third party is likely to fixate his gaze; it is an image propertyanalyzed independent of the behavior of the person recordingthe image. In contrast, we aim to detect recorder engagement in terms of when (which time intervals) the recorder haspaused to examine something in his environment. Accountingfor this distinction is crucial, as we will see in results.Furthermore, prior work in video saliency is evaluated on shortvideo clips (e.g., on the order of 10 seconds [23]), which issufficient to study gaze movements. In contrast, we evaluateon long sequences—30 minutes on average per clip, and atotal of 14 hours—in order to capture the broad context ofego-behavior that affects engagement in browsing scenarios.
B. Third-person video summarization
In video summarization, the goal is to form a conciserepresentation for a long input video. Motion cues can helpdetect “important” moments in third-person video [12]–[14],[17], [21], including temporal differences [17] and cues fromactive camera control [12]–[14]. Whereas prior methods tryto extract what will be interesting to a third-party viewer, weaim to capture recorder engagement.
C. First-person video saliency and gaze
Researchers have long expected that ego-attention detectionrequires methods distinct from bottom-up saliency [24]. Infact, traditional motion saliency can actually degrade gazeprediction for first-person video [11]. Instead, it is valuable toseparate out camera motion [10] or use head motion and handlocations to predict gaze [9]. Whereas these methods aim topredict spatial coordinates of a recorder’s gaze at every frame,we aim to predict time intervals where his engagement isheightened. Furthermore, whereas they study short sequencesin a lab [10] or kitchen [9], we analyze long data in naturalenvironments with substantial scene changes per sequence.We agree that first-person attention, construed in the mostgeneral sense, will inevitably require first-person “user-in-the-loop” feedback to detect [24]; accordingly, our work does notaim to detect arbitrary subjective attention events, but insteadto detect moments of engagement to examine an object moreclosely.Outside of gaze, there is limited work on attention in termsof head fixation detection [15] and “physical analytics” [25].In [15], a novel “cumulative displacement curve” motion cueis used to categorize the recorder’s activity (walking, sitting,on bus, etc.) and is also shown to reveal periods with fixedhead position. They use a limited definition of attention: aperiod of more than 5 seconds where the head is still but therecorder is walking. In [25], inertial sensors are used in concertwith optical flow magnitude to decide when the recorder isexamining a product in a store. Compared to both [15], [25],engagement has a broader definition, and we discover its scopefrom data from the crowd (vs. hand-crafting a definition on Throughout, we will use the term “recorder” to refer to the photographeror the first-person camera-wearer; we use the term “viewer” to refer to a thirdparty who is observing the data captured by some other recorder. visual features). Crucially, the true positives reflect that aperson can have heightened engagement yet still be in motion.
D. First-person activity and summarization
Early methods for egocentric video summarization extractthe camera motion and define rules for important moments(e.g., intervals when camera rotation is below a threshold) [26],[27], and test qualitatively on short videos. Rather than injecthand-crafted rules, we propose to learn what constitutes anengagement interval. Recent methods explore ways to predictthe “importance” of spatial regions (objects, people) using cueslike hand detection and frame centrality [28], [29], detect nov-elty [30], and infer “social saliency” when multiple camerascapture the same event [31]–[33]. We tackle engagement, notsummarization, though likely our predictions could be anotheruseful input to a summarization system.In a sense, detecting engagement could be seen as detectinga particular ego-activity. An array of methods for classifyingactivity in egocentric video exist, e.g., [34]–[41]. However,they do not address our scenario: 1) they learn models specificto the objects [34], [36]–[38], [40], [41] or scenes [39] withwhich the activity takes place (e.g., making tea, snowboard-ing), whereas engagement is by definition object- and scene-independent, since arbitrary things may capture one’s interest;and 2) they typically focus on recognition of trimmed videoclips, versus temporal detection in ongoing video.III. F
IRST -P ERSON E NGAGEMENT : D
EFINITION AND D ATA
Next we define first-person engagement. Then we describeour data collection procedure, and quantitatively analyze theconsistency of the resulting annotations. We introduce ourapproach for predicting engagement intervals in Sec. IV.
A. Definition of first-person engagement
This research direction depends crucially on having (1)a precise definition of engagement, (2) realistic video datacaptured in natural environments, and (3) a systematic way toannotate the data for both learning and evaluation.Accordingly, we first formalize our meaning of first-personengagement. There are two major requirements. First, theengagement must be related to external factors, either inducedby or causing the change in visual signals the recorderperceives. This ensures predictability from video, excludinghigh-attention events that are imperceptible (by humans) fromvisual cues. Second, an engagement interval must reflect the recorder’s intention, as opposed to the reaction of a third-person viewer of the same video.Based on these requirements, we define heightened ego-engagement in a browsing scenario as follows. A timeinterval is considered to have a high engagement level if therecorder is attracted by some object(s), and he interruptshis ongoing flow of activity to purposefully gather moreinformation about the object(s).
We stress that this definition isscoped specifically for browsing scenarios; while the particularobjects attracting the recorder will vary widely, we assume theperson is traversing some area with the intent of taking in newinformation and/or locating certain objects. The definition captures situations where the recorder reachesout to touch or grasp an object of interest (e.g., when closelyinspecting a product at the store), as well as scenarios wherehe examines something from afar (e.g., when he reads a signbeside a painting at the museum). Having an explicit definitionallows annotators to consistently identify video clips with highengagement, and it lets us directly evaluate the predictionresult of different models.We stress that ego-engagement differs from gaze and tra-ditional saliency. While a recorder always has a gaze pointper frame (and it is correlated with the frame center), periodsof engagement are more sparsely distributed across time,occupy variable-length intervals, and are a function of hisactivity and changing environment. Furthermore, as we willsee below, moments where a person is approximately still are not equivalent to moments of engagement, making observermotion magnitude [25] an inadequate signal.
B. Data collection
To collect a dataset, we ask multiple recorders to takevideos during “browsing” behavior under a set of scenarios ,or scene and event types. We aim to gather scenarios withclear distinctions between high and low engagement intervalsthat will be apparent to a third-party annotator. Based on thatcriterion, we collect videos under three scenarios: (1) shoppingin a market, (2) window shopping in shopping mall, and (3)touring in a museum. All three entail spontaneous stimuli,which ensures that variable levels of engagement will naturallyoccur.The videos are recorded using Looxcie LX2 with × resolution and 15 fps frame rate, which we chose for itslong battery life and low profile. We recruited 9 recorders—5females and 4 males—all students between 20-30 years old.Other than asking them to capture instances of the scenariosabove, we did not otherwise instruct the recorders to behavein any way. Among the 9 recorders, 5 of them record videosin all 3 scenarios. The other 4 record videos in 2 scenarios.Altogether, we obtained 27 videos, each averaging 31 minutes,for a total dataset of 14 hours. To keep the recorder behavioras natural as possible, we asked the recorders to capture thevideo when they planned to go to such scenarios anyway; assuch, it took about 1.5 months to collect the video.After collecting the videos, we crowdsource the groundtruth annotations on Amazon Mechanical Turk. Importantly,we ask annotators to put themselves in the camera-wearer’sshoes. They must precisely mark the start and end pointsof each engagement interval from the recorder’s perspective,and record their confidence. We break the source videosinto 3 minutes overlapping chunks to make each annotationtask manageable yet still reveal temporal context for the clip.We estimate the annotations took about 450 worker-hoursand cost $3,000. Our collection strategy is congruous withthe goals stated above in Sec. III-A, in that annotators areshown only the visual signal (without audio) and are asked to For a portion of the video, we also ask the original recorders to label allframes for their own video; this requires substantial tedious effort, hence toget the full labeled set in a scalable manner we apply crowdsourcing.
Mall Market Museum AllAttention Ratio 0.305 0.451 0.580 0.438
TABLE I: Basic statistics for ground truth intervals.consider engagement from the point of view of the recorder.See appendix for the details of annotation process.Despite our care in the instructions, there remains room forannotator subjectivity, and the exact interval boundaries can beambiguous. Thus, we ask 10 Turkers to annotate each video.Positive intervals are those where a majority agree engagementis heightened. To avoid over-segmentation, we ignore intervalsshorter than 1 second. For each positive interval, we select thetightest annotation that covers more than half of the intervalas the final ground truth.The resulting dataset contains examples that are diverse incontent and duration. The recorders are attracted by a varietyof objects: groceries, household items, clothes, paintings,sculptures, other people. In some cases, the attended objectis out of the field of view, e.g., a recorder grabs an itemwithout directly looking at it, in which case Turkers infer theengagement from context.Table I summarizes some statistics of the labeled data. Onaverage, the recorder is engaged about 44% of the time (see“Attention Ratio”), and it increases once to twice per minute.This density reflects the “browsing” scenarios on which wefocus the data. The length of a positive interval varies sub-stantially: the interquartile range (IQR) is 17.6 seconds, about50% longer than the median. Some intervals last as long as 5minutes. Also, different scenarios have different statistics, e.g.,Museum scenarios prompt more frequent engagement. All thisvariability indicates the difficulty of the task.The new dataset is the first of its kind to explicitly define andthoroughly annotate ego-engagement. It is also substantiallylarger than datasets used in related areas—nearly 14 hours ofvideo, with test videos over 30 minutes each. By contrast,clips in popular third-person saliency datasets are typically 20seconds [23] to 2 minutes [42], since the interest is in gauginginstantaneous gaze reactions.
C. Evaluating data consistency
How consistently do third-party annotators label engage-ment intervals? We analyze their consistency to verify thepredictability and soundness of our definition.Table II shows the analysis. We quantify label agreement interms of the average F score, whether at the frame or intervallevel (see Sec. V-A0d). We consider two aspects of agreement:boundary (how well do annotators agree on the start and endpoints of a positive interval?) and presence (how well do theyagree on the existence of a positive interval?).First we compare how consistent each of the 10 annota-tors’ labels are with the consensus ground truth (see “Turkervs. Consensus”). They have reasonable agreement on therough interval locations, which verifies the soundness of our Frame F Interval F Boundary PresenceTurker vs. Consensus 0.818 0.837 0.914vs. Recorder 0.589 0.626 0.813Random vs. Consensus 0.426 0.339 0.481vs. Recorder 0.399 0.344 0.478
TABLE II: Analysis of inter-annotator consistency.definition. Still, the F score is not perfect, which indicates thatthe task is non-trivial even for humans. Some discrepanciesare due to the fact that even when two annotators agree onthe presence of an interval, their annotations will not matchexactly in terms of the start and end frame. For example, oneannotator might mark the start when the recorder searches foritems on the shelf, while another might consider it to be whenthe recorder grabs the item. Indeed, agreement on the presencecriterion (right column) is even higher, 0.914. The “Randomvs. Consensus” entry compares a prior-informed random guessto the ground truth. These two extremes give useful bounds ofwhat we can expect from our computational model: a predictorshould perform better than random, but will not exceed theinter-human agreement.Next, we check how well the third-party labels matchthe experience of the first-person recorder (see “Turkervs. Recorder). We collect 3 hours of self-annotation from 4of the recorders, and compare them to the Turker annotations.Similar to above, we see the Turkers are considerably moreconsistent with the recorder labels compared to the prior-informed random guess, though not perfect. As one mightexpect, Turker annotations have higher recall, but lower (yetreasonable) precision against the first-person labels. Overall,the 0.813 F score for Turker-Recorder presence agreementindicates our labels are fairly faithful to individuals’ subjectiveinterpretation. IV. A PPROACH
We propose to learn the motion patterns in first-personvideo that indicate engagement. Two key factors motivateour decision to focus on motion. First, camera motion oftencontains useful information about the recorder’s intention [10],[12], [13]. This is especially true in egocentric video, wherethe recorder’s head and body motion heavily influence theobserved motion. Second, motion patterns stand to generalizebetter across different scenarios, as they are mostly indepen-dent of the appearance of the surrounding objects and scene.Our approach has three main stages. First we compute frame-wise predictions (Sec. IV-A). Then we leverage thoseframe predictions to generate interval hypotheses (Sec. IV-B).Finally, we describe each interval as a whole and classify itwith an interval-trained model (Sec. IV-C). By departing fromtraditional frame-based decisions [17], [26], [27], we capturelong-term temporal dependencies. As we will see below, doing We randomly generate interval predictions 10 times based on the prior ofinterval length and temporal distribution and report the average. (4) Compute interval motion by Temporal Pyramid (Sec. IV-C)(5) Estimate interval engagement & select candidates (Sec. IV-C) (3) Generate interval hypotheses (Sec. IV-B)(2) Estimate frame-wise engagement level (Sec. IV-A)(1) Compute frame-wise motion descriptor (Sec. IV-A) t
Engagement Level
Fig. 2: Workflow for our approach.so is beneficial for detecting subtle periods of engagement andaccounting for their variable length. Fig. 2 shows the workflow.
A. Initial frame-wise estimates
To first compute frame-wise predictions, we construct onemotion descriptor per frame. We divide the frame into agrid of × uniform cells and compute the optical flowvector in each cell. Then we temporally smooth the gridmotion with a Gaussian kernel. Since at this stage we wantto capture attention within a granularity of a second, we setthe width of the kernel to two seconds. As shown in [15],smoothing the flow is valuable to integrate out the regularunstable head bobbles by the recorder; it helps the descriptorfocus on prominent scene and camera motion. The framedescriptor consists of the smoothed flow vectors concatenatedacross cells, together with the mean and standard deviationof all cells in the frame. It captures dominant egomotion anddynamic scene motion—both of which are relevant to first-person engagement.We use these descriptors, together with the frame-levelground truth (cf. Sec. III-B), to train an i.i.d. classifier. Weuse random forest classifiers due to their test-time efficiencyand relative insensitivity to hyper-parameters, though of courseother classifiers are possible. Given a test video, the confidence(posterior) output by the random forest is used as the initialframe-wise engagement estimate. B. Generating interval proposals
After obtaining the preliminary estimate for each frame, wegenerate multiple hypotheses for engagement intervals usinga level set method as follows. For a given threshold on theframe-based confidence, we obtain a set of positive intervals,where each positive interval consists of contiguous frameswhose confidence exceeds the threshold. By sweeping throughall possible thresholds (we use the decile), we generate mul-tiple such sets of candidates. Candidates from all thresholdsare pooled together to form a final set of interval proposals .We apply this candidate generation process on both trainingdata and test data. During training, it yields both positive andnegative example intervals that we use to train an interval-level classifier (described next). During testing, it yields the hypotheses to which the classifier should be applied. Thisdetection paradigm not only lets us avoid sliding temporalwindow search, but it also allows us to detect engagementintervals of variable length.
C. Describing and classifying intervals
For each interval proposal, we generate a motion descriptorthat captures both the motion distribution and evolution overtime. Motion evolution is important because a recorder usuallyperforms multiple actions within an interval of engagement.For example, the recorder may stop, turn his head to stareat an object, reach out to touch it, then turn back to resumewalking. Each action leads to a different motion pattern. Thus,unlike the temporally local frame-based descriptor above, herewe aim to capture the statistics of the entire interval. We’d alsolike the representation to be robust to time-scale variations(i.e., yielding similar descriptors for long and short instancesof the same activity).To this end, we use a temporal pyramid representation. Foreach level of the pyramid, we divide the interval from the pre-vious level into two equal-length sub-intervals. For each sub-interval, we aggregate the frame motion computed in Sec. IV-Aby taking the dimension-wise mean and variance. So, thetop level aggregates the motion of the entire interval, and itsdescendants aggregate increasingly finer time-scale intervals.The aggregated motion descriptors from all sub-intervals areconcatenated to form a temporal pyramid descriptor. We use3-level pyramids. To provide further context, we augment thisdescriptor with those of its temporal neighbor intervals (i.e.,before and after). This captures the motion change from lowengagement to high engagement and back.We train a random forest classifier using this descriptor andthe interval proposals from the training data, this time referringto the interval-level ground truth from Sec. III-B. At test time,we apply this classifier to a test video’s interval proposalsto score each one. If a frame is covered by multiple intervalproposals, we take the highest confidence score as the finalprediction per frame.
D. Discussion
Our method design is distinct from previous work in video attention , which typically operates per frame and uses tem-porally local measurements of motion [1], [2], [16], [17],[26], [27]. In contrast, we estimate enagement from intervalhypotheses bootstrapped from initial frame estimates, and ourrepresentation captures motion changes over time at multi-ple scales. People often perform multiple actions during anengagement interval, which is well-captured by consideringan interval together. For example, it is hard to tell whetherthe recorder is attracted by an object when we only know heglances at it, but it becomes clear if we know his followingaction is to turn to the object or to turn away quickly.Simply flagging periods of low motion [15], [25], [27]is insufficient to detect all cases of heightened attention,since behaviors during the interval of engagement are oftennon-static and also exhibit learnable patterns. For example,shoppers move and handle objects they might buy; people sway while inspecting a painting; they look up and sweeptheir gaze downward when inspecting a skyscraper.External sensors beyond the video stream could potentiallyprovide cues useful to our task, such as inertial sensors todetect recorder motion and head orientation. However, suchsensors are not always available, and they are quite noisy inpractice. In fact, recent attempts to detect gazing behavior withinertial sensors alone yield false positive rates of 33% [25].This argues for the need for visual features for the challengingengagement detection task.V. E
XPERIMENTS
A. Experiment Setting
We validate on two datasets and compare to many existingmethods. a) Baselines:
We compare with 9 existing methods,organized into four types: • Saliency Map : Following [17], [21], we compute thesaliency map for each frame and take the average saliencyvalue. We apply the state-of-the-art learned video saliencymodel [1] and five others that were previously used forvideo summarization: [6], [7], [17], [19], [20]. We usethe original authors’ code for [1], [6], [7], [19], [20] andimplement [17]. Except [6], all these models use motion. • Motion Magnitude : Following [25], [27], this base-line uses the inverse motion magnitude. Intuitively, therecorder becomes more still during his moments of highengagement as he inspects the object(s). We apply thesame flow smoothing as in Sec. IV-A and take theaverage. • Learned Appearance (CNN) : This baseline predictsengagement based on the video content. We use state-of-the-art convolutional neural net (CNN) image descriptors,and train a random forest with the same frame-basedground truth our method uses. We use Caffe [43] and theprovided pre-trained model (BVLC Reference CaffeNet). • Egocentric Important Region : This is the methodof [28]. It is a learned metric designed for egocentricvideo that exploits hand detection, centrality in frame, etc.to predict the importance of regions for summarization.While the objective of “importance” is different than“engagement”, it is related and valuable as a comparison,particularly since it also targets egocentric data. We takethe max importance per frame using the predictionsshared by the authors.Some of the baselines do not target our task specifically,a likely disadvantage. Nonetheless, their inclusion is usefulto see if ego-engagement requires methods beyond existingsaliency metrics. Besides, our baselines also include methodsspecialized for egocentric video [25], [28], and one that targetsexactly our task [25].For the learned methods (ours, CNN, and Important Re-gions), we use the classifier confidences to rate frames by theirengagement level. Note that the CNN method has the benefitof training on the exact same data as our method. For the non-learned methods (saliency, motion), we use their magnitude.We evaluate two versions of our method: one with the interval
Frame F Interval F GBVS (Harel 2006 [19]) 0.462 0.286Self Resemblance (Seo 2009 [20]) 0.471 0.398Bayesian Surprise (Itti 2009 [7] ) 0.420 0.373Salient Object (Rahtu 2010 [6]) 0.504 0.389Video Attention (Ejaz 2013 [17]) 0.413 0.298Video Saliency (Rudoy 2013 [1]) 0.435 0.396Motion Mag. (Rallapalli 2014 [25]) 0.553 0.403Cross Recorder CNN Appearance 0.685 0.486Ours – frame
Ours – GT interval 0.822 0.868Cross Scenario CNN Appearance 0.656 0.463Ours – frame
Ours – GT interval 0.830 0.860Cross Recorder AND Scenario CNN Appearance 0.655 0.463Ours – frame
Ours – GT interval 0.823 0.856
TABLE III: F -score accuracy of all methods on UT EE. (Thecross-recorder/scenario distinctions are not relevant to the topblock of methods, all of which do no learning.)proposals (Ours-interval), and one without (Ours-frame). Theboundary agreement is used for interval prediction evaluationto favor methods with better localization of engagement. b) Datasets: We evaluate on two datasets: our new UTEgocentric Engagement (UT EE) dataset and the public UTEgocentric dataset (UT Ego). We select all clips from UT Egothat contain browsing scenarios (mall, market), yielding 3 clipswith total length of 58 minutes, and get them annotated withthe same procedure in Sec. III-B. c) Implementation details:
We use the code of [44] foroptical flow computation. Flow dominates our run-time, about1.2 s per frame on 48 cores. The default settings are used forthis and all the public saliency map codes. Using the scikit-learn package [45] for random forest, we train 2,400 trees inall results and leave all other parameters at default. The samplerate of video frames is 15 fps for optical flow and 1 fps forall other computation, including evaluation. d) Evaluation metric:
We evaluate the performance ofdifferent methods using the metric defined below. Let G denotea set of ground truth intervals for engagement. The set ofintervals is consistent if none of the intervals within the setoverlap with others, denoted by | g ∩ g | = 0 , ∀ g g ∈ G . g ∩ g denotes the frames that are in both interval g and g .Also, let P denote a set of predicted intervals that is consistent.We consider a predicted interval p to be covered by a groundtruth interval g if | p ∩ g | > | p | , denoted by p ⊂ g . Giventhe ground truth intervals G and predictions P , we define theinterval precision as follows: P recision = |{∃ g ∈ G s.t. p ⊂ g | ∀ p ∈ P }|| P | . Similarly, we consider a ground truth interval g to be coveredby a predicted interval p if | p ∩ g | > | g | , and we computethe interval recall as Recall = |{∃ p ∈ P s.t. g ⊂ p | ∀ g ∈ G }|| G | . Market Shopping Mall Museum
Fig. 3: Example engagement intervals detected by our method. Note the intra-interval variation: the recorder either performsmultiple actions (Market), looks at an item from multiple views (Mall) or looks at multiple items (Museum).Note the recall monotonically increases as we prolong thelength of each prediction p in P . Roughly speaking, a pre-dicted interval p is considered correct if more than ofthe prediction overlaps with some ground truth interval, anda ground truth interval is considered predicted if more than of the interval is covered by some prediction. B. UT Egocentric Engagement (UT EE) dataset
We consider three strategies to form train-test data splits.The first is leave-one-recorder-out, denoted cross-recorder , inwhich we train a predictor for each recorder using exclusivelyvideo from other recorders. This setting tests the ability togeneralize to new recorders (e.g., can we learn from John’svideo to predict engagement in Mary’s video?). The second isleave-one-scenario-out, denoted as cross-scenario , in whichwe train a predictor for each scenario using exclusively videofrom other scenarios. This setting examines to what extentvisual cues of engagement are independent of the specificactivity or location the recorder (e.g., can we learn from amuseum trip to predict engagement during a shopping trip?).The third strategy is the most stringent, disallowing anyoverlap in either the recorder or the scenario ( cross recorderAND scenario ).Fig. 4(A) ∼ (C) show the precision-recall curves for allmethods and settings on the 14 hour UT EE dataset, and wesummarize them in Table III using the F scores; here we setthe confidence threshold for each video such that 43.8% of itsframes are positive, which is the ratio of positives in the entiredataset. Our method significantly outperforms all the existingmethods. We also see our interval proposal idea has a clearpositive impact on interval detection results. However, whenevaluated with the frame classification metric (first columnin Table III), our interval method does not improve over ourframe method. This is due to some inaccurate (too coarse)proposals, which may be helped by sampling the level setsmore densely. We also show an upper bound for the accuracywith perfect interval hypotheses (see Ours-GT interval), whichemphasizes the need to go beyond frame-wise predictions aswe propose.Fig. 4 and Table III show our method performs similarly inall three train-test settings, meaning it generalizes to both newrecorders and new scenarios. This is an interesting finding,since it is not obvious a priori that different people exhibitsimilar motion behavior when they become engaged, or thatthose behaviors translate between scenes and activities. This isimportant for applications, as it would be impractical to collectdata for all recorders and scenarios. . . . . . . . . Recall P r ec i s i o n UT Ego
GBVS (Harel ‘06)Self-resemblance (Seo ‘09)Bayesian Surprise (Itti ‘09)Video Attention (Ejaz ‘13)Video Saliency (Rudoy ‘13)Salient Object (Rathu ‘10)Important Region (Lee ‘12)CNN AppearanceMotion Mag. (Rallapalli ‘14)Ours – frameOurs – interval
Fig. 5: Precision-recall accuracy on UT Ego dataset.The CNN baseline, which learns which video content corre-sponds to engagement, does the best of all the baselines. How-ever, it is noticeably weaker than our motion-based approach.This result surprised us, as we did not expect the appearance of objects in the field of view during engagement intervalsto be consistent enough to learn at all. However, there aresome intra-scenario visual similarities in a subset of clips: fourof the Museum videos are at the same museum (though therecorders focus on different parts), and five in the Mall containlong segments in clothing stores (albeit different ones). Overallwe find the CNN baseline often fails to generate coherentpredictions, and it predicts intervals much shorter than theground truth. This suggests that appearance alone is a weakersignal than motion for the task.Motion Magnitude (representative of [25], [27]) is the nextbest baseline. While better than the saliency metrics, its short-term motion and lack of learning lead to substantially worseresults than our approach. This also reveals that people oftenmove while they engage with objects they want to learn moreabout.Finally, despite their popularity in video summarization,Saliency Map methods [1], [6], [7], [17], [19], [20] do notpredict temporal ego-engagement well. In fact, they are weakerthan the simpler motion magnitude baseline. This result accen-tuates the distinction between predicting gaze (the commonsaliency objective) and predicting first-person engagement.Clearly, spatial attention does not directly translate to the task.While all the Saliency Map methods (except [6]) incorporatemotion cues, their reliance on temporally local motion, likeflickers, makes them perform no better than the purely staticimage methods.Fig. 3 shows example high engagement frames. . . . . . . . . P r ec i s i o n (A) Cross-Recorder . . . . . . . . Recall(B) Cross-Scenario . . . . . . . . (C) Cross Recorder AND Scenario GBVS (Harel ‘06)Self-resemblance (Seo ‘09)Bayesian Surprise (Itti ‘09)Video Attention (Ejaz ‘13)Video Saliency (Rudoy ‘13)Salient Object (Rathu ‘10)CNN AppearanceMotion Mag. (Rallapalli ‘14)Ours – frameOurs – interval
Fig. 4: Precision-recall accuracy on the UT EE dataset. Our approach detects engagement with much greater accuracy than anarray of saliency and content-based methods, and our interval proposal idea improves the initial frame-wise predictions.
C. UT Egocentric dataset
Fig. 5 shows the results on the UT Egocentric dataset. Theoutcomes are consistent with those on UT EE above, and againour method performs the best. Whereas [28] is both trainedand tested on UT Ego, our method does not do any training onthe UT Ego data; rather, we use our model trained on UT EE.This ensures fairness to the baseline (and some disadvantageto our method).Our method outperforms the Important Regions [28]method, which is specifically designed for first-person data.This result gives further evidence of our method’s cross-scenario generalizability. Important Regions [28] does outper-form the Saliency Map methods on the whole, indicating thathigh-level semantic concepts are useful for detecting engage-ment, more so than low-level saliency. The CNN baseline doespoorly, which reflects that its content-specific nature hindersgeneralization to a new data domain.
D. Start point correctness
Finally, Fig. 6 evaluates start point accuracy on UT EE.This setting is of interest to applications where it is essentialto know the onset of engagement, but not necessarily itstemporal extent. Here we run our method in a streamingfashion by using its frame-based predictions, without thebenefit of hindsight on the entire intervals. To compare thestart point accuracy of different methods, we plot the F scoreas a function of error tolerance window (in seconds) allowedbetween the predicted and the nearest ground truth start point.Our method outperforms all other methods under all errortolerances. This is evidence that our method has promise forboth the online and offline setting, though we think thereremains interesting future work to best account for streamingdata.The Motion Magnitude baseline is our nearest competitorfor this setting. This indicates that an abrupt decline in motionis predictive for the transition between engagement and non-engagement (e.g., as a person slows to examine something).However, it remains weaker than our method, and, as we seein the other results in the main paper, it cannot predict thecontinuation and subsequent drop of engagement level. . . . . F GBVS (Harel ‘06)Self-resemblance (Seo ‘09)Bayesian Surprise (Itti ‘09)Video Attention (Ejaz ‘13)Video Saliency (Rudoy ‘13)Salient Object (Rathu ‘10)CNN AppearanceMotion Mag. (Rallapalli ‘14)Ours – frame
Fig. 6: Start-point accuracy on UT EE, measuring how wellthe onset of an engagement interval is detected in a streamingmanner. VI. C
ONCLUSION
We explore engagement detection in first-person video. Byprecisely defining the task and collecting a sizeable dataset,we offer the first systematic study of this problem. We intro-duced a learning-based approach that discovers the connectionbetween first-person motion and engagement, together with aninterval proposal approach to capture a recorder’s long-termmotion. Results on two datasets show our method consistentlyoutperforms a wide array of existing methods for visualattention. Our work provides the foundation for a new aspectof visual attention research. In future work, we will examinethe role of external sensors (e.g., audio, gaze trackers, depth)that could assist in ego-engagement detection when they areavailable. A
CKNOWLEDGEMENTS
We thank the friends and labmates who assisted us withdata collection. This research is supported in part by ONRYIP N00014-12-1-0754 and a gift from Intel.A
PPENDIX AA NNOTATION I NTERFACE
In this section, we show the interface and instructionsfor engagement annotation on Amazon Mechanical Turk. Weinclude the full instructions and annotation interface detailsin order to help reviewers evaluate the care with which wecollected the ground truth annotations.
Playback Control Interval Annotation
Fig. 7: Screen shot of annotation interface.
A. Task Description
George is wearing a camera on his head. The cameracaptures video constantly as George goes about his daily life.Because the camera is on his head, when George moves hishead to look around, the camera moves too. Basically, itcaptures the world just as George sees it.Your job is to watch a video excerpt from George’s camerathat lasts 1-2 minutes, and determine when something in theenvironment has captured George’s attention . You will firstwatch the entire video. Then you will go back and use a sliderto navigate through the video frames and mark the intervals(start and end points) where he is paying close attention tosomething.
Note, the video may have more than one intervalwhere George is paying close attention to something . a) Definition of Attention: The following instructions will describe what we mean by“capturing George’s attention” in more detail: Humans’ cogni-tive process has different levels of attention to the surroundingenvironment. For example, people pay very little attention totheir surroundings when they are walking on a route they arefamiliar with, but the attention level will rise significantly ifthere are unusual events (such as a car accident) or somethingattracts their curiosity (such as a new advertisement on thewall), or if they want to inspect something more closely (suchas a product on the shelf when shopping). You are asked toidentify these “high attention intervals” in the video.
In particular, we ask you to identify intervals whereGeorge’s attention is focussed on an object or a specific lo-cation in the scene.
During these intervals, George is attractedby an object and tries to have a better view/understandingabout it intentionally. In general, George may: • Have a closer look at the object • Inspect the object from different views • Stare at the objectIn some situations, George may even interact physicallywith the object capturing his attention to gather more infor-mation. For example, he may grab the object to have a closerview of it, or he may turn the object to inspect it from differentviews. To identify these situations, we also ask you to annotate whether George touched the object capturing his attentionduring the interval.The following video shows examples of attention interval: please refer to the video on our project webpage . b) Important Notes: • You should watch the entire video (3 minutes) first beforedoing any annotation. This will give you the contextof the activity to know when George is paying closeattention. • A video may contain multiple or no intervals whereGeorge’s attention is captured. You should label eachone separately. The intervals are mutually exclusive andshould not overlap. • Each interval where George’s attention is captured mayvary in length. Some could be a couple seconds long,others could be closer to a minute long. The minimumlength of each interval is 15 frames (1 second). • You may need to scroll back and forth in the videousing our slider interface to determine exactly when theattention starts and stops. Mark the interval as tightly aspossible. • After labeling where an attention interval starts and ends,you will mark whether George has physical contact (grab,touch, etc.) with the object during the interval or is justlooking at it. • You will also mark your confidence in terms of howstrongly George’s attention was captured in that interval(Obvious, Fairly clear, Subtle).
B. Interface Introduction
The following introduction will give you tips on how to bestuse the tool. Please watch the below video (and/or read thebelow section) for instructions: please refer to the video onour project webpage . c) Getting Started: • Press the
Play button to play the video. • After the video finished, press the
Rewind button andstart annotation. • Play the video,
Pause the video when you reach the frameat the beginning of high attention interval. • Click the
Start button to mark the “Start” of the interval. • On the right, directly below the Start button, you will finda colorful box showing the frame number correspondingto the ‘Start’ of the interval. • Similarly, click the
End button to mark the “End” of theinterval. • After you mark the end of the interval, you will be askedwhether George contact (grabbing, touching, etc.) theobject that captures his attention. • Next, you will be asked about how obvious is the at-tention interval. Specify whether the interval is
Obvious,Fairly clear, Subtle . • Finally, you will be asked to describe what attractsGeorge’s attention. Type in what attracts George’s atten-tion (object, scene, event, etc.) and
Submit the interval. • When you are ready to submit your work, rewind thevideo and watch it through one more time. Do the“Start” and “End” you specified cover the complete highattention interval? After you have checked your work,press the
Submit HIT button. We will pay you as soonas possible. • Do not reload or close the page before redirected to nexthit. This may cause submission failure. d) How We Accept Your Work: We will hand review your work and we will only accepthigh quality work. Your annotations are not compared againstother workers. e) Keyboard Shortcuts:
These keyboard shortcuts are available for your conve-nience: • t toggles play/pause on the video • r rewinds the video to the start • d jump the video forward a bit • f jump the video backward a bit • v step the video forward a tiny bit • c step the video backward a tiny bitR EFERENCES[1] D. Rudoy, D. Goldman, E. Shechtman, and L. Zelnik-Manor, “Learningvideo saliency from human gaze using candidate selection,” in
CVPR ,2013.[2] J. Han, L. Sun, X. Hu, J. Han, and L. Shao, “Spatial and temporal visualattention prediction in videos using eye movement data,”
Neurocomput-ing , vol. 145, pp. 140–153, Dec 2014.[3] W. Lee, T. Huang, S. Yeh, and H. Chen, “Learning-based prediction ofvisual attention for video signals,”
IEEE TIP , vol. 20, no. 11, Nov 2011.[4] G. Abdollahian, C. Taskiran, Z. Pizlo, and E. Delp, “Camera motion-based analysis of user generated video,”
TMM , vol. 12, no. 1, pp. 28–41,Jan 2010.[5] V. Mahadevan and N. Vasconcelos, “Spatiotemporal saliency in dynamicscenes,”
TPAMI , vol. 32, no. 1, pp. 171–177, Jan 2010.[6] E. Rahtu, J. Kannala, M. Salo, and J. Heikkila, “Segmenting salientobjects from images and videos,” in
ECCV , 2010.[7] L. Itti and P. Baldi, “Bayesian surprise attracts human attention,”
VisionResearch , vol. 49, no. 10, pp. 1295–1306, 2009.[8] H. Liu, S. Jiang, Q. Huang, and C. Xu, “A generic virtual contentinsertion system based on visual attention analysis,” in
ACM MM , 2008.[9] Y. Li, A. Fathi, and J. M. Rehg, “Learning to predict gaze in egocentricvideo,” in
ICCV , 2013.[10] K. Yamada, Y. Sugano, T. Okabe, Y. Sato, A. Sugimoto, and K. Hiraki,“Attention prediction in egocentric video using motion and visualsaliency,” in
Advances in Image and Video Technology , 2012.[11] ——, “Can saliency map models predict human egocentric visualattention?” in
ACCV Workshop , 2011.[12] J. Kender and B.-L. Yeo, “On the structure and analysis of home videos,”in
ACCV , 2000.[13] K. Li, S. Oh, A. Perera, and Y. Fu, “A videography analysis frameworkfor video retrieval and summarization.” in
BMVC , 2012.[14] M. Gygli, H. Grabner, H. Riemenschneider, and L. V. Gool, “Creatingsummaries from user videos,” in
ECCV , 2014.[15] Y. Poleg, C. Arora, and S. Peleg, “Temporal segmentation of egocentricvideos,” in
CVPR , 2014.[16] T. V. Nguyen, M. Xu, G. Gao, M. Kankanhalli, Q. Tian, and S. Yan,“Static saliency vs. dynamic saliency: a comparative study,” in
ACMMM , 2013. [17] N. Ejaz, I. Mehmood, and S. Baik, “Efficient visual attention basedframework for extracting key frames from videos,” Image Communica-tion , vol. 28, pp. 34–44, 2013.[18] L. Itti, N. Dhavale, and F. Pighin, “Realistic avatar eye and headanimation using a neurobiological model of visual attention,” in
Proc.SPIE 48th Annual International Symposium on Optical Science andTechnology , vol. 5200, Aug 2003, pp. 64–78.[19] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” in
NIPS ,2007.[20] H. Seo and P. Milanfar, “Static and space-time visual saliency detectionby self-resemblance,”
J. of Vision , vol. 9, no. 7, pp. 1–27, 2009.[21] Y.-F. Ma, L. Lu, H.-J. Zhang, and M. Li, “A user attention model forvideo summarization,” in
ACM MM , 2002.[22] W. Kienzle, B. Sch¨olkopf, F. Wichmann, and M. Franz, “How to findinteresting locations in video: a spatiotemporal interest point detectorlearned from human eye movements,” in
DAGM , 2007.[23] M. Dorr, T. Martinetz, K. R. Gegenfurtner, and E. Barth, “Variabilityof eye movements when viewing dynamic natural scenes,”
J. of Vision ,vol. 10, no. 10, pp. 1–17, 2010.[24] M. Pilu, “On the use of attention clues for an autonomous wearablecamera,” HP Laboratories Bristol, Tech. Rep. HPL-2002-195, 2003.[25] S. Rallapalli, A. Ganesan, K. C. anad V. Padmanabhan, and L. Qiu,“Enabling physical analytics in retail stores using smart glasses,” in
MobiCom , 2014.[26] Y. Nakamura, J. Ohde, and Y. Ohta, “Structuring personal activityrecords based on attention-analyzing videos from head mounted cam-era,” in
ICPR , 2000.[27] P. Cheatle, “Media content and type selection from always-on wearablevideo,” in
ICPR , 2004.[28] Y. J. Lee, J. Ghosh, and K. Grauman, “Discovering important peopleand objects for egocentric video summarization,” in
CVPR , 2012.[29] Z. Lu and K. Grauman, “Story-driven summarization for egocentricvideo,” in
CVPR , 2013.[30] O. Aghazadeh, J. Sullivan, and S. Carlsson, “Novelty detection from anegocentric perspective,” in
CVPR , 2011.[31] Y. Hoshen, G. Ben-Artzi, and S. Peleg, “Wisdom of the crowd inegocentric video curation,” in
CVPR Workshop , 2014.[32] H. S. Park, E. Jain, and Y. Sheikh, “3d gaze concurrences from head-mounted cameras,” in
NIPS , 2012.[33] A. Fathi, J. Hodgins, and J. Rehg, “Social interactions: A first-personperspective,” in
CVPR , 2012.[34] A. Fathi, A. Farhadi, and J. Rehg, “Understanding egocentric activities,”in
ICCV , 2011.[35] Y. Poleg, C. Arora, and S. Peleg, “Temporal segmentation of egocentricvideos,” in
CVPR , 2014.[36] H. Pirsiavash and D. Ramanan, “Detecting activities of daily living infirst-person camera views,” in
CVPR , 2012.[37] D. Damen, T. Leelasawassuk, O. Haines, A. Calway, and W. Mayol-Cuevas, “You-do, i-learn: Discovering task relevant objects and theirmodes of interaction from multi-user egocentric video,” in , 2014.[38] B. Soran, A. Farhadi, and L. Shapiro, “Action recognition in the presenceof one egocentric and multiple static cameras,” in
ACCV , 2014.[39] K. Kitani, T. Okabe, Y. Sato, and A. Sugimoto, “Fast unsupervised ego-action learning for first-person sports video,” in
CVPR , 2011.[40] E. Spriggs, F. D. la Torre, and M. Hebert, “Temporal segmentation andactivity classification from first-person sensing,” in
CVPR Workshop onEgocentric Vision , 2009.[41] Y. Li, Z. Ye, and J. Rehg, “Delving into egocentric actions,” in
CVPR ,2015.[42] P. K. Mital, T. J. Smith, R. L. Hill, and J. M. Henderson, “Clustering ofgaze during dynamic scene viewing is predicted by motion,”
CognitiveComputation , vol. 3, no. 1, pp. 5–24, Mar 2011.[43] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” arXiv preprint arXiv:1408.5093 , 2014.[44] C. Liu, “Beyond pixels: Exploring new representations and applicationsfor motion analysis,” Ph.D. dissertation, Massachusetts Institute ofTechnology, May 2009.[45] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-esnay, “Scikit-learn: Machine learning in Python,”