What Can You Learn from Your Muscles? Learning Visual Representation from Human Interactions
Kiana Ehsani, Daniel Gordon, Thomas Nguyen, Roozbeh Mottaghi, Ali Farhadi
WWhat Can You Learn from Your Muscles?Learning Visual Representation from Human Interactions
Kiana Ehsani , Daniel Gordon Thomas Nguyen Roozbeh Mottaghi , Ali Farhadi University of Washington Allen Institute for AI https://github.com/ehsanik/muscleTorch A BSTRACT
Learning effective representations of visual data that generalize to a variety ofdownstream tasks has been a long quest for computer vision. Most representationlearning approaches rely solely on visual data such as images or videos. In thispaper, we explore a novel approach, where we use human interaction and atten-tion cues to investigate whether we can learn better representations compared tovisual-only representations. For this study, we collect a dataset of human inter-actions capturing body part movements and gaze in their daily lives. Our exper-iments show that our self-supervised representation that encodes interaction andattention cues outperforms a visual-only state-of-the-art method MoCo (He et al.,2020), on a variety of target tasks: scene classification (semantic), action recogni-tion (temporal), depth estimation (geometric), dynamics prediction (physics) andwalkable surface estimation (affordance).Figure 1: We propose to use human’s interactions with their visual surrounding as a training signalfor self-supervised representation learning. We record first person observations as well as the move-ments and gaze of people living their daily routines and use these cues to learn a visual embedding.We use the learned representation on a variety of diverse tasks and show consistent improvementscompared to state-of-the-art self-supervised vision-only techniques.
NTRODUCTION
Encoding visual information from pixel space to a lower-dimensional vector is the core element ofmost modern deep learning-based solutions to computer vision. A rich set of algorithms and archi-tectures have been developed to enable learning these encodings. A common practice in computervision is to explicitly train the networks to map visual inputs to a curated label space. For example, aneural network is pre-trained using a large-scale annotated classification dataset (Deng et al., 2009;Krasin et al., 2017) and the entire network or part of it is fine-tuned to a new target task (Goyal et al.,2019; Zamir et al., 2018).In recent years, weakly supervised and unsupervised representation learning approaches (e.g., Ma-hajan et al. (2018); He et al. (2020); Chen et al. (2020a)) have been proposed to mitigate the needfor supervision. The most successful ones are contrastive learning-based approaches such as (Chenet al., 2020c;b) and they have shown remarkable results on target tasks such as image classificationand object detection. Despite their success, there are two primary caveats: (1) These self-supervised1 a r X i v : . [ c s . C V ] O c t ethods are still trained on ImageNet or similar datasets, which are fairly cleaned up and/or in-clude a pre-specified set of object categories. (2) This method of training is a passive approachin that it does not encode interactions. On the contrary, for humans, a vast majority of our visualunderstanding is shaped by our interactions and our observations of others interacting with theirenvironments. We are not limited to learning from visual cues alone, and there are various othersupervisory signals such as body movements and attention cues available to us. It is shown that bylearning how to move the joints to walk and crawl, infants can significantly enhance their perceptionand cognition (Adolph & Robinson, 2015). Moreover, by observing another person interact with theenvironment humans obtain a visual and physical perception of the world (Bandura, 1977).The question we investigate in this paper is, “can we learn a rich generalizable visual representationby encoding human interactions into our visual features?”. In this work, we consider the move-ment of human body parts and the center of attention (gaze) as an indicator of their interactionswith the environment and propose an approach for incorporating interaction information into therepresentation learning process.To study what we can learn from interaction, we attach sensors to humans’ limbs and see how theyreact to visual events in their daily lives. More specifically, we record the movements of the bodyparts by Inertial Movement Units (IMUs) and also the gaze to monitor the center of attention. Weintroduce a new dataset of more than 4,500 minutes of interaction by 35 participants engaging ineveryday scenarios with their corresponding body part movements and center of attention. Thereare no constraints on the actions, and no manual annotations or labels are provided.Our experiments show that the representation we learn by predicting gaze and body movements inaddition to the visual cues outperforms the visual-only baseline on a diverse set of target tasks (Fig-ure 1): semantic (scene classification), temporal (action recognition), geometric (depth estimation),physics (dynamics prediction) and affordance-based (walkable surface estimation). This shows thatmovement and gaze information can help to learn a more informative representation compared to avisual-only model. ELATED W ORK
Visual representations can be learned using many different techniques from full supervision to nosupervision at all. We outline the most common paradigms of representation learning, namely su-pervised, self-supervised, and interaction-based representation learning.
Supervised Representation Learning.
Supervised representation learning in computer vision istypically performed by pre-training neural networks on large-scale datasets with full supervision(e.g., ImageNet (Deng et al., 2009)) or weak supervision (e.g., Instagram-1B (Mahajan et al., 2018)).These models are fine-tuned for a variety of tasks including object detection (Girshick et al., 2014;Ren et al., 2015), semantic segmentation (Shelhamer et al., 2015; Chen et al., 2017), and visualquestion answering (Agrawal et al., 2015a; Hudson & Manning, 2019). However, collecting a man-ually annotated large-scale dataset such as ImageNet requires extensive resources in terms of costand time. In contrast, in this paper, we only use human interaction data, which does not require anymanual annotation.
Self-supervised Representation Learning.
There has been a wide range of re-search on self-supervised learning of visual representations in which properties of the images them-selves act as supervision. The objectives for these methods cover a variety of tasks such as solvingjigsaw puzzles (Noroozi & Favaro, 2016), colorizing grayscale images (Zhang et al., 2016), learningto count (Noroozi et al., 2017), predicting context (Doersch et al., 2015), inpainting (Pathak et al.,2016), adversarial training (Donahue et al., 2017) and predicting image rotations (Gidaris et al.,2018). This type of representation learning is not limited to learning from single frames. Agrawalet al. (2015b) and Jayaraman & Grauman (2015) both use egomotion, Wang & Gupta (2015) cycli-cally track patches in videos, Pathak et al. (2017) use low-level non-semantic motion-based cues,and Vondrick et al. (2016) predict the representation of future frames.Inspired by contrastive learning (Hadsell et al., 2006), recent methods have used “instance discrim-ination” in which the network uniquely identifies each image. A network is trained to produce anon-linear mapping that projects multiple variations of an image closer to each other than to all otherimages. Using Noise Contrastive Estimation (Gutmann & Hyv¨arinen, 2010), networks are trainedto differentiate between similar images under complex noise models (such as non-overlapping cropsand heavy color jittering) and dissimilar images. Oord et al. (2018) and H´enaff et al. (2019) intro-2 amera Gaze TrackerIMU
Figure 2:
Dataset examples.
Two sequences from our dataset are shown on the left. The first rowshows the sequence of the images and the second row shows the movements of the body parts ac-cording to the IMU readings. We visualize the gaze using the red circle. This is just for visualizationpurposes and does not exist in the image. On the right, we show the data collection setup.duce and investigate the Contrastive Predictive Coding (CPC) method, which encodes the sharedinformation between different crops of an image to predict the features from masked regions ofthe image. Wu et al. (2018); Misra & van der Maaten (2020) use a memory bank, which enablescontrasting features of the current image against a large set of negative samples, increasing the like-lihood of finding a nearby negative. The MoCo technique (He et al., 2020; Chen et al., 2020c)encodes the positive samples with a momentum encoder to avoid the rapid changes in the originalfeature extractor. They achieve comparable results with supervised learning representations. Chenet al. (2020a;b) show that by using a trainable non-linear transformation between the representationand contrastive latent space and larger batch sizes, they can omit memory banks entirely, allowing forfull backpropagation through both positive and negative samples, and achieve better results. Bach-man et al. (2019); Tian et al. (2019) maximize the mutual information between different extractedfeatures of the same image from multiple views. Zhuang et al. (2019) enforce the extracted featuresof similar images to move towards the same part of the embedding space. Gordon et al. (2020), Yaoet al. (2020), and Devon Hjelm & Bachman (2020) apply contrastive method to videos and leveragespatio-temporal cues to learn visual representations. In contrast to all of these approaches, we utilizehuman interactions along with their visual observation for representation learning.
Interaction-Based Representation Learning.
The third class of learning representations relies oncues obtained by interacting with a dynamic environment.Pinto et al. (2016) learn a representation from interactions of a robotic arm (e.g., grasping and push-ing) with different objects. Chen et al. (2019) and Weihs et al. (2019) both tackle the representationlearning problem by training an agent to play a game in an interactive environment. Ehsani et al.(2018) learn a representation by modeling the non-semantic movements of a dog. Our work falls inthis category since we use human interactions for learning the representation. We differ from theseapproaches in that we use low-level observations of human interaction such as body part movementsand gaze to show significant improvement over a state-of-the-art baseline across multiple low-leveland high-level target tasks.
UMAN I NTERACTION D ATASET
We introduce a new dataset of human interactions for our representation learning framework. In thissection, we describe the data collection. Our goal is to capture how humans react to the visual worldby recording their movements and focus of attention.3 esNet 1x1 conv EncoderLSTM AttentionDecoder LSTMMovementDecoder LSTM F C Loss FunctionsLearned weightsIntermediate Features
Legend F C F C F C t+kt Memory Bank Figure 3:
Model Overview.
We learn a representation by jointly optimizing visual, movement andcenter of focus (gaze) objectives. The portion outlined with a rectangle is the backbone that is usedto evaluate the representation for target tasks. All parts of the network are initialized randomly andtrained from scratch.Previous datasets of human actions and gaze include only gaze information (Fathi et al., 2012; Xuet al., 2018), part movements from a third-person view (Ionescu et al., 2014; Hassan et al., 2019),or only action or hand labels in an ego-centric setting (Damen et al., 2018; Sigurdsson et al., 2018).In contrast, our new dataset includes ego-centric observations along with the corresponding gazeand body movement information during their daily activities ranging from walking and cycling todriving and shopping.To collect the dataset, we record egocentric videos from a GoPro camera attached to the subjects’forehead. We simultaneously capture body movements, as well as the gaze. We use Tobii Pro2 eye-tracking to track the center of the gaze in the camera frame. We record the body part movementsusing BNO055 Inertial Measurement Units (IMUs) in 10 different locations (torso, neck, 2 triceps,2 forearms, 2 thighs, and 2 legs). Figure 2 shows the data collection setup along with two clipsof the captured sequences. In total, we collected 4,260 minutes of videos with their correspondingbody part movement and gaze from 35 people. Unlike the common large-scale datasets used forrepresentation learning such as ImageNet, there is no restriction on the categories observed in theimages, and no manual annotation is provided. Statistical analysis of the dataset is provided inAppendix A.3. Moreover, we provide details of aligning the video with the motion sensors andsynchronization of the sensors in Appendix A.2. The supplementary video shows a few examplesof the video clips.
NTERACTION - BASED R EPRESENTATION L EARNING
Visual representation learning is typically performed using visual cues from single images orvideos (He et al., 2020; Gordon et al., 2020). Our goal in this paper is to incorporate human in-teractions into our representations to move beyond a purely visually-trained feature representation.Below, we describe our approach for integrating movement and gaze information in the representa-tion learning pipeline. Intuitively, body part movements should encode the temporal changes in theimage based on the underlying cause of those changes (e.g., moving legs results in walking whichmakes distant objects move closer). Additionally, gaze grounds the visual features with the locationin the image where the person pays the most attention. This should correlate well with semanticconcepts such as objects, or affordances such as walkable surfaces.4.1 L
EARNING F EATURES
Our goal is to learn visual representations by simultaneous learning of a visual encoding for eachframe and predicting body part movements and gaze attention from the sequence of observations.Formally, given an ego-centric video as a sequence of images V = ( I t , . . . , I t + k ) , the goal is to1) estimate the gaze G = ( G t , . . . , G t + k ) , where G t is the person’s center of focus in 2D cameracoordinates, and 2) predict the body part movements P = ( P t , . . . , P t + k ) , where P t is a binaryvector of length equal to the number of body parts, indicating whether a part is moved at time t .4e optimize three objectives: (1) gaze prediction, (2) body part movement prediction, and (3) aux-iliary visual prediction. The visual features obtained from a CNN backbone are combined with asequence-to-sequence model in order to predict gaze and movement. Note that the weights of thebackbone are randomly initialized, i.e. we train the model from scratch. Figure 3 shows an overviewof the architecture. The objectives are jointly optimized. In the following, we explain each of themin more detail. Gaze:
We predict the person’s focus of attention by modeling their gaze in the camera referenceframe. We use the Huber loss to train the center of attention L attention ( ˆ G, G ) , L attention ( ˆ G, G ) = (cid:13)(cid:13)(cid:13) ˆ G − G (cid:13)(cid:13)(cid:13) | ˆ G − G | < δ | ˆ G − G | − otherwise (1) Movement:
We find the task of predicting body part movement direction and magnitude to be highlyambiguous. For example, when walking, the visual information may not show the legs, so we cannotknow how high the legs were lifted. Instead, for each body part, we predict whether it is movingat all which is less ambiguous and reduces the problem to a binary classification task. Rather thanpredicting the movements for lower and upper parts of the joint separately (leg and thigh, forearmand tricep), we combine the movements into 6 categories of torso, neck, right arm, left arm, rightleg, and left leg. To estimate this movement, we use binary cross-entropy loss and denote it as L movement ( ˆ P , P ) . Auxiliary Visual Prediction:
We also use a visual objective, L visual ( ˆ I t , I t ) . For this objective,similar to (He et al., 2020), we use instance discrimination. Any alternative visual encoding objectivecan be used instead. In section 5.3.1, we provide results for another type of visual encoding as well.Instance discrimination’s objective is to force the visual features of different augmentations of thesame image to be as close as possible in the latent space (Wu et al., 2018; Chen et al., 2020a; Zhuanget al., 2019) while pushing apart all other image embeddings. By learning to extract what makes eachimage unique, the network focuses on semantically meaningful features of the image. This enablesthe feature extractor to embed a more detailed representation of the image, which is especiallyimportant when transferring to different tasks and domains. To contrast the positive samples (theaugmentations of the image), with a large set of negative samples, we maintain a memory bankof embedded features from different images in the data. The final objective can be formalized as L visual (which is also known as the InfoNCE (Oord et al., 2018) loss), L visual ( ˆ I t , I t ) = − log exp( f ( I t ) · f ( ˆ I t ) /τ ) (cid:80) Ni =0 exp( f ( I t ) · M i /τ ) , (2)where I t , ˆ I t are two different random augmentations of the first image of the sequence V , f is theimage feature extractor (ResNet backbone), M = ( M , . . . , M N ) is the bank of negative samplesand τ is a parameter that controls the concentration level of the distribution (Hinton et al., 2015). Wealso use a momentum-updated encoder as in (He et al., 2020). We apply the visual loss only to thefirst image of the sequence, as images within a sequence tend to be visually similar to each other.The overall objective is a weighted sum of the described loss functions. More details on the archi-tecture are provided in Appendix A.4.1. L interaction = α L attention ( ˆ G, G ) + β L movement ( ˆ P , P ) + γ L visual ( ˆ I t , I t ) (3)4.2 A DAPTING THE R EPRESENTATION TO N EW T ASKS
After training the model using L interaction objective, we use the trained weights of our feature ex-traction network (i.e., only the ResNet part) as the initialization for our target tasks. Our goal inthis paper is to evaluate the visual representation on its own rather than using it as initialization forend-to-end training. Hence, during training for the target tasks, the weights of the feature extractionbackbone are frozen. We have a diverse set of target tasks, where each requires a specific networkarchitecture (for example, depth estimation requires up-convolutional layers, while action recogni-tion requires a temporal architecture). Below, we describe the result of the transfer to the targettasks. We explain the details of the architectures for each target task in Appendix A.4.2.5 atasets SUN397 Epic Kitchen VIND NYUv2 Xiao et al. (2010) Damen et al. (2018) Mottaghi et al. (2016a) Nathan Silberman & Fergus (2012)
Method Training (a) Scene (b) Action (c) Dynamics (d) Walkable (e) DepthObjective (Top-1 ↑ ) (Top-1 ↑ ) (Top-1 ↑ ) (IOU ↑ ) (RMSE log ↓ )MoCo (He et al., 2020) vis 15.80 24.45 13.18 58.97 0.148Ours vis/attn 21.27 26.80 13.71 Ours vis/move/attn
Table 1:
Target task results.
We compare the performance of our learned representation frommovement and gaze cues with a recent self-supervised baseline MoCo (He et al., 2020) (which istrained on our data). We evaluate the performance on a variety of different target tasks.
XPERIMENTS
To evaluate our representation learning approach, we consider five different types of target tasks.The tasks are chosen such that they cover a wide range of domains: semantic (scene classification),temporal (action recognition), geometric (depth estimation), physical (dynamics prediction), andaffordance (walkable surface estimation). We show that our learned representation, which encodesbody part movement and gaze and does not rely on any manual annotation, outperforms a strongself-supervised baseline which relies on purely visual cues. Furthermore, we provide ablations ofour model by using an alternative visual loss and using a subset of body parts for representationlearning. For implementation details, refer to Appendix A.4.5.1 S
ELF -S UPERVISED B ASELINE
We compare our method with the recently introduced self-supervised representation learning tech-nique, Momentum Contrast network (MoCo) (He et al., 2020), which is a state-of-the-art represen-tation learning approach and achieves strong performance on a variety of target tasks such as imageclassification and object detection. The original work was trained on images from the ImageNetdataset. To ensure the comparison between our method and the baseline is fair, we train MoCoon the images from our dataset. Note that this baseline relies on visual cues only. Our goal is toshow whether we can learn better representations when we use movement and gaze information inaddition to the visual information.5.2 E
VALUATION OF THE L EARNED R EPRESENTATION
We evaluate the learned representation on five different target tasks. The weights for feature ex-traction backbone are frozen, and only the task-specific layers are trained. We show that the repre-sentation trained using the movement and attention (gaze) supervision in addition to the visual cuesoutperforms MoCo (He et al., 2020) baseline (trained on our data) across the board. For each targettask, we report the results in four settings, each using a different combination of visual, movement,and gaze (attention) cues for representation learning.
Scene Classification.
For the task of scene classification, a network receives a single image as inputand predicts the scene category of the image. We use SUN397 (Xiao et al., 2010) dataset for thistask, as it provides a large-scale dataset of 130k images of 397 different scene categories (e.g., park,restaurant, kitchen). The results are shown in Table 1-column (a). The representation that encodesboth movement and attention cues performs the best on the semantic task of scene classification. Weachieve nearly a 7% improvement compared to fine-tuning the MoCo (He et al., 2020) baseline.
Action Recognition.
The task is to predict the category of action from ego-centric videos. We usethe EPIC-KITCHENS dataset (Damen et al., 2018) for this task, which is a large-scale dataset of11M images from different action categories that are performed in various kitchens.As shown in Table 1-column (b), our method outperforms the strong baseline representation learningmethod by 3.5%. This again shows that incorporating additional cues such as part movements andgaze in the representation learning is beneficial for downstream tasks. It seems that both movement6raining Objective SceneClassificationTop-1 ↑ ActionRecognitionTop-1 ↑ DynamicsPredictionTop-1 ↑ WalkableEstimationIoU ↑ DepthEstimationRMSE log ↓L ae L ae + L att + L move L nce + L att + L move Table 2:
Ablation of the visual loss.
The result of using an autoencoder for the visual loss. Were-train the models for the five target tasks. L att , L move and L nce are the ones used in Eq. 3.and attention cues are helpful for action recognition. This is aligned with our intuition that predictingthe gaze of a person and how they move their body parts may be beneficial to recognizing the actionsthey perform. Future Prediction of Dynamics.
The goal of this task is to predict the future dynamics of an objectin an image. We use the VIND (Mottaghi et al., 2016a) dataset for this task. It includes 150Kimages with corresponding object bounding boxes. The dataset categorizes physical dynamics intoNewtonian scenarios such as sliding, projectile motion, and bouncing. The goal is to predict theseNewtonian scenarios and the camera viewpoint for a query object that is specified by a boundingbox and physical motion labels. There are 66 classes in total. The input to the network is a singleRGB image and the bounding box for the query object. Table 1-column (c) includes the results forthis task. We outperform the baseline by 1.3%. The representation that is learned by using bothattention and movement provides the best performance for this task, which involves predicting thefuture trajectory of objects.
Walkable Surface Estimation.
The goal of this task is to segment the pixels in an image that aperson can walk on. We use the data from (Mottaghi et al., 2016b), which provides annotationfor 1449 images of the NYU DepthV2 (Nathan Silberman & Fergus, 2012) dataset. The resultsare shown in Table 1-column (d). The variation of our method that uses only the gaze informationachieves the highest accuracy. This might be due to the fact that, during walking, human attention isfocused on the places that they can walk on. Therefore, the gaze provides sufficient information toperform this task.
Depth Estimation.
For depth estimation, the task is to regress the values of the depth for a singlemonocular RGB image. We use NYU DepthV2 (Nathan Silberman & Fergus, 2012) dataset forthis task, which provides 1449 densely labeled pairs of RGB and depth images. The results areshown in Table 1-column (e). Our learned representation outperforms the baseline for this task aswell. Movement cues seem more aligned with the task of depth estimation, and the representationembedding with this information performs better. Note that the metrics for depth and walkablesurface estimation are global metrics i.e. they are computed for the entire image. Therefore, typicallya small improvement in those metrics has a significant effect on the qualitative results.5.3 A
BLATIVE A NALYSES
We ablate our results by replacing the InfoNCE visual loss with an autoencoder loss. Additionally,we show how gaze information affects the prediction of the body part movements. Finally, weevaluate which movements serve as an important supervision by masking out subsets of the bodyparts and retraining the representation.5.3.1 V
ISUAL L OSS
As discussed in Section 4.1, we use the InfoNCE loss (Oord et al., 2018) while learning the rep-resentation. In order to investigate the impact of using this objective, we learn the representationusing an autoencoder loss for our visual objective and evaluate the learned representation on allfive target tasks after re-training using the new backbone. The autoencoder loss, L ae is defined as L ae ( d ( f ( I t )) , I t ) = (cid:107) d ( f ( I t )) − I t (cid:107) , where f is the feature extractor backbone and d is a de-coder network of five up-convolution layers, which receives the × × feature as input andreconstructs a × × image.Table 2 shows that gaze and movement information still provide a strong signal compared to thevisual-only case. However, the results are worse than the case that we use the InfoNCE loss forlearning the representation. 7rediction Avg. AccuracyVisual → Part Movement 79.19Visual + Gaze → Part Movement
Table 3:
Body part movement prediction.
We investigate the correlation of the movement andattention by using the human gaze as an additional input to predict the body part movements.MaskedParts SceneClassificationTop-1 ↑ ActionRecognitionTop-1 ↑ DynamicsPredictionTop-1 ↑ WalkableEstimationIoU ↑ DepthEstimationRMSE log ↓ w/o Torso 21.56 25.42 13.47 57.50 w/o Neck 21.50 26.25 13.54 56.76 0.148w/o Arms 20.72 24.97 13.79 58.08 0.148w/o Legs 21.38 25.65 12.62 57.16 0.147w/ all Ablation of body parts.
We show how the performance on the target tasks changes whenwe ignore a body part during representation learning.5.3.2 M
OVEMENT E STIMATION
To better understand the effect of gaze, we predict body part movements with and without gazeinformation. This experiment is not part of the representation learning experiments. It is just toevaluate whether using gaze provides any additional cue for prediction of the movements.In this experiment, we predict which subset of the six groups of body parts (neck, torso, left arm,right arm, left leg, right leg) have moved. The overall architecture for this experiment is the same asour representation learning model, except for the inputs to the LSTM modules, which is instead theconcatenated image features from ResNet and the embedded input gaze. The input gaze embeddingis a two-layer network encoding the gaze into a feature vector of size 512.Table 3 shows the results for this experiment. The network achieves an improvement in predictingthe body parts movements by having the additional information of the person’s center of attention,which can intuitively serve as a proper indicator of their “intentions”.5.3.3 E
FFECTS OF THE B ODY P ARTS
To evaluate how each body part affects the learned representation, we perform an experiment wherewe ignore a subset of body parts, re-train the representation learning (from scratch) and evaluatethe features on target tasks. Table 4 summarizes the results. The performance on the target tasks(except depth estimation) drops when we ignore a body part during representation learning. Fordepth estimation, removing torso results in a slightly lower error, which might indicate that the torsomovement is not as helpful for estimating the depth.
ONCLUSION
Representations that encode movements and actions become a necessity as we move deeper towardsembodied visual understanding. In this paper, we investigate the idea of using human interactionsto learn visual representations. To enable this research, we introduce a new dataset of human in-teractions which includes hours of synchronized streams of image frames, body part movements,and gaze information across different subjects and activities. We show that representations trainedto predict body movements and gaze encode additional information compared to their purely visualcounterparts. More specifically, we show our representation outperforms a state-of-the-art self-supervised representation learning baseline for a variety of target tasks.8
CKNOWLEDGMENTS
This work is in part supported by NSF IIS 1652052, IIS 17303166, DARPA N66001-19-2-4031,67102239 and gifts from Allen Institute for Artificial Intelligence. R EFERENCES
Karen Adolph and Scott Robinson.
Motor Development , volume 2, pp. 113–157. John Wiley &Sons, 2015.Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, DeviParikh, and Dhruv Batra. Vqa: Visual question answering.
IJCV , 2015a.Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In
ICCV , 2015b.Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizingmutual information across views. In
NeurIPS , 2019.Albert Bandura.
Social learning theory . Prentice-hall, 1977.Boyuan Chen, Shuran Song, Hod Lipson, and Carl Vondrick. Visual hide and seek. arXiv , 2019.Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille.Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, andfully connected crfs.
TPAMI , 2017.Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework forcontrastive learning of visual representations. arXiv , 2020a.Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029 , 2020b.Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentumcontrastive learning. arXiv , 2020c.Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evange-los Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray.Scaling egocentric vision: The epic-kitchens dataset. In
ECCV , 2018.Dima Damen, Will Price, Evangelos Kazakos, Antonino Furnari, and Giovanni Maria Farinella.Epic-kitchens - 2019 challenges report. Technical report, 2019.Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scalehierarchical image database. In
CVPR , 2009.R Devon Hjelm and Philip Bachman. Representation learning with video deep infomax. arXiv ,2020.Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learning bycontext prediction. In
ICCV , 2015.Jeff Donahue, Philipp Kr¨ahenb¨uhl, and Trevor Darrell. Adversarial feature learning. In
ICLR , 2017.Kiana Ehsani, Hessam Bagherinezhad, Joseph Redmon, Roozbeh Mottaghi, and Ali Farhadi. Wholet the dogs out? modeling dog behavior from visual data. In
CVPR , 2018.Alireza Fathi, Yin Li, and James M Rehg. Learning to recognize daily actions using gaze. In
ECCV ,2012.Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning bypredicting image rotations. In
ICLR , 2018.Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accu-rate object detection and semantic segmentation. In
CVPR , 2014.9aniel Gordon, Kiana Ehsani, Dieter Fox, and Ali Farhadi. Watching the world go by: Representa-tion learning from unlabeled videos. arXiv , 2020.Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan Misra. Scaling and benchmarking self-supervised visual representation learning. In
ICCV , 2019.Michael Gutmann and Aapo Hyv¨arinen. Noise-contrastive estimation: A new estimation principlefor unnormalized statistical models. In
AISTATS , 2010.Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariantmapping. In
CVPR , 2006.Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J. Black. Resolving 3d humanpose ambiguities with 3d scene constraints. In
ICCV , 2019.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-nition. In
CVPR , 2016.Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast forunsupervised visual representation learning. In
CVPR , 2020.Olivier J H´enaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, SM Eslami, andAaron van den Oord. Data-efficient image recognition with contrastive predictive coding. arXivpreprint arXiv:1905.09272 , 2019.Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv ,2015.Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoningand compositional question answering. In
CVPR , 2019.Catalin Ionescu, Dragos Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasetsand predictive methods for 3d human sensing in natural environments.
TPAMI , 2014.Dinesh Jayaraman and Kristen Grauman. Learning image representations tied to ego-motion. In
ICCV , 2015.Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In
ICLR , 2015.Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina Kuznetsova,Hassan Rom, Jasper Uijlings, Stefan Popov, Andreas Veit, Serge Belongie, Victor Gomes, Abhi-nav Gupta, Chen Sun, Gal Chechik, David Cai, Zheyun Feng, Dhyanesh Narayanan, and KevinMurphy. Openimages: A public dataset for large-scale multi-label and multi-class image classifi-cation.
Dataset available from https://github.com/openimages , 2017.Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie.Feature pyramid networks for object detection. In
CVPR , 2017.David G Lowe. Distinctive image features from scale-invariant keypoints.
International journal ofcomputer vision , 60(2):91–110, 2004.Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li,Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervisedpretraining. In
ECCV , 2018.Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representa-tions. In
CVPR , 2020.Roozbeh Mottaghi, Hessam Bagherinezhad, Mohammad Rastegari, and Ali Farhadi. Newtonianscene understanding: Unfolding the dynamics of objects in static images. In
CVPR , 2016a.Roozbeh Mottaghi, Hannaneh Hajishirzi, and Ali Farhadi. A task-oriented approach for cost-sensitive recognition. In
CVPR , 2016b. 10ushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and supportinference from rgbd images. In
ECCV , 2012.Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsawpuzzles. In
ECCV , 2016.Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. Representation learning by learning to count.In
ICCV , 2017.Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predic-tive coding. arXiv , 2018.Deepak Pathak, Philipp Kr¨ahenb¨uhl, Jeff Donahue, Trevor Darrell, and Alexei Efros. Context en-coders: Feature learning by inpainting. In
CVPR , 2016.Deepak Pathak, Ross Girshick, Piotr Doll´ar, Trevor Darrell, and Bharath Hariharan. Learning fea-tures by watching objects move. In
CVPR , 2017.Lerrel Pinto, Dhiraj Gandhi, Yuanfeng Han, Yong-Lae Park, and Abhinav Gupta. The curious robot:Learning visual representations via physical interactions. In
ECCV , 2016.Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time objectdetection with region proposal networks. In
NeurIPS , 2015.Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional networks for semanticsegmentation. In
CVPR , 2015.Wenzhe Shi, Jose Caballero, Ferenc Husz´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, DanielRueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficientsub-pixel convolutional neural network. In
CVPR , 2016.Gunnar A Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari.Charades-ego: A large-scale dataset of paired third and first person videos. arXiv , 2018.Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv , 2019.Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating visual representations fromunlabeled video. In
CVPR , 2016.Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos.In
ICCV , 2015.Luca Weihs, Aniruddha Kembhavi, Winson Han, Alvaro Herrasti, Eric Kolve, Dustin Schwenk,Roozbeh Mottaghi, and Ali Farhadi. Artificial agents learn flexible visual representations byplaying a hiding game. arXiv , 2019.Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In
CVPR , 2018.Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database:Large-scale scene recognition from abbey to zoo. In
CVPR , 2010.Yanyu Xu, Yanbing Dong, Junru Wu, Zhengzhong Sun, Zhiru Shi, Jingyi Yu, and Shenghua Gao.Gaze prediction in dynamic 360 ◦ immersive videos. In CVPR , 2018.Ting Yao, Yiheng Zhang, Zhaofan Qiu, Yingwei Pan, and Tao Mei. Seco: Exploring sequencesupervision for unsupervised representation learning. arXiv , 2020.Amir R. Zamir, Alexander Sax, William Shen, Leonidas J. Guibas, Jitendra Malik, and SilvioSavarese. Taskonomy: Disentangling task transfer learning. In
CVPR , 2018.Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In
ECCV , 2016.Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10million image database for scene recognition.
TPAMI , 2017.Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local aggregation for unsupervised learningof visual embeddings. In
ICCV , 2019. 11
PPENDIX
A.1 D
ATASET E XAMPLES
The supplementary video provides examples of our dataset and also qualitative results of our repre-sentation learning model.A.2 D
ATA C OLLECTION D ETAILS
We describe the details of our hardware setup and the alignment method used to synchronize therecordings between our camera, gaze tracker, and movement sensors.A.2.1 A
LIGNMENT AND S YNCHRONIZATION OF D EVICES
There are three different devices in our setup, 1) Tobii Pro eye-tracking glasses to record gaze, 2)BNO055 IMU sensors to record movements, and 3) GoPro camera attached to the forehead to cap-ture ego-centric videos. These different devices record data independently. Therefore, it is necessaryto synchronize all recordings.Gaze Tracker and GoPro Alignment. Tobii Pro2 eye-tracking captures a video and the center of thegaze in the camera frame. Due to the low quality of the video captured by the eye-tracking glasses,we use an additional high-quality GoPro hero 6 camera (with resolution × and 60 framesper second). To synchronize the videos from the gaze tracker and GoPro, we extract SIFT (Lowe,2004) features and use a brute force algorithm for feature matching and the RANSAC method tofind a homography that maps the gaze from Tobii’s camera frame to GoPro’s camera coordinate.Note that the gaze might be missing for some frames due to the device noise.IMU Sensors and GoPro Synchronization. The outputs of the IMU sensors are recorded on a Rasp-berry Pi board. There is also a microphone on the Raspberry Pi that records the audio. We synchro-nize the IMU and video recordings using two methods, 1) synchronize the audio from the GoProvideo and the voice recording on the Raspberry Pi board, and 2) repeat a specific pattern of bodymovements in front of a mirror, so it can be uniquely identified in both the movements depicted bythe IMU sensors and the GoPro camera (which recorded the participant’s body pose in the mirror).A.2.2 M OVEMENT C ALCULATIONS
We record the body part movements using BNO055 Inertial Measurement Units (IMUs) in 10 differ-ent locations (torso, neck, 2 triceps, 2 forearms, 2 thighs, and 2 legs). The body parts may not appearin ego-centric video frames, therefore, the task of predicting the exact orientation and location of abody part (e.g., arm) using ego-centric videos can be very challenging. We train the model usingthe simpler task of predicting whether a part has moved or not. This still contains rich informationabout the action that is happening in the video, for example, walking can be defined as periodicmovements of the left and right legs.To compute the loss function, we need to distinguish between movement and no movement in thedataset. One way is finding a threshold in the domain of the angles of the part movements, and labelall the moves smaller than the threshold as no movement and the rest as movement . However, thismight result in ambiguities for the movements close to the threshold, and the network might overpenalize the wrong predictions in the neighborhood close to the threshold. Therefore, we add a third gray area label, where the network is not penalized for wrong predictions. We divide the range ofmovements for each sensor into three equal ranges, the first 33% is labeled as no movement , thelast 33% is labeled as movement and the remaining interval is the gray area , for which there is nopenalty for misprediction.A.3 D
ATASET A NALYSIS
To ensure that the videos in the dataset consist of a wide range of activities, we do not provideany specific instructions to the subjects, and we ask them to perform their daily routine activities.Hence, the dataset includes a variety of different situations including but not limited to driving,12 pproximate Scene Distribution Gaze Distribution in Image Frame Changes in Rotation Between Consecutive frames +150K+200K
Figure 4:
Dataset Statistics.
Left: The approximate distribution of top 20 scene classes accordingto a scene classifier trained on the Places (Zhou et al., 2017) dataset. We show how often each scenecategory is predicted as top-1. Middle: The distribution of gaze across the dataset. Right: Theaverage magnitude of the change in the orientation of body parts between consecutive frames.cycling, playing pool, cooking, cleaning, walking in the streets, and shoveling snow. Purely fordataset analysis purposes, we gather proxy scene labels for each image. We use an off-the-shelfscene classification model trained on the Places dataset (Zhou et al., 2017) and record the top-1prediction for each frame of our dataset. Many scene categories in our data are not present in Places,so we see a moderate amount of misclassification. However, the classifier confidently (more than70%) predicts 101 of the 365 classes exist somewhere within our dataset, showing the diversityof our data. Figure 4 Left shows the 20 most frequent predictions. Even though there are somemispredictions among them (such as jail cell which is frequently mistaken for dark rooms), weobserve that our data is fairly diverse.Figure 4 Middle shows the distribution of the gaze in the images. As expected, the focus is mostlyin the center of the image. In Figure 4 Right, we show the change in the orientation of the body partsbetween two consecutive frames. We observe more movements in the limbs compared to the torsoand neck. We additionally notice more right arm movement than the left which is likely caused bymore of our participants being right-handed.A.4 A
RCHITECTURE & T
RAINING D ETAILS
In this section, we describe the details of our network architectures as well as the hyperparametersand optimization methods that we used, for reproducibility purposes. Also, the code and data willbe made publicly available for further research.A.4.1 B
ACKBONE N ETWORK
We train the feature extractor network by using a sequence of images of length k = 5 , which are th of a second apart, as input. We use the ResNet18 (He et al., 2016) convolution layers as thefeature extraction backbone. To preserve spatial information, which is essential for gaze prediction,we use the × × features before average pooling. We then add a × convolutional layeron top to reduce the feature size to × × . The flattened feature is then input to a 3 layerLSTM with hidden size 512, which encodes the input video into a hidden feature vector. Next, theembedded video feature vector is decoded using a 3 layer LSTM, to predict the binary movementvector and gaze. For L visual , we use the feature size (obtained by a fully connected layer ontop of the ResNet18 features) and a memory bank of size . We choose δ = 1 in Equation 1, α = 0 . , β = 0 . , γ = 0 . in Equation 3, and τ = 0 . in Equation 2. For data augmentation, weonly use color jitter and random flip. We flip the entire sequence of images, swap the part movementsfor right and left arms and legs, and calculate the updated gaze in the flipped images.A.4.2 T ARGET T ASK N ETWORKS
Shared Implementation Details.
During the target task training, the weights for the backbone arefrozen. For all of our experiments, we use the Adam optimizer (Kingma & Ba, 2015) and images13re reshaped to × . The size of the hidden layer in our LSTM in all temporal experiments is512. We use leaky-ReLU non-linearities between all network layers except for LSTMs. Self-supervised Baseline Details.
When training the MoCo encoder network, we use the SGDoptimizer with . learning rate for the MoCo baseline since it performs best in this setting. Fordata augmentation, we use random cropping, horizontal flipping, gray scaling, and color jitteringwith the same parameters used in that work. We train the baseline with batch size 256 on 8 GPUsfor 200 epochs with the same training regime as the original work. Scene Classification.
We use a decoder network of a single × convolution layer that reduces thefeature size from × × to × × , followed by two fully connected layers, that convertthese features to a vector of size 512 and then 397. We use the cross-entropy loss for training, andevaluate using mean per class top-1 accuracy. Action Recognition.
For this task, we train a single × convolution layer that reduces the featuresize from × × to × × , followed by an LSTM to embed the video in one hidden vectorof size 512, and two fully connected layers that convert the features to a vector of size 200 and thento the number of actions. As before, we use the cross-entropy loss as the objective and evaluate withmean per class top-1 accuracy. Since some of the action classes in this dataset appear in a limited setof videos, following one of the EPIC challenge finalists (Damen et al., 2019), we choose 9 verbs thatresult in a state transition, namely, take , put , open , close , wash , cut , mix , pour and peel and ignoreverbs that do not cause a state transition (e.g., check). Dynamic Prediction.
We create a binary rectangular mask using the object bounding box. We usethis mask image as the input to a two-layer convolutional network. As the result, we obtain a featureof size × × . These two convolutional layers are trained for both our method and the baseline.We concatenate the mask feature with the feature vector obtained from the image and add two fullyconvolutional layers on top, to obtain the class labels. Again, we optimize the network using thecross-entropy loss and use mean per class top-1 accuracy as the evaluation metric. Depth Estimation.
The ResNet backbone is connected to a Feature Pyramid Network (FPN) (Linet al., 2017). We use Pixel Shuffle layers (Shi et al., 2016) for up-scaling the lower level features.The learned ResNet backbone is frozen; the 5 up-convolution layers are the only layers trained forthe target task. We use the Huber loss as the objective.
Walkable Surface Estimation.
The architecture of this network is the same as the depth estimationnetwork. The ResNet backbone is frozen and five up-convolution layers are the only layers that aretrained for the target task. We use the binary cross-entropy loss as the objective, where the goal issegmenting walkable and non-walkable pixels. For evaluation, we use the standard Intersection overUnion (IOU) metric for segmentation tasks.A.5 R
ESULT OF F ULL S UPERVISION
As a point of reference, we also provide the results using a fully supervised backbone that is trainedusing ImageNet. Neither our method nor the purely visual baselines use any supervision for repre-sentation learning. Therefore, a direct comparison is not fair. The results are shown in Table 5. Thecorresponding self-supervised results are shown in Table 1.
Datasets SUN397 Epic Kitchen VIND NYUv2
Xiao et al. (2010) Damen et al. (2018) Mottaghi et al. (2016a) Nathan Silberman & Fergus (2012)
Method Training (a) Scene (b) Action (c) Dynamics (d) Walkable (e) DepthObjective (Top-1 ↑ ) (Top-1 ↑ ) (Top-1 ↑ ) (IOU ↑ ) (RMSE log ↓ )Supervised(IN) Classification 47.27 32.09 20.01 65.8 0.132 Table 5: