STAIR Actions: A Video Dataset of Everyday Home Actions
SSTAIR Actions: A Video Dataset of EverydayHome Actions
Yuya Yoshikawa, Jiaqing Lin, Akikazu Takeuchi
STAIR Lab, Chiba Institute of Technology
Abstract.
A new large-scale video dataset for human action recogni-tion, called
STAIR Actions is introduced. STAIR Actions contains100 categories of action labels representing fine-grained everyday homeactions so that it can be applied to research in various home tasks suchas nursing, caring, and security. In STAIR Actions, each video has asingle action label. Moreover, for each action category, there are around1,000 videos that were obtained from YouTube or produced by crowd-source workers. The duration of each video is mostly five to six seconds.The total number of videos is 102,462. We explain how we constructedSTAIR Actions and show the characteristics of STAIR Actions com-pared to existing datasets for human action recognition. Experimentswith three major models for action recognition show that STAIR Actionscan train large models and achieve good performance. STAIR Actionscan be downloaded from http://actions.stair.center . Keywords:
Action recognition, Video dataset, Human Actions, Deepneural networks, Deep learning
In recent years, human action recognition, the task of classifying what actionspeople appearing in a given video are performing, has attracted attention as oneof the main themes in video analysis [1, 2]. Nowadays, because cameras are in-stalled in many devices such as smartphones, robots, cars, and home appliances,it is expected that the technologies for human action recognition will be used invarious situations instead of recognition by human eyes.In most recent studies on human action recognition, deep neural networks(DNNs) are used. When using DNNs, the following two points are important.The first is the size of the dataset. One reason image recognition has becomesuccessful is the massive amount of labeled image data, which is now sufficientfor training DNNs, along with the technical revolution of deep learning [3]. Haraet al. suggested this successful scheme could also be valid for human actionrecognition in videos [4]. The second is the selection of action labels. If one wouldlike to apply the trained model to do some task, action labels should be chosenfrom that task domain. Although many datasets for human action recognitionhave been constructed so far, action labels were not designed with target tasksin mind, but instead designed considering the ease of data collection. a r X i v : . [ c s . C V ] A p r Yuya Yoshikawa, Jiaqing Lin, Akikazu Takeuchi
In this paper, we introduce a new large-scale video dataset for human actionrecognition called
STAIR Actions . STAIR Actions contains 100 categoriesof action labels representing fine-grained everyday home actions so that it canbe used for the recognition of various home tasks such as nursing, caring, andsecurity. For each action category, there are 900 to 1,200 videos obtained fromYouTube or produced by crowdsource workers. Moreover, each video in STAIRActions has a single action label. The duration of each video is mostly 5–6 s.The shortest and longest are 3 and 10 s, respectively. The total number of videosis 102,462. In this paper, we explain how we constructed STAIR Actions, andcompare the characteristics of STAIR Actions with existing datasets for humanaction recognition. Experiments of training three popular architectures show thatSTAIR Actions can succeed in training large models that achieve competitiveperformance. STAIR Actions can be downloaded from http://actions.stair.center .Our main contributions can be summarized as follows: – We developed a large video dataset of everyday home actions consisting ofa balanced distribution of videos over 100 action categories. – Through the analysis of the dataset, we characterized the dataset againstexisting datasets and identified challenging action categories in this dataset. – We showed how human action recognition can be achieved with this datasetby experiments using three DNN models for video.The rest of this paper is organized as follows. In Section 2, we review theexisting action recognition datasets widely used in computer vision research. InSection 3, we explain how we constructed STAIR Actions in detail. Next, theunique characteristics of STAIR Actions are shown in Section 4. Benchmarkresults of training popular architectures with STAIR Actions are described inSection 5. Finally, we conclude this paper in Section 6.
Over the last two decades, several action video datasets have been developed.In the 2000s, relatively small datasets consisting of tens to hundreds of videoswere constructed. For example, the
KTH dataset consists of 600 monochromevideos with six action categories [5], the
Hollywood dataset includes 430 shortvideo clips extracted from 32 movies with eight action categories [6], and the
UCF11 dataset contains videos obtained from YouTube with 11 sports-relatedaction categories [7].Subsequently, datasets have become larger because a large number of videoscan easily be gathered from social media such as YouTube. The
HMDB51 datasetincludes 6,849 videos and its action labels comprise both indoor and outdooractions [8]. The
UCF101 dataset is an extension of
UCF50 , which mostly consistsof sports actions [9]. UCF101 includes 13,320 videos with 101 action labels, eachof which can be classified into one of five large classes: human-object interaction,
TAIR Actions: A Video Dataset of Everyday Home Actions 3 body motion only, human-human interaction, playing musical instruments, andsports.Since 2015, several datasets that have a large number of categories and/orvideos have been constructed. Version 1.3 of
ActivityNet contains 23,064 videoclips extracted from YouTube videos, each of which is annotated by one of 200action labels [10].
Kinetics contains 400 human action categories with at least400 videos clips for each action [11]. Each clip lasts around 10 s and is takenfrom a different YouTube video. The
AVA2.0 dataset provides 80 atomic visualactions densely annotated in 192 15-min video clips, resulting in 740,000 actionlabels in total [12].STAIR Actions is a large video dataset for human action recognition thatis competitive in size with respect to both videos and action labels with Activ-ityNet, Kinetics, and AVA. Moreover, the action labels of STAIR Actions arefocused and fine-grained; that is, all the videos in STAIR Actions are relatedonly to everyday home actions. In Section 4, we compare the characteristics ofSTAIR Actions with those of ActivityNet, Kinetics, and AVA, qualitatively andquantitatively.
This section explains how we construct STAIR Actions dataset in details.
This subsection explains how we selected the 100 action labels in STAIR Actions.First, we obtained the Japanese basic verb list from Wiktionary, which con-tains 170 basic verbs. From the list, we extracted the verbs associated withactions commonly seen in the home and office. Moreover, we added verbs char-acteristically associated with special rooms such as the bathroom, kitchen, andliving room.Some verbs change their meanings depending on the object words with whichthey are used. For example, the verb “open” is related to various actions such as“opening a door,” “opening a refrigerator door,” and “opening a bottle.” Unlike Moments in Time dataset, which intentionally adopts verbs as action labels [13],we believe it is important to distinguish such actions, even if they share the sameverb. Therefore, we defined action labels using the form “verb + object” for suchverbs. The selected 100 action labels are listed in Table 1.Note that the action labels of Kinetics and ActivityNet seem to be selectedfrom keywords that return a large number of videos on YouTube. Unlike thesedatasets, those of STAIR Actions are selected in a top-down manner from actionsthat need to be recognized in the home and office. https://en.wiktionary.org/wiki/Appendix:1000_Japanese_basic_words Yuya Yoshikawa, Jiaqing Lin, Akikazu Takeuchi
Table 1.
100 actions of STAIR Actions
Kitchen relateddrinkingeating mealeating snackwashing dishthrowing trashwashing handsopening refrig doorpouring tea or coffeecutting foodcookingWashroom relatedsetting hairdrying hair with blowermaking upmanicuringgarglingbrushing teethwashing faceshavingObject manipulationwearing glassplaying with toyplaying board gameusing computerlistening to music with headphonesplaying computer gametaking photousing smartphoneusing tabletoperating remote controlwatching TVtelephoninggardeningplaying guitarplaying pianoblowing flutestanding on chair or table or stepladderthrowingopening or closing containersmokingironingknitting or stitchingpolishing shoewearing shoessewinghanging out or capture laundryfolding laundrywearing tieputting off clothputting on clothhousecleaningwiping windowdrawing picturedoing origamireading newspaperstudyingreading bookwriting Multiplayer actionchanging baby diaperbottle-feeding babypiggybacking someoneholding someonefeeding babyassisting in getting upassisting in walkingteachingnoddingshaking headspeakinghearingpointing with fingercaressing headkissingdoing high fivehuggingstroking animalshaking handsbowinggiving massagepassing somethingdoing paper-rock-scissorsfightingSolo actionwalking with stickwalkinggoing up or down stairsjumping on sofa or bedbaby cryingbaby crawlingexercisingdancingrunning aroundclapping handssitting downstanding upsleeping on bedlying on floorleaving roomentering roombeing angrybeing surprisedcryingsmiling
TAIR Actions: A Video Dataset of Everyday Home Actions 5
Show guidelineSelect one within 10 action labels‘Not applicable’ button Send button
Fig. 1.
Screenshot of the web annotation system we developed. The system was devel-oped for Japanese people.
About the half of the videos included in STAIR Actions consist of videos fromYouTube. We annotated the action labels for the videos using the following foursteps:
Step 1.
Gathering videos from YouTube.
Step 2.
Extracting 5 s videos from the obtained videos.
Step 3.
Annotating the 5 s videos with action labels.
Step 4.
Checking the quality of the annotated labels.First, we gathered videos from YouTube. When searching for videos, thesearch keywords were set to words and phrases related to the action labels ofSTAIR Actions. Additionally, we restricted the YouTube search to videos lessthan 4 min in duration.Second, animations and slide-shows were removed from the gathered videos.Scenes with no humans in them were extracted and removed from the videos.The resulting videos were chopped into short video clips 5 s in duration. Webelieve that 5 s is long enough to recognize an action, while still keeping theinput data size manageable.The third step is to annotate the 5 s video clips with action labels. Annota-tion was performed by crowdsource workers. The workers were first shown theannotation guidelines and their comprehension of the guidelines was tested. Onlythe workers who passed the test were asked to annotate videos.To make the annotation work performed by crowdsource workers efficient, wedeveloped the original web annotation system shown in Figure 1. In the system,the workers are shown a video and asked to select one label from 10 labels plusa “not applicable” label. They were given only 10 labels because workers cannot
Yuya Yoshikawa, Jiaqing Lin, Akikazu Takeuchi
Fig. 2.
Example of a video for the action “leaving the room.” memorize the guidelines for all 100 actions. A worker is given 10 labels at thebeginning of the task and treated as a specialist of those 10 labels.The final step is to check the quality of the annotated labels by checkingwhether they are correct. This quality checking was also done by crowdsourceworkers for all the annotated videos, although these workers were different fromthose who annotated the videos. In the checking process, each video was checkedby three workers, and the final label was added to the dataset by majority vote.
For some of the 100 actions in STAIR Actions, it is easy to find correspondingmovies on YouTube just by searching using the action labels as keywords. Forexample, movies of such actions as “cooking,” “playing guitar,” and “dancing”can be collected rather easily. This is because these action names are popularlyused as tags characterizing those movies.However there are many actions in STAIR Actions that are rarely taggedin YouTube videos. They are ordinary and common actions. Actions such as“entering a room” and “leaving a room” frequently appear in many videos, butnobody tags those scenes using those actions names. Obviously, these actionsare not considered worthy of tagging.To collect videos of those actions, we asked crowdsource workers to take them.Given an action title, a worker was asked to shoot that action at an arbitrarytime of day, with any number of actors, and in any place (but preferably in thehome). The duration of each video is 5–6 s. There are several risks associatedwith this method. The biggest one is that someone could send a video extractedfrom a copyrighted movie. To avoid this, workers are requested to place a papersign reading “Stair Lab” anywhere in the scene. Figure 2 shows this sign is placedin a scene. Almost half of the videos in STAIR Actions were produced in thisway.Comparing those videos with the YouTube videos, there are several differ-ences. – There are many videos in which the same person performs the same action.To humans, they look all the same, but pixel-wise, they are different becausethey were shot under different settings, i.e., at least one of the followingconditions differs: camera angle, camera distance, and person’s clothing. – The actions in the videos are more or less staged performances. Actions arealways in the center of the scene and the beginning and end of an action are
TAIR Actions: A Video Dataset of Everyday Home Actions 7
Table 2.
STAIR Actions: statisticsTraining Validation Total
Table 3.
Comparison with existing datasets. apparent. On the contrary, the actions in YouTube videos are wild. Some-times it is difficult to see the action. – The same person performs various actions in same place. Hence, a place isnot correlated with the specific action.
In this section, we present the characteristics of STAIR Actions obtained byperforming quantitative and qualitative analyses and comparisons with existingdatasets.
First, we present the basic statistics of STAIR Actions in Table 2. Note thatbecause we split the video clips of STAIR Actions into training and validationsets uniformly at random, the distributions of the number of videos per categoryfor each set are nearly identical.STAIR Actions focuses on everyday home actions, so all of its action cate-gories are everyday home actions. Existing action datasets have a more diverseinterest in various (indoor and outdoor) actions. Table 3 compares four actiondatasets, STAIR Actions, ActivityNet [10], Kinetics [11], and AVA [12], withrespect to the number of everyday home action categories.We tried to include the Moments in Time database [13] in this table, however,the thinking behind Moments in Time is so different that we could not compareit with the others. There are several radical features in Moments in Time: – Category = Verb: Each category simply corresponds to a verb. Hence, a verbsuch as “playing” can contain many action categories such as “playing gui-tar,” “playing a video game,” or “playing in the garden,” which are regardedas distinct categories in other datasets.
Yuya Yoshikawa, Jiaqing Lin, Akikazu Takeuchi – Non-human actors: Actors in Moments in Time are not limited to humans.The actor could be an animal or a physical object. For example, an actor inthe “dropping” category could be a boy, liquid, the jaw of an animated dog,or a boat.Although Moments in Time seems to be a valuable resource for the explo-ration of the relationships between language (verb) and vision (motion), weexclude it from this dataset comparison.As for action vocabulary size, the numbers of categories of the four datasetsrange between 80 and 400. Because the number of everyday home actions arenot given for ActivityNet, Kinetics, and AVA, we counted them ourselves. Basedon our analysis, the number of these categories range between 80 to 215.STAIR Actions, ActivityNet, and AVA have approximately 100 everydayhome action categories, respectively, while Kinetics has twice as many suchcategories. However, there is not a large overlap among them. In the case ofSTAIR Actions, almost half of its categories are unique and not shared withother datasets. The largest overlap is with Kinetics, which shares 54 categorieswith STAIR Actions. Indeed, there are many kinds of home actions in our every-day life. Hence, possibility of having the same action category is not very highas long as the size of the video collection is in the several hundreds.There are many choices that must be made when one defines action cate-gories. For example, one can define “dancing” as one category, but others coulddecide to define “tango dancing,” “tap dancing,” and “salsa dancing” as separatecategories. Hence, the granularity of categories varies from dataset to dataset.Compared with STAIR Actions, ActivityNet and Kinetics have finer-grainedcategories. Indeed, Kinetics has 18 categories for dancing and ActivityNet hasfive categories, whilst STAIR Actions has only one category: “dancing.”As shown in Table 3, 49 categories of ActivityNet match 28 STAIR Actionscategories, and 122 categories of Kinetics match 54 STAIR Actions categories.These numbers imply that ActivityNet and Kinetics have categories that aretwice as fine as those of STAIR Actions on average.Fine-grained categories are easy to define because the meaning of the indi-vidual category becomes narrower and clearer, hence a classifier may be easierto develop. However, collecting samples of fine-grained categories becomes moredifficult because there are more constraints for each sample.It is now well-known that to obtain better classification accuracy throughdeep learning, it is critical to have many examples for each category. In the caseof ImageNet, that number is 1,000. STAIR Actions and Kinetics are better inthis respect. Both provide nearly 1,000 videos for each category. The number ofvideos per category in AVA varies drastically from category to category. The aimof AVA is on the multiple annotation/labeling of videos, and hence its datasetdoes not seem to be constructed for classification.
To train a DNN model to recognize a human action, video data should containat least one human body performing that action. This is practically not so easy
TAIR Actions: A Video Dataset of Everyday Home Actions 9
Fig. 3.
Human parts in different staged scene actions: “Blowing flute,” “Bowing,” and“Playing piano.” from the viewpoint of dataset construction. A typical example is “origami.” Mostvideos with the “origami” tag on YouTube contain no body and just show thehands. As a compromise, we accepted videos that include only parts of a body.It is interesting to see how many videos in STAIR Actions contain a wholebody and how many videos contain just body parts. Hence, we examined eachvideo in the dataset using OpenPose as body/parts detector [14].The detection procedure was as follows: every tenth frame was extracted froma video, resized to 256 × > single body > face > hands. Figure 3 shows example frames detected as “singlebody,” “multi-body,” and other cases (i.e., “face” and “hands”).Figure 4 shows the result. Although the check is based on tenth-frame sam-pling, it shows that 39.4% of the videos include at least one frame that includesmultiple bodies, and 51.6% of the videos includes a single body, and so on. Nobody parts are detected only in 4.4% of the videos. Note that more than 91% ofthe videos in the dataset contain a body image (in the OpenPose sense).We performed the same body/parts detection check for Kinetics. Figure 5compares STAIR Actions and Kinetics with respect to body and body partsappearances. Because Kinetics videos are “wild” videos collected from YouTube,OpenPose body/parts detection fails in many cases. This indicates that theSTAIR Actions videos contain more human body images than those of Kineticsand may help to train DNNs to recognize actions. Fig. 4.
Distribution of human actors/body parts in STAIR Actions videos.
STAIR Actions is a large video dataset of everyday human actions. Each videoincluded in the dataset is a short video clip 3–10 s in length, mostly 5–6 s inlength. Annotation for each video is just one label out of 100 action labels. Thereis no bounding box, no label taxonomy, and no start and end time.Moreover, there are the following notable features of STAIR Actions.
Paired actions where direction matters
STAIR Actions contains pairedactions such that the direction of state change matters. Examples are actionssuch as “sitting down” ←→ “standing up,” “entering a room” ←→ “leaving aroom,” and “putting on clothing” ←→ “taking off clothing”. Any static imagetaken from such paired actions never helps to discriminate which one of the pairis correct. To discriminate actions in the pair, the model needs to recognize thenature of temporal change. This is the core of action recognition. Emotional actions
STAIR Actions contains emotional actions such as “beingangry,” “being surprised,” “crying,” and “smiling.” Those actions are sometimesvery subtle and may be difficult to recognize.
Similar gadgets
To recognize actions related to gadget manipulation, it is im-portant to recognize a gadget in the scene. In STAIR Actions, there are severalgadget manipulation categories such as “using smartphone,” “taking photo,”
TAIR Actions: A Video Dataset of Everyday Home Actions 11
Fig. 5.
Comparison of the distribution of human actors/body parts in STAIR Actionsand Kinetics videos. OpenPose was used as human body/parts detector. “using tablet,” and “operating remote control.” Because these devices look sim-ilar in many cases, distinguishing such actions may be difficult.
In most recent studies on human action recognition, DNNs are used. Such DNN-based models are best trained using a massive number of videos with the correctaction labels. DNN-based models can be roughly classified into three types ofarchitectures. The first one is a 2DCNN+LSTN, which first extracts featuresfrom each frame in a video using a two-dimensional (2D) convolutional neuralnetwork (CNN), feeds the sequence of the features into an LSTM (recurrentneural network) [15], and finally recognizes an action. The second one is a3DCNN, which extends the 2DCNNs along the space and time axes [16]. Thethird one is called a two-stream CNN, which combines different 2DCNNs thatcapture spatial and temporal structures in videos [4, 17].
A video can be seen as a sequence of images. Donahue et al. proposed a modelcalled the
LRCN for action recognition, which (1) extracts image features ofeach frame from an input video using a 2DCNN, and then (2) extracts temporalfeatures from the sequence of the image features using an LSTM [15].
We implemented LRCN from scratch, referring to their paper. For the 2DCNNin our implementation, we used AlexNet [18] pretrained on the ImageNet dataset(ILSVRC2012). As preprocessing, each frame of an input video was resized to256 ×
256 and then randomly cropped to a 224 ×
224 frame. Image features wereextracted from each frame by AlexNet from the randomly selected consecutive30 frames. Here, the image features are the output of the sixth fully-connectedlayer (i.e., fc6) in AlexNet. Then, the 30 image features were fed into a single-layer LSTM with 256 hidden units. After all the image features were fed intothe LSTM, the output of the LSTM was transformed into the probabilities ofthe action labels by applying a fully-connected layer.
Simonyan et al. proposed the two-stream convolutional neural networks (CNN)model for action recognition [16]. The model consists of two ConvNet architec-tures, one for spatial information and the other for temporal information. Thesoftmax scores of the two streams are averaged and used for training a multi-classlinear SVM [19].We implemented the two-stream model from scratch in reference to the paper[16]. As for the spatial model, the layer configuration is same as that of [16]. Wepretrained the spatial model on ILSVRC2012. In every iteration, we randomlyselected a frame from an input video and resized it to 256 × × × × ×
10. Moredetails are described in [16].To train the two-stream model, we set the mini-batch size to 32 and set theinitial learning rate to 0.001 in both streams. Because the spatial model waspretrained, only the last layer was trained. The training result of the spatialmodel is shown in Figure 6(a). The learning rate was multiplied by 0.1 at epochs140 and 190. We trained the temporal model from scratch. The learning ratewas multiplied by 0.1 at epoch 330. The training result of the temporal model isshown in Figure 6(b). We used Chainer V3 [21] on a server with eight GeForceGTX 1080Ti GPUs.Table 4 shows the results of four architectures, the spatial model, temporalmodel, two-stream model that averages the scores of two streams, and two-stream model that trains an SVM with two stream scores as input, on threedatasets: HMDB51, UCF101, and STAIR Actions.
3D CNNs with spatio-temporal 3D kernels (3DCNNs) have been expected tobe suitable for action recognition from video. However, because of their largenumber of parameters, it has been difficult to train them without overfitting.
TAIR Actions: A Video Dataset of Everyday Home Actions 13
Fig. 6.
Performance the two-stream CNN on STAIR Actions on. (a) Result trained onthe spatial CNN. (b) Result trained on the temporal CNN.
Table 4.
Two-stream model accuracies on HMDB51, UCF101, and STAIR Actions.Spatial Temporal Avg. fusion SVM fusionHMDB51 40.5% 54.6% 58.0% 59.4%UCF101 73.0% 83.7% 86.9% 88.0%STAIR Actions 70.4% 54.1% 73.1% 73.7%
Kay et al. [11] and Hara et al. [4, 17] demonstrated that using a large videodataset such as Kinetics, 3DCNN can be trained without falling into overfittingand achieve competitive performance with other modern models such as thetwo-stream models.We experimented with a 3DCNN model called ResNet-34, which developedby Hara et al. Although ResNet-34 is neither the deepest nor the best per-forming model of their 3DCNN models, its performance is promising. The resultis shown in Table 6. Most of the model parameters are the same as the onedescribed in [4, 17] except for the number of categories and sample duration,which is the number of consecutive frames to be processed. We experimentedusing 16, 30, and 60 frames for sample duration. The result is shown in Table 5.There are advantages and disadvantages with respect to the length of sampleduration; shorter durations risk missing critical moments of action while longerdurations may include irrelevant actions. In addition, longer durations requiremore parameters to be trained. From the results in Table 5, a duration 30 framesseems good compromise. Figure 7 shows validation loss and accuracy over 200epochs.Table 6 summarizes the accuracies of the three models on the three datasets. It is interesting to compare the performances of the same 3DCNN (i.e., ResNet-34) on STAIR Actions and Kinetics. [4] reports that ResNet-34 on Kinetics https://github.com/kenshohara/3D-ResNets-PyTorch The accuracy of LRCN on STAIR Actions will be provided in a later version of thepaper.4 Yuya Yoshikawa, Jiaqing Lin, Akikazu Takeuchi
Table 5. A cc u r a c y L o ss Epoch
3D ResNets with 34 layers (Number of sample frames = 30) loss acc
Fig. 7.
3D ResNet Performance achieved 60.1% top-1 accuracy. The better performance (76.5%) of the samemodel on STAIR Actions may be because STAIR Actions has smaller number ofcategories than Kinetics and its videos contain more body/parts images. More-over, using deeper models such as ResNeXt-101 or Wide ResNet-50 may makeit possible to achieve better accuracy on STAIR Actions.
A new video dataset of everyday human actions, STAIR Actions, was intro-duced. It contains 100 everyday human action categories with an average of1,000 trimmed video clips for each category. Clips were taken from YouTube orcreated by crowdsource workers.STAIR Actions is the first large video dataset of everyday human actionswith a balanced distribution of videos over categories. Through experiments
TAIR Actions: A Video Dataset of Everyday Home Actions 15
Table 6.
Accuracy of various models on HMDB51, UCF101, and STAIR Actions. The3DCNN scores for HMDB51 and UCF101 are taken from Hara et al. [4], which used aKinetics-pretrained 3D ResNet-34. The 3DCNN score for STAIR Actions is our resultfor 30 sample frames. LRCN Two-stream CNN 3DCNNHMDB51 53.3% 59.4% 59.1%UCF101 87.2% 88.0% 87.7%STAIR Actions N/A 73.7% 76.5% with well-known actions recognition models, STAIR Actions is shown to be ableto train large models such as 3DCNNs.As for future work, we plan to continuously publish subsequent versions ofSTAIR Actions with some cleaning and fine tuning. In addition, we are devel-oping a Japanese caption dataset for STAIR Actions that will be published inthe future. Further experiments with sophisticated models should be performed.These models include I3D [22] and 3D ResNeXt-101 [4]. It would also be interest-ing to study how the models pretrained on STAIR Actions perform on UCF-101and HMDB-51.
Acknowledgement
This project is supported by NEDO (New Energy and Industrial TechnologyDevelopment Organization), Japan. We would like to thank Yutaro Shigeto forhis helpful suggestions and comments.
References
1. Cheng, G., Wan, Y., Saudagar, A.N., Namuduri, K., Buckles, B.P.: Advances inHuman Action Recognition: A Survey. (2015) 1–302. Wu, D., Sharma, N., Blumenstein, M.: Recent Advances in Video-Based HumanAction Recognition Using Deep Learning: A Review. In: 2017 International JointConference on Neural Networks (IJCNN), IEEE (may 2017) 2865–28723. Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature (7553) (2015) 436–4444. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the historyof 2D CNNs and ImageNet? arXiv preprint arXiv:1711.09577 (2017)5. Schuldt, C., Laptev, I., Caputo, B.: Recognizing Human Actions: A Local SVMApproach. In: Proceedings of the Pattern Recognition, 17th International Confer-ence on (ICPR’04) Volume 3 - Volume 03, IEEE Computer Society (2004) 32–366. Laptev, I., Marsza lek, M., Schmid, C., Rozenfeld, B.: Learning Realistic HumanActions from Movies. 26th IEEE Conference on Computer Vision and PatternRecognition (2008)7. Jingen, L., Jiebo, L., Mubarak, S.: Recognizing Realistic Actions from Videos ”inthe Wild”. In: IEEE International Conference on Computer Vision and PatternRecognition. (2009)8. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: A LargeVideo Database for Human Motion Recognition. In: International Conference onComputer Vision, IEEE (nov 2011) 2556–25639. Soomro, K., Zamir, A.R., Shah, M.: UCF101: A Dataset of 101 Human ActionsClasses from Videos in the Wild. Technical Report November (2012)10. Fabian Caba Heilbron, Victor Escorcia, B.G., Niebles, J.C.: Activitynet: A large-scale video benchmark for human activity understanding. In: Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition. (2015) 961–97011. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S.,Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., Zisserman, A.: Thekinetics human action video dataset. CoRR abs/1705.06950 (2017)12. Gu, C., Sun, C., Vijayanarasimhan, S., Pantofaru, C., Ross, D.A., Toderici, G.,Li, Y., Ricco, S., Sukthankar, R., Schmid, C., Malik, J.: AVA: A video dataset ofspatio-temporally localized atomic visual actions. CoRR abs/1705.08421 (2017)13. Monfort, M., Zhou, B., Bargal, S.A., Yan, T., Andonian, A., Ramakrishnan, K.,Brown, L., Fan, Q., Gutfruend, D., Vondrick, C., et al.: Moments in time dataset:one million videos for event understanding14. Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimationusing part affinity fields. In: CVPR. (2017)15. Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S.,Saenko, K., Darrell, T., Austin, U.T., Lowell, U., Berkeley, U.C.: Long-term Re-current Convolutional Networks for Visual Recognition and Description. In: IEEEConference on Computer Vision and Pattern Recognition. (2015)16. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recog-nition in videos. CoRR abs/1406.2199 (2014)17. Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3Dresidual networks for action recognition. Proceedings of the ICCV Workshop onAction, Gensture, and Emotion Recognition (2017)18. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet Classification with DeepConvolutional Neural Networks. Advances In Neural Information Processing Sys-tems (2012) 1–9TAIR Actions: A Video Dataset of Everyday Home Actions 1719. Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research (Dec) (2001) 265–29220. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flowestimation based on a theory for warping. In: European conference on computervision, Springer (2004) 25–3621. Tokui, S., Oono, K., Hido, S., Clayton, J.: Chainer: a next-generation open sourceframework for deep learning. In: Proceedings of Workshop on Machine LearningSystems (LearningSys) in The Twenty-ninth Annual Conference on Neural Infor-mation Processing Systems (NIPS). (2015)22. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and thekinetics dataset. CoRR abs/1705.07750abs/1705.07750