[PDF] All About Knowledge Graphs for Actions

Abstract

Current action recognition systems require large amounts of training data for recognizing an action. Recent works have explored the paradigm of zero-shot and few-shot learning to learn classifiers for unseen categories or categories with few labels. Following similar paradigms in object recognition, these approaches utilize external sources of knowledge (eg. knowledge graphs from language domains). However, unlike objects, it is unclear what is the best knowledge representation for actions. In this paper, we intend to gain a better understanding of knowledge graphs (KGs) that can be utilized for zero-shot and few-shot action recognition. In particular, we study three different construction mechanisms for KGs: action embeddings, action-object embeddings, visual embeddings. We present extensive analysis of the impact of different KGs in different experimental setups. Finally, to enable a systematic study of zero-shot and few-shot approaches, we propose an improved evaluation paradigm based on UCF101, HMDB51, and Charades datasets for knowledge transfer from models trained on Kinetics.

Full PDF

AAll About Knowledge Graphs for Actions

Pallabi Ghosh · Nirat Saini · Larry S. Davis · Abhinav Shrivastava Abstract

Current action recognition systems requirelarge amounts of training data for recognizing an ac-tion. Recent works have explored the paradigm of zero-shot and few-shot learning to learn classiﬁers for unseencategories or categories with few labels. Following simi-lar paradigms in object recognition, these approachesutilize external sources of knowledge (eg. knowledgegraphs from language domains). However, unlike ob-jects, it is unclear what is the best knowledge represen-tation for actions. In this paper, we intend to gain abetter understanding of knowledge graphs (KGs) thatcan be utilized for zero-shot and few-shot action recog-nition. In particular, we study three diﬀerent construc-tion mechanisms for KGs: action embeddings, action-object embeddings, visual embeddings. We present ex-tensive analysis of the impact of diﬀerent KGs in diﬀer-ent experimental setups. Finally, to enable a systematicstudy of zero-shot and few-shot approaches, we proposean improved evaluation paradigm based on UCF101,HMDB51, and Charades datasets for knowledge trans-fer from models trained on Kinetics.

Keywords

Zero-shot/Few-shot action recognition · Knowledge graphs · Graph Convolution Networks

Action recognition has seen rapid progress in thepast few years, including better datasets [Gu et al.,2018, Kay et al., 2017] and stronger models [Carreiraand Zisserman, 2017, Diba et al., 2017, Qiu et al.,2017, Tran et al., 2018, Wang et al., 2016, Xiang et al.,2018, Zhang et al., 2017a]. Despite this progress, it is Department of Computer Science, University of Maryland,College Park, MD, USAE-mail: { pallabig, nirat, lsd, abhinav } @cs.umd.edu not easy to train an action classiﬁer for a new cate-gory. A potential solution is to leverage the knowledgefrom seen or familiar categories to recognize unseenor unfamiliar categories. This is the zero-shot learn-ing paradigm, where we transfer or adapt classiﬁersof related, known, or seen categories to classify unseenones. Similarly, for few-shot action recognition, insteadof testing on completely unseen classes, we have only afew labeled samples from the test classes, which help inlearning about the rest of the test samples.Both zero-shot and few-shot learning methods havebeen studied widely for image classiﬁcation. One of therecent technique involves building a knowledge graph(KG) representing relationships between seen and un-seen classes and then training a graph convolutionalnetwork (GCN) on this KG to transfer classiﬁer knowl-edge from seen to unseen classes [Wang et al., 2018].Using the same technique for action recognition is hardsince, unlike objects, it is unclear what is the bestknowledge representation for actions. One of the rea-sons as observed in [Gentner, 1981] is that verbs havea broader deﬁnition and conﬂicting meaning.In this work, we study the performance improve-ments by using diﬀerent types of KGs for zero-shot andfew-shot action recognition (Fig.1). The primary step inbuilding a KG is generating a good implicit representa-tion for action classes. In image classiﬁcation, standardword embeddings (word2vec, GloVe, ConceptNet, etc.)capture the semantic knowledge associated with well-deﬁned class names. However, for action classiﬁcation,class names vary from single words (sit, stand, etc.)to phrases (shooting ball (not playing baseball)) andthere are multiple deﬁnitions of the same (or similar)action class(es); like, apply eye makeup or put on eye-liner. Such diversity is less pronounced in image classi-ﬁcation tasks due to the simplicity of labels. Our ﬁrstcontribution is studying diﬀerent implicit representa- a r X i v : . [ c s . C V ] A ug Ghosh et al.

ZSLFSL (c)(a) (b)

Action Names KG Verb-Noun KG Visual KG

Playing TablaPlaying DrumsPlaying GuitarPlaying ViolinPlaying DholPlaying Ice Hockey Playing FieldHockeyPlaying SoccerSoccer JugglingTennisSwingingPlaying GuitarPlaying TablaPlaying DrumsPlaying ViolinPlaying DholPlaying FieldHockeyPlaying SoccerSoccer JugglingTennis Swinging

Playing TablaDrums GuitarViolinDholIce HockeyField HockeySoccerTennisJugglingSwinging

Playing IceHockey X Fig. 1

We experiment with diﬀerent Knowledge Graphs (KGs), using word and visual-feature based embeddings, for zero-shotlearning and few-shot learning of actions. For zero-shot learning of actions, we construct a KG using action class names (i.e.

KG1 ) (a) and a KG using the associated verb and nouns (i.e.

KG2 ) (b). For few-shot learning, in addition, we use a KG withvisual features (i.e.

KG3 ) from a few examples from the test classes (c). tion for action classes and showing the advantages ofa sentence2vector model in capturing the semantics ofword sequences for zero/few-shot action recognition.Our second contribution is building an explicit re-lationship map from these implicit representations ofaction classes. In image classiﬁcation, the explicit repre-sentations for transferring knowledge from seen to un-seen categories are using attributes or external KGs.Several datasets provide labeled class-attribute pairs(e.g., AwA [Lampert et al., 2009] , aYahoo [Farhadiet al., 2009], COCO-Attributes [Patterson and Hays,2016], MITstates [Isola et al., 2015], etc.). Similarly,many KGs have nodes that correspond to image clas-siﬁcation classes (e.g., WordNet [Miller, 1995], NELL,and NEIL [Carlson et al., 2010, Chen et al., 2013, Wanget al., 2018]). In contrast, such sources are scarce foraction classes. Wordnet contains verbs, therefore, itcan be used to construct a KG for verbs, but wecannot have a KG with nodes representing the entirephrase (eg., “playing(verb) guitar(noun)”) for an actionclass. Instead, there will be separate nodes for verbsand objects with deﬁned inter-relationships. Concept-Net [Speer et al., 2017] has some phrases, but the list isnot exhaustive and a lot of label names in our datasetsare not present in ConceptNet. On the other hand, webuild a KG with an explicit relationship of the multi-word action phrases in any dataset. We append datasetwith action classes from other datasets and constructtwo KGs, one for noun, and other for verb either bysplitting the action phrase in cases like “playing(verb)guitar(noun)” or using WordNet to get the nearest nounin cases like “cake”(noun) for action class named “bak-ing”(verb). Further, we build a KG for few-shot learning using mean features of training data-points per class.We append this KG with the two KGs deﬁned previ-ously and observe performance improvement.Finally, most previous work on zero-shot actionrecognition uses image-based learned models to esti-mate actions in videos. Recent advances in action recog-nition lead to the use of a network trained on videodataset as the feature extractor. So it requires an im-proved evaluation paradigm, since the action classes inthe training set cannot be in the test set. We manuallycheck for commonalities between the training datasets(Kinetics) and testing datasets (UCF101, HMDB51,Charades), but could not resolve problems within Ki-netics which is a huge dataset and can have videoscommon across multiple classes. So we keep all Kinet-ics classes in training set and remove common classesfrom Kinetics with UCF101, HMDB51 and Charadesfrom the test set. Hence, our third contribution is thecreation of this evaluation paradigm using UCF101,HMDB51, Charades, and Kinetics datasets.In summary our main three contributions are: – Better implicit representation of action phrases(which are word sequences) using sentence2vec – Comparative study of diﬀerent KGs for action zero-shot/few-shot learning – Develop an improved evaluation paradigm for zero-shot/few-shot action recognition using networkstrained on video datasets as feature extractorsThese 3 contributions together builds an integrated ap-proach for both zero-shot and few-shot learning. ll About Knowledge Graphs for Actions 3

Signiﬁcant performance boostin state-of-the-art action recognition was observed withimproved dense trajectories [Wang and Schmid, 2013]and 3D ConvNets [Ji et al., 2013] which capturedeep spatio-temporal features instead of handcraftedones. Thereafter, multiple ideas like single stream net-works [Karpathy et al., 2014], two-stream networks [Si-monyan and Zisserman, 2014], end-to-end encoder-decoder based architectures [Donahue et al., 2015, Tranet al., 2015, Yao et al., 2015] and combining diﬀer-ent streams with convolutional networks [Feichtenhoferet al., 2016, Wang et al., 2016] evolved. Recent stud-ies include [Carreira and Zisserman, 2017, Diba et al.,2017, Qiu et al., 2017, Tran et al., 2018, Wang et al.,2016, Xiang et al., 2018, Zhang et al., 2017a]. We useI3D model pre-trained on Kinetics described in [Car-reira and Zisserman, 2017], to extract and learn featuresof the input videos.

Zero-Shot Action Recognition:

Zero-shot learning(ZSL) refers to the task of learning to predict on classesthat are excluded from the training set [Palatucci et al.,2009]. Various studies do ZSL for image classiﬁcationand object detection [Changpinyo et al., 2016, Kodirovet al., 2015, Lampert et al., 2014, Sung et al., 2018a],and action recognition [Alexiou et al., 2016, Gan et al.,2016, Hahn et al., 2019, Jain et al., 2015a, Jain et al.,2015b, Mettes and Snoek, 2017, Xu et al., 2015, Xuet al., 2016, Xu et al., 2017, Zhu et al., 2018]. The otherzero-shot action papers, to the best of our knowledge,mostly are not GCN based, which has been proven to dobetter than traditional zero-shot techniques for imageclassiﬁcation [Wang et al., 2018]. While [Gao et al.,2019] is GCN based, their KG is very diﬀerent fromthe one we use. They construct a single KG with ac-tions and objects using ConceptNet [Speer et al., 2017],where nodes are connected based on word embedding.They use visual object features as a second channel in-terconnected with the same edge weights to improvezero-shot learning. The number of objects in their graphis not dependent on the number of action classes. Theyshow their best result when selecting 2000 most com-mon visible objects in their dataset to get their objectnodes, meaning they need access to the unlabelled testdata (transductive). We use separate KGs for action,verb and noun and fuse them at the end with a fu-sion layer. Our verbs and nouns are dependent onlyon the action label and uses no visual information (in-ductive). We compare our results with [Gao et al.,2019], [Romera-Paredes and Torr, 2015] and [Zhanget al., 2017b], where [Romera-Paredes and Torr, 2015] uses a two linear layers network for learning relation-ships between features, attributes, and classes; while[Zhang et al., 2017b] uses the image feature space tomap the language embedding, instead of an intermedi-ate space.

Few-Shot Action Recognition:

Few-shot for imageclassiﬁcation has been explored using meta-learning forlearning distance of samples and decision boundary inthe embedding space [Ren et al., 2018, Snell et al.,2017, Sung et al., 2018b], or by learning the optimiza-tion algorithm which can be generalized over diﬀerentdatasets [Mishra et al., 2018b, Ravi and Larochelle,2017]. A benchmark for few shot image classiﬁcationis created in [Hariharan and Girshick, 2017].For action recognition, studies propose embeddinga video as a matrix [Yang et al., 2018, Zhu and Yang,2018], using deep networks [Mishra et al., 2018a] orgenerative models [Kumar Dwivedi et al., 2019, Mishraet al., 2018a] and using human-object interaction [Katoet al., 2018]. We tried GCN based few-shot learning foraction recognition, but our approach cannot be com-pared to many of these approaches due to two reasons– 1) Each paper uses a diﬀerent dataset split, and oursplits are diﬀerent as well because we use a pre-trainednetwork from Kinetics in our pipeline; 2) We do notevaluate the episodic learning formulation like severalother papers. Our aim is to improve few-shot using theKG constructed for the zero-shot setting (relationshipof class names, etc.) thereby building an uniﬁed zero-shot and few-shot learning framework, which to the bestof our knowledge, is not explored in the past.

Knowledge Graphs and Graph Convolution Net-works:

KGs are used to improve performance for dif-ferent visual applications [Marino et al., 2016, Fanget al., 2017]. Automatic construction of a large KGand relationship learning has captured a lot of attentionin the past [Bordes and Gabrilovich, 2014, Choudhuryet al., 2017, Gao et al., 2019, Lin et al., 2015].We focus on construction of a KG to depict inter-relationships of action categories. [Gentner, 1981] showshow verbs and nouns have diﬀerent levels of complex-ities and usually an action phrase comprises of bothor just the verb. We explore diﬀerent KGs, includingone with verbs and nouns only, to understand howthese knowledge graphs improve performance for actionrecognition in zero-shot and few-shot learning setup.To process graphs using deep learning algorithms,graph convolution networks(GCN) have been used for anumber of diﬀerent applications including action recog-nition [Yan et al., 2018, Ghosh et al., 2020]. For GCNs,some of the initial works include [Atwood and Towsley,

Ghosh et al. … Seen Class ClassifierUnseen Class Classifier

Word Embedding Knowledge Graph Graph Convolution Network Output Graph I3D weight comparison

Playing piano,Playing soccer, Playing tabla, Playing ice hockey, Playing guitar, Tennis swinging …

Zero-Shot Learning Setup … Seen Class ClassifierUnseen Class Classifier

Visual Features Knowledge Graph Graph Convolution Network Output Graph I3D weight comparison

Few-Shot Learning Setup … Fig. 2

System overview: We use knowledge graphs based on word embeddings (action class names, and associated verbs andnouns) and visual features for action recognition. With the word embeddings based knowledge graph, we propose a zero-shotlearning approach and with visual features based knowledge graph we propose a few-shot learning approach.

The implementation technique of Graph ConvolutionalNetworks in [Kipf and Welling, 2017] is used to trainour KG to transfer classiﬁer layer weights from trainedclasses to unseen test classes. The GCN operation canbe described by the equation H l +1 = g ( H l , A ) = σ ( ˆ D − / ˆ A ˆ D − / H l W l )where ˆ A = I + A and A is the adjacency matrix consist-ing of edge weights between nodes, ˆ D is the node degreematrix of ˆ A , H l and W l are the N × d l input matrix ofthe l th layer and d l × d l +1 weight matrix respectively. N is the number of nodes in the graph, d l is the dimensionof the l th layer and σ represents a non-linear activationfunction (e.g., ReLU).Zero-shot/few-shot action recognition using GCN(Fig.2) follows a similar technique as [Wang et al.,2018]. It consists of training and testing phases as de-scribed next. Training:

Initially, a model pre-trained on Kinetics isﬁne-tuned using training classes of UCF101, HMDB51,or Charades, followed by the extraction of the ﬁnal clas-siﬁer layer weights to be used for training the GCN. Theconstructed KG, along with the adjacency matrix, areinputs to the GCN. The output of each node of theGCN has the same dimensions as the trained classiﬁerlayer ﬁlter size (1024 in our case). The GCN is trainedsuch that its output for the training classes matches theclassiﬁer layer weights of the trained I3D model. Theloss used is the mean squared error (MSE) loss.So if there are C train number of training classes, C test number of test classes and the output feature di-mension of each class is d , then the output of the GCN, W GCN , is of size ( C train + C test ) × d . From W GCN , theoutput dimensions corresponding to the training nodesare selected, denoted by W GCNTrain with size C train × d .This feature is of the same dimension as the weightsof the I3D classiﬁer layer trained or ﬁne-tuned on thetraining classes of the dataset, W cls . The MSE loss thatis back-propagated is given by (cid:107) W GCNTrain − W cls (cid:107) . Testing:

During test time, the penultimate layer of theI3D model is used to extract the features of the test im-ages f test with dimensions N × d . The output of the test ll About Knowledge Graphs for Actions 5 nodes of the GCN with dimension C test × d is extractedfrom W GCN , denoted by by W GCNTest . The output classprobabilities for the test images ( P test ) are obtained as P test = f test W T GCNTest . In this section, we describe the construction of diﬀer-ent KG for actions. We follow similar pipeline as [Wanget al., 2018] (also described in Section3) which requiresa KG as input. [Wang et al., 2018] use Wordnet em-beddings to construct the KG for ZSL on image clas-siﬁcation. Compared to [Wang et al., 2018], our actionlabel classes are sentences or phrases instead of words,which is why using wordnet or word2vec doesn’t pro-vide distributive and coherent embeddings for actionlabels. Moreover, getting semantically correlated em-bedding space for words and visual features for a goodKG is another challenge. We describe these challengesand how we tackle them while constructing three dif-ferent versions of KGs for actions.

KG1:

The ﬁrst KG is based on word descriptors of ac-tion class names. Since our action classes are composedof multiple words like a sentence or phrase, averagingword2vec embedding for all words in the sentence doesnot provide a cohesive embedding space. We discussthe experimental results for word2vec embeddings inSection7. To overcome this challenge, we use the sen-tence2vec model described in [Pagliardini et al., 2018],which is an unsupervised learning method to learn em-beddings for whole sentences. We use the unigramsmodel trained on Wikipedia to generate our sentenceembeddings.The node features in

KG1 are the sentence em-beddings. The nodes from Kinetics action classes areadded in

KG1 corresponding to each dataset (UCF101,HMDB51, and Charades). This is inspired by [Xu et al.,2016, Xu et al., 2015], where they show distinct advan-tages of adding classes and images from other datasetsin zero-shot learning. Although we cannot directly addimages due to the way our model is constructed, weadd new activity classes from the Kinetics dataset toincrease the size of our KGs. Appending 400 Kinet-ics classes to UCF101 results in a total of 501 nodesin the

KG1 for UCF. Similarly appending the nodes toHBDB51 and Charades results in a total of 451 nodesand 557 nodes respectively. We show more results onperformance comparison with and without adding Ki-netics nodes in Section 7.With the sentence2vector node features, we con-struct the

KG1 where node i is connected to anothernode j in the combined dataset based on edge weights A ij from cosine similarity of node features. Here, A isthe adjacency matrix for KG1 . We sort the edges weightsin descending order to get the top N closest neighborsper node. N is a hyperparameter that is determined ex-perimentally and is dependent on the dataset. It is 5for HMDB51 and UCF101 and 20 for Charades. j be-ing one of the top N neighbors of i does not mean thatthe vice versa is true as well. To make the adjacencymatrix symmetric, we ﬁll A ji with the same value as A ij , so the number of connections to each node > = N . KG2:

The second graph,

KG2 , is constructed withverbs and nouns associated with each action class. Thisgraph is inspired by multiple works on zero-shot actionusing human object interaction where the detected ob-jects in the scene are used to draw the relationshipsbetween seen and unseen action classes [Gao et al.,2019, Jain et al., 2015a]. In [Gao et al., 2019] object de-tection is carried out in the visual domain as well andthen mapped to word domain for zero-shot learning.We do not do mapping for objects features from visualto word. Instead, we just take the output of verb andnoun graphs (

KG2 ), and pass it through the fusion layerto get the visual action (noun+verb) classiﬁer weights.To construct

KG2 , we use a standard language lem-matizer [Bird and Klein, 2009] to break up a phrasedescribing an action and convert the word to its rootform. Then, we use a part-of-speech (pos) [Toutanovaet al., 2003] tagger to label the word as a noun or averb. Still, a lot of action class names do not have anoun in the phrase, for example “beatboxing”. For suchclasses the pos tagger gives a noun label of “unknown”and if Wordnet can return a noun that is related tothat word, we replace the “unknown” by the noun. Foraction classes like “archery”, which does not have a spe-ciﬁc verb associated with it, we replace the verb with“doing”. For node features, we compute sentence2vecembeddings as above for verbs and nouns. Hence, weget a set of graphs with only verbs and only nouns.These also have same number of nodes as

KG1 . More-over, these graphs are used and categorized together as

KG2 , since they provide partial information about ac-tion class (either verb or noun).

KG1 and

KG2 can beused to deﬁne ZSL setup.

KG3:

The third graph is developed to see relative per-formance improvements by incorporating only a few la-belled images per test class. We use averaged visualfeatures as nodes in

KG3 . In the visual feature space,we see implicit clustering of similar actions, which issometimes not captured in word embedding space. Forexample, “pommel horse” and “horse walking” are con-sidered similar in word embedding space, but these are

Ghosh et al. very diﬀerent activities which is captured in visual em-bedding space shown for dataset UCF101 in Fig. 3. Werandomly pick 5 videos from each test class and use I3Dto generate video features as described in Section 5.2.Then taking the mean of these features, we get thegraph node descriptors and take their cosine similar-ity to generate the adjacency matrix as we do for

KG1 and

KG2 . This generates a graph based on visual fea-tures.

KG3 is used to replicate few-shot learning setupusing KGs, since we use 5 visual samples for each testclass to construct the nodes. In few-shot setting, we cancombine

KG3 with

KG1 and

KG2 to improve results.

Kinetics [Kay et al., 2017]:

Kinetics is a largedataset with 400 classes and about 3 ∗ videos. Wedo not actually need access to Kinetics videos, but theclass names and an I3D model pre-trained on Kineticsavailable in [Carreira and Zisserman, 2017]. Since weuse Kinetics for pre-training I3D and data augmenta-tion while training the GCN, we cannot keep commonclasses between Kinetics and UCF101 or HMDB51 orCharades in the test set while doing zero-shot learning.So, we use classes in UCF101, HMDB51 and Charadesthat are also present in Kinetics, as training set. UCF101 [Soomro et al., 2012]:

UCF101 has 13320videos from 101 classes. After removing common classeswith Kinetics, we get 23 classes with 3004 videos in testset for UCF101 and the remaining 78 classes are usedfor training. Some test class labels do not have semanti-cally correlated neighbors. So, we appended these classnames with extra words, for example “front crawl” inUCF101 becomes “front crawl swimming”. We discussclass-wise accuracy for test classes in Fig.5.

HMDB51 [Kuehne et al., 2013]:

HMDB51 has6849 videos from 51 classes. Similar to UCF101, we re-move common classes with Kinetics, and get 12 classeswith 1541 videos for HMDB51’s test set and remaining39 classes for training. Additionally, to encourage cor-relation with action classes in Kinetics, we convert theclass labels to continuous tenses. For example, classeslike “eat”, uses sentence2vector embedding correspond-ing to “eating”.

Fig. 3 t-SNE visualization showing feature distribution ofUCF101 video dataset. Sample images are added for our testclasses. (Best viewed in digital format)

Charades [Sigurdsson et al., 2016]:

Charades has9848 videos from 157 classes and is also a multilabeldataset, meaning each video can have multiple actionlabels. Charades has noun and verb labels associatedwith each action class, which we use directly without la-belling ourselves. After removing all videos which haveat least one common label with Kinetics, we are leftwith 111 possible test classes. Each video can have bothtraining and test labels in Charades. We cannot sepa-rate the training and test videos but just the classes.We split the classes into 50-50 train-test split meaningthere are 79 and 78 train and test classes respectively.The 78 test classes are from the 111 classes not in com-mon with Kinetics. All videos with at least one trainingclass are kept in training set and we remove test classlabels from them. The rest of the videos are test videosand training class labels are removed from them.5.2 Feature ExtractionTo extract video features , we use initial model of I3Dtrained on Kinetics data and ﬁne-tune the last layer onthe training classes of either UCF101 or HMDB51. ForCharades, just ﬁne-tuning the last layer did not yieldgood classiﬁcation performance, so we ﬁne-tune thewhole network. This means while training, we cannotcompute loss on the Kinetics nodes in the KG for Cha-rades. Even after ﬁne-tuning the complete network forCharades we did not achieve signiﬁcant performance forzero-shot learning; so we use inverse cross-correlation ll About Knowledge Graphs for Actions 7

Table 1

Zero-shot learning results for all 3 datasets wherewe compare performances of

KG1 , KG2 and a combination ofthe two.

KG1 + KG2 always does the best. For UCF101 andHMDB51, the results are in mean accuracy whereas for Cha-rades, we report mean average precision (mAP)

Dataset

KG1 KG2 KG1 + KG2

UCF101 49.14 45.47

HMDB51 38.01 31.57

Charades 15.81 12.48 of training features multiplied with itself as last layerweight inspired by [Romera-Paredes and Torr, 2015], totrain GCN. We visualize the video feature space distri-bution of the UCF101 classes in Fig.3 with some exam-ple images for the test classes. As we can see in Fig.3,similar classes are grouped together forming clusters.5.3 Our PipelineOur GCN consists of 6 layers with intermediate layer ﬁl-ter dimensions of 512 → → → → KG1 and for few-shotit uses the adjacency matrix of

KG3 . For

KG1 + KG2 inUCF101, the above fusion technique did not give goodperformance. So, we use the weighted sum of the out-puts of

KG1 and

KG2 with weights of 0.9 for

KG1 and0.05 each for the verb and noun from

KG2 . The results for zero-shot learning on all 23 test classesfor UCF101, 12 test classes for HMDB51 and 78 testclasses for Charades are in Table1. These results are based on KGs

KG1 and

KG2 and combination of both.The combination of

KG1 and

KG2 graph is done by pass-ing it through the fusion layer (for HMDB51 and Cha-rades) or weighted summation of output (for UCF101).Since all datasets have many action classes without anynouns, only

KG2 did not give good performance, but thecombination of

KG1 + KG2 works well.We also provide the comparison with state-of-the-art in Table 2. For our data split, we have comparedour results with three previous works carried out un-der similar zero-shot learning settings, ESZSL [Romera-Paredes and Torr, 2015], DEM [Zhang et al., 2017b]and TS-GCN [Gao et al., 2019]. We could not applyDEM baseline results for Charades, since it is a multi-label dataset. Also, TS-GCN only released code for thetransductive setup for UCF101. We have implementedthe inductive version and compared to it. We have alsoadded some of the recent results for zero-shot learning.Either their splits are diﬀerent, or they do not providecode, or an essential part of their framework is missing.However, note that recent work of [Gao et al., 2019]outperforms these other approaches on their splits andwe outperform [Gao et al., 2019] on our splits.We report results for combining

KG3 with

KG1 and

KG2 in Table 3. Since we are using

KG3 , these experi-ments can be considered as few shot learning setup. Tocreate a baseline, we used the nearest neighbor searchto get the class label for test videos. Based on the 5 la-belled videos provided, we calculate the mean or centerfeature for each class and then we use cosine distancesbetween the rest of the test videos and these class cen-ters to sort them into corresponding classes. Our re-sults along with the baselines are in Table 3. We usethe same train-test splits for UCF101 and HMDB51.For both UCF101 and HMDB51, we get best results ifwe use all 3 KGs. We do not conduct this experimentfor Charades since each video has multiple labels, henceeach video data point will update multiple class centersresulting in overlapping class distribution.

For con-structing node features from action labels, weused the word2-vec embeddings trained on GoogleNews [Mikolov et al., 2013a, Mikolov et al., 2013b,Mikolov et al., 2013c]. For all words in each class name,the word2vec embeddings were averaged to give a re-sultant embedding for the whole phrase, which servesas features of the nodes in the KG. In Fig.4(b), weshow the word2vec embedding space of node “PommelHorse” and its nearest neighbor class nodes.

Ghosh et al.

Table 2

Zero-shot learning results for all 3 datasets. The baselines are ESZSL, DEM, Objects2Action, UR, CEWGAN andTS-GCN. For UCF101 and HMDB51, the results are in mean accuracy whereas for Charades, we report mean average precision(mAP) since it is multi-label dataset.

Method UCF101 HMDB51 Charades23-78 split 50-51 split 12-39 split 25-26 split 78-79 splitESZSL [Romera-Paredes and Torr, 2015] 35.27 15.0 34.16 18.5 17.21DEM [Zhang et al., 2017b] 34.26 - 35.26 - -Objects2Action [Jain et al., 2015a] - 30.3 - 15.6 -UR [Zhu et al., 2018] - 17.5 - 24.4 -CEWGAN [Mandal et al., 2019] - 26.9 - 30.2 -TS-GCN [Gao et al., 2019] 44.5 34.2 - 23.2 -

Ours 50.13 - - Table 3

Few-shot learning results for the UCF101 and HMDB51 datasets. The baseline is nearest neighbor, given 5 videosfor each test set. The combination of

KG1 , KG2 and

KG3 does the best in both cases.

Dataset Baseline

KG3 KG3 + KG1 K3 + KG2 KG3 + KG1 + KG2

UCF101 52.7 57.04 62.10 59.92

HMDB 30.2 45.07 45.67 47.61 (a) (b)

Fig. 4 (a) Sentence2Vec embedding space for Kinetics andUCF101 classes. The class “uneven bars” and its neighborsare highlighted. (b) Class “Pommel horse” and its neighbor-ing classes in Kinetics dataset using word2vec embedding.The embeddings of each individual word forming the phraseis also displayed. (Best viewed in digital format)

Averaging word2Vec embedding for all words in ac-tion class label phrase works in some cases, but it can-not always capture the meaning or correct relationshipsbetween the action classes. Hence, for a class like “ridingor walking with horse” in Kinetics dataset, the embed-ding for each word is located far apart from each otheras displayed in Fig.4(b). The mean of these individualwords does not lie close to related words in the em-bedding space and hence does not capture meaningfulinformation.To solve this problem we use sentence2vec modelfrom [Pagliardini et al., 2018], which captures the se-mantic meaning of sequences of words. Using this em-bedding space, the closest word match to a class like“uneven bars” is “gymnastics tumbling”. The word em-

Table 4

Performance comparison between word2vec embed-ding and sentence2vec embedding based models. Both themodels are trained on graphs consisting of class nodes fromKinetics and UCF101 with losses on both. Performance met-ric used is mean accuracy.

Method Mean AccuracyWord2Vec 38.02Sentence2vec 49.14 bedding space for all the classes in UCF101 and Kinet-ics are displayed in Fig.4(a). The word “Uneven bars”along with its neighbors are emphasized. We run ex-periments with both word2vec embeddings trained onGoogle News [Mikolov et al., 2013a, Mikolov et al.,2013b, Mikolov et al., 2013c] and Sentence2Vec embed-dings based on unigram model trained on Wikipedia[Pagliardini et al., 2018]. The results on UCF101 areshown in Table 4. These results show signiﬁcant im-provement by using setence2vec over word2vec for

KG1 . Appending Knowledge Graphs with more actionclasses:

We augment the UCF101 and HMDB51 KGswith Kinetics class labels in three diﬀerent ways. In theﬁrst conﬁguration, either only the UCF101 nodes orHMDB51 nodes are used in the KG (101/51 nodes) outof which, 78 and 39 are training nodes respectively. Theloss is computed by comparing the output of the GCNon these classes to the weights in the ﬁnal classiﬁer layerof the ﬁne-tuned I3D network.The second conﬁguration uses the same KG as

KG1 explained in section 4. The loss is computed by compar-ing the output of only the UCF101 or HMDB51 training ll About Knowledge Graphs for Actions 9

Table 5

Experiments with 3 diﬀerent knowledge graphconstructions. The variations are due to using onlyUCF101/HMDB51 classes for the knowledge graph or ap-pending it with Kinetics classes and training loss be-ing calculated on UCF101/HMDB51 nodes only or bothUCF101/HMDB51 and Kinetics nodes in the knowledgegraphs. Performance metric used is mean accuracy.Knowledge Nodes for Loss MeanGraph Computation AccuracyUCF only UCF 27.72UCF+Kinetics UCF 32.85

UCF+Kinetics UCF+Kinetics 49.14

HMDB only HMDB 31.09HMDB+Kinetics HMDB 29.22

HMDB+Kinetics HMDB+Kinetics 38.01Table 6

Performance comparison for fully connected(FC)and bipartite graphs constructed with UCF101 or HMDB51with Kinetics dataset nodes. Both the models are trained ongraphs consisting of class nodes from two datasets (UCF101and Kinetics or HMDB51 and Kinetics) with losses on both.Performance metric used is mean accuracy.Method Mean-accuracy Mean-accuracyfor UCF for HMDBFC 49.14 38.01Bipartite 33.11 28.49 nodes (78/39 nodes) to the ﬁnal classiﬁer layer of theﬁne-tuned I3D network.In the third conﬁguration, again

KG1 is used. Al-though, now the loss is computed by summing the 2MSE losses: (a) Loss 1 by comparing the output of onlythe UCF101 or HMDB training nodes(78/39 nodes)to the ﬁnal classiﬁer layer of the ﬁne-tuned I3D net-work. (b) Loss 2 by comparing the output of the Kinet-ics nodes (400 nodes) to the classiﬁer layer weight ofI3D pre-trained on Kinetics. The results of these threeexperiments are shown in Table 5. For UCF101 andHMDB51, third conﬁguration works best.

Types of connections in Knowledge Graphs:

While constructing the KG with both UCF101 orHMDB51 and Kinetics dataset, we used two types ofgraph connections. In fully-connected graphs all nodescan be connected to all other nodes, out of which weselect top 5 connections. In bipartite, for every node inUCF101 or HMDB51 dataset, we ﬁnd the top 5 con-nections to the Kinetics dataset nodes and vice versa.The fully connected(FC) graph works better than thebipartite graph (Table 6).

Analysis of Class-wise Accuracy using diﬀerentKnowledge Graphs:

To understand the impact of us-ing

KG1 , KG2 and

KG3 for learning each test class, we

Table 7

Performance comparison of using GCN vs a linearcombination (using the adjacency matrix edge weights) ofthe top 4 closest training class weights to the test classes.Performance metric used is mean accuracy.Method Mean accuracyGCN 49.14Linear Combination 42.57

Table 8

Performance comparison of using an encoder-decode layer before the GCN layers on UCF101 dataset vsnot using one. Performance metric used is mean accuracy.Method Mean accuracywithout encoder-decoder 49.14with encoder-decoder 47.72 plot the class-wise accuracy for UCF101 and HMDB51in Fig. 5. Each color of the bar represents a diﬀerentKG: blue is for word based

KG1 , orange is for visual fea-ture based

KG3 and grey is the combination of

KG1 , KG2 and

KG3 .As observed in Fig. 5, for few classes such as “bil-liards”, “talk”, “playing tabla”,

KG1 performs the best.These classes innately have many neighbors in the wordembeddings space, which help in learning them fromgiven training classes. Few other classes, such as “frontcrawl swimming”, “pommel horse gymnastics”, “chewfood” and “pour liquid” perform well with just

KG1 as well, since we add the extra word “swimming”,“gymnastics”, “food” and “liquid” respectively, to en-force good neighbors in language domain. Intuitively,

KG3 does well for “uneven bars”, “fall ﬂoor”, “smile”and “shoot gun”, since these have distinct visual fea-tures. The combination KG works well for “still rings”,“parallel bars”, “jumping jack”, “playing dhol”, “climbstairs”, “talk” and “wave”.

Ablation for Network Architecture:

We experi-ment with diﬀerent number of layers of the GCN (2,4 6,8 and 10) to explore inﬂuence of GCN depth on perfor-mance for both UCF101 and HMDB51. The increase inthe number of layers of the GCN increases smoothingand decrease in number of layers causes less informa-tion propagation. We found that 6 layers gives us thebest performance.

Usefulness of GCN vs a linear combination of trainingclass weights:

To show the performance improvementdue to GCN compared to just linear combinations, weperform an ablation study. For each test class, we ﬁndthe top 4 neighbors in the training set. Then using theadjacency edge connection weights, the classiﬁer layer C l a ss w i s e A cc u r a c y Test Class Labels

UCF101

KG1 KG3 KG1,KG2 and KG3 C l a ss w i s e A cc u r a c y Test Class Labels

HMDB51

KG1 KG3 KG1,KG2 and KG3

Fig. 5

This ﬁgure shows class-wise accuracy for diﬀerent KGs and combination of KGs for UCF101 and HMDB51. We addedfew words for better word embeddings in the labels (such as “pommel horse” becomes “pommel horse gymnastics”), whichimproves performance for only word based KG (i.e.

KG1 ), as shown here. Each color for bar represents a KG, blue is wordbased KG, orange is visual feature based KG and grey is combination of all three KGs (

KG1 , KG2 and

KG3 ). Table 9

Results on UCF101 with 10 randomly selected test classes leaving 91 classes to be used for training I3D and GCN.Mean accuracy is used for evaluation. The experiments are carried out 5 times and the ﬁnal column provides the mean accuracyscores. We compare our results to two previous work with similar settings.

Method Nodes for Loss Computation Split 1 Split 2 Split 3 Split 4 Split 5

Mean

ESZSL - 61.25 60.30 53.68 64.81 60.56 60.12DEM - 60.87 65.88 41.89 61.90 52.11 56.53

Ours

UCF101 59.68 48.51 42.18 49.86 43.12 48.67

Ours

UCF101+Kinetics weight for the test class is a weighted average of the clas-siﬁer layer weights for its neighbors. The performanceis in Table 7.

Use encoder decoder before GCN:

We run another setof experiments where a 2 layered encoder decoder net-work is added before GCN, for improving encoding ofsentence embedding features. The results do not showany promise as seen in Table 8.

Random test train splits:

Some of the experimentsare done on a random sub-sample of the test-set classes.For UCF101, we choose 10 out of 23 classes 5 times; sothat for each random sample of 10 test classes, the restof the 91 classes forms the training set. The mean accu-racy score is calculated after each run and the result of all 5 runs are averaged to get the ﬁnal mean accuracyscore. The results for each of these splits is in Table 9.

Learning classiﬁer for unknown classes from re-lated classes in Knowledge Graph:

The heatmapsin Figure 6 depicts the test nodes learning from the in-terconnections to the train nodes in the KG. They arebased on CAM [Zhou et al., 2016]. Considering the testclass “playing sitar in UCF101, one of the top 5 near-est train classes in UCF101 is playing guitar and oneof the random classes that have no relation is biking.Now among the ﬁve sub-ﬁgures in Figure 6, (a) is thedisplay of the activation from the “playing sitar classon a “playing sitar video, (b) is the display of the ac-tivation from the “playing guitar class on a “playingguitar video, (c) is the display of the activation from ll About Knowledge Graphs for Actions 11 (a) (b) (c)(d) (e)

Fig. 6

Heatmaps showing activations of various classes’ clas-siﬁer layers on various class videos. (a) is the display of theactivation from the “playing sitar” class on a“playing sitar”video, (b) is the display of the activation from the “playingguitar” class on a“playing guitar” video, (c) is the displayof the activation from the “playing sitar” class on a“playingguitar” video, (d) is the display of the activation from the“biking” class on a“biking” video and (e) is the display of theactivation from the “playing sitar” class on a“biking” video.These heatmaps show that test class “playing sitar” is cor-rectly learning from training class “playing guitar” instead oftraining class “biking” the “playing sitar class on aplaying guitar video, (d)is the display of the activation from the “biking classon a “biking video and (e) is the display of the activa-tion from the “playing sitar class on a “biking video.What we show here is that “playing sitar classiﬁer issimilar to the “playing guitar classiﬁer and hence theheat maps from both are similar. This is not the casebetween “playing sitar and “biking.

In this work we investigate diﬀerent combinations ofknowledge graphs (KG) for actions that give better per-formance for zero and few shot action recognition. Weshow signiﬁcant improvement on zero shot learning byusing a network that models a sequence of words in-stead of traditional single word based models. More-over, extending KG using other action classes leads tobetter results. We observe that combining word basedknowledge graphs with visual knowledge graphs helpin few shot learning. Also combining verbs and nounbased KG, improves both zero and few shot learning.Work on dynamically learning the graph weights canbe explored in the future.

Acknowledgements

This work was supported by the AirForce, via Small Business Technology Transfer (STTR) PhaseI (FA8650-19-P-6014) and Phase II (FA864920C0010), andDefense Advanced Research Projects Agency (DARPA) viaARO contract number W911NF2020009.

References

Alexiou et al., 2016. Alexiou, I., Xiang, T., and Gong, S.(2016). Exploring synonyms as context in zero-shot ac-tion recognition. In , pages 4190–4194. IEEE.Atwood and Towsley, 2016. Atwood, J. and Towsley, D.(2016). Diﬀusion-convolutional neural networks. In

Ad-vances in Neural Information Processing Systems , pages1993–2001.Bird and Klein, 2009. Bird, Steven, E. L. and Klein, E.(2009). Natural language processing with python.Bordes and Gabrilovich, 2014. Bordes, A. and Gabrilovich,E. (2014). Constructing and mining web-scale knowledgegraphs: Kdd 2014 tutorial. In

Proceedings of the 20th ACMSIGKDD international conference on Knowledge discoveryand data mining , pages 1967–1967. ACM.Carlson et al., 2010. Carlson, A., Betteridge, J., Kisiel, B.,Settles, B., Hruschka, E. R., and Mitchell, T. M. (2010).Toward an architecture for never-ending language learning.In

Twenty-Fourth AAAI Conference on Artiﬁcial Intelli-gence .Carreira and Zisserman, 2017. Carreira, J. and Zisserman,A. (2017). Quo vadis, action recognition? a new model andthe kinetics dataset. In proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition , pages6299–6308.Changpinyo et al., 2016. Changpinyo, S., Chao, W.-L.,Gong, B., and Sha, F. (2016). Synthesized classiﬁers forzero-shot learning. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages5327–5336.Chen et al., 2013. Chen, X., Shrivastava, A., and Gupta, A.(2013). Neil: Extracting visual knowledge from web data.In

Proceedings of the IEEE International Conference onComputer Vision , pages 1409–1416.Choudhury et al., 2017. Choudhury, S., Agarwal, K., Puro-hit, S., Zhang, B., Pirrung, M., Smith, W., and Thomas, M.(2017). Nous: Construction and querying of dynamic knowl-edge graphs. In , pages 1563–1565. IEEE.Deﬀerrard et al., 2016. Deﬀerrard, M., Bresson, X., and Van-dergheynst, P. (2016). Convolutional neural networks ongraphs with fast localized spectral ﬁltering. In

Advances inneural information processing systems , pages 3844–3852.Diba et al., 2017. Diba, A., Fayyaz, M., Sharma, V., Karami,A. H., Arzani, M. M., Yousefzadeh, R., and Van Gool,L. (2017). Temporal 3d convnets: New architecture andtransfer learning for video classiﬁcation. arXiv preprintarXiv:1711.08200 .Donahue et al., 2015. Donahue, J., Anne Hendricks, L.,Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko,K., and Darrell, T. (2015). Long-term recurrent convolu-tional networks for visual recognition and description. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pages 2625–2634.Duvenaud et al., 2015. Duvenaud, D. K., Maclaurin, D.,Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-Guzik,A., and Adams, R. P. (2015). Convolutional networks ongraphs for learning molecular ﬁngerprints. In

Advances inneural information processing systems , pages 2224–2232.Fang et al., 2017. Fang, Y., Kuan, K., Lin, J., Tan, C., andChandrasekhar, V. (2017). Object detection meets knowl-edge graphs. In

Proceedings of the 26th International JointConference on Artiﬁcial Intelligence , pages 1661–1667.Farhadi et al., 2009. Farhadi, A., Endres, I., Hoiem, D., andForsyth, D. (2009). Describing objects by their attributes.2 Ghosh et al.In , pages 1778–1785. IEEE.Feichtenhofer et al., 2016. Feichtenhofer, C., Pinz, A., andZisserman, A. (2016). Convolutional two-stream networkfusion for video action recognition. In

Proceedings of theIEEE conference on computer vision and pattern recogni-tion , pages 1933–1941.Gan et al., 2016. Gan, C., Yang, Y., Zhu, L., Zhao, D., andZhuang, Y. (2016). Recognizing an action using its name: Aknowledge-based approach.

International Journal of Com-puter Vision , (1):61–77.Gao et al., 2019. Gao, J., Zhang, T., and Xu, C. (2019). Iknow the relationships: Zero-shot action recognition viatwo-stream graph convolutional networks and knowledgegraphs. In

Proceedings of the AAAI Conference on Ar-tiﬁcial Intelligence , volume 33, pages 8303–8311.Gentner, 1981. Gentner, D. (1981). Some interesting diﬀer-ences between verbs and nouns. In

Cognition and braintheory , volume 4, pages 161–178.Ghosh et al., 2020. Ghosh, P., Yao, Y., Davis, L., and Di-vakaran, A. (2020). Stacked spatio-temporal graph convo-lutional networks for action segmentation. In

The IEEEWinter Conference on Applications of Computer Vision ,pages 576–585.Gu et al., 2018. Gu, C., Sun, C., Ross, D. A., Vondrick, C.,Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G.,Ricco, S., Sukthankar, R., et al. (2018). Ava: A videodataset of spatio-temporally localized atomic visual actions.In

Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition , pages 6047–6056.Hahn et al., 2019. Hahn, M., Silva, A., and Rehg, J. M.(2019). Action2vec: A crossmodal embedding approach toaction learning. arXiv preprint arXiv:1901.00484 .Hammond et al., 2011. Hammond, D. K., Vandergheynst,P., and Gribonval, R. (2011). Wavelets on graphs via spec-tral graph theory.

Applied and Computational HarmonicAnalysis , 30(2):129–150.Hariharan and Girshick, 2017. Hariharan, B. and Girshick,R. (2017). Low-shot visual recognition by shrinking andhallucinating features. In

Proceedings of the IEEE Interna-tional Conference on Computer Vision , pages 3018–3027.Henaﬀ et al., 2015. Henaﬀ, M., Bruna, J., and LeCun, Y.(2015). Deep convolutional networks on graph-structureddata. arXiv preprint arXiv:1506.05163 .Isola et al., 2015. Isola, P., Lim, J. J., and Adelson, E. H.(2015). Discovering states and transformations in imagecollections. In

Proceedings of the IEEE conference on com-puter vision and pattern recognition , pages 1383–1391.Jain et al., 2015a. Jain, M., van Gemert, J. C., Mensink, T.,and Snoek, C. G. (2015a). Objects2action: Classifying andlocalizing actions without any video example. In

Proceed-ings of the IEEE international conference on computer vi-sion , pages 4588–4596.Jain et al., 2015b. Jain, M., van Gemert, J. C., and Snoek,C. G. M. (2015b). What do 15, 000 object categories tell usabout classifying and localizing actions? In

IEEE Confer-ence on Computer Vision and Pattern Recognition, CVPR2015, Boston, MA, USA, June 7-12, 2015 , pages 46–55.Ji et al., 2020. Ji, F., Yang, J., Zhang, Q., and Tay, W. P.(2020). Gfcn: A new graph convolutional network based onparallel ﬂows. pages 3332–3336.Ji et al., 2013. Ji, S., Xu, W., Yang, M., and Yu, K. (2013).3d convolutional neural networks for human action recogni-tion.

IEEE transactions on pattern analysis and machineintelligence , (1):221–231. Karpathy et al., 2014. Karpathy, A., Toderici, G., Shetty, S.,Leung, T., Sukthankar, R., and Fei-Fei, L. (2014). Large-scale video classiﬁcation with convolutional neural net-works. In

Proceedings of the IEEE conference on ComputerVision and Pattern Recognition , pages 1725–1732.Kato et al., 2018. Kato, K., Li, Y., and Gupta, A. (2018).Compositional learning for human object interaction. In

The European Conference on Computer Vision (ECCV) .Kay et al., 2017. Kay, W., Carreira, J., Simonyan, K.,Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F.,Green, T., Back, T., Natsev, P., et al. (2017). Thekinetics human action video dataset. arXiv preprintarXiv:1705.06950 .Kipf and Welling, 2017. Kipf, T. N. and Welling, M. (2017).Semi-supervised classiﬁcation with graph convolutional net-works. In

International Conference on Learning Represen-tations .Kodirov et al., 2015. Kodirov, E., Xiang, T., Fu, Z., andGong, S. (2015). Unsupervised domain adaptation for zero-shot learning. In

Proceedings of the IEEE internationalconference on computer vision , pages 2452–2460.Kuehne et al., 2013. Kuehne, H., Jhuang, H., Stiefelhagen,R., and Serre, T. (2013). Hmdb51: A large video databasefor human motion recognition. In

High Performance Com-puting in Science and Engineering 12 , pages 571–582.Springer.Kumar Dwivedi et al., 2019. Kumar Dwivedi, S., Gupta, V.,Mitra, R., Ahmed, S., and Jain, A. (2019). Protogan: To-wards few shot learning for action recognition. In

Proceed-ings of the IEEE International Conference on ComputerVision Workshops .Lampert et al., 2009. Lampert, C. H., Nickisch, H., andHarmeling, S. (2009). Learning to detect unseen objectclasses by between-class attribute transfer. In ,pages 951–958. IEEE.Lampert et al., 2014. Lampert, C. H., Nickisch, H., andHarmeling, S. (2014). Attribute-based classiﬁcation forzero-shot visual object categorization.

IEEE Transactionson Pattern Analysis and Machine Intelligence , (3):453–465.Lin et al., 2015. Lin, Y., Liu, Z., Sun, M., Liu, Y., and Zhu,X. (2015). Learning entity and relation embeddings forknowledge graph completion. In

Twenty-ninth AAAI con-ference on artiﬁcial intelligence .Mandal et al., 2019. Mandal, D., Narayan, S., Dwivedi,S. K., Gupta, V., Ahmed, S., Khan, F. S., and Shao, L.(2019). Out-of-distribution detection for generalized zero-shot action recognition. In

Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition , pages9985–9993.Marino et al., 2016. Marino, K., Salakhutdinov, R., andGupta, A. (2016). The more you know: Using knowl-edge graphs for image classiﬁcation. arXiv preprintarXiv:1612.04844 .Mettes and Snoek, 2017. Mettes, P. and Snoek, C. G. M.(2017). Spatial-aware object embeddings for zero-shot lo-calization and classiﬁcation of actions. In

IEEE Inter-national Conference on Computer Vision, ICCV 2017,Venice, Italy, October 22-29, 2017 , pages 4453–4462.Mikolov et al., 2013a. Mikolov, T., Chen, K., Corrado, G.,and Dean, J. (2013a). Eﬃcient estimation of word repre-sentations in vector space. arXiv preprint arXiv:1301.3781 .Mikolov et al., 2013b. Mikolov, T., Sutskever, I., Chen, K.,Corrado, G. S., and Dean, J. (2013b). Distributed represen-tations of words and phrases and their compositionality. In

Advances in neural information processing systems , pages3111–3119.ll About Knowledge Graphs for Actions 13Mikolov et al., 2013c. Mikolov, T., Yih, W.-t., and Zweig, G.(2013c). Linguistic regularities in continuous space wordrepresentations. In

Proceedings of the 2013 conference ofthe north american chapter of the association for computa-tional linguistics: Human language technologies , pages 746–751.Miller, 1995. Miller, G. A. (1995). Wordnet: a lexicaldatabase for english.

Communications of the ACM ,(11):39–41.Mishra et al., 2018a. Mishra, A., Verma, V. K., Reddy, M.S. K., Arulkumar, S., Rai, P., and Mittal, A. (2018a). Agenerative approach to zero-shot and few-shot action recog-nition. .Mishra et al., 2018b. Mishra, N., Rohaninejad, M., Chen, X.,and Abbeel, P. (2018b). A simple neural attentive meta-learner. In

International Conference on Learning Repre-sentations .Pagliardini et al., 2018. Pagliardini, M., Gupta, P., andJaggi, M. (2018). Unsupervised learning of sentence em-beddings using compositional n-gram features. In

NAACL2018-Conference of the North American Chapter of the As-sociation for Computational Linguistics .Palatucci et al., 2009. Palatucci, M., Pomerleau, D., Hinton,G. E., and Mitchell, T. M. (2009). Zero-shot learning withsemantic output codes. In

Advances in neural informationprocessing systems , pages 1410–1418.Patterson and Hays, 2016. Patterson, G. and Hays, J.(2016). Coco attributes: Attributes for people, animals,and objects. In

European Conference on Computer Vision ,pages 85–100. Springer.Qiu et al., 2017. Qiu, Z., Yao, T., and Mei, T. (2017). Learn-ing spatio-temporal representation with pseudo-3d residualnetworks. In proceedings of the IEEE International Con-ference on Computer Vision , pages 5533–5541.Ravi and Larochelle, 2017. Ravi, S. and Larochelle, H.(2017). Optimization as a model for few-shot learning. In

International Conference on Learning Representations .Ren et al., 2018. Ren, M., Triantaﬁllou, E., Ravi, S., Snell,J., Swersky, K., Tenenbaum, J. B., Larochelle, H., andZemel, R. S. (2018). Meta-learning for semi-supervised few-shot classiﬁcation. In

Proceedings of 6th International Con-ference on Learning Representations ICLR .Romera-Paredes and Torr, 2015. Romera-Paredes, B. andTorr, P. (2015). An embarrassingly simple approach tozero-shot learning. In

International Conference on Ma-chine Learning , pages 2152–2161.Sigurdsson et al., 2016. Sigurdsson, G. A., Varol, G., Wang,X., Farhadi, A., Laptev, I., and Gupta, A. (2016). Holly-wood in homes: Crowdsourcing data collection for activityunderstanding.

Lecture Notes in Computer Science , page510526.Simonyan and Zisserman, 2014. Simonyan, K. and Zisser-man, A. (2014). Two-stream convolutional networks foraction recognition in videos. In

Advances in neural infor-mation processing systems , pages 568–576.Snell et al., 2017. Snell, J., Swersky, K., and Zemel, R.(2017). Prototypical networks for few-shot learning. InGuyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus,R., Vishwanathan, S., and Garnett, R., editors,

Advancesin Neural Information Processing Systems 30 , pages 4077–4087. Curran Associates, Inc.Soomro et al., 2012. Soomro, K., Zamir, A. R., and Shah, M.(2012). A dataset of 101 human action classes from videosin the wild.

Center for Research in Computer Vision , 2(11).Speer et al., 2017. Speer, R., Chin, J., and Havasi, C. (2017).Conceptnet 5.5: An open multilingual graph of general knowledge. In

Proceedings of the Thirty-First AAAI Con-ference on Artiﬁcial Intelligence , AAAI17, page 44444451.AAAI Press.Sung et al., 2018a. Sung, F., Yang, Y., Zhang, L., Xiang, T.,Torr, P. H., and Hospedales, T. M. (2018a). Learning tocompare: Relation network for few-shot learning. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 1199–1208.Sung et al., 2018b. Sung, F., Yang, Y., Zhang, L., Xiang,T., Torr, P. H., and Hospedales, T. M. (2018b). Learningto compare: Relation network for few-shot learning. .Toutanova et al., 2003. Toutanova, K., Klein, D., Manning,C. D., and Singer, Y. (2003). Feature-rich part-of-speechtagging with a cyclic dependency network. In

Proceedingsof the 2003 Conference of the North American Chapter ofthe Association for Computational Linguistics on HumanLanguage Technology - Volume 1 , NAACL ’03, pages 173–180, Stroudsburg, PA, USA. Association for ComputationalLinguistics.Tran et al., 2015. Tran, D., Bourdev, L., Fergus, R., Torre-sani, L., and Paluri, M. (2015). Learning spatiotemporalfeatures with 3d convolutional networks. In

Proceedingsof the IEEE international conference on computer vision ,pages 4489–4497.Tran et al., 2018. Tran, D., Wang, H., Torresani, L., Ray, J.,LeCun, Y., and Paluri, M. (2018). A closer look at spa-tiotemporal convolutions for action recognition. In

Pro-ceedings of the IEEE conference on Computer Vision andPattern Recognition , pages 6450–6459.Veliˇckovi´c et al., 2018. Veliˇckovi´c, P., Cucurull, G.,Casanova, A., Romero, A., Li`o, P., and Bengio, Y.(2018). Graph attention networks. In

InternationalConference on Learning Representations .Wang and Schmid, 2013. Wang, H. and Schmid, C. (2013).Action recognition with improved trajectories. In

Proceed-ings of the IEEE international conference on computer vi-sion , pages 3551–3558.Wang et al., 2016. Wang, L., Xiong, Y., Wang, Z., Qiao, Y.,Lin, D., Tang, X., and Van Gool, L. (2016). Temporalsegment networks: Towards good practices for deep actionrecognition. In

European conference on computer vision ,pages 20–36. Springer.Wang et al., 2018. Wang, X., Ye, Y., and Gupta, A. (2018).Zero-shot recognition via semantic embeddings and knowl-edge graphs. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 6857–6866.Xiang et al., 2018. Xiang, X., Tian, Y., Reiter, A., Hager,G. D., and Tran, T. D. (2018). S3d: Stacking segmentalp3d for action quality assessment. In , pages928–932. IEEE.Xu et al., 2015. Xu, X., Hospedales, T., and Gong, S. (2015).Semantic embedding space for zero-shot action recognition.In , pages 63–67. IEEE.Xu et al., 2017. Xu, X., Hospedales, T., and Gong, S. (2017).Transductive zero-shot action recognition by word-vectorembedding.

International Journal of Computer Vision ,(3):309–333.Xu et al., 2016. Xu, X., Hospedales, T. M., and Gong, S.(2016). Multi-task zero-shot action recognition with pri-oritised data augmentation. In

European Conference onComputer Vision , pages 343–359. Springer.4 Ghosh et al.Yan et al., 2018. Yan, S., Xiong, Y., and Lin, D. (2018). Spa-tial temporal graph convolutional networks for skeleton-based action recognition. In

Thirty-Second AAAI Confer-ence on Artiﬁcial Intelligence .Yang et al., 2018. Yang, H., He, X., and Porikli, F. (2018).One-shot action localization by learning sequence matchingnetwork. In

The IEEE Conference on Computer Vision andPattern Recognition (CVPR) .Yao et al., 2015. Yao, L., Torabi, A., Cho, K., Ballas, N.,Pal, C., Larochelle, H., and Courville, A. (2015). Describ-ing videos by exploiting temporal structure. In

Proceedingsof the IEEE international conference on computer vision ,pages 4507–4515.Zhang et al., 2017a. Zhang, J., Zheng, Y., and Qi, D.(2017a). Deep spatio-temporal residual networks for city-wide crowd ﬂows prediction. In

Thirty-First AAAI Con-ference on Artiﬁcial Intelligence .Zhang et al., 2017b. Zhang, L., Xiang, T., and Gong, S.(2017b). Learning a deep embedding model for zero-shotlearning. In

Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 2021–2030. Zhang et al., 2019. Zhang, N., Deng, S., Sun, Z., Wang, G.,Chen, X., Zhang, W., and Chen, H. (2019). Long-tailrelation extraction via knowledge graph embeddings andgraph convolution networks. In

Proceedings of the 2019Conference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) , pages3016–3025.Zhou et al., 2016. Zhou, B., Khosla, A., Lapedriza, A., Oliva,A., and Torralba, A. (2016). Learning deep features for dis-criminative localization. In

Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages2921–2929.Zhu and Yang, 2018. Zhu, L. and Yang, Y. (2018). Com-pound memory networks for few-shot video classiﬁcation. In

The European Conference on Computer Vision (ECCV) .Zhu et al., 2018. Zhu, Y., Long, Y., Guan, Y., Newsam, S.,and Shao, L. (2018). Towards universal representation forunseen action recognition. In