Generating Adjacency Matrix for Video-Query based Video Moment Retrieval
GGenerating Adjacency Matrix for Video-Querybased Video Moment Retrieval
Yuan Zhou, Mingfei Wang, Ruolin Wang, Shuwei HuoAugust 21, 2020
In this paper, we continue our work on Video-Query based Video Moment re-trieval task. Based on using graph convolution to extract intra-video and inter-video frame features, we improve the method by using similarity-metric basedgraph convolution, whose weighted adjacency matrix is achieved by calculatingsimilarity metric between features of any two different timesteps in the graph.Experiments on ActivityNet v1.2 and Thumos14 dataset shows the effectivenessof this improvement, and it outperforms the state-of-the-art methods.
Video retrieval has many types, due to diverse query modalities. The mosttraditional video retrieval methods are text query-based methods [1, 2, 3], whichleverage keywords or description sentences as query to search videos. Lateron, image query based video retrieval methods and video query based videoretrieval methods are proposed for furtherly enriching the expressive power ofquery modality, which use an image and a video as query respectively [4, 5,6, 7, 8, 9, 10, 11]. Whats more, using image and video as query modality cansomewhat mitigate cross-modality issue.However, in practical applications, long and untrimmed videos are in com-mon, which implies that the videos contain many complex actions, but only afew of the actions directly meet the need of users’ queries. As a result, a newkind of video query task called Video Moment Retrieval (VMR) is proposed,which allows user to search for certain clips inside the video instead of the wholevideo.Like the development of video retrieval mentioned above, video momentretrieval’s methods are also text-based methods at first, which aims to searchthe video clip that is relevant to the given text query [12, 13, 14]. However, usingtext as query modality still limits the richness and complexity of the informationcontained in the query. Then, in order to compensate for the disadvantage oftext query, video-query based video moment retrieval (VQ-VMR) is proposed,1 a r X i v : . [ c s . C V ] A ug hich is also known as Video Relocalization task [15, 16]. Its aim is temporallylocalizing video segments in a long and untrimmed reference video, and thesegment should have semantic correspondence with the given query video clip.An example of VQ-VMR task is shown in Figure 1. In this example, users firstpick out the clip with the action of basketball dunk from the query video, whichis the input of VQ-VMR task. And the task aims at retrieving a clip with thesame semantic meaning in another untrimmed reference video.For VQ-VMR task, the most direct way is leveraging semantic similaritybetween query video clip and reference video. [15] made the first attempt toaddress this problem by proposing a cross-gated bilinear matching module. Intheir method, every timestep in the reference video is matched with all thetimesteps in query video clip, thus making prediction of the starting and end-ing time a sequence labeling problem. [17] modified their feature extractionmethod by leveraging Attention-based Fusion Module to compute frame-levelsemantic similarity between query video clip and pre-extracted proposal clipsin reference video. Then the generated Attention-based Fusion Tensor passesthrough Semantic Relevance Measurement Module to achieve the video-level se-mantic relevance between them, but the prediction of starting time and endingtime is not related with sequence labeling. [18] proposes Multi-Graph FeatureFusion Module, which makes the first attempt of using graph convolution inVQ-VMR task, and improves the evaluation metric of this task. In the arti-cle, they first treat concatenate query video feature and proposal clip featureas a graph. Then, with multiple pre-designed adjacency matrices, Multi-GraphFeature Fusion block can further fuse the feature of the two videos.However, the adjacency matrices are fixed, which implies that they cannot beadjusted for adapting to each query-proposal pair in training and testing stage.Whats more, the adjacency matrix should to be trained in training stage forbetter node connections. As a result, a video-pair dependent adjacency matrixshould be built for further improve the result.In article [19], they build the adjacency matrix by measuring the similaritybetween features of different timesteps to improve the result in Video TemporalAction Localization task. Since each video pair has different feature matrix, anddifferent adjacency matrices can be built. In addition, they use a fully connectedlayer to train for a better node feature representation so that a better adjacencymatrix could be trained. We borrow their idea and put it into our work.Our contribution is using video-pair dependent adjacency matrix in our VQ-VMR task. To generate the adjacency matrix mentioned above, Weighted GraphAdjacency Matrix Generation Module is proposed. Experiments on ActivityNetv1.2 and Thumos14 dataset has proved the effectiveness of this module.2igure 1: An illustration of Video-Query based Video Motion Retrieval (VQ-VMR) task: given a query video clip from a query video and an untrimmedreference video, the task is to detect the starting point and the ending point ofa segment in the reference video, which is semantically correspond to the queryvideo. In this example, given a query video clip corresponding to basketballdunk (in blue box), VQ-VMR task aims to find out a clip which is also relevantto basketball dunk in another untrimmed long reference video (in red box). Video retrieval aims at selecting the video which is most relevant to the queryvideo clip from a set of candidate videos. According to different types of querymodalities, video retrieval can be divided into following categories: text-querybased video retrieval, image-query based video retrieval and video-query basedvideo retrieval.Text-query based video retrieval has long been tackled via joint visual-language embedding models [20, 21, 22, 23, 24, 25, 26]. Recently, much progresshas been made in this aspect. Although text and video are different modalities,which brought difficulties in studying joint feature representations, some earlierworks still manage to achieve good results. Several deep video-text embeddingmethods [20, 27, 28] has been developed by extending image-text embeddings[29, 30]. Other recent methods improve the results by utilizing concept wordsas semantic priors [31], or relying on strong representation of videos such asRecurrent Neural Network-Fisher Vector (RNN-FV) method [32]. Also, somedominant approach leverages RNN or their variants to encode the whole mul-timodal sequences (e.g. [20, 31, 32]). which faces the challenge of processingcross-modality data. However, text and videos are different modalities, whichmeans there exists inconsistency between features from the two modalities.Image-query based video retrieval techniques uses an image as query. Li et.al[4] propose Hashing across Euclidean space and Riemannian manifold (HER) to3eal with the problem of retrieving videos of a specific person given his/her faceimage as query. A. Araujo et.al [5] introduce a new retrieval architecture, inwhich the image query can be compared directly with database videos. Althoughcompared with text modalitys feature, image modalitys feature is much morelike video modalitys feature, image modalitys feature only provide appearanceat one certain time, thus lacks dynamic clues.As the expression power of text and single image are always limited, video-query based retrieval techniques are proposed to break this kind of limitation.Some video-based methods still borrow the idea of hashing from image-querybased video retrieval which also map high-dimensional video features to compactbinary codes so that it can address video-to-video retrieval techniques. Andvideo retrieval has many specific applications, such as fine-grained incident videoretrieval and near-duplicate video retrieval, which mainly focus on retrievingvideos of the same incident and duplicated videos respectively [10, 11, 33].However, in practice, videos are still very long and untrimmed. But only theclip with certain action directly meets the users need. To this end, video momentretrieval task is proposed, which only retrieves the video clip with certain actiongiven a query. And our paper focuses on this task.
Temporal Proposal Generation is used in Temporal Action Localization taskto generate proposals of actions. Earlier proposal generation method is slid-ing window method, which slides the temporal window along time dimensionto pick out candidates. Based on sliding windows method, [34, 35, 36, 37, 38]uses proposal network to predict if the current sliding window is action or not,so that some sliding windows are removed. However, sliding window methodis not always satisfying, for the length of the sequence is fixed, while differentactions last different time. To solve the problem, Heilbron et. al. [39] proposeFast Temporal Activity Proposals method. Escorcia et. al. [40] proposes DeepAction Proposals (DAP) method to generate proposals, in which a visual en-coder, a sequence encoder, a localization module and a prediction module arecomposited as a pipeline to extract K proposals with confidence scores over aT timestep video. Zhao et. al. [41] proposed a method called Temporal Ac-tionness Grouping method. They use an actionness classifier to evaluate thebinary actionness probabilities for individual snippets and a repurposed water-shed algorithm to combine the snippets as proposals. In our article, we needtemporal proposal generation method to generate raw proposal clip in queryvideos and their reference videos, and we use TAG method in [41] to generateour proposals.
Derived from video retrieval, Video Moment Retrieval needs to find out semantic-relevant clips in a video given a query. It can also be divided into two mainresearch methods: text-query based video moment retrieval and video-query4ased video moment retrieval, and the latter one is also called ”video relocal-ization” [15].Text-query based video moment retrieval focus on locating temporal segmentwhich is the most relevant to the given text. Hendricks et. al. [12] proposeMoment Context Network which leverages both local and global video featuresover video’s timesteps and effectively realize the localization in videos based onnatural language queries. Gao et. al. [13] propose a Cross-Modal TemporalRegression Localizer to jointly model both textual query and video candidatemoments, and its localizer outputs alignment scores between them and action’sregression boundaries. With the development of attention mechanism in thefield of vision and language interaction, attention is gradually used in VideoMoment Retrieval models to help capturing interactions between text and videomodalities. Both matching score and boundary regression are also consideredin our work. We put these thoughts in our work to make it reasonable.Different from text-query based video moment retrieval, video-query basedvideo moment retrieval does not have the cross-modality problem, since bothquery and reference are both from video modality. The methods of this part arevery few. [15] make the first attempt on by using a cross-gated bilinear match-ing module. In this method, the feature in reference video at every timestepis matched with every timestep in query video clip via attention mechanism.Thus, based on matching Later on, [17] improved the result by using Attention-based Fusion module and Semantic Relevance Measurement module to captureframe-level relationship, however, this method still treats VQ-VMR task as atraditional regression problem. [16] extends this task to spatial-temporal level,which requires to find out both temporal segment and spatial location in theproposal video given a query video clip. In our article, our task is just videomoment retrieval, which is the same as the task in [15] and [17].
Graph Neural Networks were firstly proposed in [42], which are used for nodeclassification, graph classification and link prediction tasks at first.With the success of Graph Neural Network in many aforementioned graphtasks, Graph Neural Network shows it strong power in extracting features ofgraph data. And many other non-graph tasks also begin using graph neuralnetworks: they first model the input data of their tasks as graphs, and then usegraph neural network to extract and fuse features. For example, [43] uses graphneural networks in Image Denoising task, where a pixel is treated as a node inthe graph. [44] uses graph neural network in video semantic segmentation task,where each timestep is treated as a node in the graph.After [42], many new kinds of Graph Neural Networks are proposed, andthey can be divided into two aspects: spectral based methods and spatial basedmethods.Spectral based focus on interpreting Graph Neural Networks from graphspectrums and Graph Fourier Transforms. And Laplacian matrix (which rep-resents the graph spectrum) is calculated in this kind of methods. Graph Con-5olutional Network (GCN) [45] and Graph Attention Network (GAT) [46] aretwo examples.However, spatial based methods focus on message passing from current nodesneighbors to current node, and nodes features are updated directly via theirneighbors (and no graph spectrum information is used). GraphSAGE [47] is atypical example.For this task, we treat correlations among different timesteps as a graphdata, which better represents the frame-level relationship, and we use GraphNeural Network method to fuse the feature of all timesteps in two videos. Thisgraph modeling scheme is the same as that in [44].As for graph neural network methods, our method is more likely to be aspatial based method than a spectral one, for we just use the original connec-tions among nodes in our defined graphs, and we do not utilize the spectralinformation of those graphs.
In this section, we will introduce our proposed method for Video Query basedVideo Retrieval. The total architecture of our module can be seen in Figure 2.This section is organized as follows: Section 3.1 is problem formulation, 3.2 isan overview of our methodology, 3.3 is our proposed Weighted Graph AdjacencyMatrix Generation Module for generating video-pair specific adjacency matrix.Section 3.4 is about other modules in our graph, namely graph convolution layer,score module and regression module. Section 3.5 is about losses in our method.
Given an query video clip Q, and a reference long video P. Our target is to getstarting point and ending point [ s pred , e pred ] of video clip inside P.To achieve this goal, in training stage, we use triplet ( q, p, n ) as input, where( q, p, n ) denote query video clip, positive video (same semantic label with query)and negative video (different semantic label with query) respectively, and thetotal architecture of our purposed method is shown in Figure 2. This methodis different from that in [15], where video-query based method is treated asa sequence labeling problem. We use Temporal Actionness Grouping (TAG)method to get action video clips, and for one query, we pick out one clip withsame class as query as positive proposal, and one clip which has different classwith query as negative proposal. Training stage aims at optimizing featureextraction module and regression module.In testing stage, we do not use negative sample of the triplet, and only( q, p ) pairs are used in testing stage. For one query video clip, we also useTAG method to pick out all the proposals in the positive video. Different frompicking one proposal out in training stage, we use all the proposals to predicteach proposal’s [ s testpred e testpred ] as output. 6igure 2: The total architecture of our module. The key is Weighted Adja-cency Matrix Generation Module, which takes the input feature matrix H (0) asinput, and outputs an weighted adjacency matrix to represent the similarity be-tween nodes. Different from the fixed adjacency matrix, it considers the featuresimilarity of the nodes inside the graph. We follow the method in [19], and the only change is the generation of adjacencymatrix. The overview of our method is shown as follows:First, for each video in the input query-positive (or negative) proposal videopair, we use LSTM module to extract temporal features. Then, we concatenatethe output feature of two videos at different timesteps to get a feature matrix,which is regarded as nodes features of a graph. To get the adjacency matrix ofthe graph, we pass the feature matrix into a fully connected layer to get latentfeature representation, and a feature similarity metric is used to get node-wisefeature similarity, and that is the way of building adjacency matrix. Then, withgenerated adjacency matrix and feature matrix of the graph as input, a graphconvolution layer is proposed to further extract and aggregate feature. Finally,the features are sent to Score Module and Regression Module to calculate tripletloss and regression loss.Since the training procedure and testing procedure is almost the same with[18], our focus is describing the procedure of building video-dependent adjacencygraph based on metric. 7 .3 Weighted Graph Adjacency Matrix Generation Mod-ule
In [18], we use pre-defined weighted adjacency matrices to run the graph con-volution. However, in this paper, we use similarity metric between features oftwo timesteps to reflect the relationship between them, which is different fromthe pre-defined adjacency matrix to reflect connections in different timesteps in[18]. As a result, a video-pair specific weighted graph adjacency matrix couldbe built. We believe that the relationship between different timesteps should bedefined by themselves. And the similarity metric between the nodes are muchmore reasonable than the manually-designed weights in adjacency matrix. Inthis work, we call it Weighted Graph Adjacency Matrix Generation Module. Be-low are the detailed building methods. And Figure 3 and Figure 4 also illustrateit. Following the methods in [18], after getting the nodes features H (0) ∈ R T × d via LSTM Module and feature concatenation, where T denotes the number oftimesteps in a video clip and d denotes feature dimensions, our goal is to getan input-dependent graph adjacency matrix ˆ A ∈ R T × T , where each elementˆ A [ i, j ] shows the relationship between node i and node j .First, we use a fully connected layer to learn a simple linear function φ oninput feature h (0) i ∈ R d : φ ( h (0) i ) = W φ h (0) i + b φ Where W φ ∈ R d (1) × d , b φ ∈ R d (1) are learnable weights and biases.This layer aims at weighting graph edges such that nodes with more similar φ have higher edge weights between them.Then, ˆ A [ i, j ] (edge weight between φ ( h (0) i ) and φ ( h (0) j )) can be computed as:ˆ A [ i, j ] = f ( φ ( h (0) i ) , φ ( h (0) j ))And we use the similarity metric defined as the formula below: f ( h i , h j ) = h (cid:62) i h j (cid:107) h i (cid:107) (cid:107) h i (cid:107) Also, to ensure the sparsity of our adjacency matrix, we add an L1-sparistyloss as a constraint, which will be introduced in the next subsection.
In this part, we will introduce the modules after weighted adjacency matrixgeneration module. We will introduce graph convolution layer, score moduleand regression module.
After generating the weighted adjacency matrix ˆ A ∈ R T × T , we take it andfeature matrix H (0) ∈ R T × d as the input of graph convolution layer.8igure 3: The architecture of our purposed Weighted Adjacency Matrix Gen-eration Module. First, the concatenated feature matrix are projected into anew latent space via a Fully Connected Layer. Then, we calculated the featuresimilarity between different videos at different timesteps and arrange them intoa weighted adjacency matrix.Figure 4: An example of our proposed graph. In this example, different fromthe graph in [18], the graph in our article is a fully-connected graph. And everyedge is weighted by the similarity metric between 2 nodes.Our Graph Convolution Layer is formulated as follows: H ( i +1) = σ ( ˆ AH ( i ) W )where ˆ A is the adjacency matrix generated in the previous section, H ( i ) is theinput feature matrix, W is trainable weight matrix, sigma is non-linear function.In the implementation of this article, different from multi-graph feature fusionmethod in [18], we only use one graph convolution layer, and the layer onlycontains one graph. After passing two videos features via a graph convolution module, a globalaverage pooling is used for gathering the feature of the 2 T nodes H (1) into onenodes feature. h global = AvgPool( H (1) ) ∈ R d (1) h global denotes output global feature, AvgPool denotes average pool-ing, d (1) denotes output feature of graph convolution layer. Score Module: h global is fed into this module, which consists a Multi-Layer Perception(MLP), and the module outputs an s ∈ ( − ,
1) . The output s is used forTriplet Loss. Regression Module: h global is fed into this module, which also consists an MLP, and the moduleoutputs regression offsets ( T c , T l ) ∈ R , where T c and T l stand central pointoffset and length offset respectively. Since the proposal clip can be either tootight or too loose, this regression progress tends to find a better position. Our loss function has 3 parts: triplet loss L tri is used for extracting and fusingfeatures between query and proposals, regression loss L reg is used for refiningstarting and ending points and L1-sparsity loss L L − sparsity is used for spars-ening the generated weighted adjacency matrix. The triplet loss L tri is defined as follows: L tri = N (cid:88) i =1 max(0 , γ − S ( q, p ) + S ( q, n )) + λ (cid:107) θ (cid:107) where N is batch size, q, p, n denotes query, positive and negative video cliprespectively. γ is a hyper parameter to ensure a sufficiently large differencebetween the positive-query score and negative-query score. λ is also a hyperparameter on regularization loss. In our experiment, we set γ = 0 .
5, and λ =5 × − , which is the same as the setting in [18]. The regression loss L reg is in the following forms: L reg = 1 N N (cid:88) i =1 | T c,i − T ∗ c,i | + | T l,i − T ∗ l,i | Where T c,i and T l,i are predicted i th positive proposals relative central andlength, and N is batch size. T ∗ c,i and T ∗ l,i are ground truth central points andoffsets which are calculated as follows: T ∗ l,i = log( len i len ∗ i ) T ∗ c,i = loc i − loc ∗ i len ∗ i i and len i denote the center coordinate and length of the i th pro-posal respectively, loc ∗ i and len ∗ i denote those of the corresponding ground truthsegments. Based on the losses in [19], we add L1-sparsity loss, which is related to thegenerated ˆ A . This loss is formulated as follows L L − sparsity = 14 T T (cid:88) i =1 2 T (cid:88) j =1 | ˆ A [ i, j ] | This loss trains fully connected layer φ to create tighter clusters from inputfeature from H (0) .The detailed training procedure will be shown in the experiment section. After training our proposed framework, we perform the task on test set. Testingstage aims at retrieving the matching clip in an untrimmed video given a queryclip. As a result, we only use the reference video which is known to have thesame semantic label as query video clip, and no negative video is used in teststage. The query video clip and reference video mentioned above are paired tobe our input in test stage.There are two procedures in testing stage: proposal selection and proposalrefinement.
Given a query video clip q , we first get M proposals of the reference video usingTAG method. Then, we calculate the score between query video clip and eachof the M proposals. And the proposal with highest score is selected, which canbe expressed as m = arg max m S ( q, p m ) After selecting the proposal with highest score, whose index is m , the boundaryof the m th proposal is then refined based on the regression module in Figure 3.In Regression Module, we have: T ∗ l,i = log( len i len ∗ i ) T ∗ c,i = loc i − loc ∗ i len ∗ i oc i and l en i are the predicted center coordinate and length of the i th proposal respectively, and loc ∗ i and len ∗ i denote those of the correspondingground truth segments. In testing stage, we need to get the predicted startingand ending point, which is equivalent to solving loc ∗ i and len ∗ i with all otheritems already known.Then, with refined central point and total length known, it is easy to get therefined starting and ending point. In this section, to prove the effectiveness of similarity metric between differ-ent nodes, we conduct experiments on ActivityNet v1.2 dataset and Thumos14dataset. As the results shows below, our method outperforms all the othermethods in VQVMR task.This section is organized as follows: first, we introduce datasets and imple-mentation details. Then we introduce our method’s results with other methods’on ActivityNet v1.2 dataset and Thumos14 dataset. Finally, we show our visu-alization result of the generated adjacency matrix.
As for VQ-VMR task, [15] first exploit and reorganize the videos in ActivityNetto form a new dataset for research. Also, we added experiments on Thumos14dataset, which is the same as [17], to prove the effectiveness of our proposedmethod.In both datasets, original videos are annotated with starting and endingpoint for each action, which is referred to ”ground truth” in the following para-graph.
ActivityNet v1.2 [50] has 9682 videos, which are divided into 100 action classes.We reorganized ActivityNet v1.2 for our study. Following the split methods in[15], we split 80 classes, 10 classes, 10 classes for training, validation, testingrespectively. In the experiment, we use the pre-extracted 500-dimension PCAfeatures with a temporal resolution of 16.
Thumos14 dataset [51] has many videos, but the untrimmed long videos withtemporal annotations directly meet our needs. We picked out 412 of them (200from validation data and 212 from test data in the original Thumos14 dataset)for our training and testing, which are from 20 classes. We randomly select 14classes for training and the rest 6 classes for testing. We need to remind thattwo falsely annotated videos (”270” and ”1496”) in the testing set were excludedin the present study. 12 .2 Implementation Details
For both datasets, we use pretrained C3D features as input, which is the sameas that in [17] and [18]. PyTorch 1.4.0 is used to implement our model. Ourbatch size is 32, and we train our model for 64 epochs. As mentioned above, wehave 3 losses, triplet loss L tri , regression loss L reg , and L1 graph sparsity loss L L − sparsity . For 3 losses, 3 separate Adam optimizers are implemented mini-mize the losses. For L tri , the optimizer optimizes parameters except RegressionModule and Weighted Adjacency Matrix Generation Module, and its learningrate is 1e-4. For L reg , it only optimizes the parameters in Regression Module,and its learning rate is 1e-1. For L L − sparsity , it optimizes the parameters inWeighted Adjacency Matrix Generation Module, and its learning rate is 1e-2.The 3 optimizers have the same β = 0 . β = 0 . L L − sparsity ,which is only with respect to parameters of the fully connected layer φ . Then,the second Adam optimizer with learning rate=1e-4 is used to optimize theparameters except φ and Regression Module, which aims at minimizing L tri .And finally, the third Adam optimizer with learning rate=0.1 is used to optimizethe parameters of Regression Module for minimizing L reg . The comparison result is listed in table. From the table, it is clear to seethat our proposed method has advantage with the existing VQVMR methods.Comparing with [15], our methods result is much higher than theirs, this isbecause we use pre-extracted video moment clips, and pick out the most suitableone, rather than determining whether current timestep is in the retrieval clip ornot.Comparing with AFT+SRM method, our methods performance is still betterthan it. This conclusion is firstly mentioned in [18], which aims at proving theeffectiveness of using graph convolution. This comparison once again provesthe effectiveness of using graph convolution. Whats more, comparing with inn=1, k=1 case in [18]. The only difference between our method and n=1, k=1case in [18] is that our methods graph is not the same among all the batches.From the result we can see that using can improve the result by adjusting theweighted adjacency matrix adaptively according to the input video pair andfeature similarity metric.We also conduct experiments on different kinds of graphs, for example, n=1,k=1&2 and n=1, k=1&2&3 in [18]. The results also show the effectiveness ofusing metrics between timesteps to build graph adjacency matrix.13able 1: Results on ActivityNet DatasetMethods \ tIoU 0.5 0.6 0.7 0.75 0.8 0.85 0.9 0.95Chance 0.161 0.113 0.056 - 0.031 - 0.012 -Frame-Level Baseline 0.202 0.146 0.104 - 0.054 - 0.025 -Video-Level Baseline 0.254 0.181 0.127 - 0.063 - 0.026 -SST 0.347 0.258 0.183 - 0.081 - 0.03 -Cross-Gated Bilinear Matching 0.458 0.377 0.282 - 0.171 - 0.073 -AFT+SRM 0.6355 0.6346 0.6216 0.5639 0.5276 0.4314 0.2902 0.1233n=1, k=1 0.6333 0.6325 0.6313 0.5637 0.4747 0.3753 0.265 0.1162n=1, k=1&2 0.6596 0.6494 0.6452 0.5805 0.4823 0.3832 0.2711 0.1201n=1, k=1&2&3 0.6766 0.6758 0.6714 0.5933 0.5043 0.4016 0.2854 0.1241n=2, k=1 0.681 0.6802 0.6787 0.6023 0.5163 0.4006 0.2835 0.1245n=2, k=1&2 0.6914 0.6907 0.6877 0.6138 0.5224 0.4016 0.2846 0.126n=2, k=1&2&3 0.7012 0.7014 0.6981 0.6309 0.5305 0.4138 0.2886 0.1291n=3, k=1 0.6016 0.6008 0.597 0.5283 0.4585 0.3608 0.2489 0.1125n=3, k=1&2 0.6242 0.6238 0.6207 0.5413 0.4613 0.3483 0.2502 0.1125n=3, k=1&2&3 0.6311 0.6302 0.6268 0.5448 0.4677 0.3703 0.2624 0.1136n=1, k=1, CNN 0.5676 0.5675 0.5641 0.4936 0.4104 0.3305 0.2323 0.0998n=2, k=1&2&3, CNN 0.6275 0.6266 0.6247 0.5539 0.4676 0.3646 0.2642 0.1192Our Method 0.651 0.6501 0.6465 0.5702 0.4831 0.3798 0.266 0.1247 Like [17] and [18], to further prove the effectiveness of our proposed method,we also conduct experiments on Thumos14 dataset. The experimental settingsof Thuoms14 is mentioned above, which is the same as that in [17] and [18].The results are listed in Table . From the table, it is easy to show that thepattern in ActivityNet dataset still appears in Thumos14 dataset (when tIoUgets larger, the mAP descends) We also find that all the graph-convolutionbased methods is better than AFT+SRM methods, which is different from thatin ActivityNet. We conclude this for disparity of the two datasets. Of all themethods, our proposed method is the best, and mAP is higher than other graph-convolution based methods, which also shows the power of using our method.And our methods result is better than the AFT+SRM methods in [17].
Also, we show the qualitive results of our proposed method to demonstrate theeffectiveness of our method intuitively. We picked out 2 classes from ActivityNetdataset (i.e., Hand Washing Clothes and Pole Vault) and two classes from Thu-mos14 dataset (i.e. Javelin Throw and Soccer Penalty). The results are shownin Figure . It can be seen that ground truth and our proposed method overlap alot. (The overlap of , which is due to its length is relatively long.) Although thetest classes and test pairs have not been seen before, it can effectively measure14able 2: Results on Thumos14 DatasetMethods \ tIoU 0.5 0.6 0.7 0.8 0.9ChanceFrame-Level BaselineVideo-Level BaselineSSTCross-Gated Bilinear MatchingAFT+SRM 0.5063 0.5015 0.4797 0.3133 0.1206n=1, k=1 0.5683 0.5645 0.539 0.3527 0.142n=1, k=1&2 0.5704 0.5662 0.5447 0.3525 0.1431n=1, k=1&2&3 0.5783 0.5741 0.5472 0.3564 0.1444n=2, k=1 0.5709 0.565 0.5263 0.359 0.1396n=2, k=1&2 0.5715 0.5662 0.5343 0.3613 0.142n=2, k=1&2&3 0.5804 0.5728 0.5473 0.3724 0.145n=3, k=1 0.5581 0.5493 0.5328 0.3454 0.1326n=3, k=1&2 0.5663 0.5543 0.533 0.3558 0.1376n=3, k=1&2&3 0.5731 0.5688 0.5533 0.3606 0.1395n=1, k=1, CNN 0.4967 0.4915 0.4668 0.303 0.126n=2, k=1&2&3, CNN 0.4418 0.4366 0.4188 0.2568 0.1068Our Method 0.6056 0.601 0.5763 0.3713 0.145the semantic similarities between query and reference classes. We show our adjacency matrix in Figure . From this R × matrix, we cansee that the intra-video node connections are strengthened. The inter-videonode connections weights are not as big as the intra-video node connections,but some values are relatively higher than the others, which implies the nodesfrom 2 video clips are more similar, and the connection is somewhat importantamong all the inter-video node connections.We also posted the pre-designed adjacency matrix in [18] in Figure . Compar-ing with adjacency matrix in our method, we can find that the edge connectionstrategy in these two methods are different. The connection in [18] is focus oninter-video part, while our method focuses on intra-video part. In this article, we further improve Video Query based Video Moment Retrieval.Based on the graph convolution methods purposed before, we first concatenatethe features of query and proposal to build a graph. Then we use metrics be-tween the nodes in the graph, which is different from pre-defined adjacencyweights used before. And a fully connected layer is also used for trained for15xperiments on ActivityNet v1.2 and Thumos14 dataset has shown the effec-tiveness of our new method.
References [1] R. Xu, C. Xiong, W. Chen, and J. J. Corso, “Jointly modeling deep videoand compositional text to bridge vision and language in a unified frame-work,” in
Proceedings of the Twenty-Ninth AAAI Conference on ArtificialIntelligence , p. 23462352, 2015.[2] N. C. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury, “Learning jointembedding with multimodal cues for cross-modal video-text retrieval,” in
Proceedings of the 2018 ACM on International Conference on MultimediaRetrieval , p. 1927, 2018.[3] Y. Youngjae, K. Jongseok, and K. Gunhee in
European Conference onComputer Vision , pp. 471–487, 2018.[4] L. Yan, R. Wang, Z. Huang, S. Shan, and X. Chen, “Face video retrievalwith image query via hashing across euclidean space and riemannian man-ifold,” in
IEEE Conference on Computer Vision and Pattern Recognition ,2015.[5] A. Araujo and B. Girod, “Large-scale video retrieval using image queries,”
IEEE Transactions on Circuits and Systems for Video Technology , vol. 28,no. 6, pp. 1406–1420, 2018.[6] D. Noa, Garcia and V. George, “Asymmetric spatio-temporal embeddingsfor large-scale image-to-video retrieval,” in
British Machine Vision Con-ference , vol. 28, pp. 1406–1420, 2018.[7] G. Ye, D. Liu, J. Wang, and S. Chang, “Large-scale video hashing viastructure learning,” in , pp. 2272–2279, 2013.[8] J. Song, H. Zhang, X. Li, L. Gao, M. Wang, and R. Hong, “Self-supervisedvideo hashing with hierarchical binary auto-encoder,”
IEEE Transactionson Image Processing , vol. 27, no. 7, pp. 3210–3221, 2018.[9] Z. Chen, J. Lu, J. Feng, and J. Zhou, “Nonlinear structural hashing forscalable video search,”
IEEE Transactions on Circuits and Systems forVideo Technology , vol. 28, no. 6, pp. 1421–1433, 2018.[10] G. Kordopatis-Zilos, S. Papadopoulos, I. Patras, and I. Kompatsiaris 2017.[11] G. Kordopatis-Zilos, S. Papadopoulos, I. Patras, and I. Kompatsiaris,“Fivr: Fine-grained incident video retrieval,”
IEEE Transactions on Mul-timedia , vol. 21, no. 10, pp. 2638–2652, 2019.1612] L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Rus-sell, “Localizing moments in video with natural language,” in , pp. 5804–5813,2017.[13] J. Gao, C. Sun, Z. Yang, and R. Nevatia, “Tall: Temporal activity lo-calization via language query,” in , pp. 5277–5285, 2017.[14] N. C. Mithun, S. Paul, and A. K. Roy-Chowdhury, “Weakly supervisedvideo moment retrieval from text queries,” in , pp. 11584–11593,2019.[15] Y. Feng, L. Ma, W. Liu, T. Zhang, and J. Luo, “Video re-localization,” in
The European Conference on Computer Vision , 2018.[16] Y. Feng, L. Ma, W. Liu, and J. Luo, “Spatio-temporal video re-localizationby warp lstm,” in , pp. 1288–1297, 2019.[17] Y. Zhou and R. Wang, “Semantic relevance learning for video-query basedvideo moment retrieval,” 2020.[18] Y. Zhou, M. Wang, R. Wang, and S. Huo, “Graph neural network forvideo-query based video moment retrieval,” 2020.[19] M. Rashid, H. Kjellstrm, and Y. J. Lee, “Action graphs: Weakly-supervisedaction localization with graph convolution networks,” in , pp. 604–613,2020.[20] A. Torabi, N. Tandon, and L. Sigal, “Learning language-visualembedding for movie understanding with natural-language,”
CoRR ,vol. abs/1609.08124, 2016.[21] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semantic embeddings with multimodal neural language models,”
CoRR ,vol. abs/1411.2539, 2014.[22] M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description asa ranking task: Data, models and evaluation metrics,”
Journal of ArtificialIntelligence Research .[23] D. Lin, S. Fidler, C. Kong, and R. Urtasun, “Visual semantic search: Re-trieving videos via complex textual queries,” in , pp. 2657–2664, 2014.[24] I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun, “Order-embeddings ofimages and language,”
CoRR , vol. abs/1511.06361, 11 2015.1725] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell, “Naturallanguage object retrieval,” in , pp. 4555–4564, 2016.[26] J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille, and K. Mur-phy, “Generation and comprehension of unambiguous object descriptions,”in , pp. 11–20, 2016.[27] R. Xu, C. Xiong, W. Chen, and J. J. Corso, “Jointly modeling deep videoand compositional text to bridge vision and language in a unified frame-work,”
AAAI Conference on Artificial Intelligence , 2015.[28] M. Otani, Y. Nakashima, E. Rahtu, J. Heikkil, and N. Yokoya, “Learn-ing joint representations of videos and sentences with web image search,”
European Conference on Computer Vision , 2016.[29] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, andT. Mikolov, “Devise: A deep visual-semantic embedding model,” in
Pro-ceedings of the 26th International Conference on Neural Information Pro-cessing Systems - Volume 2 , NIPS13, (Red Hook, NY, USA), p. 21212129,Curran Associates Inc., 2013.[30] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng,“Grounded compositional semantics for finding and describing images withsentences,”
Transactions of the Association for Computational Linguistics, ,vol. 2, pp. 207–218, 2014.[31] Y. Yu, H. Ko, J. Choi, and G. Kim, “End-to-end concept word detection forvideo captioning, retrieval, and question answering,” in , pp. 3261–3269,2017.[32] D. Kaufman, G. Levi, T. Hassner, and L. Wolf, “Temporal tessellation forvideo annotation and summarization,”
CoRR , vol. abs/1612.06950, 2016.[33] G. Kordopatis-Zilos, S. Papadopoulos, I. Patras, and I. Kompatsiaris,“Visil: Fine-grained spatio-temporal video similarity learning,” 08 2019.[34] A. Gaidon, Z. Harchaoui, and C. Schmid, “Temporal localization of ac-tions with actoms,”
IEEE Transactions on Pattern Analysis and MachineIntelligence , vol. 35, no. 11, pp. 2782–2795, 2013.[35] J. Yuan, B. Ni, X. Yang, and A. A. Kassim, “Temporal action localizationwith pyramid of score distribution features,” in , pp. 3093–3102, 2016.[36] M. Jain, J. v. Gemert, H. Jgou, P. Bouthemy, and C. G. M. Snoek, “Ac-tion localization with tubelets from motion,” in , pp. 740–747, 2014.1837] S. Ma, J. Zhang, N. Ikizler-Cinbis, and S. Sclaroff, “Action recognition andlocalization by hierarchical space-time segments,” in , pp. 2744–2751, 2013.[38] F. C. Heilbron, J. C. Niebles, and B. Ghanem, “Fast temporal activityproposals for efficient detection of human actions in untrimmed videos,”in , pp. 1914–1923, 2016.[39] F. C. Heilbron, J. C. Niebles, and B. Ghanem, “Fast temporal activityproposals for efficient detection of human actions in untrimmed videos,”in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2016.[40] V. Escorcia, F. Heilbron, J. C. Niebles, and B. Ghanem, “Daps: Deepaction proposals for action understanding,” vol. 9907, pp. 768–784, 10 2016.[41] Z. Yue, X. Yuanjun, W. Limin, W. Zhirong, T. Xiaoou, and L. Dahua,“Temporal action detection with structured segment networks,” in
Inter-national Journal of Computer Vision , 2017.[42] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks andlocally connected networks on graphs,” 2014.[43] D. Valsesia, G. Fracastoro, and E. Magli, “Deep graph-convolutional imagedenoising,” 2019.[44] F. Mao, X. Wu, H. Xue, and R. Zhang, “Hierarchical video frame sequencerepresentation with deep convolutional graph network,” 2019.[45] T. N. Kipf and M. Welling, “Semi-supervised classification with graph con-volutional networks,” in
International Conference on Learning Representa-tions , 2017.[46] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Li`o, and Y. Ben-gio, “Graph Attention Networks,” in
International Conference on LearningRepresentations , 2018.[47] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learningon large graphs,” in
Advances in Neural Information Processing Systems30 , pp. 1024–1034, 2017.[48] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” vol. 9,pp. 1735–1780, 1997.[49] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “C3D:generic features for video analysis,” in
Proceedings of the IEEE Interna-tional on Computer Vision , pp. 4489–4497, 2015.1950] F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, “Activitynet:A large-scale video benchmark for human activity understanding,” in
IEEEConference on Computer Vision and Pattern Recognition , 2015.[51] Y. G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah,and R. Sukthankar, “THUMOS challenge: Action recognition with a largenumber of classes,” 2014.[52] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in
International Conference on Learning Representations , 2015.[53] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi-nov, “Dropout: A simple way to prevent neural networks from overfitting,”vol. 15, pp. 1929–1958, 2014.[54] C. L. Chou, H. T. Chen, and S. Y. Lee, “Pattern-based near-duplicatevideo retrieval and localization on web-scale videos,”
IEEE Transactionson Multimedia , vol. 17, no. 3, pp. 382–395, 2015.[55] S. Buch, V. Escorcia, C. Shen, B. Ghanem, and J. Carlos Niebles, “Sst:Single-stream temporal action proposals,” in