[PDF] Natural Language Video Localization: A Revisit in Span-based Question Answering Framework

Abstract

Natural Language Video Localization (NLVL) aims to locate a target moment from an untrimmed video that semantically corresponds to a text query. Existing approaches mainly solve the NLVL problem from the perspective of computer vision by formulating it as ranking, anchor, or regression tasks. These methods suffer from large performance degradation when localizing on long videos. In this work, we address the NLVL from a new perspective, i.e., span-based question answering (QA), by treating the input video as a text passage. We propose a video span localizing network (VSLNet), on top of the standard span-based QA framework (named VSLBase), to address NLVL. VSLNet tackles the differences between NLVL and span-based QA through a simple yet effective query-guided highlighting (QGH) strategy. QGH guides VSLNet to search for the matching video span within a highlighted region. To address the performance degradation on long videos, we further extend VSLNet to VSLNet-L by applying a multi-scale split-and-concatenation strategy. VSLNet-L first splits the untrimmed video into short clip segments; then, it predicts which clip segment contains the target moment and suppresses the importance of other segments. Finally, the clip segments are concatenated, with different confidences, to locate the target moment accurately. Extensive experiments on three benchmark datasets show that the proposed VSLNet and VSLNet-L outperform the state-of-the-art methods; VSLNet-L addresses the issue of performance degradation on long videos. Our study suggests that the span-based QA framework is an effective strategy to solve the NLVL problem.

Full PDF

AACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (TPAMI) 1

Natural Language Video Localization: A Revisitin Span-based Question Answering Framework

Hao Zhang, Aixin Sun, Wei Jing, Liangli Zhen, Joey Tianyi Zhou and Rick Siow Mong Goh

Abstract —Natural Language Video Localization (NLVL) aims to locate a target moment from an untrimmed video that semanticallycorresponds to a text query. Existing approaches mainly solve the NLVL problem from the perspective of computer vision by formulatingit as ranking, anchor, or regression tasks. These methods suffer from large performance degradation when localizing on long videos.In this work, we address the NLVL from a new perspective, i.e., span-based question answering (QA), by treating the input video asa text passage. We propose a video span localizing network (VSLNet), on top of the standard span-based QA framework (namedVSLBase), to address NLVL. VSLNet tackles the differences between NLVL and span-based QA through a simple yet effective query-guided highlighting (QGH) strategy. QGH guides VSLNet to search for the matching video span within a highlighted region. To addressthe performance degradation on long videos, we further extend VSLNet to VSLNet-L by applying a multi-scale split-and-concatenationstrategy. VSLNet-L ﬁrst splits the untrimmed video into short clip segments; then, it predicts which clip segment contains the targetmoment and suppresses the importance of other segments. Finally, the clip segments are concatenated, with different conﬁdences,to locate the target moment accurately. Extensive experiments on three benchmark datasets show that the proposed VSLNet andVSLNet-L outperform the state-of-the-art methods; VSLNet-L addresses the issue of performance degradation on long videos. Ourstudy suggests that the span-based QA framework is an effective strategy to solve the NLVL problem.

Index Terms —Natural Language Video Localization, Single Video Moment Retrieval, Temporal Sentence Grounding, Cross-modalRetrieval, Multimodal Learning, Span-based Question Answering, Multi-Paragraph Question Answering, Cross-modal Interaction. (cid:70)

NTRODUCTION N ATURAL language video localization (NLVL) is a prominentyet challenging problem in vision-language understanding.Given an untrimmed video, NLVL is to retrieve a temporalmoment that semantically corresponds to a given language query.As illustrated in Figure 1, NLVL involves both computer visionand natural language processing techniques [1]–[6]. Cross-modalreasoning is essential for NLVL to correctly locate the targetmoment in a video. Prior studies primarily treat NLVL as aranking task, which apply multimodal matching architecture toﬁnd the best matching video segment for a query [7]–[11]. Someworks [11]–[14] assign multi-scale temporal anchors to framesand select the anchor with the highest conﬁdence as the result.Recently, several methods explore to model cross-interactionsbetween video and query, and to regress temporal locations oftarget moment directly [15]–[17]. There are also studies thatformulate NLVL as a sequential decision-making problem andsolve it with reinforcement learning [18]–[20].Different from the aforementioned works, we address theNLVL task from a new perspective, i.e., span-based questionanswering (QA). Speciﬁcally, the essence of NLVL is to searchfor a video moment as the answer to a given language query from • This research is supported by the Agency for Science, Technology and Re-search (A*STAR) under its AME Programmatic Funding Scheme (Project • H. Zhang is with the Institute of High Performance Computing, A*STAR,Singapore, 138632 and the School of Computer Science and Engineering,Nanyang Technological University, Singapore, 639798. • A. Sun is with the School of Computer Science and Engineering, NanyangTechnological University, Singapore, 639798. • W. Jing is with the Institute of Infocomm Research, A*STAR, Singapore,138632. • L. Zhen, J.T. Zhou and R.S.M. Goh are with the Institute of HighPerformance Computing, A*STAR, Singapore, 138632. • Corresponding author: J.T. Zhou (Email: [email protected]).

Query : The boarder tries to grab on to a sign and is falling in and tumbling around in snow.

The Ground Truth Moment : … … Query : The boarder tries to grabon to a sign and is falling in andtumbling around in snow.

Visual FeatureExtractionTextual FeatureExtraction Cross-modalInteractions 53.21s80.08sLocalizationFeature Encoding

Fig. 1.

An illustration of localizing a temporal moment in anuntrimmed video by a given language query, and the generalprocedures of natural language video localization. an untrimmed video. By treating the video as a text passage, andthe target moment as the answer span, NLVL shares signiﬁcantsimilarities with the span-based QA conceptually. With this intu-ition, NLVL could be revisited in the span-based QA framework.However, the existing span-based QA methods could not bedirectly applied to solve the NLVL problem, due to the followingtwo technical gaps. First, the data nature is different. To bespeciﬁc, video is continuous, which results in the continuouscausal relation inference between two consecutive video events;natural language, on the other hand, is discrete, and words ina sentence demonstrate syntactic structure. Therefore, changesbetween consecutive video frames are usually smooth. In contrast,two word tokens may carry different and even totally differentmeanings, e.g., oxymoron or negation. As a result, events ina video are temporally correlated, and one can be linked withone another along the video sequence. The relationships between a r X i v : . [ c s . C L ] M a r CCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (TPAMI) 2 words or sentences are usually indirect and can be far apart.Second, small shifts in video frames are less imperceptible tohumans than words in a text. Videlicet, small drifts betweenframes usually do not affect the understanding of video content.In contrast, the change of a few words or even one word couldchange the meaning of a sentence entirely.By considering the above differences in data nature, we pro-pose a video span localizing network ( VSLNet ), where a query-guided highlighting (QGH) strategy is introduced on the top of thetraditional span-based QA framework [21]. In QGH, we considera region that covers the target moment by extending its startingand ending frames a bit further. In this way, the selected region isregarded as foreground, while the rest is treated as background.One challenge in NLVL is that the performance of manyexisting methods degrades signiﬁcantly along with the increaseof video length (see detailed discussion in Section 4.4). Sincethe NLVL models generally perform well on short videos, onestraightforward solution to address this issue is to split a longvideo into multiple short clip segments. Then each clip segment isregarded as a short video. By treating a long video as a document,a clip segment as a paragraph, NLVL can be viewed as themulti-paragraph question answering (MPQA) task [22]. The targetmoment in a long video can be considered as the answer span in adocument for a given query.However, how to properly split long video into clip segmentsis still challenging. Paragraphs in a document are semanticallycoherent units with boundaries deﬁned by humans. Videos arecontinuous, and splitting the video into semantically coherent clipsegments is difﬁcult, even if it is feasible. In addition, the answerspan in MPQA can be found in one of the paragraphs, but wecannot expect the target moment can be found within a single clipsegment, regardless of how to split the videos. We propose a multi-scale split-and-concat strategy to partition long video into clips ofdifferent lengths. Compared with ﬁxed length splitting, the multi-scale splitting strategy increases the chance of locating a targetmoment in one segment. In this way, even if a target moment istruncated at one or several scales, segments in other scales maystill be able to fully contain it. Thus, we can locate the moment inthe clips that are more likely to contain it. This network is termedas

VSLNet-L in the following context.This paper is a substantial extension of our conference publi-cation [23] with the following improvements. First, we investigatethe localization performance degradation issue of existing NLVLmodels on long videos, and propose VSLNet-L to tackle thisproblem by introducing the concept of MPQA. Second, we carryout more experimental analyses involving the VSLNet-L anddemonstrate its effectiveness on long videos. The contribution andnovelty of this work are summarized as follows:1) We provide a new perspective to solve NLVL by for-mulating it as span-based QA, and analyze the naturaldifferences between them in detail. To the best of ourknowledge, this is one of the ﬁrst works to adopt a span-based QA framework for NLVL.2) We propose VSLNet to explicitly address the differencesbetween NLVL and span-based QA, by introducing anovel query-guided highlighting strategy.3) In addition to VSLNet, VSLNet-L is proposed by incor-porating the concept of multi-paragraph QA to addressthe performance degradation on long videos.

1. “ L ” represents the multi-scale split-and-concat strategy for L ong videos. ELATED W ORK

As introduced by [24], [25], NLVL requires to model thecross-modality interactions between natural language texts anduntrimmed videos, to retrieve video segments with languagequeries. Existing work on NLVL can be roughly divided into ﬁvecategories: ranking, anchor, reinforcement learning, regression andspan based methods.Ranking-based methods [7]–[10], [24], [26]–[28] treat NLVLas a ranking task and rely on multimodal matching architecturesto ﬁnd the best matching video moment for a language query.For instance, Gao et al. [25] proposed a cross-modal tempo-ral regression localizer to jointly model the queries, as wellas predicting alignment scores and action boundary regressionsfor pre-segmented candidate clips. Chen et al. [29] proposeda semantic activity proposal that integrates sentence queries’semantic information into the proposal generation process to getdiscriminative activity proposals. Although intuitive, these modelsare sensitive to negative samples. Speciﬁcally, they need denselysampled candidates to achieve good performance, leading to lowefﬁciency and low ﬂexibility.Various approaches have been developed to overcome theabove-mentioned drawbacks. Anchor-based methods [11]–[14],[30], [31] sequentially process the video frames and assign eachframe with multiscale temporal anchors; then the anchor with thehighest conﬁdence is selected as the result. For instance, Yuan etal. [32] proposed a semantic conditioned dynamic modulation forbetter correlating sentence-related video contents over time andestablishing a precise matching relationship between sentence andvideo. Zeng et al. [33] designed a dense regression network toregress the distances from each frame to the start/end frame of thevideo segment described by the query. Liu et al. [30] presented across- and self-modal graph attention network that recasts NLVLas a process of iterative messages passing over a joint graph tocapture high-order interactions between two modalities.Reinforcement learning-based methods formulate NLVL as asequential decision-making problem [18], [20]. This process im-itates humans’ coarse-to-ﬁne decision-making scheme to observecandidate moments conditioned on queries. For example, Wang etal. [18] proposed a recurrent neural network-based reinforcementlearning model to selectively observe a sequence of frames andassociate the given sentence with the video content in a matching-based manner. He et al. [19] presented a multi-task reinforcementlearning framework to learn an agent that regulates the tempo-ral grounding boundaries progressively based on its policy. Wu et al. [20] presented a tree-structured policy-based progressivereinforcement learning framework to simulate humans’ coarse-to-ﬁne decision-making paradigm and to sequentially regulate thetemporal boundary in an iterative reﬁnement manner.Regression-based methods [15], [17] tackle NLVL by regress-ing the temporal time of target moments directly. Yuan et al. [15]built a proposal-free model with BiLSTM and a three-step atten-tion module to regress temporal locations of the target moment.Lu et al. [16] proposed a dense bottom-up framework, whichregresses the distances to start and end boundaries for each framein the target moment. Then, it selects the one with the highestconﬁdence as the result. Chen et al. [17] further extended thebottom-up framework in [16] with graph-based feature pyramidnetwork to boost the performance. Mun et al. [34] proposed toextract a collection of semantic phrases in the query and reﬂect

CCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (TPAMI) 3 F ea t u r e E n c od e r F ea t u r e E n c od e r F ea t u r e E n c od e r … C QA C QA … N P M N P M … || Q u e r y - G u i d e d H i gh li gh ti ng C ond iti on e d S p a n P r e d i c t o r 𝑃 ! 𝑃 " SelfAttention T e x t Q u e r y : T h i s p e r s on s t a r t s c ook i ng a tt h e s t ov e . D C onv N e t G l o V e FF N FF N F ea t u r e E n c od e r F ea t u r e E n c od e r C on t e x t Q u e r y A tt e n ti on 𝑃 ! 𝑃 " Q u e r y - G u i d e d H i gh li gh ti ng SelfAttention C ond iti on e d S p a n P r e d i c t o r 𝑽𝑸 𝑽′𝑸′ %𝑽%𝑸 𝑽 %𝑽 𝒉 $ Y / N Y / N ( C on ca t ) FF N FF N FF NG l o V e M u lti - s ca l e S p lit … D C onv N e t (a) Video Span Localizing Network (VSLNet) (b) VSLNet with Multi-scale Split-and-Concat Strategy (VSLNet-L) T e x t Q u e r y : T h i s p e r s on s t a r t s c ook i ng a tt h e s t ov e . 𝑸 𝑸′ %𝑸𝑪 % 𝑪 & %𝑪 % %𝑪 & 𝑪 % 𝑪 & 𝒉 $ (𝑪 % (𝑪 & 𝑪 ’ Fig. 2.

An overview of the proposed architectures for NLVL. The feature extractors, i.e.,

GloVe and 3D ConvNet, are ﬁxed duringtraining. (a) depicts the structure of VSLNet. (b) shows the architecture of VSLNet-L. Note the standard span-based QA framework(VSLBase) is similar to VSLNet by removing the Self-Attention and Query-Guided Highlighting modules. bi-modal interactions between the linguistic and visual features inmultiple levels for moment boundaries regression.Recently, a few attempts have been made to address NLVL thatconsider the concept of question answering [35], [36]. Speciﬁcally,Chen et al. [35] introduced a cross-gated attended recurrent net-work and a self interactor to exploit the ﬁne-grained interactionsbetween the query and video. A segment localizer is adopted topredict the starting and ending boundary of the moment. Ghosh et al. [37] presented an extractive approach that predicts the startand end frames by leveraging cross-modal interactions betweenthe text and video. However, they did not state and explain thesimilarity and differences between NLVL and the traditional span-based QA; and they did not adopt the standard span-based QAframework. In this work, we adopt a standard span-based QAframework to address NLVL; and propose VSLNet to explicitlyaddress the issue caused by the differences between NLVL andthe traditional span-based QA tasks.In addition, several other methods have been applied to NLVL,Escorcia et al. [38] extends NLVL to a general case that themodel is required to retrieve the target moments from a videocorpus instead of from a single video. Shao et al. [39] presenteda uniﬁed framework to learn video-level matching and moment-level localization jointly. Mithun et al. [40] explored to build ajoint visual-semantic embedding based framework to learn latentalignment between video frames and sentence descriptions underweakly supervised setting. Lin et al. [41] proposed a weakly-supervised moment retrieval framework requiring only coarsevideo-level annotations. Zhang et al. [42] modeled the temporalrelations between video moments by a two-dimensional map andproposed a temporal adjacent network to encode discriminativefeatures for matching video moments with referring expressions.Chen et al. [43] presented a method for learning pairwise modal-ity interactions to exploit complementary information better andimprove performance on both temporal sentence localization andevent captioning.

Span-based Question Answering (QA) has been widely studiedin the past years. Wang et al. [44] combined match-LSTM [45]and Pointer-Net [46] to estimate boundaries of the answer span.BiDAF [47] introduced bi-directional attention to obtain query-aware context representation. Xiong et al. [48] proposed a coat-tention network to capture the interactions between context andquery. R-Net [49] integrated mutual and self attentions into the RNN encoder for feature reﬁnement. QANet [21] leveraged on asimilar attention mechanism in a stacked convolutional encoderto improve performance. FusionNet [50] presented a full-awaremulti-level attention to capture complete query information. Bytreating input video as a text passage, the above frameworks areall applicable to NLVL in principle. However, these frameworksare not designed to consider the differences between video andtext passage. Their modeling complexity arises because of theinteractions between query and text passage, where the two inputsare in the homogeneous space. In our proposed method, VSLBaseadopts a simple and standard span-based QA framework, makingit easy to model the differences between video and text by addingadditional modules. Our VSLNet addresses the issue caused bythe differences with the QGH module.Moreover, VSLNet-L is conceptually similar to the multi-paragraph question answering (MPQA) framework [22], [51]–[53], where both methods explore to locate an answer span frommultiple paragraphs/clips. Clark et al. [22] adopted standard span-based QA models to entire documents scenario by introducingconﬁdence modules to ﬁnd an answer span from multiple para-graphs. Wang et al. [51] proposed to solve MPQA by jointlytraining three modules that can predict the result based on answerboundary, answer content, and answer veriﬁcation factors. Wang et al. [52] presented a multi-passage BERT model to globallynormalize answer scores across all passages of the same question.Lin et al. [53] proposed a learning to rank model with an attentionmodule to select the best-matching paragraph for a question. Pang et al. [54] presented a three-level hierarchical answer spans modelto extract the answer from multi-paragraphs.

ETHODOLOGY

In this section, we ﬁrst describe how to address the NLVL task byadopting a span-based QA framework. Then, we present VSLBase(Sections 3.2 to 3.4) and VSLNet (Section 3.5) in detail. Lastly, wedetail VSLNet-L (Section 3.6) with multi-scale split-and-concatstrategy. The model architectures are shown in Figure 2.

We denote the untrimmed video as V = [ f , f , . . . , f T ] andthe language query as Q = [ q , q , . . . , q m ] , where T and m are the number of frames and words, respectively. τ s and τ e represent the start and end time of the temporal moment i.e., answer span. To address NLVL with the span-based QA CCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (TPAMI) 4 P o s iti on E n c od i ng L a y e r N o r m a li za ti on C onvo l u ti on ✚ L a y e r N o r m a li za ti on M u lti h ea d A tt e n ti on ✚ L a y e r N o r m a li za ti on F ee d F o r w a r d L a y e r ✚ R e p ea t × Fig. 3.

The structure of Feature Encoder. framework, we transform the data into a set of SQuAD styletriples ( Context, Question, Answer ) [55]. For each video V ,we extract its visual feature sequence V = [ v , v , . . . , v n ] by apre-trained 3D ConvNet [56], where n is the number of extractedfeature vectors. Here, V can be regarded as the sequence of wordembeddings for a text passage with n tokens, as in the traditionalspan-based QA framework. Similar to word embeddings, eachfeature vector v i here is a video feature vector.Since span-based QA aims to predict start and end boundariesof an answer span, the start/end time of a video sequence needs tobe mapped to the corresponding boundaries in the visual featuresequence V . Suppose the video duration is T , the start (end)span index is calculated by a s ( e ) = Round ( τ s ( e ) / T × n ) , where Round ( · ) denotes the rounding operator. During the inference,the predicted span boundary can be easily converted to the cor-responding time via τ s ( e ) = a s ( e ) /n × T . After transformingmoment annotations in the NLVL dataset, we obtain a set of ( V , Q, A ) triples. Visual feature sequence V = [ v , v , . . . , v n ] acts as the passage with n tokens; Q = [ q , q , . . . , q m ] is thequery with m tokens, and the answer A = [ v a s , v a s +1 , . . . , v a e ] corresponds to a piece in the passage. Then, the NLVL taskbecomes to ﬁnd the correct start and end boundaries of the answerspan, a s and a e . After obtaining visual features V = [ v , v , . . . , v n ] (cid:62) ∈ R n × d v ,for a text query Q , we can compute its word embeddings Q = [ q , q , . . . , q m ] (cid:62) ∈ R m × d q with existing text embeddingapproaches, e.g., GloVe. We project the video and text featurevectors into the same dimension d , V (cid:48) ∈ R n × d and Q (cid:48) ∈ R m × d ,by two linear layers. Then we build the feature encoder with asimpliﬁed version of the embedding encoder layer in QANet [21].Instead of applying a stack of multiple encoder blocks, weuse only one encoder block, as shown in Figure 3. This encoderblock consists of four convolution layers, followed by a multi-head attention layer [57]. A feed-forward layer is used to producethe output. Layer normalization [58] and residual connection [59]are applied to each layer. The encoded visual feature sequence andword embeddings are as follows: (cid:101) V = FeatureEncoder ( V (cid:48) ) (cid:101) Q = FeatureEncoder ( Q (cid:48) ) (1)The parameters of feature encoder are shared by visual featuresand word embeddings. After feature encoding, we utilize the context-query attention(CQA) [21], [47], [48] to capture the cross-modal interactions

Query : He uses the tool to take off all of the nuts one by one. …… ……

Foreground 𝑎 ! 𝑎 " 𝐿 = 𝑎 " − 𝑎 ! 𝛼𝐿 𝛼𝐿 Background0 1

Fig. 4.

An illustration of foreground and background of visualfeatures. α is the ratio of foreground extension. … … C o n v1 D & S i g m o i d 𝒉 ! 𝒮 ! "𝑽 " c on ca t Self-Attention 𝒗 " 𝒗 $ 𝒗 % 𝒉 ! 𝒉 ! 𝒉 ! " $ & ✚ ✚ $𝑽 " Fig. 5.

The structure of Query-Guided Highlighting (QGH). between visual and textural features. CQA ﬁrst calculates thesimilarity scores,

S ∈ R n × m , between each visual feature andquery feature. Then context-to-query ( A ) and query-to-context ( B )attention weights are computed as: A = S r · (cid:101) Q ∈ R n × d , B = S r · S Tc · (cid:101) V ∈ R n × d (2)where S r and S c are the row- and column-wise normalization of S by SoftMax, respectively. Finally, the output of context-queryattention is written as: V q = FFN (cid:0) [ (cid:101) V ; A ; (cid:101) V (cid:12) A ; (cid:101) V (cid:12) B ] (cid:1) (3)where V q ∈ R n × d ; FFN is a single feed-forward layer; (cid:12) denotesan element-wise multiplication.

We construct a conditioned span predictor using two unidirec-tional LSTMs and two feed-forward layers, inspired by Ghosh et al. [37]. The main difference between our method and themethod in [37] is that we use unidirectional LSTM instead ofbidirectional LSTM. We observe that unidirectional LSTM showssimilar performance but with fewer parameters and higher efﬁ-ciency. We believe that the unidirectional LSTM works as wellas bidirectional LSTM in this setting for two possible reasons.First, video is a temporal sequence where events happen earlierlead to later events, but not in the other way round. Second,the end boundary of a target moment always appears after thestart boundary in the sequence. Therefore, we stack the twoLSTMs so that the LSTM for end boundary is conditioned onthe LSTM for start boundary, to maintain the sequence nature ofvideo. The hidden states of the two LSTMs are then fed into thecorresponding feed-forward layers to compute the start and endscores: h st = UniLSTM start ( v qt , h st − ) h et = UniLSTM end ( h st , h et − ) S st = W s ([ h st ; v qt ]) + b s S et = W e ([ h et ; v qt ]) + b e (4)where S s/et ∈ R n denote the scores of start/end boundaries atposition t ; v qt represents the t -th element in V q . W s/e and b s/e CCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (TPAMI) 5

Ground Truth MomentLong Video Split

Negative Negative Positive Negative Negative

Clip Segments (a)

An ideal case of splitting video into clip segments.

Ground Truth MomentLong Video Split

Negative Positive Positive Negative Negative

Clip Segments (b)

A non-ideal case of splitting video into clip segments.

Ground Truth MomentLong Video Split

Negative Positive Negative Negative

Scale 1:

𝐾 = 5

PositiveNegative Positive Negative Negative

Scale 2:

𝐾 = 4 (c)

An example of splitting video with two different scales.

Fig. 6.

Illustration of splitting video into clip segments. denote the weight matrix and bias of the start/end feed-forwardlayer, respectively. Then, the probability distributions of start/endboundaries are computed by P s/e = SoftMax ( S s/e ) ∈ R n , andthe training objective is deﬁned as follows: L span = 12 (cid:2) f CE ( P s , Y s ) + f CE ( P e , Y e ) (cid:3) (5)where f CE is the cross-entropy loss function; Y s and Y e are thelabels for the start ( a s ) and end ( a e ) boundaries, respectively.During the inference, the predicted answer span (ˆ a s , ˆ a e ) of aquery is generated by maximizing the joint probability of startand end boundaries as: span (ˆ a s , ˆ a e ) = arg max ˆ a s , ˆ a e P s (ˆ a s ) P e (ˆ a e ) s.t. ≤ ˆ a s ≤ ˆ a e ≤ n (6)The description of VSLBase architecture has been completed.VSLNet is built on top of VSLBase with QGH, to be detailed inSubsection 3.5. The Query-Guided Highlighting (QGH) strategy is introduced inVSLNet, to address the major differences between text span-basedQA and NLVL tasks, as shown in Figure 2(a). With the QGHstrategy, we consider the target moment as the foreground, andthe rest as background, illustrated in Figure 4. The target moment,which is aligned with the language query, starts from a s and endsat a e with length L = a e − a s . QGH extends the boundaries of theforeground to cover its antecedent and consequent video contents,where the extension ratio is controlled by a hyperparameter α .As aforementioned in Introduction, the extended boundary couldpotentially cover additional contexts and also help the network tofocus on subtle differences among video frames.By assigning to the foreground and to the background,we obtain a binary sequence, denoted as Y h . QGH is a binaryclassiﬁcation module to predict the conﬁdence a visual featurevector belongs to the foreground or background. The structureof QGH is shown in Figure 5. We ﬁrst encode word featurerepresentations (cid:101) Q into sentence representation (denoted by h Q ),with self-attention mechanism [60]. Then h Q is concatenatedwith each feature element in V q as ¯ V q = [ ¯ v q , . . . , ¯ v qn ] , where Feature Encoder 𝒄 !" 𝒄 𝒄 $" 𝒄 %" …… "𝒄 !" "𝒄 "𝒄 $" "𝒄 %" …… ✚ Feed Forward LayerSigmoid

Clip Segment 𝑪 !" Y/N (1/0) ✚ Encoded Features $𝑪 !" 𝒉 & ! " Self-Attention 𝑆 ’()! $𝑪 !" ’𝑪 !" Fig. 7.

The structure of Nil Prediction Module (NPM). ¯ v qi = [ v qi ; h Q ] . The highlighting score and highlighted featurerepresentations are computed as: S h = σ (cid:0) Conv1D ( ¯ V q ) (cid:1)(cid:101) V q = S h · ¯ V q (7)where S h ∈ R n ; σ denotes the Sigmoid activation; · representsmultiplication operator, which multiplies each feature of ¯ V q and the corresponding score of S h . Accordingly, feature V q inEquation 4 is replaced by (cid:101) V q in VSLNet to compute L span . Theloss function of query-guided highlighting is formulated as: L QGH = f CE ( S h , Y h ) (8)VSLNet is trained in an end-to-end manner by minimizing thefollowing overall loss: L = L span + L QGH (9)

To address the localization performance degradation issues onlong videos, we introduce VSLNet-L with a multi-scale split-and-concatenation strategy. The architecture of VSLNet-L is shown inFigure 2(b).

As illustrated in Figure 6(a), given a long video, VSLNet-L splitsit into K clip segments: V = [ C k ] Kk =1 and C k = [ v i ] k × li =( k − × l (10)where l is the length of each clip segment C k , i.e., K × l = n .Note that, in our implementation, we perform video split at visualfeature level, instead of the untrimmed video itself for computa-tional efﬁciency. Speciﬁcally, we split the features V = [ v i ] ni =1 obtained from the pre-trained 3D ConvNet (see Section 3.1), anduse the feature vector V in Eq. 10 accordingly.Each clip segment C k is then processed by feature en-coder and CQA, to learn query-attended representations C qk =[ c qi ] k × li =( k − × l ∈ R l × d , as shown in Figure 2(b). Thus, C qk encodes the cross-modal features between clip segment k andquery. Then, a Nil Prediction Module (NPM) is introduced inVSLNet-L, to predict whether a clip segment contains or partiallycontains the temporal moment that corresponds to the text query,as shown in Figure 2(b). Next, we detail the structure of the NPMfollowing the illustration in Figure 7. CCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (TPAMI) 6

TABLE 1

Statistics of the Natural Language Video Localization (NLVL) datasets.

Dataset Domain N vocab ¯ L video ¯ L query ¯ L moment ∆ moment Charades-STA Indoors , / − / ,

334 12 , / − / ,

720 1 ,

303 30 . s .

22 8 . s . s ANetCap Open , / − / ,

917 37 , / − / ,

505 12 ,

460 117 . s .

78 36 . s . s TACoS org

Cooking / /

25 10 , / , / ,

083 2 ,

033 287 . s .

05 5 . s . s TACoS tan

Cooking / /

25 9 , / , / ,

001 1 ,

983 287 . s .

42 25 . s . s Note that N vocab is the vocabulary size of lowercase words, ¯ L video denotes the average length of videos in seconds, ¯ L query denotes the average number of words insentence query, ¯ L moment is the average length of temporal moments in seconds, and ∆ moment is the standard deviation of temporal moment length in seconds. For each clip segment, its query-attended features C qk are ﬁrstencoded by a feature encoder as: (cid:98) C qk = FeatureEncoder

NPM ( C qk ) (11)The self-attention mechanism [60] is adopted to encode (cid:98) C qk intoclip representation h qC k , and the nil-score is computed as: α k = SoftMax ( Conv1D ( (cid:98) C qk )) h qC k = (cid:88) li =1 α k,i · ˆ c q ( k − × l + i S k nil = σ ( FFN ( h qC k )) (12)where α k ∈ R l c and h qC k ∈ R d . S k nil ∈ R is a scalar, whichindicates the conﬁdence of clip segment k containing the groundtruth moment. The loss of NPM is formulated as: L NPM = f CE ( S nil , Y nil ) (13) Y nil is a - sequence provided during training. A clip segmentis positive (label ) iff it overlaps with ground truth moment,illustrated in Figure 6. Clip segments that do not contain groundtruth moment are negative (label ). After computing the nil-scores of all clip segments, i.e., S nil = [ S nil , . . . , S K nil ] ∈ R K ,we normalize the scores. The output for clip segment k is: ¯ S nil = S nil max( S nil ) , ¯ C qk = ¯ S k nil × (cid:98) C qk (14)where ¯ S nil is the normalized nil-score and ¯ C qk is the re-weightedrepresentations of clip segment k . The NPM highlights the clipsegments that contain the target moment, and suppresses othersegments. Then the subsequent modules could localize the resultby focusing more on the highlighted segments, which is equivalentto narrowing down the searching scope from a long video to a shortsegment of it. With the clip segments processed separately in the previous step,we now concatenate the representations of all clip segments, fortwo reasons. First, a single clip segment may not fully coverthe target moment. Second, even if a segment is predicted to benegative (or low conﬁdence), it might provide useful contextualinformation for localizing the target moment. ¯ C qV = [ ¯ C q (cid:107) ¯ C q (cid:107) · · · (cid:107) ¯ C qK ] (15)where (cid:107) denotes the concatenation operator and ¯ C qV ∈ R n × d .Accordingly, the input feature V q of QGH is replaced by ¯ C qV in VSLNet-L to compute L QGH and L span . The combined trainingloss for VSLNet-L is: L = L span + L QGH + L NPM (16)

Unlike a text document, there are no paragraphs as semantic unitsin a video. Any split in the video may break important contextualinformation to different clip segments. Although all clip segmentswill be concatenated again for video level localization, each clipsegment is processed separately, and the contextual informationbetween two segments may not be well captured.To address this issue, we propose a multi-scale split mecha-nism, to split the video at different segment lengths, i.e., different K (see Figure 6(c)). Suppose we use N s different scales, and foreach scale we have: K i × l i = n, ∀ i = 1 , , . . . , N s (17)where K i and l i denote the number of clip segments and clipsegment length for the i -th scale, respectively. Through a multi-scale split, contextual information is well captured. Meanwhile,a multi-scale split also provides variation in training samples forthe same video and query pair input. The target moment may belocated in multiple different clip segments, and its contexts are alsodifferent. Note that the same training process as the single-scalesplit-and-concat applies, except that the clip segments of eachscale are fed separately into VSLNet-L, to optimize the objective.Consequently, we derive N s predicted moments for a givenvideo and language query pair, because the clip segments at eachscale would lead to a pair of start/end boundaries for a predictedmoment. During inference, we adopt two simple candidate selec-tion strategies to derive the ﬁnal target moment. VSLNet-L- P m strategy selects the candidate with the highest joint boundary prob-ability, P m = max { P i span } N s i =1 , where P i span = P is (ˆ a s ) P ie (ˆ a e ) isthe maximal joint boundary probability of the moment generatedby i -th scale using Eq. 6. VSLNet-L- U strategy selects two mo-ments with the largest overlap from N s candidates, and computestheir union as the ﬁnal result. XPERIMENTS

We conduct experiments on three benchmark datasets: Charades-STA [25], ActivityNet Captions [61], and TACoS [62], as summa-rized in Table 1.

Charades-STA is prepared by Gao et al. [25]based on Charades dataset [63], with the videos of daily indooractivities. There are , and , moment annotations fortraining and test, respectively. ActivityNet Captions (ANetCap) contains about k videos taken from ActivityNet [64]. We followthe setup in Yuan et al. [15], with , moment annotations fortraining, and , annotations for test. TACoS is selected fromMPII Cooking Composite Activities dataset [65]. There are twoversions of TACoS available. TACoS org used in [25] has , , CCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (TPAMI) 7

TABLE 2

Results ( % ) of “ Rank@ , IoU = µ ” and “mIoU” compared withthe state-of-the-art on Charades-STA. Model Rank@ , IoU = µ mIoU µ = 0 . µ = 0 . µ = 0 .

3D ConvNet without ﬁne-tuning as visual feature extractorCTRL [25] - 23.63 8.89 -ACL [9] - 30.48 12.20 -QSPN [10] 54.70 35.60 15.80 -SAP [29] - 27.42 13.36 -SM-RL [18] - 24.36 11.17 -RWM-RL [19] - 36.70 - -MAN [11] - 46.53 22.72 -DEBUG [16] 54.95 37.39 17.69 36.34TSP-PRL [20] - 37.39 17.69 37.222D-TAN [42] - 39.81 23.31 -CBP [14] - 36.80 18.87 -GDP [17] 54.54 39.47 18.49 -VSLBase 61.72 40.97 24.14 42.11VSLNet

3D ConvNet with ﬁne-tuning on Charades datasetExCL [37] 65.10 44.10 23.30 -SCDM [32] - .

09 31 . -VSLBase 68.06 .

23 30 . , and , annotations for training, validation and test.TACoS tan is from 2D-TAN [42], which contains , , , and , annotations in train, validation and test sets.We conduct experiments on all the benchmark datasets forVSLBase and VSLNet. For VSLNet-L, we evaluate it on Activi-tyNet Captions and both versions of TACoS. Charades-STA is notused for VSLNet-L due to its short video length. We adopt “Rank@ n, IoU = µ ” and “mIoU” as the evaluationmetrics, following [8], [15], [25]. The “Rank@ n, IoU = µ ”denotes the percentage of language queries having at least oneresult whose Intersection over Union (IoU) with ground truth islarger than µ in top- n retrieved moments. “mIoU” is the averageIoU over all testing samples. In our experiments, we use n = 1 and µ ∈ { . , . , . } . For a text query Q , we lowercase all its words and initialize themwith d GloVe [66] embeddings. The word embeddings areﬁxed during training. For the untrimmed video V , we extractits visual features using 3D ConvNet pre-trained on Kineticsdataset [56]. We set the maximal feature length n as forCharades-STA, i.e., the extracted visual feature sequence of avideo will be uniformly downsampled if its length > n , or zero-padding otherwise. The n of ANetCap is set to , while twomaximal feature lengths, n ∈ { , } , are used for evaluationon TACoS. When evaluating VSLNet-L, the visual features aresplit into multiple clip segments using different scales, we use l = { , , , } ( i.e., K = { , , , } ) for n = 300 , and l = { , , , } ( i.e., K = { , , , } ) for n = 600 . TABLE 3

Results ( % ) of “Rank@ , IoU = µ ” and “mIoU” compared withthe state-of-the-art on TACoS. Dataset Model Rank@ , IoU = µ mIoU µ = 0 . µ = 0 . µ = 0 . T A C o S o r g CTRL [25] 18.32 13.30 - -TGN [12] 21.77 18.90 - -ACL [9] 24.17 20.01 - -ACRN [8] 19.52 14.62 - -ABLR [15] 19.50 9.40 - 13.40SM-RL [18] 20.25 15.95 - -DEBUG [16] 23.45 11.72 - 16.03SCDM [32] 26.11 21.17 - -GDP [17] 24.14 13.90 - 16.18CBP [14] 27.31 24.79 19.10 21.59DRN [33] - 23.17 - -VSLNet ( S ) ( L ) P m ( S ) U ( S ) VSLNet-L- P m ( L ) VSLNet-L- U ( L ) T A C o S t a n ( S ) ( L ) P m ( S ) VSLNet-L- U ( S ) P m ( L ) VSLNet-L- U ( L ) ( S ) denotes n = 300 and ( L ) represents n = 600 . Video Length (minute) M e a n I o U ( % ) CTRL on TACoS org

SCDM on TACoS org

VSLNet on TACoS org tan

VSLNet on TACoS tan

Fig. 8.

The Mean IoU (%) performance of CTRL [25], SCDM [32],2D-TAN Pool [42] and VSLNet on the TACoS dataset.

We set the dimension of all the hidden layers in the model as ; the kernel size of the convolution layer is ; the head sizeof multi-head attention is . All the models are trained for epochs with a batch size of and an early stopping strategy forall datasets. Adam [67] is used as the optimizer, with a learningrate of . , linear decay of learning rate and gradient clippingof . . Dropout [68] of . is applied to prevent overﬁtting. Allexperiments are conducted on dual NVIDIA GeForce RTX 2080TiGPUs workstation. CCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (TPAMI) 8

TABLE 4

Results ( % ) of “Rank@ , IoU = µ ” and “mIoU” compared withstate-of-the-arts on ActivityNet Captions. Model Rank@ , IoU = µ mIoU µ = 0 . µ = 0 . µ = 0 . TGN [12] 45.51 28.47 - -ACRN [8] 49.70 31.67 11.25 -ABLR [15] 55.67 36.79 - 36.99RWM-RL [19] - 36.90 - -QSPN [10] 45.30 27.70 13.60 -DEBUG [16] 55.91 39.72 - 39.51SCDM [32] 54.80 36.75 19.86 -TSP-PRL [20] 56.08 38.76 - 39.21GDP [17] 56.17 39.27 - 39.80CBP [14] 54.30 35.76 17.80 -DRN [33] - P m U * denotes that the maximal visual sequence length n of VSLBase and VSLNetis set as , which is consistent with [23].

64 100 128 192 200 256 300 320 384 400 448 500 512 576 600 640

Maximal Visual Feature Length ( n ) M e a n I o U ( % ) Fig. 9.

Mean IoU ( % ) results of VSLNet on the TACoS tan datasetunder different maximal visual representation lengths n . We compare the proposed methods with the following state-of-the-arts: CTRL [25], ACRN [8], TGN [12], ACL [9], DEBUG [16],SAP [29], QSPN [10], SM-RL [18], RWM-RL [19], ABLR [15],MAN [11], SCDM [32], TSP-PRL [20], 2D-TAN [42], GDP [17],ExCL [37], CBP [14], LGI [34] and DRN [33]. Scores of thebenchmark methods in all result tables are those reported in thecorresponding papers. The best results are in bold and the secondbests are underlined.Among the proposed methods, VSLBase is a direct imple-mentation of span-based QA framework on the NLVL task;VSLNet is the extension of VSLBase with QGH; VSLNet-L is afurther extension of VSLNet with multi-scale split-concat strategy,designed to handle long videos more effectively. In the following,we show that VSLBase is comparable to existing baselines onNLVL tasks, while VSLNet surpasses VSLBase and all existingbaselines, and achieves state-of-the-art results. We then show thatVSLNet-L well addresses the issue of performance degradation onNLVL for long videos, compared to VSLNet.

Video Length (minute) N o r m a li ze d M o m e n t L e n g t h ( % ) TACoS org

TACoS tan

Fig. 10.

Statistic of normalized moment lengths in videos for bothTACoS org and TACoS tan . Video Length (minute) M e a n I o U ( % ) VSLNet on TACoS org

VSLNet-L- P m on TACoS org VSLNet-L- U on TACoS org VSLNet on TACoS tan

VSLNet-L- P m on TACoS tan VSLNet-L- U on TACoS tan Fig. 11.

Visualization of performance improvement of VSLNet-Lon different video lengths compared to VSLNet on TACoS.

The results on Charades-STA are reported in Table 2. Itis observed that VSLNet signiﬁcantly outperforms most of thebaselines by a large margin over all metrics. In addition, it isworth noting that the performance improvements of VSLNetare more signiﬁcant under larger values of IoU. For instance,VSLNet achieves . improvement in IoU = 0 . versus . in IoU = 0 . , compared to MAN. Without query-guidedhighlighting, VSLBase outperforms all compared baselines overIoU = 0 . , which shows adopting the span-based QA frameworkis promising for NLVL. Moreover, VSLNet beneﬁts from visualfeature ﬁne-tuning, and achieves state-of-the-art results on thisdataset.Table 3 reports the results on both versions (if available) ofTACoS dataset. In general, VSLNet outperforms previous methodsover all evaluation metrics. In addition, with the Split-and-Concatmechanism, VSLNet-L further improves the performance, on topof VSLNet. On TACoS org , the results of VSLNet ( S ) is comparableto that of VSLNet ( L ) , while VSLNet-L ( L ) surpasses VSLNet-L ( S ) for both candidate selection strategies. Here, L and S denotethe maximal video feature length and , respectively.Similar observations hold on TACoS tan . These results demonstratethat VSLNet-L is more adept than others at localizing temporalmoments in longer videos. Moreover, VSLNet-L- P m is generallysuperior to VSLNet-L- U under different n for both versions of theTACoS dataset.The results on the ActivityNet Captions dataset are sum-marized in Table 4. We observe that VSLBase shows similarperformance to or slightly better than most of the baselines, whileVSLNet further boosts the performance of VSLBase signiﬁcantly.Comparing VSLNet with n = 128 and that with n = 300 , we CCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (TPAMI) 9 𝑎 " ($𝑎 " )𝑎 & $𝑎 & 𝑎 & ($𝑎 & ) 𝑎 " $𝑎 " personaisinaentryway eating a sandwich .aperson awakens intheirbedroom laying ona pillow . Fig. 12.

Similarity scores, S , between visual and language fea-tures in the context-query attention. a s /a e denote the start/endboundaries of ground truth video moment, ˆ a s / ˆ a e denote thestart/end boundaries of predicted target moment. Extension Ratio ( ) R @ , I o U = . VSLNetVSLBase (a)

Rank@ , IoU = 0 . Extension Ratio ( ) R @ , I o U = . VSLNetVSLBase (b)

Rank@ , IoU = 0 . Extension Ratio ( ) R @ , I o U = . VSLNetVSLBase (c)

Rank@ , IoU = 0 . Extension Ratio ( ) m I o U VSLNetVSLBase (d) mIoU

Fig. 13.

Analysis of the impact of extension ratio α in Query-Guided Highlighting on Charades-STA. ﬁnd that small n leads to better performance on loose metric ( e.g., . versus . on IoU = 0 . ) and large n is beneﬁcial forstrict metric ( e.g., . versus . on IoU = 0 . ). Meanwhile,the performance of VSLNet-L is comparable to state-of-the-artmethods. It is worth noting that annotations in ActivityNetCaptions belong to videos that are shorter than minutes. AsVSLNet-L is designed to address performance degradation onlong videos, it is reasonable to observe that VSLNet-L achievesless signiﬁcant performance improvement on ActivityNet Cap-tions compared to TACoS, w.r.t. the state-of-the-arts. Moreover,VSLNet-L- U performs better than VSLNet-L- P m on ActivityNetCaptions, different from the observation on TACoS. This could bedue to the different ratios of ¯ L moment and ¯ L video in the two datasets(see Table 1). The strategy to select longer spans works better onActivityNet Captions dataset. As discussed in Section 1 and illustrated in Figure 8, existingmethods including VSLNet still underperform on NLVL with longvideos. That is, the localization performance decreases dramat-ically along with the increase of video length. Summarized inTable 5, there are fewer videos/annotations along the increase

Intersection over Union (IoU) o f S a m p l e s VSLBaseVSLNet (a)

Charades-STA

Intersection over Union (IoU) o f S a m p l e s VSLBaseVSLNet (b)

ActivityNet Captions

Fig. 14.

Histograms of the number of predicted results on test setunder different IoUs, on two datasets. of video length in the datasets. The relatively small number oftraining samples may lead to instability of the evaluated models,and performance degradation on long videos, to some extent.However, we believe that the following are the two main reasonsfor the performance degradation:1) Downsampling of visual features of long videos in mostexisting methods adversely affects localization accuracydue to information loss. As shown in Figure 9, sparselydownsampling video feature presentations below certainnumber ( e.g., n < ) would lead to dramatic perfor-mance degradation.2) As plotted in Figure 10, the average normalized length ofground truth moments gradually decrease along with theincrease of video length. The sparsity of moments alsocontributes to poor performance on long videos.To address this issue, VSLNet-L splits the video into multipleclip segments with different scales to simulate the multiple para-graphs in the document and maximize the chance of locating thetarget moment in one segment clip.Table 6 reports the “mIoU” gains of VSLNet-L on the TACoSdataset for videos with different lengths. Compared to the resultsof VSLNet (the best performing method without considering videolength), larger improvements are observed on longer videos, whichdemonstrates the superiority of VSLNet-L for localizing temporalmoments in long videos. For instance, VSLNet-L ( L ) achievesmore than absolute improvements in mIoU for videos longerthan minutes versus less than gains for videos shorter than minutes on TACoS org . Despite the slight performance reductionon videos shorter than minutes, along with video length raises,consistent improvements are observed on TACoS tan for both n = 300 ( S ) and

600 ( L ) , compared to VSLNet. Figure 11plots the performance improvements along video lengths for bettervisualization.Results on ActivityNet Captions are reported in Table 7.Despite that the videos are relatively short, VSLNet-L manages toimprove localization performance, with larger improvements ob-served on longer videos. These results show consistent superiorityof VSLNet-L over VSLNet for both candidate selection strategies,on videos of different lengths. In this section, we conduct ablative experiments to analyze theimportance of different modules, including feature encoder andcontext-query attention in VSLBase. We also investigate the im-pact of extension ratio α (see Figure 4) in query-guided highlight-ing (QGH) of VSLNet, and study the impact of the multi-scale CCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (TPAMI) 10

TABLE 5

Statistics of videos and annotations w.r.t. different video lengths over NLVL datasets.

Dataset Split w.r.t. different video lengths ∼ min ∼ min ∼ min ∼ min ∼ min ∼ min > minTACoS org Train

75 10 ,

146 27 / ,

847 29 / ,

015 8 / ,

284 4 /

607 3 /

616 2 /

328 2 / Val

27 4 ,

589 3 /

312 5 /

771 5 /

887 7 / ,

275 1 /

173 4 /

830 2 / Test

25 4 ,

083 5 /

578 6 /

937 3 /

564 3 /

447 2 /

373 3 /

617 3 / TACoS tan

Train

75 9 ,

790 27 / ,

769 29 / ,

840 8 / ,

227 4 /

576 3 /

597 2 /

336 2 / Val

27 4 ,

436 3 /

311 6 /

929 4 /

639 7 / ,

225 1 /

171 4 /

812 2 / Test

25 4 ,

001 5 /

594 6 /

907 3 /

535 3 /

428 2 /

370 3 /

598 3 / ANetCap Train ,

009 37 ,

421 5 , / ,

806 4 , / ,

551 8 /

28 3 /

10 4 /

22 0 / / Test ,

917 17 ,

505 2 , / ,

193 2 , / ,

274 5 /

19 2 / / / / TABLE 6

Comparison of mIoU ( % ) between VSLNet and VSLNet-L on TACoS dataset w.r.t. different video lengths. Dataset Video Length ∼ min ∼ min ∼ min ∼ min ∼ min ∼ min > min T A C o S o r g

578 937 564 447 373 617 567

VSLNet ( S ) P m ( S ) +0.98 28.11 +0.43 24.63 +3.07 +1.50 15.26 +3.14 24.56 +3.97 +3.74VSLNet-L- U ( S ) +1.01 +3.29 30.37 +1.42 +3.15 +4.28 12.15 +3.14VSLNet ( L ) P m ( L ) +0.79 29.84 +0.12 27.85 +1.51 +1.10 +4.31 +7.31 11.90 +3.32VSLNet-L- U ( L ) +0.26 +1.87 31.78 +1.00 15.31 +3.94 25.37 +6.18 +3.68 T A C o S t a n

594 907 535 428 370 598 569

VSLNet ( S ) P m ( S ) +1.83 36.51 +2.56 41.37 +4.94 +8.27 +5.47 25.08 +3.68VSLNet-L- U ( S ) +2.90 +5.42 28.30 +6.86 33.84 +4.47 +3.80VSLNet ( L ) P m ( L ) +2.67 37.70 +4.36 +7.04 +11.33 +7.62 +6.43VSLNet-L- U ( L ) +4.56 40.06 +6.95 27.97 +9.31 39.01 +7.30 24.48 +5.03 ( S ) denotes n = 300 and ( L ) represents n = 600 . Performance gain and loss are indicated in different colors. TABLE 7

Comparison of mIoU ( % ) between VSLNet and VSLNet-L onActivityNet Captions w.r.t. different video lengths. Video Length ∼ min ∼ min > min ,

193 9 ,

274 38

VSLNet 46.21 40.60 35.24VSLNet-L- P m U +0.82 +0.86 +4.39 Performance gain and loss are indicated in different colors.

TABLE 8

Comparison between models with alternative modules inVSLBase on Charades-STA.

Module Rank@ , IoU = µ mIoU µ = 0 . µ = 0 . µ = 0 . BiLSTM + CAT 61.18 43.04 26.42 42.83CMF + CAT 63.49 44.87 27.07 44.01BiLSTM + CQA 65.08 46.94 28.55 45.18CMF + CQA 68.06 50.23 30.16 47.15 split strategy in VSLNet-L. Finally, we visually show the effec-tiveness of the proposed methods and discuss their limitations. Weconduct the ablation studies in this order because VSLNet is built

TABLE 9

Performance gains ( % ) of different modules over“ Rank@ , IoU = 0 . ” on Charades-STA. Module CAT CQA ∆ BiLSTM 26.42 28.55 +2.13CMF 27.07 30.16 +3.09 ∆ +0.65 +1.61 - on top of VSLBase, and VSLNet-L is an extension of VSLNet. We ﬁrst study the effectiveness of the feature encoder and context-query attention (CQA) in VSLBase by replacing them with othermodules. Speciﬁcally, we use bidirectional LSTM (BiLSTM) as analternative feature encoder. For context-query attention, we replaceit by a simple method (named CAT), which concatenates eachvisual feature with the max-pooled query feature.Our feature encoder consists of Convolution + Multi-headattention + Feed-forward layers (see Section 3.2), named as CMFhere. With these alternatives, we now have four combinations,listed in Table 8. As observed from the results, CMF shows stablesuperiority over CAT on all metrics regardless of other modules;CQA surpasses CAT whichever feature encoder is used. This studyindicates that CMF and CQA are more effective.

CCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (TPAMI) 11

Language Query:

The person starts fixing her hair.

Language Query:

The person takes a sandwich from the refrigerator. (a)

Two example cases on the Charades-STA dataset

Language Query:

He shows a water bottle he has along with a brush, and uses the brush to remove snow from the dash window of a car and the water to remove any excess snow left on the windshield.

Language Query:

A lady talks with the men as they wait on the crane. (b)

Two example cases on the ActivityNet Caption dataset

Fig. 15.

Visualization of predictions by VSLBase and VSLNet. Figures on the left depict the localized results by the two models.Figures on the right show probability distributions of start/end boundaries and highlighting scores.

TABLE 10

Results ( % ) of VSLNet-L on TACoS using different split scaleswith n = 600 . Dataset Model Scales ( l ) Rank@ , IoU = µ mIoU µ = 0 . µ = 0 . µ = 0 . T A C o S o r g VSLNet - 29.78 24.71 19.64 23.96 V S L N e t - L

100 30.18 25.81 20.77 24.46120 31.13 26.87 21.19 25.12150 30.42 26.38 20.89 24.73200 30.59 26.07 21.01 24.61- P m U T A C o S t a n VSLNet - 41.42 30.67 22.32 31.92 V S L N e t - L

100 42.24 32.69 23.67 32.41120 43.39 33.37 24.19 33.45150 44.61 33.99 24.27 33.71200 43.74 33.67 23.74 33.50- P m U Table 9 reports the performance gains of different modulesover “Rank@ , IoU = 0 . ” metric. The results show that re-placing CAT with CQA leads to larger improvements, comparedto replacing BiLSTM by CMF. This observation suggests thatCQA plays a more important role in our model. Speciﬁcally,keeping CQA, the absolute gain is . by replacing the encodermodule. Keeping CMF, the gain of replacing the attention module Query : The person chops the tops and bottoms off the broad beans. Ground TruthVSLNetVSLNet-L- 𝑷 𝒎 VSLNet-L- 𝑼 VSLNet-L ( 𝑙 = 100,𝐾 = 6,𝑃 " = 0.193 ) VSLNet-L ( 𝑙 = 120,𝐾 = 5,𝑃 " = 0.128 ) VSLNet-L ( 𝑙 = 150,𝐾 = 4,𝑃 " = 𝟎.𝟐𝟏𝟎 ) VSLNet-L ( 𝑙 = 200,𝐾 = 3,𝑃 " = 0.143 ) Query : The person washes her hands and then the cutting board in the sink.Ground Truth VSLNetVSLNet-L- 𝑷 𝒎 VSLNet-L- 𝑼 VSLNet-L ( 𝑙 = 100,𝐾 = 6,𝑃 " = 𝟎.𝟐𝟒𝟑 ) VSLNet-L ( 𝑙 = 120,𝐾 = 5,𝑃 " = 0.214 ) VSLNet-L ( 𝑙 = 150,𝐾 = 4,𝑃 " = 0.169 ) VSLNet-L ( 𝑙 = 200,𝐾 = 3,𝑃 " = 0.167 ) Fig. 16.

Visualizations of two predicted examples by VSLNet andVSLNet-L on TACoS dataset. is . .Figure 12 displays the matrix of similarity score betweenvisual and language features in the context-query attention (CQA)module ( S ∈ R n × m in Section 3.3). This ﬁgure shows visual CCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (TPAMI) 12 -20 -10 0 10 20 30

Moment Length Error (second) o f S a m p l e s VSLBaseVSLNet (a)

Charades-STA -100 -50 0 50 100 150 200

Moment Length Error (second) o f S a m p l e s VSLBaseVSLNet (b)

ActivityNet Captions

Fig. 17.

Plots of moment length errors in seconds betweenground truths and results predicted by VSLBase and VSLNet,respectively. features are more relevant to the verbs and their objects in thequery sentence. For example, the similarity scores between visualfeatures and “ eating ” (or “ sandwich ”) are higher than that ofother words. We believe that verbs and their objects are morelikely to be used to describe video activities. Our observationis consistent with Ge et al. [9], where verb-object pairs areextracted as semantic activity concepts. In contrast, these conceptsare automatically captured by the CQA module in our method.

VSLNet introduces a query-guided highlighting (QGH) module ontop of VSLBase to address the technical gaps between video andtext. The QGH guides the VSLNet to search for the target momentwithin a longer highlighted region controlled by the extensionratio α . We now investigate the impact of the extension ratio α inthe QGH module on the Charades-STA dataset. We evaluated different values of α from . to ∞ in experiments. . representsno answer span extension, and ∞ means that the entire video isregarded as foreground. The results for various α ’s are plottedin Figure 13. It shows that query-guided highlighting consistentlycontributes to performance improvements, regardless of α values, i.e., from to ∞ . Along with α raises, the performance ofVSLNet ﬁrst increases and then gradually decreases. The optimalperformance appears between α = 0 . and . over all metrics.Note that, when α = ∞ , which is equivalent to no region ishighlighted as a coarse region to locate target moment, VSLNetremains better than VSLBase. Shown in Figure 5, when α = ∞ ,QGH effectively becomes a straightforward concatenation of sen-tence representation with each of the visual features. The resul-tant feature remains helpful for capturing semantic correlationsbetween vision and language. In this sense, this function canbe regarded as an approximation or simulation of the traditionalmultimodal matching strategy [8], [24], [25]. VSLNet-L further introduces a multi-scale split-and-concat strat-egy to address performance degradation on long videos. Here, westudy the impact of the multi-scale split on the TACoS datasetwith n = 600 , against the single-scale split. We evaluate different values of single-scale, i.e., l ∈ { , , , } .The multi-scale mechanism is jointly trained with the four scales.The results are summarized in Table 10. Compared to VSLNet,split-and-concat strategy in VSLNet-L consistently contributesto performance improvements, regardless of the l value. The best single-scale l is for TACoS org , and for TACoS tan .Compared to VSLNet-L with single-scale, VSLNet-L- P m (- U )further improves all metrics signiﬁcantly. The multi-scale split-and-concat mechanism not only alleviates the issue of targetmoment truncation but also captures contextual information in thevideo; both improve the generalization ability of VSLNet-L. Figure 14 shows the histograms of predicted results of VSLBaseand VSLNet on test sets of the Charades-STA and ActivityNetCaptions datasets. Results indicate that VSLNet beats VSLBaseby having more samples in the high IoU ranges, e.g.,

IoU ≥ . on the Charades-STA dataset. More predicted results of VSLNetare distributed in the high IoU ranges for the ActivityNet Captiondataset. This result demonstrates the effectiveness of the query-guided highlighting (QGH) strategy.We show two examples of VSLBase and VSLNet in Fig-ures 15(a) and 15(b) from the Charades-STA and ActivityNetCaptions datasets, respectively. From the two ﬁgures, the localizedmoments by VSLNet are closer to the ground truth than that byVSLBase. Meanwhile, the start and end boundaries predicted byVSLNet are roughly constrained in the highlighted regions S h ,computed by QGH.Figure 16 depicts two predicted examples from the TACoSdataset as case studies. The localized moments by VSLNet-Lare more accurate and closer to the ground truth moment thanthat of VSLNet. Both ﬁgures show the results of VSLNet-L areconstrained in the positive clip segments, i.e., the clip segment thatcontains ground truth. In the second example, VSLNet does notcapture the concept “the cutting board in the sink” and focuses onretrieving “washes” action only, which leads to an error prediction.In contrast, VSLNet-L correctly understands both “washes” andthe position of “cutting board”, leading to the correct prediction. We further study the error patterns of predicted moment lengthson VSLBase and VSLNet, as shown in Figure 17. The differencesbetween moment lengths of ground truths and predicted results aremeasured. A positive length difference means the predicted mo-ment is longer than the corresponding ground truth, while a nega-tive means shorter. Figure 17 shows that VSLBase tends to predictlonger moments, e.g., more samples with length error larger than seconds in Charades-STA or 30 seconds in ActivityNet. On thecontrary, constrained by QGH, VSLNet tends to predict shortermoments, e.g., more samples with length error smaller than − seconds in Charades-STA or − seconds in ActivityNet Caption.This observation is helpful for future research on adopting thespan-based QA framework for NLVL.In addition, we also exam failure cases (with IoU predictedby VSLNet lower than . ) shown in Figure 18. In the ﬁrst case,as illustrated by Figure 18(a), we observe an action that a personturns towards to the lamp and places an item there. The QGHfalsely predicts the action as the beginning of the moment “turnsoff the light”. The second failure case involves multiple actions ina query, as shown in Figure 18(b). QGH successfully highlightsthe correct region by capturing the temporal information of twodifferent action descriptions in the given query. However, it assigns“pushes” with a higher conﬁdence score than “grabs”. Thus,VSLNet only captures the region corresponding to the “pushes”action, due to its conﬁdence score. CCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (TPAMI) 13 !𝑎 ! 𝑎 " (!𝑎 " )𝑎 ! Language Query : The person turns off the light. (a)

A failure case on the Charades-STA dataset with IoU = 0 . . !𝑎 ! 𝑎 ! Language Query : After, the man grabs the girl’s arm, then the girl pushes the man over the wall. 𝑎 " !𝑎 " (b) A failure case on the ActivityNet Caption dataset with IoU = 0 . . Fig. 18.

Two failure examples predicted by VSLNet, a s /a e denote the start/end boundaries of ground truth video moment, ˆ a s / ˆ a e denote the start/end boundaries of predicted target moment. ONCLUSION

In this paper, we revisit the NLVL task and propose to solve itwith a multimodal span-based QA framework by considering avideo as a text passage. We show that adopting a standard span-based QA framework, VSLBase, can achieve promising results onthe NLVL task. However, there are two major differences betweenvideo and text in the standard span-based QA framework, limitingthe performance of VSLBase. To address the differences, we thenpropose VSLNet, which introduces a simple and effective strategynamed query-guided highlighting (QGH), on top of VSLBase.With QGH, VSLNet is guided to search for answers withina predicted coarse region. The effectiveness of VSLNet (andVSLBase) is demonstrated with experiments on three datasets.The results indicate that it is promising to explore the span-basedQA framework to address NLVL problems. Moreover, we haveobserved that the existing methods including VSLNet suffer fromthe performance degradation issue on long videos. To addressthis issue, we further extend VSLNet by regarding a long videoas a document with multiple paragraphs. We adopt the conceptof MPQA and propose a multi-scale split-and-concat network tocapture contextual information in a long video by partitioning thelong video multiple times with different clip lengths. Extensiveexperiments demonstrate that VSLNet-L prevents performancedegradation on long videos effectively and advances the state-of-the-art for NLVL on three benchmark datasets. R EFERENCES [1] J. Gao, R. Ge, K. Chen, and R. Nevatia, “Motion-appearance co-memorynetworks for video question answering,”

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pp. 6576–6585,2018.[2] Z. Yu, D. Xu, J. Yu, T. Yu, Z. Zhao, Y. Zhuang, and D. Tao, “Activitynet-qa: A dataset for understanding complex web videos via question answer-ing,” in

Proceedings of the AAAI Conference on Artiﬁcial Intelligence ,vol. 33, 2019, pp. 9127–9134. [3] R. Hu, D. Fried, A. Rohrbach, D. Klein, T. Darrell, and K. Saenko,“Are you looking? grounding to multiple modalities in vision-and-language navigation,” in

Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics . Florence, Italy: Associationfor Computational Linguistics, Jul. 2019, pp. 6551–6557.[4] H. Le, D. Sahoo, N. Chen, and S. Hoi, “Multimodal transformer networksfor end-to-end video-grounded dialogue systems,” in

Proceedings of the57th Annual Meeting of the Association for Computational Linguistics .Association for Computational Linguistics, 2019, pp. 5612–5623.[5] L. Zhou, H. Palangi, L. Zhang, H. Hu, J. J. Corso, and J. Gao,“Uniﬁed vision-language pre-training for image captioning and vqa.” in

Proceedings of the AAAI Conference on Artiﬁcial Intelligence , 2020, pp.13 041–13 049.[6] J. Lei, L. Yu, T. Berg, and M. Bansal, “TVQA+: Spatio-temporal ground-ing for video question answering,” in

Proceedings of the 58th AnnualMeeting of the Association for Computational Linguistics . Online:Association for Computational Linguistics, Jul. 2020, pp. 8211–8225.[7] L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, andB. Russell, “Localizing moments in video with temporal language,” in

Proceedings of the 2018 Conference on Empirical Methods in NaturalLanguage Processing . Association for Computational Linguistics, 2018,pp. 1380–1390.[8] M. Liu, X. Wang, L. Nie, X. He, B. Chen, and T.-S. Chua, “Attentivemoment retrieval in videos,” in

The 41st International ACM SIGIRConference on Research & Development in Information Retrieval , 2018,pp. 15–24.[9] R. Ge, J. Gao, K. Chen, and R. Nevatia, “Mac: Mining activity conceptsfor language-based temporal localization,” in

IEEE Winter Conferenceon Applications of Computer Vision , 2019, pp. 245–253.[10] H. Xu, K. He, B. A. Plummer, L. Sigal, S. Sclaroff, and K. Saenko,“Multilevel language and vision integration for text-to-clip retrieval,” in

Proceedings of the AAAI Conference on Artiﬁcial Intelligence , vol. 33,2019, pp. 9062–9069.[11] D. Zhang, X. Dai, X. Wang, Y.-F. Wang, and L. S. Davis, “Man: Momentalignment network for natural language moment retrieval via iterativegraph adjustment,” in

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2019, pp. 1247–1257.[12] J. Chen, X. Chen, L. Ma, Z. Jie, and T.-S. Chua, “Temporally groundingnatural sentence in video,” in

Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing , 2018, pp. 162–171.[13] Z. Zhang, Z. Lin, Z. Zhao, and Z. Xiao, “Cross-modal interactionnetworks for query-based moment retrieval in videos,” in

Proceedingsof the 42nd International ACM SIGIR Conference on Research andDevelopment in Information Retrieval , ser. SIGIR’19. New York, NY,USA: Association for Computing Machinery, 2019, p. 655–664.[14] J. Wang, L. Ma, and W. Jiang, “Temporally grounding language queries

CCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (TPAMI) 14 in videos by contextual boundary-aware prediction,” in

Proceedings ofthe AAAI Conference on Artiﬁcial Intelligence , 2020, pp. 12 168–12 175.[15] Y. Yuan, T. Mei, and W. Zhu, “To ﬁnd where you talk: Temporalsentence localization in video with attention based location regression,”in

Proceedings of the AAAI Conference on Artiﬁcial Intelligence , vol. 33,2019, pp. 9159–9166.[16] C. Lu, L. Chen, C. Tan, X. Li, and J. Xiao, “DEBUG: A dense bottom-up grounding approach for natural language video localization,” in

Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference onNatural Language Processing , 2019, pp. 5147–5156.[17] L. Chen, C. Lu, S. Tang, J. Xiao, D. Zhang, C. Tan, and X. Li, “Re-thinking the bottom-up framework for query-based video localization,”in

Proceedings of the AAAI Conference on Artiﬁcial Intelligence , 2020.[18] W. Wang, Y. Huang, and L. Wang, “Language-driven temporal activitylocalization: A semantic matching reinforcement learning model,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2019, pp. 334–343.[19] D. He, X. Zhao, J. Huang, F. Li, X. Liu, and S. Wen, “Read, watch,and move: Reinforcement learning for temporally grounding naturallanguage descriptions in videos,” in

Proceedings of the AAAI Conferenceon Artiﬁcial Intelligence , vol. 33, 2019, pp. 8393–8400.[20] J. Wu, G. Li, S. Liu, and L. Lin, “Tree-structured policy based progressivereinforcement learning for temporally language grounding in video,” in

Proceedings of the AAAI Conference on Artiﬁcial Intelligence , 2020, pp.12 386–12 393.[21] A. W. Yu, D. Dohan, Q. Le, T. Luong, R. Zhao, and K. Chen, “Fastand accurate reading comprehension by combining self-attention andconvolution,” in

International Conference on Learning Representations ,2018.[22] C. Clark and M. Gardner, “Simple and effective multi-paragraph readingcomprehension,” in

Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: Long Papers) .Melbourne, Australia: Association for Computational Linguistics, Jul.2018, pp. 845–855.[23] H. Zhang, A. Sun, W. Jing, and J. T. Zhou, “Span-based localizingnetwork for natural language video localization,” in

Proceedings of the58th Annual Meeting of the Association for Computational Linguistics .Online: Association for Computational Linguistics, Jul. 2020, pp. 6543–6554.[24] L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. C.Russell, “Localizing moments in video with natural language,” in , 2017, pp.5804–5813.[25] J. Gao, C. Sun, Z. Yang, and R. Nevatia, “Tall: Temporal activitylocalization via language query,” in

IEEE International Conference onComputer Vision , 2017, pp. 5277–5285.[26] B. Liu, S. Yeung, E. Chou, D.-A. Huang, L. Fei-Fei, and J. C. Niebles,“Temporal modular networks for retrieving complex compositional ac-tivities in videos,” in

The European Conference on Computer Vision(ECCV) , 2018.[27] M. Liu, X. Wang, L. Nie, Q. Tian, B. Chen, and T.-S. Chua, “Cross-modal moment localization in videos,” in

Proceedings of the 26th ACMInternational Conference on Multimedia . Association for ComputingMachinery, 2018, pp. 843–851.[28] S. Zhang, J. Su, and J. Luo, “Exploiting temporal relationships in videomoment localization with natural language,” in

Proceedings of the 27thACM International Conference on Multimedia , ser. MM ’19. New York,NY, USA: Association for Computing Machinery, 2019, p. 1230–1238.[29] S. Chen and Y.-G. Jiang, “Semantic proposal for activity localization invideos via sentence query,” in

Proceedings of the AAAI Conference onArtiﬁcial Intelligence , vol. 33, 2019, pp. 8199–8206.[30] D. Liu, X. Qu, X. Liu, J. Dong, P. Zhou, and Z. Xu, “Jointly cross-and self-modal graph attention network for query-based moment local-ization,” in

Proceedings of the 28th ACM International Conference onMultimedia , ser. MM ’20. Association for Computing Machinery, 2020.[31] X. Qu, P. Tang, Z. Zhou, Y. Cheng, J. Dong, and P. Zhou, “Fine-grainediterative attention network for temporallanguage localization in videos,”in

Proceedings of the 28th ACM International Conference on Multimedia ,ser. MM ’20. Association for Computing Machinery, 2020.[32] Y. Yuan, L. Ma, J. Wang, W. Liu, and W. Zhu, “Semantic conditioneddynamic modulation for temporal sentence grounding in videos,” in

Advances in Neural Information Processing Systems 32 , 2019, pp. 536–546.[33] R. Zeng, H. Xu, W. Huang, P. Chen, M. Tan, and C. Gan, “Dense regres-sion network for video grounding,” in

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2020, pp.10 287–10 296.[34] J. Mun, M. Cho, and B. Han, “Local-global video-text interactions fortemporal grounding,” in

Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition , 2020, pp. 10 810–10 819.[35] J. Chen, L. Ma, X. Chen, Z. Jie, and J. Luo, “Localizing naturallanguage in videos,” in

Proceedings of the AAAI Conference on ArtiﬁcialIntelligence , vol. 33, 2019, pp. 8175–8182.[36] C. Rodriguez, E. Marrese-Taylor, F. S. Saleh, H. Li, and S. Gould,“Proposal-free temporal moment localization of a natural-language queryin video using guided attention,” in

The IEEE Winter Conference onApplications of Computer Vision (WACV) , March 2020.[37] S. Ghosh, A. Agarwal, Z. Parekh, and A. Hauptmann, “ExCL: ExtractiveClip Localization Using Natural Language Descriptions,” in

Proceedingsof the 2019 Conference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Language Technologies .Association for Computational Linguistics, 2019, pp. 1984–1990.[38] V. Escorcia, M. Soldan, J. Sivic, B. Ghanem, and B. Russell, “Temporallocalization of moments in video collections with natural language,”

ArXiv , vol. abs/1907.12763, 2019.[39] D. Shao, Y. Xiong, Y. Zhao, Q. Huang, Y. Qiao, and D. Lin, “Find andfocus: Retrieve and localize video events with natural language queries,”in

The European Conference on Computer Vision (ECCV) , September2018.[40] N. C. Mithun, S. Paul, and A. K. Roy-Chowdhury, “Weakly supervisedvideo moment retrieval from text queries,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , June 2019, pp.11 592–11 601.[41] Z. Lin, Z. Zhao, Z. Zhang, Q. Wang, and H. Liu, “Weakly-supervisedvideo moment retrieval via semantic completion network,” in

Proceed-ings of the AAAI Conference on Artiﬁcial Intelligence , 2020, pp. 11 539–11 546.[42] S. Zhang, H. Peng, J. Fu, and J. Luo, “Learning 2d temporal adjacentnetworks formoment localization with natural language,” in

Proceedingsof the AAAI Conference on Artiﬁcial Intelligence , 2020.[43] S. Chen, W. Jiang, W. Liu, and Y.-G. Jiang, “Learning modality interac-tion for temporal sentence localization and event captioning in videos,”in

The European Conference on Computer Vision (ECCV) , 2020.[44] S. Wang and J. Jiang, “Machine comprehension using match-lstm and an-swer pointer,” in

International Conference on Learning Representations ,2017.[45] S. H. Wang and J. Jiang, “Learning natural language inference withLSTM,” in

Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human Lan-guage Technologies . Association for Computational Linguistics, 2016,pp. 1442–1451.[46] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” in

Advancesin Neural Information Processing Systems . Curran Associates, Inc.,2015, pp. 2692–2700.[47] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, “Bidirectionalattention ﬂow for machine comprehension,” in

International Conferenceon Learning Representations , 2017.[48] C. Xiong, V. Zhong, and R. Socher, “Dynamic coattention networks forquestion answering,” in

International Conference on Learning Represen-tations , 2017.[49] W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou, “Gated self-matching networks for reading comprehension and question answering,”in

Proceedings of the 55th Annual Meeting of the Association forComputational Linguistics . Association for Computational Linguistics,2017, pp. 189–198.[50] H.-Y. Huang, C. Zhu, Y. Shen, and W. Chen, “Fusionnet: Fusing viafully-aware attention with application to machine comprehension,” in

International Conference on Learning Representations , 2018.[51] Y. Wang, K. Liu, J. Liu, W. He, Y. Lyu, H. Wu, S. Li, and H. Wang,“Multi-passage machine reading comprehension with cross-passage an-swer veriﬁcation,” in

Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: Long Papers) .Melbourne, Australia: Association for Computational Linguistics, Jul.2018, pp. 1918–1927.[52] Z. Wang, P. Ng, X. Ma, R. Nallapati, and B. Xiang, “Multi-passageBERT: A globally normalized BERT model for open-domain questionanswering,” in

Proceedings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the 9th International JointConference on Natural Language Processing (EMNLP-IJCNLP) . HongKong, China: Association for Computational Linguistics, Nov. 2019, pp.5878–5882.

CCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (TPAMI) 15 [53] D. Lin, J. Tang, K. Pang, S. Li, and T. Wang, “Selecting paragraphs toanswer questions for multi-passage machine reading comprehension,” in

Information Retrieval . Cham: Springer International Publishing, 2019,pp. 121–132.[54] L. Pang, Y. Lan, J. Guo, J. Xu, L. Su, and X. Cheng, “Has-qa:Hierarchical answer spans model for open-domain question answering,”in

Proceedings of the AAAI Conference on Artiﬁcial Intelligence , 2019,pp. 6875–6882.[55] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+questions for machine comprehension of text,” in

Proceedings of the2016 Conference on Empirical Methods in Natural Language Processing .Association for Computational Linguistics, 2016, pp. 2383–2392.[56] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a newmodel and the kinetics dataset,” in

IEEE Conference on Computer Visionand Pattern Recognition , 2017, pp. 4724–4733.[57] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,L. Kaiser, and I. Polosukhin, “Attention is all you need,” in

Advances inNeural Information Processing Systems . Curran Associates, Inc., 2017,pp. 5998–6008.[58] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,”

ArXiv ,vol. abs/1607.06450, 2016.[59] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[60] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation byjointly learning to align and translate,” in

International Conference onLearning Representations , 2015.[61] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles, “Dense-captioning events in videos,” in , 2017, pp. 706–715.[62] M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, andM. Pinkal, “Grounding action descriptions in videos,”

Transactions ofthe Association for Computational Linguistics , vol. 1, pp. 25–36, 2013.[63] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, andA. Gupta, “Hollywood in homes: Crowdsourcing data collection foractivity understanding,” in

European Conference on Computer Vision(ECCV) . Springer International Publishing, 2016, pp. 510–526.[64] F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, “Activitynet:A large-scale video benchmark for human activity understanding,” in

IEEE Conference on Computer Vision and Pattern Recognition , 2015,pp. 961–970.[65] M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal, andB. Schiele, “Script data for attribute-based recognition of compositeactivities,” in

The European Conference on Computer Vision (ECCV) ,2012.[66] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectorsfor word representation,” in

Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing (EMNLP) , 2014,pp. 1532–1543.[67] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in

International Conference on Learning Representations , 2015.[68] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: A simple way to prevent neural networks from overﬁt-ting,”