A Straightforward Framework For Video Retrieval Using CLIP
Jesús Andrés Portillo-Quintero, José Carlos Ortiz-Bayliss, Hugo Terashima-Marín
AA Straightforward Framework For VideoRetrieval Using CLIP
Jes´us Andr´es Portillo-Quintero [0000 − − − ,Jos´e Carlos Ortiz-Bayliss [0000 − − − , andHugo Terashima-Mar´ın [0000 − − − School of Engineering and Sciences, Tecnologico de MonterreyAv. Eugenio Garza Sada 2501, Monterrey, NL 64849, [email protected], { jcobayliss, terashima } @tec.mx Abstract.
Video Retrieval is a challenging task where a text query ismatched to a video or vice versa. Most of the existing approaches foraddressing such a problem rely on annotations made by the users. Al-though simple, this approach is not always feasible in practice. In thiswork, we explore the application of the language-image model, CLIP, toobtain video representations without the need for said annotations. Thismodel was explicitly trained to learn a common space where images andtext can be compared. Using various techniques described in this doc-ument, we extended its application to videos, obtaining state-of-the-artresults on the MSR-VTT and MSVD benchmarks . Video is one of the most consumed forms of media available on the internet.The high consumption of this type of media requires to find suitable methodsfor finding videos that contain one or more features desired by the users. Mostvideo browsers rely on annotations made by users to identify video contents.Although this solution is simple to implement, it comes at a high price. Relyingon annotations to perform a query on videos requires an extensive descriptionof the videos’ innards and context. Unfortunately, this information may not bemade available. Thus, it is clear that a video retrieval system that can handleuser’s queries without the need for such annotations represents a relevant topicof study.This document describes a video retrieval model, which, as its name implies,can retrieve the videos from a collection that are best described by a particularquery (text). For example, “A woman is running” should return videos thatcontain women running. Given that the Video Retrieval architecture estimatesthe similarity between video and text, it can also be used to perform the video-to-text retrieval (VTR) task. It consists of returning captions that best describethe query (video) from a set of description candidates. In either task, the system Code available at: https://github.com/Deferf/CLIP_Video_Representation a r X i v : . [ c s . C V ] F e b J. A. Portillo-Quintero et al. goal is that, given a query and a set of video-text pairs, it must return theranking at which the corresponding opposite modality is positioned.The TVR and VTR tasks can be seen as a method by which video and textcontents are funnelled into a fixed-length representation using an embeddingfunction. Since both projections fall in the same dimensional space, a similarityscore can be applied, which consequently can be used to rank elements from aset of prospects. Given that similarity metrics between text-video and video-textare equal, TVR and VTR are considered inverse operations. They only dependon the modality of the input prompt.Some works extensively focus on the video representation by adding pre-trained models considered “experts”. Each “expert” focuses on specific videocontents such as sound, face detection, motion, among others. The informationfrom all the experts is multiplexed by a complex gating mechanism [5,7]. In-stead of starting from an elaborated video representation to train a commonvisual-text space, we propose to use a learned visual-text space to build a videorepresentation. Similarly to Mithun et al. [12], our approach consists of usingpre-trained models that measure the similarity between image and text. Then,we extend this idea to handle videos. We experimented with several aggregationmethods to comply with the extra temporal dimension.In this work, we chose CLIP as the base image-text model. CLIP is a state-of-the-art Neural Network, which is pre-trained for image-text pairs [14]. CLIPhas proved that similarity learning can be used to train a visual encoder fordownstream tasks such as classification, captioning, and clustering, to mentionsome. We harness the power of its visual representations to create a video rep-resentation that can be used directly with its original text encoder to bootstrapa Neural Network model for Video Retrieval. Since our work focuses on aggre-gation strategies of image features, our method is tested with Zero Shots ofthe evaluation dataset. Hence, no parameter finetuning is exercised to improveretrieval results.The remainder of this document is organized as follows. In Section 2 weprovide the foundations of this investigation and an overview of the most relevantrelated works. Section 3 describes the experiments conducted, their main resultsand their discussion. Finally, in Section 4 we present the conclusion and someideas that may be worth exploring as part of the future work.
The work presented in this document is related to strategies used to constructa video encoder for video retrieval. It is straight forward to think that imagefeatures can serve as a proxy for video representations. In fact, Karpathy etal. [6] observed that a Convolutional Neural Network (CNN) feature from asingle frame could be discriminative enough for video classification, achievingjust 1.3 fewer percentage points than the top accuracy model from the samework, which on its part included more visual and temporal information.
Straightforward Framework For Video Retrieval Using CLIP 3
Mithun et al. [12] proved that it was possible to supersede the state-of-the-artvideo retrieval model by obtaining the average visual features obtained from animage-text model. This practice has been implemented on novel models, alongwith more elaborated video representations. For instance, the state-of-the-artin video retrieval has been pushed by models that implement a Mixture-of-Experts (MoE) paradigm [5,7,10,13]. The MoE approach proposes a complexvideo representation by multiplexing the outputs of several pre-trained models(known as “experts”) that attend to particular aspects of video such as motion,face detection, character recognition, among others.In this regard, we are aware that at most seven experts have been included ina Video Retrieval model [5]. Nonetheless, the current state-of-the-art implementsa mixture of two experts, indicating that video-text representations may rescindthe added complexity that multiple experts convey [13]. Patrick et al. [13] pro-pose that Contrastive Training used by most video retrieval systems encouragesrepulsive forces on independent, but similar, examples. To alleviate this, theyuse a support set containing positive examples for each data point on a trainingbatch, so the common video-text space must learn concept sharing. Nonetheless,contrastive training has been proved successful in image and video representationlearning [2,9].Contrastive training is a regime on which a model is inducted to pull similardata points together and push apart dissimilar ones on a latent space. Thefoundational mechanism of the Contrastive Language-Image Pretraining (CLIP)is the model used in this work. As its name states, the model is pre-trainedon 400,000 image-text pairs collected from the Internet. As a siamese neuralnetwork, it is composed of an image (ViT-B/32) and text encoder (transformer)that funnel information into a common space where objects can be comparedusing cosine similarity [14].
This section describes a mathematical description of CLIP and how we can useit for VTR or TVR. Then, we describe the datasets and metrics considered forthis work. Then, we detail the experiments and their main results, followed bya brief discussion of the most relevant findings.
By using CLIP we obtain the pre-trained functions ω ( u ) = w and φ ( t ) = c t ,which encode image u and text t into w , c t ∈ R d , where d = 512. Assume a video v is composed of s sampled frames such that v = { u , u , . . . , u s } . Consequently,we can calculate the embedding of each frame into a matrix W ∈ R d × s so W = [ ω ( u ) = w , w , . . . , w s ]. Therefore, the problem we try to solve is tofind an aggregation function Λ that maps the input W ∈ R d × s into a videorepresentation c v ∈ R d . Then, with a video and text representations c v and J. A. Portillo-Quintero et al. c t , we can compute a cosine similarity function (Equation 1), which is useful forranking the video-text pairs inside a dataset given a query of a specific modality. sim ( a , b ) = a T b (cid:107) a (cid:107)(cid:107) b (cid:107) (1) The proposed framework assumes a set C of videos and corresponding captionspairs in the form {{ ( v i , t ij ) } ni =1 } m ( v i ) j =1 where the number of captions per videomay be non-uniform, hence m is a function of v . By design, some datasets aresplit into sections used for training and validation of results. For the preliminaryexperiments, we use the training splits to prove our hypothesis, but final resultsare reported on tests split of their respective datasets.The datasets involved in this work are listed below. MSR-VTT is a dataset composed of 10,000 videos, each with a length thatranges from ten to 32 seconds and 200,000 captions. The training, validationand test splits are composed of 6,513, 497 and 2,990 videos, respectively,with 20 corresponding descriptions each [18]. The test set has been used indifferent ways in the literature. Then, we will refer to two common variationsas Full [7] (containing all the 2,990 videos in the test set from MSR-VTT)and 1k-A [19] (containing only 1000 videos from the 2,990 in the test set inMSR-VTT).
MSVD contains 1970 videos, each with a length that ranges from one to 62seconds. Train, validation and test splits contain 1200, 100 and 670 videos,respectively [1]. Each video has approximately 40 associated sentences inEnglish.
LSMDC is comprised 118,081 videos, each with a length that ranges from twoto 30 seconds. The videos were extracted from 202 movies. The validation setcontains 7,408 videos, and the test set 1,000 videos from movies independentfrom the training and validation splits [15].All the frames were sampled from each video from the previously mentioneddatasets to extract the frame features. Other datasets are related to this workbut cannot be used include WIT (WebImageText) [14] and HT100M [11]. WITis composed of 400,000 image-text pairs on which CLIP was trained on. SinceWIT is an image-text dataset that cannot be used as a benchmark for videoretrieval. HT100M is a dataset of 100 million video-text pairs, used only as apre-training set for other Video Retrieval works [5,11,13,16].
To conduct our experiments, we follow the testing methodologies used in previousworks [5,7] and report standard retrieval metrics. For median rank (MdR), mean
Straightforward Framework For Video Retrieval Using CLIP 5 rank (MnR) and standard deviation of rank (StdR), the lower the value, the bet-ter the performance. In the case of recall at rank ( R @ k , where k = { , , } ),the higher the value, the better the performance. For datasets that involve mul-tiple sentences per video —such as Full from MSR-VTT and MSVD test—, wefollow the protocol used by Liu et al. [7] and use the minimum rank among allassociated sentences to a given video query. In the exploratory experiments, we empirically define two candidates for frame-level aggregation Λ functions. We conduct this set of preliminary experimentson a validation sample comprised of 1,000 video-text pairs from MSR-VTT. Thefirst frame-level aggregation function is based on the idea that it is feasible to ob-tain reasonable video representations by only considering one frame sample [6].Given the feature matrix W ∈ R d × s , we define Λ s ( W ) = W ∈ R d as a functionthat returns the features of the 30 th frame. Since these videos contain approx-imately 30 frames per second, this is equivalent to sampling a frame from thefirst second of the video.A second candidate for an aggregation function is proposed by Mithun etal. [12], who suggest that the average of frame-level features from videos canbe used as an approximator for video representations. This method has exten-sively been used in other retrieval works [5,7,9,11,13]. Consequently, we define Λ avg ( W ) = ¯ W ∈ R d , where ¯ W is the average value of matrix columns.Given that videos present dynamic events in which several sequences offrames can represent completely different things, we used k -means as the methodfor aggregation [17]. With this implementation, the aggregation function followsthe form Λ k ( W ) = W ∈ R d × k , which returns k video embeddings. For evaluationpurposes, we repeat the ranking procedure with the obtained independent videorepresentations k times and register each query’s minimum rank, then calculatethe retrieval metrics.Based on the results depicted in Table 1, the average-based methods obtainthe best results in terms of the metrics used. It is noticeable that, among k -means methods, there is no significant difference between the results. This maybe because MSR videos do not exceed 32 seconds in length, which may notbe enough to differentiate the centroids when creating the clusters. We appealto Occam’s Razor principle regarding the aggregation method and select Λ avg for further experiments since it accomplishes a similar performance to k -meansbased aggregation methods but with a lower calculation complexity. This section compares our video retrieval model against the state-of-the-art inthe MSR-VTT, MSVD and LSMDC datasets. In all the cases, we evaluate boththe TVR and VTR tasks.In MSR-VTT, we are able to supersede the R@1 score of the previous bestmodel SSB [13] on the split 1k-A for the TVR task. However, we are positioned
J. A. Portillo-Quintero et al. Λ R@1 R@5 R@10 MdR MnR StdR Λ s Λ avg Λ Λ Λ Λ Λ Λ Λ Λ Λ Table 1.
Text-to-Video Retrieval results on the MSR-VTT validation set, using dif-ferent aggregation functions. behind previous works on other recall metrics (Table 2). Besides, we consistentlyachieve state-of-the-art results on all the recall metrics in the Full split fromMSR-VTT. In the MSVD dataset, we obtain state-of-the-art results on mostof the retrieval metrics (Table 3). We suppose that models that are based onMoE such as SSB [13] and CE [7] cannot use all of their implemented expertsbecause the videos in MSVD lack audio information, so they have to rely onlyon visual features. In LSMDC, we do not obtain state-of-the-art results, butwe are positioned second-best (Table 4). Given that video descriptions in thisdataset do not follow the form of a typical sentence, as they are designed toteach a model to recognize characters and interactions between movie scenes,we commend the robustness of CLIP’s text encoder because it could adapt to anew sentence schema.
Although we obtain outstanding results on different metrics and datasets, thereare some things worth discussing. For example, our original supposition was thatthe ranking worsens as the video gets longer. To confirm or reject this idea, weproduced Figure 1. Figure 1 depicts the video length in seconds ( x -axis), and therank assigned to it ( y -axis). As a video gets longer, we expected that it wouldbe more difficult for the video representation to capture the temporal elements.Hence it would be ranked worse. However, the experiment conducted on set 1k-A from MSR-VTT shows that ranking varies wildly independently from videolength (at least as long as the videos present in the dataset).We proceeded to look at the worst ranked video-text pairs, we noticed thatseveral sentences incorporated phrases like “a family is having a conversation”or “a man talking about a woman” hinting that sentences that were mainlydescribing audio content would be ranked worse. This conclusion is reinforcedby the fact that our model scored the best on MSVD, a dataset that by design Straightforward Framework For Video Retrieval Using CLIP 7TVR VTRMethod Training Test Set
R@1 R@5 R@10 MdR R@1 R@5 R@10 MdR
JSFusion [19] M 1k-A 10.2 31.2 43.2 13 - - - -HT100M [11] H+M 1k-A 14.9 40.2 52.8 9 16.8 41.7 55.1 8CE [7] M 1k-A 20.9 48.8 62.4 6 20.6 50.3 64 5.3AVLnet [16] H+M 1k-A 27.1 55.6 66.6 4 28.5 54.6 65.2 4MMT [5] H+M 1k-A 26.6 57.1
CLIP W 1k-A - - - -
CE [7] M Full 10.0 29.0 42.2 16 15.6 40.9 55.2 8.3CLIP W Full
TVR and VTR results in the MSR-VTT dataset. M, H and W denote trainingon MSR-VTT, HT100M and WIT, respectively.
TVR VTRMethod Training R@1 R@5 R@10 MdR R@1 R@5 R@10 MdR
VSE [12] D 12.3 30.1 42.3 14 34.7 59.9 70.0 3VSE++ [12] D 15.4 39.6 53.0 9 - - - -Multi Cues [12] D 20.3 47.8 61.1 6 - - - -CE [7] D 19.8 49.0 63.8 6 - - - -Support-set Bottleneck [13] H+D 28.4 60.0 72.9 4 - - - -CLIP W
37 64.1 73.8 3 59.9 85.2 90.7 1Table 3.
TVR and VTR results in the MSVD dataset. D, H and W denote trainingon MSVD, HT100M and WIT, respectively.
TVR VTRMethod Training R@1 R@5 R@10 MdR R@1 R@5 R@10 MdR
JSFusion [19] L 9.1 21.2 34.1 36
CE [7] L 11.2 26.9 34.8 25.3 - - - -MMT [5] H+L - - - -CLIP W 11.3 22.7 29.2 56.5 6.8 16.4 22.1 73
Table 4.
TVR and VTR results in the LSMDC dataset. L, H and W denote trainingon LSMDC, HT100M and WIT, respectively. does not contain any audio track and text descriptions are based on what canbe visualized.
J. A. Portillo-Quintero et al.
Fig. 1.
Scatter plot of video length and assigned rank on TVR task on the 1k-A testsplit. The red line represents the median rank.
This work presents the first implementation of CLIP to obtain video features.Our method works by leveraging its learned common image-text space withoutthe need for parameter finetuning (Zero-Shot). We apply an aggregation functionto frame-level features, common in other video retrieval works. Our work focusesonly on visual and text modalities, as it supersedes methods that implement acomplex mixture of pre-trained models obtaining state-of-the-art results on theMSVD and MSR-VTT datasets.One potential application of this CLIP-derived implementation is to retrievespecific moments inside videos. Also, it is yet unseen how will our video repre-sentation behave if tested as a video classifier. This methodology might be usefulto create a video representation that is based on CLIP for longer durations. Forexample, other works have used frame features to construct a graph that canchange through time [8]. Such a representation could keep the strong text align-ment suitable for video retrieval. Also, our work can be used as an expert on afuture MoE video retrieval system.
Acknowledgments
This research was partially supported by ITESM Research Group with StrategicFocus on Intelligent Systems.
References
1. Chen, D., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation.In: Proceedings of the 49th Annual Meeting of the Association for ComputationalLinguistics: Human Language Technologies. pp. 190–200 (2011) Straightforward Framework For Video Retrieval Using CLIP 92. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for con-trastive learning of visual representations. In: International conference on machinelearning. pp. 1597–1607. PMLR (2020)3. Dong, J., Li, X., Snoek, C.G.: Predicting visual features from text for imageand video caption retrieval. IEEE Transactions on Multimedia (12), 3377–3388(2018)4. Dong, J., Li, X., Xu, C., Ji, S., He, Y., Yang, G., Wang, X.: Dual encoding forzero-example video retrieval. In: Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition. pp. 9346–9355 (2019)5. Gabeur, V., Sun, C., Alahari, K., Schmid, C.: Multi-modal transformer for videoretrieval. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision– ECCV 2020. pp. 214–229. Springer International Publishing, Cham (2020)6. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings ofthe IEEE conference on Computer Vision and Pattern Recognition. pp. 1725–1732(2014)7. Liu, Y., Albanie, S., Nagrani, A., Zisserman, A.: Use what you have: Video retrievalusing representations from collaborative experts (2020)8. Mao, F., Wu, X., Xue, H., Zhang, R.: Hierarchical video frame sequence represen-tation with deep convolutional graph network. In: Proceedings of the EuropeanConference on Computer Vision (ECCV) Workshops. pp. 0–0 (2018)9. Miech, A., Alayrac, J.B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-endlearning of visual representations from uncurated instructional videos. In: Proceed-ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.pp. 9879–9889 (2020)10. Miech, A., Laptev, I., Sivic, J.: Learning a Text-Video Embedding from Incompleteand Heterogeneous Data. arXiv:1804.02516 (2018)11. Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.:Howto100m: Learning a text-video embedding by watching hundred million nar-rated video clips. In: Proceedings of the IEEE/CVF International Conference onComputer Vision. pp. 2630–2640 (2019)12. Mithun, N.C., Li, J., Metze, F., Roy-Chowdhury, A.K.: Learning joint embeddingwith multimodal cues for cross-modal video-text retrieval. In: Proceedings of the2018 ACM on International Conference on Multimedia Retrieval. pp. 19–27 (2018)13. Patrick, M., Huang, P.Y., Asano, Y., Metze, F., Hauptmann, A.G., Henriques,J.F., Vedaldi, A.: Support-set bottlenecks for video-text representation learn-ing. In: International Conference on Learning Representations (2021), https://openreview.net/forum?id=EqoXe2zmhrhhttps://openreview.net/forum?id=EqoXe2zmhrh