[PDF] Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

Abstract

When watching videos, the occurrence of a visual event is often accompanied by an audio event, e.g., the voice of lip motion, the music of playing instruments. There is an underlying correlation between audio and visual events, which can be utilized as free supervised information to train a neural network by solving the pretext task of audio-visual synchronization. In this paper, we propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos in the wild, and further benefit downstream tasks. Specifically, we explore three different co-attention modules to focus on discriminative visual regions correlated to the sounds and introduce the interactions between them. Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods. To further evaluate the generalizability and transferability of our approach, we apply the pre-trained model on two downstream tasks, i.e., sound source localization and action recognition. Extensive experiments demonstrate that our model provides competitive results with other self-supervised methods, and also indicate that our approach can tackle the challenging scenes which contain multiple sound sources.

Full PDF

LLook, Listen, and Attend: Co-Attention Network forSelf-Supervised Audio-Visual Representation Learning

Ying Cheng , Ruize Wang , Zhihao Pan , Rui Feng , , ∗ , Yuejie Zhang , ∗ Academy for Engineering and Technology, Fudan University, China School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, China{chengy18, rzwang18, zhpan18, fengrui, yjzhang}@fudan.edu.cn

ABSTRACT

When watching videos, the occurrence of a visual event is often ac-companied by an audio event, e.g., the voice of lip motion, the musicof playing instruments. There is an underlying correlation betweenaudio and visual events, which can be utilized as free supervisedinformation to train a neural network by solving the pretext taskof audio-visual synchronization. In this paper, we propose a novelself-supervised framework with co-attention mechanism to learngeneric cross-modal representations from unlabelled videos in thewild, and further benefit downstream tasks. Specifically, we explorethree different co-attention modules to focus on discriminativevisual regions correlated to the sounds and introduce the interac-tions between them. Experiments show that our model achievesstate-of-the-art performance on the pretext task while having fewerparameters compared with existing methods. To further evaluatethe generalizability and transferability of our approach, we ap-ply the pre-trained model on two downstream tasks, i.e., soundsource localization and action recognition. Extensive experimentsdemonstrate that our model provides competitive results with otherself-supervised methods, and also indicate that our approach cantackle the challenging scenes which contain multiple sound sources.

CCS CONCEPTS • Information systems → Multimedia information systems ;• Computing methodologies → Computer vision . KEYWORDS

Self-Supervised Learning; Representation Learning; Co-AttentionNetwork; Audio-Visual Synchronization

ACM Reference Format:

Ying Cheng, Ruize Wang, Zhihao Pan, Rui Feng, Yuejie Zhang. 2020. Look,Listen, and Attend: Co-Attention Network for Self-Supervised Audio-VisualRepresentation Learning. In

Proceedings of the 28th ACM International Con-ference on Multimedia (MM ’20), October 12–16, 2020, Seattle, WA, USA.

ACM,New York, NY, USA, 9 pages. https://doi.org/10.1145/3394171.3413869 ∗ indicates corresponding authors.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected]. MM ’20, October 12–16, 2020, Seattle, WA, USA © 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-7988-5/20/10...$15.00https://doi.org/10.1145/3394171.3413869

Positive Pairs Negative Pairs (AVC) Negative Pairs (AVS)

Figure 1: An illustration of different audio-visual pretexttasks, i.e., Audio-Visual Correspondence (AVC) and Audio-Visual Synchronization (AVS). The examples in the greenbox are positive pairs in the pretext tasks (both AVC andAVS), which are extracted from the same time of one video.The examples in the red box are negative pairs in the AVCtask, which are extracted from different videos. The exam-ples in the yellow box are negative pairs in the AVS task,which are extracted from different slices of one video.

In many tasks of video analysis, applying supervised learning totrain a neural network is very expensive. Annotators need to watchentire videos and manually add labels to each video (even eachframe in videos). Only a small amount of video data can be usedfor training via this learning method. However, vast numbers ofimages/videos are uploaded to social websites every day. It is sucha waste if we cannot fully exploit them.Self-supervised learning is a framework which aims to learngeneric representations without human-labeling. The supervisedsignals come from the data itself, e.g., audio-visual co-occurrences.Audio and visual events tend to occur together in videos. For exam-ple, when a person is chopping wood, we can hear the clash betweenthe axe and the wood. The audio-visual co-occurrences providefree supervised signals, which can be utilized for co-training a jointneural network and learning cross-modal representations. Com-pared with the fully-supervised methods, self-supervised learningprovides the opportunity for exploiting large-scale unlabelled data.Recently, some efforts have been made for learning audio andvisual semantic representations in a self-supervised way. Thesemethods (also called pretext tasks) can be divided into two cate-gories, i.e., Audio-Visual Correspondence (AVC) and Audio-VisualSynchronization (AVS). The VGG group at Oxford University [3, 4]proposed the AVC task, i.e., predicting whether visual frames andaudio clips are sampled from the same video. As illustrated in Figure a r X i v : . [ c s . MM ] A ug , the main difference between the AVC and AVS task is the negativepairs. The objective of the AVS task is to detect the misalignmentsbetween audio and visual streams. Most of previous researchers[10, 11, 15, 30] focused on speech tasks, and thus were limited tothe domain of face scenes. Korbar et al. [26], Owens and Efros [31]proposed to use more general data to learn semantic representa-tions, which could be leveraged in practical applications. However,previous approaches overlook the information exchange betweenmodalities, and they are pervasively limited by the heterogeneouscomplexity of audio-visual scenes, i.e., multiple sound sources.In this paper, we focus on how to learn generic cross-modal rep-resentations in the complex scenes efficiently. We note that there isan interesting phenomenon, some nearsighted people are difficultto hear clearly when taking off glasses. This can be explained by theMcGurk effect [29], which demonstrates an interaction betweenhearing and vision in humans. Inspired by such a specific phenom-enon, we propose a self-supervised framework with co-attentionmechanism to provide information exchange between audio andvisual streams. To jointly attend to distinct sound sources and theirvisual regions, we also apply the multi-head based structure [41]to cross-modal attention. After training on the pretext task, weemploy the pre-trained model as a base for two audio-visual down-stream tasks, i.e., sound source localization and action recognition.Extensive experiments indicate that our model achieves superiorperformance and can tackle complicated scenes, while having fewerparameters compared with previous approaches.The contributions of this paper can be summarized as follows: • We propose to introduce interactions between audio andvisual steams for self-supervised learning, learn cross-modalrepresentations, and eventually benefit downstream tasks. • We propose a co-attention framework to exploit the co-occurrences between audio and visual events, in which themulti-head based structure can effectively associate the dis-criminative visual regions with the sound sources. • Extensive experiments indicate that our method has goodgeneralizability and can tackle the challenging scenes whichcontain multiple sound sources.

We first concisely review the works of representation learning foraudio and visual modalities, and then discuss the relevant progressin the area of audio-visual applications.

In recent years, there have been some works that focus on audio-visual representation learning. These approaches can be dividedinto three classes according to the source of supervision signals, i.e.,vision, audio, and both of them. The early works [5, 6, 32] arguedthat the audio and visual representations of the same instance hadsimilar semantic information. Thus they investigated how to trans-fer supervision between different modalities using unlabeled videosas a bridge. Such "teacher-student" training procedure leveragesthe discriminative knowledge from a well-trained model of onemodality to supervise another different modality. Aytar et al. [5]used visual information as supervision for acoustic scene/objectclassification, while Owens et al. [32] exploited ambient sound to supervise the learning of visual representations. Although the afore-mentioned approaches have shown promising cross-modal transferlearning ability, they still rely on the knowledge from establishedmodels with a large amount of training data.Recently, some researchers are interested in whether the au-dio and visual modalities can supervise each other. Arandjelovicand Zisserman [3] proposed to learn audio and visual semanticrepresentations by predicting whether the static image and audioclip correspond to each other. They further developed the Audio-Visual Embedding Network (AVE-Net) [4] to facilitate cross-modalretrieval and sound source localization. Hu et al. [18] and Alwas-sel et al. [2] introduced multimodal clustering to disentangle eachmodality into distinct components and performed efficient corre-spondence learning between the components. Owens and Efros [31]transformed the pretext task from audio-visual correspondence totemporal synchronization and enforced the models to learn spatio-temporal features. Korbar et al. [26] proposed to combine these twopretext tasks by using the strategy of curriculum learning. How-ever, these approaches neglect the information exchange betweenmodalities. In contrast, our framework allows the communicationsof audio and visual information through co-attention modules, andfinds the underlying correlations between them.

Sound Source Localization.

The task of sound source localiza-tion is to localize visual regions correlated to the sounds. In thecomputational approaches, multivariate Gaussian process model[17], non-parametric methods [13], Canonical Correlations Anal-ysis (CCA) [20, 24], and keypoints/probabilistic formalism [8] areused to associate the visual regions with the sounds. Besides, someacoustic hardware based approaches [40, 45] use specific devices(e.g., microphone arrays) to capture phase differences of sound ar-rival, utilizing the physical relationships to localize sound sources.With the development of deep learning, more and more researchershave studied this task. Some methods [4, 43] focused on the domainof musical instruments localization. On the other hand, severalmethods [3, 31] explored the sound source localization in moregeneric scenes. However, such methods are difficult to recognizedistinct sources when the video contains mixed multiple sounds.Our network with multi-head based structure can localize eachsound source in the complicated scenes without any supervision.

Action Recognition.

Action Recognition is a very attractive topicdue to its various real-world applications, including visual surveil-lance [19, 21, 36], human-computer interaction [25, 34], video re-trieval [12, 33], etc. Researchers have proposed various approachesto address this problem, such as C3D [38], I3D [9], R(2+1)D [39], S3D[42]. However, such models are trained in a fully-supervised way,and normally require a great deal of annotated training data. Al-though the datasets of Sports-1M [22] and Youtube-8M [1] providemillions of human activity videos, their annotations are generatedby the involved main topics automatically, and thus might not beaccurate. Recently, some self-supervised methods [2, 26, 31] havebeen proposed to exploit the sufficiency of unlabeled videos andboost the generalizability of models. Compared with these works,our method introduces the interactions between modalities andprovides competitive results with fewer parameters of the model. udio Encoder E a Synchronized?yes / no C o - a tt en t i on M odu l e ... AGAVGA SA SA

CMA

Co-attention Module F v F a H v H a Visual Encoder E v Figure 2: An overview of our co-attention model for the Audio-Visual Synchronization (AVS) task. The co-attention moduleconsists of the CMA (cross-modal attention) block followed by the SA (self-attention) block, in which the CMA block is com-posed of AGA (audio-guided attention) and VGA (visual-guided attention). The ellipsis denotes that the co-attention modulescan be cascaded in depth.

The audio and visual events tend to occur together at the same time,and the videos provide natural alignments between them. Therefore,it is achievable to determine whether the extracted semantic con-tents are corresponding by detecting the synchronization of streamsin videos, and result in generic audio and visual representationswithout any manual annotations.In this paper, the pretext task of cross-modal self-supervisedlearning is Audio-Visual Synchronization (AVS), which can beset up as a binary classification problem. Taking a set of videoclips as inputs, audio and visual streams are synchronized in posi-tive samples while unsynchronized in negative samples, and ourmodels are trained to distinguish between these samples. Specif-ically, given a training dataset consisting of N video clips X = {( a , v , y ) , · · · , ( a n , v n , y n )} , where a n and v n denote the n -thaudio sample and the n -th visual clip, respectively. The label y n ∈{ , } indicates whether the audio and visual inputs are synchro-nized, which is obtained from the video clips directly without hu-man annotations. If y n = a n and v n are sampled from the sametime of the video. If y n = a n and v n are sampled from differentslices of the video.We consider that this task requires the model to associate dis-criminative visual regions with the sound sources. To do this,we propose a neural network with co-attention mechanism, asillustrated in Figure 2. Let E a and E v be the audio and visualencoders, respectively. Thus, we can obtain the set of audio fea-tures F a = (cid:110) f ( n ) a = E a ( a n ) (cid:111) ∈ R d a × N and visual features F v = (cid:110) f ( n ) v = E v ( v n ) (cid:111) ∈ R d v × N , where d a and d v denote the dimen-sion of the audio and visual feature, respectively. To introduce theinformation interactions between audio and visual streams, we ex-plore three different co-attention modules consisting of transformer[41] blocks, which can be cascaded in depth. We will describe thedetails of co-attention modules in Section 3.1. Given a set of audiofeatures F a and visual features F v , the final output representationsof co-attention modules are H a and H v , which are flattened, con-catenated, and passed to the fully-connected fusion layers to predictthe synchronization probability. The architecture of our network is composed of three main parts:audio and visual encoders which are used for extracting audio andvisual features, respectively, and co-attention modules that provideinteractions between modalities.

Audio encoder.

For the audio stream, we follow the approachin [31]. The input waveforms are ingested to a sequence of 1Dconvolutional filters, where д o , д d and д s denote the filter out-put dimension, filter size, and filter stride, respectively. The wave-form is 1D ( N samples by 2 stereo channels) but reshaped to [ N ,1, 1, 2], and thus the convolution is implemented as 3D convo-lution. Specifically, the audio encoder consists of one 3D convo-lutional filter ( д o =64, д d =[65*1*1], д s =4), one 3D average pool-ing ( д o =64, д d =[4*1*1], д s =[4*1*1]), two 3D convolutional filters( д o =128, д d =[15*1*1], д s =4), two 3D convolutional filters ( д o =128, д d =[15*1*1], д s =4), two 3D convolutional filters ( д o =256, д d =[15*1*1], д s =4), one 3D average pooling ( д o =128, д d =[3*1*1], д s =[3*1*1]), andone 3D convolutional filter ( д o =128, д d =[3*1*1], д s =1). A fully con-nected layer is applied to obtain the 512D audio features F a . Visual encoder.

In order to capture the motion information, thevisual encoder consists of a small number of 3D convolutional filters,where д o , д d and д s denote the filter output dimension, filter size,and filter stride, respectively. First, we reshape the inputs into asize of t × × ×

3, where t denotes the number of frames. Wethen feed the input features into one 3D convolutional filter ( д o =64, д d =[5*7*7], д s =1), perform 3D average pooling ( д o =64, д d =[1*3*3], д s =[1*2*2]), and go through four 3D convolutional filters ( д o =64, д d =[3*3*3], д s =2). At last, we average all frame features and applya fully connected layer to obtain the 512D video features F v . Co-attention module.

Before presenting the architecture of co-attention modules, we first introduce the Self-Attention (SA) en-coder block in the transformer. The inputs of the SA block are thevectors of queries ( Q ), keys ( K ), and values ( V ). The output attendedfeatures are obtained by weighted summation over the values V ,as formulated in Eq. (1). The weights on the values are computedas the dot-product similarity between the queries Q and keys K ,divide each by √ d , where d denotes the dimension of Q , K and V . ulti-HeadAttentionFeed ForwardAdd & NormAdd & Norm Q K V FH Multi-HeadAttentionFeed ForwardAdd & NormAdd & Norm Q a K a V a Multi-HeadAttentionFeed ForwardAdd & NormAdd & Norm Q a H a Multi-HeadAttentionFeed ForwardAdd & NormAdd & Norm Q v K v V v K a V a H v F a F v F v F a Audio Audio Visual (a) Self-Attention (SA) (b) Visual-Guided Attention (VGA) (c) Cross-Modal Attention (CMA) H a Figure 3: Three basic blocks of co-attention modules. (a) Self-Attention (SA), in which the vectors of queries (Q), keys (K) andvalues (V) are same as the input features; (b) Visual-Guided Attention (VGA), in which the inputs of Q a are modified to thevisual features F v ; (c) Cross-Modal Attention (CMA), in which the queries of each modality are modified to the features of theother modality. These blocks introduce the interactions between modalities, and the multi-head attention structure enablesthe model to focus on discriminative concrete components. Attention ( Q , K , V ) = softmax (cid:18) QK T √ d (cid:19) V (1)To further focus on discriminative concrete components, multi-head attention structure is adopted in the blocks, which consists of m paralleled "heads". Regarding each independent attention func-tion as one head, the output features of each head are concatenatedand then projected into to one value, yielding the final output as:MultiHead ( Q , K , V ) = Concat (head , . . . , head m ) W O where head i = Attention (cid:16) QW Qi , KW Ki , VW Vi (cid:17) (2)where W Qi , W Ki , W Vi ∈ R d × d m and W O ∈ R m × d m × d are thelearned projection matrices. d m is the dimension of the output fromeach head. To prevent the dimension of output feature becomingtoo large, d m is usually set to d / m .As shown in Figure 3(a), the SA block consists of a multi-headattention structure followed by a position-wise fully connected feed-forward network. Residual connection [16] and layer normalization[7] are also applied to each of the two sub-layers. Specifically, inthe SA block, the vectors of Q , K and V are the same, which areequal to the intermediate representations F . Taking the set of rep-resentations F = ( f , · · · , f n ) ∈ R d × N as inputs, we can obtain theattended output features H = ( h , · · · , h n ) ∈ R d × N .Inspired by the McGurk effect, we first propose a Visual-GuidedAttention (VGA) transformer block, in which the intermediate vi-sual features F v guide the attention learning for audio stream. Asdepicted in Figure 3(b), the query vectors of audio steam Q a passedto the multi-head attention are modified to the visual features F v rather than the extracted audio features F a . Thus, the attentionmaps of the VGA block tend to focus on the values in the audiostream related to visual information. Similar to the VGA block, we also design an Audio-Guided Attention(AGA) block, in which theaudio features F a guide the attention learning for visual stream andeventually obtain the attended visual features H v . In addition to theVGA and AGA blocks, we further explore a Cross-Modal Attention(CMA) block for the AVS task. As depicted in Figure 3(c), the queriesof each modality are modified to the intermediate features of theother modality, which introduce cross-modal interactions betweenaudio and visual streams. These interactions are further exploited toobtain the final attended audio features H a and visual features H v .It should be noted that such three attention blocks are all followedby the SA blocks to model the intra-modal interactions in audioand visual streams, and these attention blocks can be cascaded indepth to gradually refine the attended cross-modal features. Datasets.

Audioset [14] contains 2,084,320 manually-annotated10-sec. segments of 632 audio event classes from YouTube. Forefficiency purposes, we train our neural network on a subset of ap-proximately 240K video clips randomly sampled from the trainingsplit of AudioSet, which is denoted as Audioset-240K † . The valida-tion and test sets are sampled from the evaluation split of Audioset,which contain 10K and 8K video clips, respectively. Audioset isalways used for audio event recognition, but we do not use anyannotations other than the unconstrained videos themselves in ourtask. We only use the YouTube identifiers to download videos fromthe website, split each entire video into non-overlapping 10-sec.clips, and discard the clips shorter than ten seconds. Compared withother works, we do not even use the annotations of timestamps. Thevideo clips segmented by the timestamps usually have a specificevent while the videos in the real-life environment are complicated.Hence, the dataset obtained by our approach is more difficult andrealistic. Any unconstrained video can be used for training, andour model can learn more generic and complex representations. A cc u r a c y Figure 4: The transfer experimental results on the Audio-Visual Synchronization (AVS) task of action categories in the Kineticsdataset. The categories are sorted from high to low according to the accuracy.

Training data sampling.

For consistency and fair evaluation, wefollow the same settings in the previous work [31]. We sample4.2-sec. clips at the full frame-rate (29.97Hz) from longer 10-sec.video clips in Audioset-240K † , and the audio streams are shifted by2.0 to 5.8 seconds in negative samples. The video frames are firstuniformly resized to 256 × × Optimization details.

We train our models end-to-end on 4 GPUs(16GB P100) for 750,000 iterations with a batch size of 32. We use theStochastic Gradient Descent (SGD) algorithm with a momentumrate of 0.9 to optimize our model, where the initial learning rate is0.01 with the weight decayed by 10 − . To accelerate the trainingprocessing, we pre-train our model at a lower video frame-rate of7.5 Hz and a larger batch size of 64 for the first 100,000 iterationsas the warm-up process [16], which takes roughly 46 hours. We evaluate the performance of our approach on the AVS task,which is directly compared with two baseline models [31], including

Multisensory and

Multisensory* . Multisensory is the checkpointprovided by Owens and Efros [31]. This model is pre-trained on thesubset of Audioset, which includes approximately 750,000 videos,and we denote this subset as Audioset-750K.

Multisensory ∗ is themodel re-trained on the Audioset-240K † (our dataset) to ensure afair comparison.The related experimental results are reported in Table 1, whichindicate that our proposed co-attention models achieve superiorperformance over the baseline models. We observe that our CMA-based model makes obvious improvements over the two ablatedmodels, i.e., AGA-based and VGA-based models. Thus we just dis-cuss the CMA-based model in the rest of this paper. Our CMA-basedmodel outperforms Multisensory ∗ by 5.1% and 3.1% on Audioset-750K and Audioset-240K † , respectively. This verifies the effective-ness of introducing the cross-modal interactions between audio and Table 1: The experimental results on the AVS task. † denotesthe video clips in the dataset are segmented by our methods.* denotes the re-trained model. The notations of Ours (CMA) , Ours (AGA) , and

Ours (VGA) denote our co-attention modulewhich consists of both AGA and VGA (i.e., CMA), only AGA,and only VGA, respectively.

Methods Parameters Training Set Evaluation (Acc.)

Audioset-750K Audioset-240K † Multisensory[31] 36M Audioset-750K 59.9% 59.3%Multisensory ∗ [31] 36M Audioset-240K † † † † visual streams. It should be noted that such performance is achievedwith 58.3% fewer parameters. It is significantly more parametric-efficient than the baseline models. Besides, we note that comparedto Multisensory , the re-trained

Multisensory ∗ improves the perfor-mance by 0.3% and 0.2% on Audioset-750K and Audioset-240K † ,respectively. This demonstrates that the unconstrained videos inour dataset enforce the model to learn more effective and genericrepresentations with less training data.To study the transferability of our model and further under-stand which categories the model can achieve good performance,we also evaluate the prediction accuracy of the AVS task on theKinetics dataset [23] (without re-training). We find that our CMA-based model still outperforms Multisensory ∗ by a significant margin(60.8% vs. 57.5% accuracy). The results of each category are shownin Figure 4. It can be viewed that the most successful predictionsare the categories containing human speech (e.g., news anchoring,answering questions), which indicates that our co-attention modelis sensitive to lip motions and can be applied to some speech appli-cations (e.g., active speaker detection and speech separation). Theworst predictions are some actions that almost make no sound, e.g.,slacklining and climbing ladder. In these cases, even human beingscan hardly make a correct judgment. (cid:7)(cid:10)(cid:6)(cid:8)(cid:5) (cid:14)(cid:11)(cid:16)(cid:10)(cid:4) (cid:14)(cid:11)(cid:16)(cid:13)(cid:3)(cid:5)(cid:1)(cid:7)(cid:17)(cid:5)(cid:4) (cid:9)(cid:16)(cid:8)(cid:15)(cid:7)(cid:12)(cid:8)(cid:5) (cid:14)(cid:11)(cid:16)(cid:10)(cid:4) (cid:14)(cid:11)(cid:16)(cid:13)(cid:3)(cid:5)(cid:14) (cid:5)(cid:17)(cid:11)(cid:16)(cid:10)(cid:15)(cid:9)(cid:12)(cid:15)(cid:13)(cid:14)(cid:18)(cid:5)(cid:17)(cid:11)(cid:16)(cid:10)(cid:15)(cid:9)(cid:12)(cid:15)(cid:13)(cid:14)(cid:18)(cid:6)(cid:17)(cid:14)(cid:15)(cid:5)(cid:17)(cid:11)(cid:16)(cid:10)(cid:15)(cid:9)(cid:12)(cid:15)(cid:13)(cid:14)(cid:18)(cid:3)(cid:1)(cid:14)(cid:9)(cid:4)(cid:16)(cid:14)(cid:7)(cid:10)(cid:12)(cid:9)(cid:8)(cid:2)(cid:6)(cid:17)(cid:14)(cid:15)(cid:5)(cid:17)(cid:11)(cid:16)(cid:10)(cid:15)(cid:9)(cid:12)(cid:15)(cid:13)(cid:14)(cid:18)(cid:3)(cid:1)(cid:14)(cid:9)(cid:4)(cid:16)(cid:14)(cid:7)(cid:10)(cid:12)(cid:9)(cid:8)(cid:2) Figure 5: The qualitative results of sound source localization. The first three speech examples are sampled from Audioset andthe rest ones are sampled from Kinetics-Sounds (e.g., chopping wood, singing, etc.). The top three rows are the cases that eachvideo contains only one sound source, and the bottom three rows are the cases that each video contains multiple sound sources.More visualized videos can be found at https://youtu.be/UPrZ5Kr-DwA . It is worth noting that the AVS task is quite difficult. First, theaudio streams may be silent, or the video frames may be unchanged.Second, the unconstrained videos may contain mixed multiple com-ponents, making it hard to associate with the real audio-visual pairsand achieve acceptable performance. Moreover, the sound-makermay not even appear on the screen (e.g., the voiceover of pho-tographer, the person narrating the video). Owens and Efros [31]gave 30 Amazon Mechanical Turk participants 60 aligned/shiftedaudio-visual pairs from Audioset, which were 15-sec. in length andthe audio tracks were shifted by large, 5 seconds, so that they hadmore temporal semantic contexts to make predictions. However,the human classification result is only 66 . ± .

4% accuracy. Thissuggests that our co-attention model helps to bridge the gap to theperformance on human evaluation significantly.Audio-visual synchronization has extensive applications in ourdaily life, e.g., determining the lip-sync errors [10], detecting sig-nal processing delays in video camera and microphone. However,as a pretext task in self-supervised learning, the performance onthe AVS task is not our final objective. We are also interested inthe learned cross-modal representations with the expectation thatthese representations can carry good semantics or structural mean-ings and eventually facilitate a variety of downstream tasks. Inthe following sections, we present the qualitative and quantitativeevaluation on some practical downstream tasks.

It is essentially impossible for a neural network to effectively per-form the Audio-Visual Synchronization (AVS) task, unless it hasfirst learned to find the discriminative visual regions which makethe sound. Here, we show how our co-attention model associatesthe visual regions with the sound components. To this end, wevisualize the results by using the Class Activation Map (CAM) [44],which exploits Global Average Pooling (GAP) [28] to build effectivelocalizable representations that recognize the visual regions.The qualitative results are shown in Figure 5. We compare themodel that performs best on the pretext task with the baselinemodels, i.e.,

Multisensory and

Multisensory ∗ . As can be seen, themodels can recognize the objects that make the sounds or whosemotions are highly correlated to the sounds (e.g., lip motion), whichare used for detecting the misalignments between audio and vi-sual streams. The top three rows show the examples with only onesound source in each video. From the results in Figure 4, we findthat the most successful predictions of the AVS task are the cate-gories involving human speech. Hence, we perform sound sourcelocalization on the held-out speech subset of Audioset and displaythe results of three randomly sampled speech instances in Figure 5.It can be viewed that both multisensory models and our model aresensitive to face and mouth movements. Specifically, the differenceetween Multisensory and

Multisensory ∗ is not too much. Whereasthe co-attention module in our model can benefit the fine-grainedrepresentations of audio and visual streams, leading to localizingthe sound sources more precisely and more concentratedly.To further demonstrate the generalizability of our model to othercategories of videos, we apply the models to the Kinetics-Soundsdataset [3] and show some examples in Figure 5, including choppingwoods, playing organ, singing, dribbling basketball and playing guitar .We can see that the models can still localize the action regions.The bottom three rows in Figure 5 are more complex and chal-lenging cases that each video contains multiple sound sources. Com-pared with the baseline models, our model can almost localize eachsound source in various categories of videos. For example, as shownin the fourth column, one coach and some athletes are dribblingbasketball in the gym. Even though they have similar actions thatmake the same thuds of bouncing balls, our co-attention model stillcan localize each athlete and the coach precisely, whereas the base-line models just localize the visual region of the coach and ignorethe actions of athletes. These results indicate that the multi-headattention mechanism facilitates the network to capture variousinformation, attend to discriminative visual regions correlated tothe sound sources, and localize the regions in the scenes. To evaluate the effectiveness of cross-modal representations thatemerge from the pretext task quantitatively, we fine-tune our modelon two standard action recognition datasets of UCF101 [37] andHMDB51 [27]. UCF101 consists of 13K videos from 101 human ac-tion classes, and HMDB51 contains about 7K clips from 51 differenthuman motion classes. UCF101 and HMDB51 have three differentofficial training/testing split lists, and the mean accuracy over thethree splits are computed during our experiments. We compareour model with the fully-supervised 3D CNN methods and otherself-supervised methods, and discuss the results in this section.For action recognition, the task requires the models to classifythe action labels of given videos. We concatenate our two finaloutput audio and visual features of co-attention modules and addtwo fully-connected layers on the top of the pre-trained model forclassification. The dimension of the last fully-connected layer is thenumber of labels in the dataset. We fine-tune the entire model withcross-entropy loss, and the weights are initialized by the modelpre-trained on the AVS task.

Data sampling.

During the fine-tuning procedure, the inputs tothe visual encoder are video frames, which are resized to 256 × × Optimization details.

We fine-tune our models on single 16GP100 for about 30,000 iterations. The batch size is set as 16. Weuse Adam to optimize our models with the initial learning rate of3e-5, and halve the rate every 10,000 iterations. Dropout is imposed

Table 2: The action recognition accuracy on UCF101 [37]and HMDB51 [27]. We compare our model against thefully-supervised 3D CNN methods and other self-supervisedaudio-visual methods.

Methods Pre-training EvaluationDataset Size UCF101 HMDB51I3D-RGB [9] None 0K 57.1% 40.0%I3D-RGB [9] Kinetics 240K 95.1% 74.3%I3D-RGB [9] Kinetics + Imagenet - 95.6% 74.8%Multisensory ∗ [31] Audioset-240K † † † Using larger datasets

L3-Net [3] Audioset 2M 72.3% 40.2%Multisensory [31] Audioset-750K 0.75M 82.1% 54.0%AVTS [26] Audioset 2M 89.0% 61.6%XDC [2] Audioset 2M 91.2% 61.0%XDC [2] IG65M 65M 94.2% 67.4% with a probability of 0.5. Data augmentation strategies, i.e., randomcropping and shifting audio clips, are applied to reduce overfitting.

The comparison results are reported in Table 2. We observe that ourco-attention model provides a remarkable boost against other self-supervised methods when pre-trained on the same size of dataset.Our model improves by 2.0% on UCF101 and 1.3% on HMDB51compared with AVTS [26] pre-trained on Kinetics (240K examples).It is worth noting that the visual subnetwork adopted in AVTS isbased on the mixed-convolution (MCx) family of architectures [39],which is designed for action recognition and has more parametersthan our model. We also observe that the accuracy grows monoton-ically as the size of pre-training dataset increases, which suggeststhat our co-attention model may reduce the remaining gap to thefully-supervised method by using more pre-training data.To evaluate the effectiveness of the self-supervised pre-trainingmechanism, we also train the neural network of the same cross-modal architecture from scratch. We can see that our full modelimproves by 17.3% on UCF101 and 13.4% on HMDB51 when pre-trained on the pretext task, which verifies our self-supervised pre-training is beneficial to the task of action recognition.We further explore whether the pre-trained visual subnetwork(vision-only) can work alone. For that purpose, we set the acti-vation of audio streams to zero and change the co-attention toself-attention, that is, modify the queries Q v in the video streamsto the intermediate video features F v . We observe that there is a4.5% fall on UCF101 and a 6.3% fall on HMDB51, respectively, whichproves the audio subnetwork is important for action recognitionand the visual subnetwork can work independently. (cid:1)(cid:6)(cid:3)(cid:7)(cid:1)(cid:6)(cid:5)(cid:7)(cid:2)(cid:6)(cid:1)(cid:4)(cid:7)(cid:1)(cid:6)(cid:1)(cid:4)(cid:7)(cid:2)(cid:6)(cid:5)(cid:7)(cid:2)(cid:6)(cid:3) Figure 6: The ablation study results on the task of sound source localization. The first example contains only one sound sourceand the rest examples contain mixed multiple sources.Table 3: The ablation study results on the tasks of Audio-Visual Synchronization (AVS) and action recognition.

Hyperparams Evaluation (Acc.)

In this work, we denote the hidden size of transformer as H , thenumber of attention heads as A , and the depth of co-attentionmodule as L . In this section, we explore the effects on six modelsizes, i.e., L =[1,2], H =512, A =[4,8,16].Table 3 shows the results of ablation study on the Audio-VisualSynchronization (AVS) and action recognition tasks with respectto different depths and heads of the model. We find that both ofthese two tasks benefit from shallower models, and increasing thenumber of heads properly can also improve the performance. Figure6 presents the effects on localizing the sound sources. We observethat all models perform similarly when the video contains onlyone sound source. For better comparison, we show more examples that contain mixed multiple sounds. It can be seen that the greaterdepth of co-attention module allows the model to localize the soundsource more concentratedly. We also observe that the model canlocalize more sound sources with the increase of head number. Forexample, in the second column, there are nine older people singingon the screen. As the number of heads increases, the model canidentify the older people in the bottom left corner and middle right.This result verifies that the multi-head mechanism is helpful tolocalize multiple visual regions correlated to the sounds. In this paper, we propose a self-supervised framework with co-attention mechanism to learn generic cross-modal representationsby solving the pretext task of Audio-Visual Synchronization (AVS).Our co-attention model introduces the information interactions be-tween audio and visual streams, which can achieve state-of-the-artperformance on the AVS task. We also demonstrate that it success-fully learns cross-modal semantic information that can be utilizedfor a variety of downstream tasks, such as sound source localizationand action recognition. Note that we train our networks with asubset of Audioset for efficiency purposes, we will exploit morepre-training data to achieve continual improvement, and apply ourmodel to other practical audio-visual tasks in future work.

ACKNOWLEDGMENTS

This work was partially supported by National Natural ScienceFoundation of China (No. 61976057, No. 61572140), Military KeyResearch Foundation Project (No. AWS15J005), Shanghai MunicipalScience and Technology Major Project (2018SHZDZX01) and ZJLab.

EFERENCES [1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici,Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. Youtube-8m:A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016).[2] Humam Alwassel, Dhruv Mahajan, Lorenzo Torresani, Bernard Ghanem, and DuTran. 2019. Self-Supervised Learning by Cross-Modal Audio-Video Clustering. arXiv preprint arXiv:1911.12667 (2019).[3] Relja Arandjelovic and Andrew Zisserman. 2017. Look, listen and learn. In

Proceedings of the IEEE International Conference on Computer Vision . 609–617.[4] Relja Arandjelovic and Andrew Zisserman. 2018. Objects that sound. In

Proceed-ings of the European Conference on Computer Vision (ECCV) . 435–451.[5] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learningsound representations from unlabeled video. In

Advances in neural informationprocessing systems (NIPS) . 892–900.[6] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2017. See, hear, and read:Deep aligned representations. arXiv preprint arXiv:1706.00932 (2017).[7] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normaliza-tion. arXiv preprint arXiv:1607.06450 (2016).[8] Zohar Barzelay and Yoav Y Schechner. 2007. Harmony in motion. In . IEEE, 1–8.[9] Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a newmodel and the kinetics dataset. In proceedings of the IEEE Conference on ComputerVision and Pattern Recognition . 6299–6308.[10] Joon Son Chung and Andrew Zisserman. 2016. Out of time: automated lip syncin the wild. In

Asian conference on computer vision . Springer, 251–263.[11] Soo-Whan Chung, Joon Son Chung, and Hong-Goo Kang. 2019. Perfect match:Improved cross-modal embeddings for audio-visual synchronisation. In

ICASSP2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 3965–3969.[12] Arridhana Ciptadi, Matthew S Goodwin, and James M Rehg. 2014. Movementpattern histogram for action recognition and retrieval. In

European conference oncomputer vision . Springer, 695–710.[13] John W Fisher III, Trevor Darrell, William T Freeman, and Paul A Viola. 2001.Learning joint statistical models for audio-visual fusion and segregation. In

Advances in neural information processing systems (NIPS) . 772–778.[14] Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, WadeLawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Au-dio Set: An ontology and human-labeled dataset for audio events. In

Proc. IEEEICASSP 2017 . New Orleans, LA.[15] Tavi Halperin, Ariel Ephrat, and Shmuel Peleg. 2019. Dynamic temporal align-ment of speech to lips. In

ICASSP 2019-2019 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) . IEEE, 3980–3984.[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residuallearning for image recognition. In

Proceedings of the IEEE conference on computervision and pattern recognition . 770–778.[17] John R Hershey and Javier R Movellan. 2000. Audio vision: Using audio-visualsynchrony to locate sounds. In

Advances in neural information processing systems(NIPS) . 813–819.[18] Di Hu, Feiping Nie, and Xuelong Li. 2019. Deep Multimodal Clustering forUnsupervised Audiovisual Learning. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition . 9248–9257.[19] Weiming Hu, Dan Xie, Zhouyu Fu, Wenrong Zeng, and Steve Maybank. 2007.Semantic-based surveillance video retrieval.

IEEE Transactions on image processing

16, 4 (2007), 1168–1181.[20] Hamid Izadinia, Imran Saleemi, and Mubarak Shah. 2012. Multimodal analysis foridentification and segmentation of moving-sounding objects.

IEEE Transactionson Multimedia

15, 2 (2012), 378–390.[21] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D convolutional neuralnetworks for human action recognition.

IEEE transactions on pattern analysisand machine intelligence

35, 1 (2012), 221–231.[22] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Suk-thankar, and Li Fei-Fei. 2014. Large-scale Video Classification with ConvolutionalNeural Networks. In

CVPR .[23] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, SudheendraVijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017.The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).[24] Einat Kidron, Yoav Y Schechner, and Michael Elad. 2005. Pixels that sound. In , Vol. 1. IEEE, 88–95.[25] Hema S Koppula and Ashutosh Saxena. 2015. Anticipating human activities usingobject affordances for reactive robotic response.

IEEE transactions on patternanalysis and machine intelligence

38, 1 (2015), 14–29.[26] Bruno Korbar, Du Tran, and Lorenzo Torresani. 2018. Cooperative learning ofaudio and video models from self-supervised synchronization. In

Advances inNeural Information Processing Systems (NIPS) . 7763–7774. [27] H. Kuhne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. 2011. HMDB: A LargeVideo Database for Human Motion Recognition. In

IEEE International Conferenceon Computer Vision (ICCV) .[28] Min Lin, Qiang Chen, and Shuicheng Yan. 2014. Network in network. In

Interna-tional Conference on Learning Representations (ICLR) .[29] Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices.

Nature

ICASSP2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 6829–6833.[31] Andrew Owens and Alexei A Efros. 2018. Audio-visual scene analysis withself-supervised multisensory features. In

Proceedings of the European Conferenceon Computer Vision (ECCV) . 631–648.[32] Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and AntonioTorralba. 2016. Ambient sound provides supervision for visual learning. In

European conference on computer vision . Springer, 801–816.[33] Mohsen Ramezani and Farzin Yaghmaee. 2016. A review on human actionanalysis in videos for retrieval applications.

Artificial Intelligence Review

46, 4(2016), 485–514.[34] MS Ryoo, Thomas J Fuchs, Lu Xia, Jake K Aggarwal, and Larry Matthies. 2015.Robot-centric activity prediction from first-person videos: What will they do tome?. In . IEEE, 295–302.[35] Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional net-works for action recognition in videos. In

Advances in neural information process-ing systems (NIPS) . 568–576.[36] Sanchit Singh, Sergio A Velastin, and Hossein Ragheb. 2010. Muhavi: A mul-ticamera human action video dataset for the evaluation of action recognitionmethods. In . IEEE, 48–55.[37] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: Adataset of 101 human actions classes from videos in the wild. arXiv preprintarXiv:1212.0402 (2012).[38] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri.2015. Learning spatiotemporal features with 3d convolutional networks. In

Proceedings of the IEEE international conference on computer vision (ICCV) . 4489–4497.[39] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and ManoharPaluri. 2018. A closer look at spatiotemporal convolutions for action recognition.In

Proceedings of the IEEE conference on Computer Vision and Pattern Recognition .6450–6459.[40] Harry L Van Trees. 2004.

Optimum array processing: Part IV of detection, estimation,and modulation theory . John Wiley & Sons.[41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all youneed. In

Advances in neural information processing systems (NIPS) . 5998–6008.[42] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018.Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in videoclassification. In

Proceedings of the European Conference on Computer Vision(ECCV) . 305–321.[43] Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott,and Antonio Torralba. 2018. The sound of pixels. In

Proceedings of the EuropeanConference on Computer Vision (ECCV) . 570–586.[44] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.2016. Learning deep features for discriminative localization. In

Proceedings of theIEEE conference on computer vision and pattern recognition (CVPR) . 2921–2929.[45] Andrea Zunino, Marco Crocco, Samuele Martelli, Andrea Trucco, Alessio Del Bue,and Vittorio Murino. 2015. Seeing the sound: A new multimodal imaging de-vice for computer vision. In