Learning Audio-Visual Correlations from Variational Cross-Modal Generation
LLEARNING AUDIO-VISUAL CORRELATIONS FROM VARIATIONAL CROSS-MODALGENERATION
Ye Zhu Yu Wu Hugo Latapie Yi Yang Yan Yan Illinois Institute of Technology, USA ReLER, University of Technology Sydney, Australia Cisco, USA
ABSTRACT
People can easily imagine the potential sound while seeing anevent. This natural synchronization between audio and visualsignals reveals their intrinsic correlations. To this end, wepropose to learn the audio-visual correlations from the per-spective of cross-modal generation in a self-supervised man-ner, the learned correlations can be then readily applied inmultiple downstream tasks such as the audio-visual cross-modal localization and retrieval. We introduce a novel Vari-ational AutoEncoder (
VAE ) framework that consists of Mul-tiple encoders and a Shared decoder (
MS-VAE ) with an addi-tional Wasserstein distance constraint to tackle the problem.Extensive experiments demonstrate that the optimized latentrepresentation of the proposed
MS-VAE can effectively learnthe audio-visual correlations and can be readily applied inmultiple audio-visual downstream tasks to achieve compet-itive performance even without any given label informationduring training.
Index Terms — Audio-visual correlations, Variational au-toencoder, Cross-modal generation.
1. INTRODUCTION
As humans, we can naturally imagine the possible visualframes while hearing the corresponding sound or imaginethe potential sound while seeing an event happening. There-fore, the correlations between audio and visual informationaccompanying an event can be modeled in the perspectiveof generation, i.e. , the corresponding audio signals and vi-sual frames can generate each other. Moreover, the naturalcorrespondence between audio and visual information fromthe videos makes it possible to accomplish this objective in aself-supervised manner without additional annotations.Audio and visual perceptions are both essential sourcesof information for humans to explore the world. Audio-visualcross-modal learning has thus become a research focus in re-cent years [1, 2, 3, 4, 5, 6, 7, 8, 9]. The correlations betweenaudio and visual signals are the key in this field. Recentstudies in the audio-visual cross-modal field largely focus onrepresentation learning that incorporates the information from both modalities in a discriminative way, and then applyingthe learned feature embedding in relevant audio-visual taskssuch as sound source localization [5, 10, 11], sound sourceseparation [2, 3, 11], cross-modal retrievals [12, 3, 5, 13]and cross-modal localization [13, 14]. In contrast, anotherbranch of research work exploits to learn the correlationsin a generative manner [15, 6, 16]. Hu et al. [7] introduceDeep Multimodal Clustering for capturing the audio-visualcorrespondence. Korbar et al. [1] propose a cooperativelearning schema to obtain multi-sensory representation fromself-supervised synchronization. Arandjelovic and Zisser-man [3] propose to learn a mutual feature embedding throughthe audio-visual correspondence task. As for more concreteaudio-visual downstream tasks, Gao et al. [10] look into theproblem of separating different sound sources based on adeep multi-instance multi-label network. Among those abovestudies with concrete downstream tasks, most of them learnthe audio-visual correlations in a discriminative way to obtainbetter performance, which usually requires label information.Our work tackles the problem from a different perspectivebased on generative models via self-supervised learning.Overall, we have two motivations to fulfill in this work:leveraging the label-free advantage to learn the intrinsicaudio-visual correlations from cross-modal generations viathe proposed
MS-VAE framework, and achieving competi-tive performance for multiple audio-visual downstream tasksat the same time. Specifically, the proposed
MS-VAE is aVariational AutoEncoder framework with Multiple encodersand a Shared decoder. VAE [17] is a popular class of gen-erative models that synthesizes data with latent variablessampled from a variational distribution, generally modeledby Gaussian distributions. Based on the properties of VAE,we propose to use the latent variables to represent modality-specific data. Then these latent variables need to be alignedin order to complementarily present the happening event intwo perspectives. Finally, the aligned latent variable shouldbe able to generate both audio and visual information toconstruct the corresponding data pair. The optimized latentvariable hence automatically learns the intrinsic audio-visualcorrelations during the process of cross-modal generation, a r X i v : . [ c s . C V ] F e b ig. 1 . Schematic overview of our proposed MS-VAE model.and is ready to be directly applied in audio-visual tasks.One practical challenge to apply the VAE framework inmulti-modal learning is that VAE often suffers from the de-generation in balancing simple and complex distributions [18]due to the large dimensionality difference between audio andvideo data. We adopt a shared-decoder architecture to helpavoiding the degeneration and to enforce the mutuality be-tween audio and visual information. To further obtain a betteralignment of the latent space, we derive a new evidence lowerbound (ELBO) with a Wasserstein distance [19, 20, 21] con-straint, which formulates our objective function.The main contributions of our work can be summarized asfollows: 1) We model the audio-visual correlations from theperspective of cross-modal generation. We make a shared-latent space assumption to apply a unified representationfor two modalities, by deriving the objective function froma new lower bound with a Wasserstein distance constraint.2) We propose the
MS-VAE network, which is a novel self-supervised learning framework for the audio-visual cross-modal generation.
MS-VAE generates a corresponding audio-video pair from either single modality input and alleviates thedegeneration problem in VAE training. 3) The learned latentrepresentations from
MS-VAE can be readily applied in mul-tiple audio-visual tasks. Experiments on AVE dataset [13]show that our unsupervised method is able to achieve perfor-mance comparable or superior to the supervised methods onthe challenging localization and retrieval tasks, even trainedwithout any labels.
2. METHODOLOGY2.1. MS-VAE for Cross-Modal Learning
Our
MS-VAE network is composed of separate encoders andone shared decoder. q a,φ a ( z a | x a ) is the encoder that encodesaudio input data x a into a latent space z a , and q v,φ v ( z v | x v ) isthe encoder for visual input data x v to construct another latentspace z v . Ideally, we wish to obtain an aligned mutual latentspace z where z = z a = z v . The shared decoder p θ ( x | z ) aimsto reconstruct the original data pair x from this mutual latentspace z in training, in which case the expected reconstructeddata should consist of x a and x v , we denote the pair of audio and video data as x = ( x a , x v ) . φ a , φ v and θ are the modelparameters, which we omit in the following formulations toreduce redundancy.The goal of our model resembles to the original VAE [17],where we target to maximize the log-probability log p ( x ) ofthe reconstruction data pair x from the desired mutual latentspace z , i represents either the modality a (audio) or v (vi-sual). This model design leads to a similar variational lowerbound as the original VAE [17] as follows: log p ( x ) ≥ E z i ∼ q i ( z i | x i ) [log p ( x | z i )] − KL( q i ( z i | x i ) || p ( z i )) , (1)where KL denotes the Kullback-Leibler divergence, definedas KL( p ( x ) || q ( x )) = (cid:82) x p ( x ) log p ( x ) q ( x ) , which measures thesimilarity between two distributions and is always positive.To build the relation between audio and video components,we rewrite Equation 1 as a mixture of log-likelihood con-ditioned on the latent variable from different modality. AWasserstein distance loss [22], which we refer as W latent , isfurther added to better encourage the alignment between twolatent space. Since the Wasserstein distance is always posi-tive, the inequality remains valid. In this case, we obtain anew lower bound, whose equality is obtained only when themodeled distribution is the same as data distribution, as wellas z a and z v are perfectly aligned: log p ( x ) ≥ E z [log( p ( x | z )] −
12 [KL( q a ( z | x a ) || p ( z ))+ KL( q v ( z | x v ) || p ( z ))] − W latent ( q a ( z a | x a ) || q v ( z v | x v )) . (2) The schematic overview of the proposed
MS-VAE architectureis illustrated in Figure 1. We have separate encoders q a and q v for audio and visual inputs, a shared decoder p is used to gen-erate the corresponding audio and visual data pair. Wasser-stein distance is computed between two latent variables z a and z v to encourage the alignment between the latent spacesing the similar approach as in WAE [23], where we samplefrom latent variables z a and z v to compute E q a ,q v [ || z a , z v || ] .The ultimate goal of the proposed MS-VAE is to obtainan aligned latent representation that automatically learns theaudio-visual correlations. During training, for each epoch, wereconstruct the audio-visual pair from either audio or visualinput. The encoder returns the µ i and σ i for the Gaussian dis-tribution, z i is sampled from N ( µ i , σ i ) . The reconstructionloss Mean Square Error (MSE) is computed between the re-constructed pair ˆ x i from modality i and the input ground truthpair x . The total loss contains the reconstruction loss, KL di-vergence and the Wasserstein latent loss. No label informa-tion is given in the entire training process. Overall speaking,we have three loss terms to optimize: L total = λ L MSE + λ L KL + λ W latent . (3)Empirically, we choose λ and λ to be 1, λ is set to be0.1 in the first 10 training epochs and then reduced to 0.01 toencourage better reconstruction.
3. EXPERIMENTS3.1. Dataset and Evaluation Metrics
We expect the data for our experiments have intrinsic correla-tions in describing an audio-visual event, we therefore adoptthe AVE dataset. The AVE dataset [13] is a subset of Au-dioSet [24] that contains 4,143 videos labeled with 28 eventcategories. Each video has a duration of 10s, and at least 2sare labeled with audio-visual events, such as baby crying anddog barking etc. . Apart from the event categories labels forthe entire video, each video is also temporally divided into10 segments with audio-visual event boundaries, indicatingwhether the one-second segment is event-relevant or the back-ground. The labels in
MS-VAE are purely used for evaluationpurposes in our experiments, no labels are used in training.We mainly apply our proposed model in two downstreamtasks: the cross-modal localization (CML) [13] and the audio-visual retrieval [3]. The CML task contains two subtasks, in-cluding localizing the visual event boundary from audio sig-nals (A2V) and vice versa (V2A). The evaluation accuracy iscomputed based on the strict exact match, which means thatwe only count correct when the match location is exactly thesame as its ground truth. For the retrieval task, the mean recip-rocal rank (MRR) is used to evaluate the performance. MRRcalculates the average value of the reciprocal of the rank atwhich the first relevant information is retrieved across queries.In addition to the concrete audio-visual tasks, we also includefurther qualitative ablation studies on the learned latent spaceto provide a more comprehensive analysis for the proposed
MS-VAE model.
Setting Method A2V ↑ V2A ↑ Average ↑ Spv. DCCA [27] 34.1 34.8 34.5AVDLN [13] 35.6 44.8 40.2AVSDN [25] 37.1 45.1 40.9AVFB [26] 37.3 46.0 41.6Unspv. Ours 25.0 ± ± ± W ± ± ± Table 1 . Quantitative evaluations on the CML task in termsof the exact matching accuracy.
Spv. means supervised and
U nspv. means unsupervised.
Bg Bg Bg Bg Bark Bg BgBark
Bark Bark
A2V: mismatch for 1 segment V2A: exact match
Fig. 2 . Example of qualitative results for the CML task.
Cross-modal localization is proposed in [13], in which wewish to temporally locate a given segment of one modality(audio/visual) data in the entire sequence of the other modal-ity. The localization is realized in two directions, i.e. , visuallocalization from given audio segments (A2V) and audiolocalization from given visual segments (V2A). This taskespecially emphasizes the correlations between visual andaudio signals since not only the correlations between differ-ent modalities are required, the temporal correlations are alsoneeded to successfully fulfill the task [13, 14, 25, 26].In inference, for the sub-task A2V, we adopt the slidingwindow strategy as in [13] to optimize the following objec-tive: t ∗ = argmin t (cid:80) ls =1 D cml ( V s + t − , ˆ A s ) , where t ∗ ∈{ , ..., T − l + 1 } is the start time when audio and visualcontent synchronize, T is the total length of a testing videosequence, and l is the length of the audio query ˆ A s . The timeposition that minimize the cumulative distance between audiosegments and visual segments is chosen to be the matchingstart position. For D cml in our experiments, we compute twoterms, i.e. , the Wasserstein latent distance W latent betweentwo latent variables encoded from audio and visual segments,and the Euclidean distance D gen between the generated pairby audio and visual segment: D cml = W latent + D gen . Sim-ilarly for another sub-task V2A.The quantitative results are shown in Table 1. We compareour method with other network models that learn the audio-visual correlation in a supervised manner. It is worth notingthat our proposed self-supervised model achieves compara-ble performance with the supervised state-of-the-art methods.Figure 2 shows an example of a mismatch and exact match etting Method A-A ↑ A-V ↑ V-A ↑ V-V ↑ Spv. AVDLN [13] 0.34 0.17 0.20 0.26AVSDN [25] 0.38 0.19 0.21 0.25AVFB [26] 0.37 0.20 0.20 0.27Unspv. L [28] 0.13 0.14 0.13 0.14AVE [3] 0.16 0.14 0.14 0.16Ours 0.24 0.14 0.15 0.27Ours+ W Table 2 . Cross-modal and intra-modal retrieval results interms of MRR. The columns headers denote the modalitiesof the query and the database. For example, A-V means re-trieve visual frames from audio query.
Spv. and
Unspv. denotesupervised and unsupervised.for the cross-modal localization task. In A2V sub-task, weare given an audio query accompanying an event, e.g., dogbark, we want to localize its corresponding positions in theentire 10s visual frame sequence. Similarly in the V2A sub-task. Only the exact match is considered to be correct, thusmaking the task very challenging.
Cross-modal retrieval task proposed in [3] seeks to find therelevant audio/visual segment given a single audio or visualquery. As in [3], we randomly sample a single event-relevantpair of visual frame and audio signal from each video in thetesting set of AVE dataset to form the retrieval database. Theretrieval task contains intra-modal ( e.g. , audio-to-audio) andcross-modal ( e.g. , audio-to-visual) categories. Each item inthe test set is used as a query. The identical item from thedatabase for each given query in intra-modal retrieval is re-moved to avoid bias. All the labels are only used for evalua-tions to calculate the MRR metric, which computes the aver-age of the reciprocal of the rank for the first correctly retrieveditems ( i.e. , items with the same category label as the query).We fine-tune the L -Net [28] and AVE-Net [3] in thetraining set of AVE dataset. AVE-Net incorporates a Eu-clidean distance between two separate feature embeddingsto enforce the alignment. Both models learn audio-visualembedding in a self-supervised manner. We use the same in-ference as presented in [28] and [3] for L -Net and AVE-Net,respectively. In addition to these two unsupervised baselines,we also adopt the supervised models to perform the retrievaltasks. Similar to the previous CML task, these models aretrained to minimize the similarity distance between two em-beddings of a corresponding audio-visual pair. In inference,we use a similar technique as in cross-modal localization task,which is to retrieve the sample that has the minimum distancescore. For MS-VAE , we compare the sum of Wassersteinlatent distance and the reconstruction error between the givenquery and all the test samples from the retrieval database.Table 2 presents the quantitative comparison between theproposed
MS-VAE and other methods. We achieve better per-
Fig. 3 . Latent space comparison. t-SNE visualization for test-ing segments of corresponding audio-visual pair.formance in all four sub-tasks compared to the unsupervisedbaselines, and achieve competitive performance close to thesupervised models. This experiment further proves the effec-tiveness of leveraging audio-visual correlations for
MS-VAE . We perform additional ablation studies on the shared decoderand the Wasserstein restriction components to show that theyare both significant factors that contribute to the learnedaudio-visual correlations. Figure 3 shows the qualitative re-sults for the latent space comparison among
CM-VAE [29],our
MS-VAE and
MS-VAE + W in the form of latent spacevisualization. Specifically, CM-VAE is proposed in [29] forcross-modal generation for hand poses ( e.g. , generating 3Dhand poses from RGB images).
CM-VAE adopts separate en-coders and decoders for each modality, and can be consideredas
MS-VAE without shared decoder version. It is interestingto observe that in the latent space obtained by
CM-VAE , thevisual embedding degenerates into a purely non-informativeGaussian distribution as in [18].
MS-VAE alleviates the de-generation problem and learns similar distributions for audioand visual input, but the distance between two latent spaceis still evident. Wasserstein distance further bridges the twolatent space and regularizes the learned data distributions.Note that the perfect alignment is very difficult to achieve andremains to be an open question in the research community.
4. CONCLUSION
In this paper, we propose the
MS-VAE framework to learnthe intrinsic audio-visual correlations for multiple down-stream audio-visual tasks.
MS-VAE is a self-supervisedlearning framework that generates a pair of correspondingaudio-visual data given either one modality as input data. Itleverages the advantage of label-free self-supervised learningfrom the generative models and achieves very competitiveperformance for multiple audio-visual tasks.
5. ACKNOWLEDGEMENTS
This research was partially supported by NSF NeTS-1909185,CSR-1908658 and Cisco. This article solely reflects the opin-ions and conclusions of its authors and not the funding agents.Yu Wu is supported by the Google PhD Fellowship. . REFERENCES [1] Bruno Korbar, Du Tran, and Lorenzo Torresani, “Co-operative learning of audio and video models from self-supervised synchronization,” in
NeurIPS , 2018.[2] Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-HsuanYang, and In So Kweon, “Learning to localize soundsource in visual scenes,” in
CVPR , 2018.[3] Relja Arandjelovic and Andrew Zisserman, “Objectsthat sound,” in
ECCV , 2018.[4] Yapeng Tian, Chenxiao Guan, Goodman Justin, MarcMoore, and Chenliang Xu, “Audio-visual interpretableand controllable video captioning,” in
CVPR Workshop ,2019.[5] Andrew Owens and Alexei A Efros, “Audio-visualscene analysis with self-supervised multisensory fea-tures,” in
ECCV , 2018.[6] Andrew Rouditchenko, Hang Zhao, Chuang Gan, JoshMcDermott, and Antonio Torralba, “Self-supervisedaudio-visual co-segmentation,” in
ICASSP . IEEE, 2019.[7] Di Hu, Feiping Nie, and Xuelong Li, “Deep multimodalclustering for unsupervised audiovisual learning,” in
CVPR , 2019.[8] Chuang Gan, Deng Huang, Hang Zhao, Joshua B Tenen-baum, and Antonio Torralba, “Music gesture for visualsound separation,” in
CVPR , 2020.[9] Chuang Gan, Deng Huang, Peihao Chen, Joshua BTenenbaum, and Antonio Torralba, “Foley music:Learning to generate music from videos,” in
ECCV ,2020.[10] Ruohan Gao, Rogerio Feris, and Kristen Grauman,“Learning to separate object sounds by watching unla-beled video,” in
ECCV , 2018.[11] Hang Zhao, Chuang Gan, Andrew Rouditchenko, CarlVondrick, Josh McDermott, and Antonio Torralba, “Thesound of pixels,” in
ECCV , 2018.[12] Yusuf Aytar, Carl Vondrick, and Antonio Torralba,“Soundnet: Learning sound representations from unla-beled video,” in
NeurIPS , 2016.[13] Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, andChenliang Xu, “Audio-visual event localization in un-constrained videos,” in
ECCV , 2018.[14] Yu Wu, Linchao Zhu, Yan Yan, and Yi Yang, “Dualattention matching for audio-visual event localization,”in
ICCV , 2019. [15] Lele Chen, Sudhanshu Srivastava, Zhiyao Duan, andChenliang Xu, “Deep cross-modal audio-visual genera-tion,” in
ACM Multimedia Workshop , 2017.[16] Aakanksha Rana, Cagri Ozcinar, and Aljosa Smolic,“Towards generating ambisonics using audio-visual cuefor virtual reality,” in
ICASSP . IEEE, 2019.[17] Diederik P Kingma and Max Welling, “Auto-encodingvariational bayes,” in
ICLR , 2014.[18] Huangjie Zheng, Jiangchao Yao, Ya Zhang, and Ivor WTsang, “Degeneration in vae: in the light of fisher infor-mation loss,” arXiv preprint arXiv:1802.06677 , 2018.[19] Martin Arjovsky, Soumith Chintala, and L´eon Bot-tou, “Wasserstein generative adversarial networks,” in
ICML , 2017.[20] Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, andAlexander Rush, “Latent alignment and variational at-tention,” in
NeurIPS , 2018.[21] Yingtao Tian and Jesse Engel, “Latent translation:Crossing modalities by bridging generative models,” arXiv preprint arXiv:1902.08261 , 2019.[22] Nicolas Bonneel, Julien Rabin, Gabriel Peyr´e, andHanspeter Pfister, “Sliced and radon wassersteinbarycenters of measures,”
Journal of MathematicalImaging and Vision , vol. 51, no. 1, 2015.[23] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, andBernhard Schoelkopf, “Wasserstein auto-encoders,” in
ICLR , 2017.[24] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman,Aren Jansen, Wade Lawrence, R Channing Moore,Manoj Plakal, and Marvin Ritter, “Audio set: An on-tology and human-labeled dataset for audio events,” in
ICASSP , 2017.[25] Yan-Bo Lin, Yu-Jhe Li, and Yu-Chiang Frank Wang,“Dual-modality seq2seq network for audio-visual eventlocalization,” in
ICASSP , 2019.[26] Janani Ramaswamy and Sukhendu Das, “See the sound,hear the pixels,” in
The IEEE Winter Conference onApplications of Computer Vision , 2020.[27] Galen Andrew, Raman Arora, Jeff Bilmes, and KarenLivescu, “Deep canonical correlation analysis,” in
ICML , 2013.[28] Relja Arandjelovic and Andrew Zisserman, “Look, lis-ten and learn,” in
ICCV , 2017.[29] Adrian Spurr, Jie Song, Seonwook Park, and OtmarHilliges, “Cross-modal deep variational hand pose es-timation,” in