A Multi-View Approach To Audio-Visual Speaker Verification
Leda Sarı, Kritika Singh, Jiatong Zhou, Lorenzo Torresani, Nayan Singhal, Yatharth Saraf
AA MULTI-VIEW APPROACH TO AUDIO-VISUAL SPEAKER VERIFICATION
Leda Sarı , Kritika Singh , Jiatong Zhou , Lorenzo Torresani , Nayan Singhal , Yatharth Saraf Facebook AI, USA Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, USA
ABSTRACT
Although speaker verification has conventionally been anaudio-only task, some practical applications provide bothaudio and visual streams of input. In these cases, the visualstream provides complementary information and can oftenbe leveraged in conjunction with the acoustics of speech toimprove verification performance. In this study, we exploreaudio-visual approaches to speaker verification, starting withstandard fusion techniques to learn joint audio-visual (AV)embeddings, and then propose a novel approach to handlecross-modal verification at test time. Specifically, we in-vestigate unimodal and concatenation based AV fusion andreport the lowest AV equal error rate (EER) of 0.7% on theVoxCeleb1 dataset using our best system. As these methodslack the ability to do cross-modal verification, we introduce amulti-view model which uses a shared classifier to map audioand video into the same space. This new approach achieves28% EER on VoxCeleb1 in the challenging testing conditionof cross-modal verification.
Index Terms — Speaker verification, multi-view model,multi-modal systems, convolutional neural networks
1. INTRODUCTION
Speaker recognition and verification systems are convention-ally based on the speech component as speech is a mediumthat partially represents the identity of the speaker. However,in a noisy acoustic environment, it can become harder to dis-tinguish different speakers based only on speech signals. Insuch cases, humans often rely on other signals for the iden-tity which are not affected by acoustic noise, such as facialfeatures. Because of the complementary nature of audio andvideo, several audio-visual (AV) systems have been proposedfor speaker identification [1, 2, 3].Such AV identification systems vary depending on theirfusion strategies and modeling approaches. As described inother multimodal studies, e.g. [4], fusion methods includeearly, mid-level and late fusion. Early fusion concatenatesinputs and learns joint features of both modalities, mid-levelfusion combines information after some independent process-ing of the two modalities and late fusion mainly consists of
This work was done when L. Sarı was an intern at Facebook. score fusion from unimodal systems. As for modeling ap-proaches, earlier systems make use of probabilistic modelssuch as dynamic Bayesian networks [1] whereas recent stud-ies focus on neural network based modeling [3].A common approach to achieve speaker verification is toextract a speaker representative embedding from the givenutterance and compare a pair of embeddings using a dis-tance measure to determine if the given utterances belongto the same person. Earlier studies have used i-vectors [5],as speaker representation and probabilistic linear discrim-inant analysis (PLDA) scoring for verification. In recentstudies, neural network based speaker embeddings, such as x-vectors [6], are used. These systems usually process the inputspeech by a network that generates a sequence of features forthe utterance which are then aggregated into a single vector torepresent the speaker embedding [7, 8]. These aggregation orsummarization methods range from temporal pooling to clus-ter based approaches which are also used in computer visionstudies such as NetVLAD [9] and GhostVLAD [10, 7].In this study, we first investigate AV speaker verifica-tion by learning AV speaker embeddings which assumesthat audio and video data are simultaneously available attest time. To our knowledge, we achieve the best reportedAV speaker verification performance on the VoxCeleb1 [11]and VoxCeleb2 [12] datasets. These AV systems assume theavailability of both modalities during test time. However,in many practical settings, one of the modalities is degradedor may be missing on one side of the verification pair. Forinstance, the speaker may be off-screen or have their cameradisabled while they are actively speaking in which case onlythe audio stream is usable for verification. The audio may bemissing or corrupted in some other scenarios. There may alsobe some verification pairs where audio is the usable modal-ity on one side and video on the other. Here we need to docross-modal matching, i.e. verifying if a video and an audiosignal represent the same person. Late fusion or mid-levelfusion cannot handle such cases. Therefore, we propose amultiview approach that allows us to perform verificationin the case where a pair does not have matching modalitiesavailable, either audio or video. The proposed multiviewapproach achieves this by mapping audio and video into thesame space by using a shared classifier on top of the unimodalencoders. a r X i v : . [ c s . S D ] F e b . RELATED WORK There have been several studies on VoxCeleb1 and VoxCeleb2datasets and several benchmarks have been published. Thetypical setup is to train the system on VoxCeleb2-dev set andtest on VoxCeleb1-test verification pairs. These setups dif-fer in network architecture and the embedding aggregationsteps. For example, in [12], ResNet-50 architecture and timeaverage pooling is used which achieves 4.2% EER. In [7],thin ResNet-34 architecture is used along with GhostVLADpooling that results in 3.2% EER. In [8], a convolutional at-tention model is proposed for time and frequency dimensionsand GhostVLAD based aggregation is applied and the modelachieves 2.0%. The lowest EER on the VoxCeleb1 test set isreported in [13], where their best system makes use of dataaugmentation and system combination.Although the VoxCeleb2 dataset comes with videos, thereare only a few studies on audio-visual approaches for speakerverification. In [3], pretrained face and voice embeddingsare fused using a cross-modal attention mechanism on shortspeech segments (0.1s or 1s). In their tests on VoxCeleb2,they obtain an EER of 5.3%. They also analyze the perfor-mance in the case of noisy or missing modality and their per-formance degrades to 7.7% and 12.2% when voice and faceembeddings are omitted, respectively. In [14], an audio-visualself-supervised approach is used to train a system that learnsidentity and context embeddings separately. As a comparison,they also report audio-only fully-supervised training resultson the VoxCeleb1 test set which achieves 7.3% EER. Sincethey use only 20% of the VoxCeleb2 dataset for training andthere is not a standard set of verification pairs for VoxCeleb2,their AV results are not directly comparable to the previousstudy.Cross-modal processing has been recently used in differ-ent combinations such as audio-video [15, 14, 16, 17] andspeech-text [18]. The common approach in these studies isto map inputs from different modalities into a shared spaceto achieve cross-modal retrieval. For example, in [15], con-trastive loss is used to learn to map matching face and voiceembeddings to the same space. In [16], same-different clas-sification is performed on the cosine scores between faceand voice embeddings to train the system. In [17], a novelloss function is proposed to learn the embeddings in a sharedspace. Their loss function tries preserving neighborhoodconstraints within and across modalities.
3. UNIMODAL AND MULTIMODAL MODELS
In the training stage, our verification models are optimized tolearn speaker discriminative embeddings. The cosine simi-larity between embeddings coming from two videos are thenused for verification at test time. In this section, we will de-scribe the uni-modal and multi-modal systems that allow usto generate these embeddings. Our unimodal systems consist of an encoder ( F ) followedby a nonlinear classifier ( C ) as shown in Fig. 1a and 1b.Therefore, the final network output is represented by y i = C i ( F i ( x i )) i ∈ { A, V } (1)where the subscript i in x i denotes the modality of the in-put. In order to achieve AV speaker verification using uni-modal systems, we use late score fusion. In this fusion, weseparately compute the cosine similarity using each unimodalsystem and then average the similarities to get the final verifi-cation scores.For joint AV training, we investigate a naive mid-level fu-sion approach. This is shown in Fig. 1c. In this model, wehave separate encoders for audio and video, whose outputsare concatenated along the feature dimension before being fedinto a nonlinear classifier. The network output is representedas y AV = C AV ([ F A ( x A ) , F V ( x V )]) . (2)In this case, a single loss function is applied to the joint out-put y AV during training. During verification, AV embeddingsfrom two AV inputs are compared using cosine similarity.
4. THE MULTIVIEW MODEL
We propose a model that is trained to generate high level rep-resentations for audio and video modalities in a space sharedacross the two modalities. Such a system enables us to usethe learned embeddings in a cross-modal testing scheme. Weachieve this by using a shared classifier for audio and videoencoder outputs and hence when optimized jointly, the en-coder outputs are mapped to a shared space. We call thissystem a multi-view system since the classifier sees differ-ent views of the same input, i.e. the audio component and thevisual component of the video input.As shown in Fig. 1d, in the multi-view model, we stillhave two separate encoders for audio and video ( F A and F V ).If we denote the multi-view classifier by C M , then the networkwill have two outputs one for each modality: y M,A = C M ( F A ( x A )) (3) y M,V = C M ( F V ( x V )) . (4)In this study, we jointly train the whole network with amulti-task objective. The total loss L is calculated using L M,AV = λ M,A L M,A + λ M,V L M,V (5)where the unimodal losses L A and L V are computed basedon y M,A and y M,V , respectively. Note that here we jointlyoptimize two encoders and a single shared classifier. A F A x A y A (a) Unimodal: Audio-only C V F V y V x V (b) Unimodal: Video-only C AV F A F V x A x V y AV (c) Mid-level AV fusion C M C M F A F V x A x V y M,A y M,V (d) Multi-view
Fig. 1 : Encoder and classifier structures of various models: (a) A-only unimodal system; (b) V-only unimodal system; (c)Mid-level AV fusion model; (d) The multi-view model.
5. EXPERIMENTS AND RESULTS
We perform our training on VoxCeleb2 (VC2) dev set andtest on both VoxCeleb1 (VC1) and VC2 test sets. For theVC1 case, we use the verification pairs provided as part ofthe dataset. For VC2, we sample one positive and one neg-ative video for each test set video as there is not an officialset of pairs provided with the dataset. The positive video israndomly uniformly chosen from the utterances of the samespeaker and the negative video is sampled from a differentspeaker by first choosing a random speaker and then selectinga random utterance of that sampled negative speaker. To getthe training and validation splits, we set aside one video (in-cluding several utterances) of each speaker for validation anduse the rest for training. This gives us roughly 995k train-ing utterances and 97k validation utterances with their corre-sponding visuals.
We use 64-dimensional logmel features to represent the au-dio. For the video component, we downsample the data to 2frames per second and apply face detection to each frame us-ing Detectron2 [19], then resize the face crops into 112x112pixels. We skip the frames for which we do not have a facedetection output. If we cannot detect any of the faces in avideo, we use a single zero frame to represent it.In our unimodal and mid-level AV fusion systems, wemake use of variants of convolutional neural networks (CNNs)for modeling the encoders. For the audio-only networkand for the audio branch of the AV network, we use Mo-bileNetV2 [20] with the following inverted residual blocksdimensions: [3,32,1,1], [4,32,1,1], [6,64,1,2], [4,64,1,1],[4,64,1,1], [6,128,1,2], [6,128,1,1], [4,128,1,1], [6,256,1,2],[4,256,1,1], [5,256,1,1], [4,256,1,1] based on the notationused in [20]. For the video-only branch of the AV model,we use ResNet architecture [21]. At the end of encoders, wehave a sequence of features on both branches. In order tosummarize the utterance into a single vector we pool using aself-attention mechanism on the audio-branch and temporal pooling on the video branch. These result in 356 dimensionalaudio encodings and 2048 dimensional video encodings. Inthe multi-view system, in order to bring them into the samedimensions, in our case 256, we apply a linear projectionlayer on both branches before feeding them into the classifier.The classifiers consist of a fully connected layer, followed byReLU nonlinearity, batch normalization [22], dropout [23]and a linear fully connected layer.We train all the networks with arc-margin loss [24]. Weuse a learning rate of 0.001 which is reduced by a factor of0.95 based on the plateau of the loss value, the batchsize is128.
In Table 1, we present the EERs on VC1 and VC2 datasets.The upper part of the table includes audio-only (A-only) EERof [8, 13] and attention based fusion proposed in [3]. In thelower part of the table, the first two rows show the unimodal performance of our systems. In the case of A-only, we achievecomparable results to the current best performance on VC1.Although [13] reports lower EER, they make use of eitherheavy data augmentation or system fusion. The third row re-ports EER for our mid-level fusion approach. Since it makesuse of both modalities, the EER is lower than either of the unimodal systems. We also experimented with score fusionby averaging the cosine similarity scores from various sys-tems before making the verification decision. The late fusionof unimodal systems achieves even lower performance thanthe naive AV fusion possibly because separately optimizedA-only and V-only systems learn to capture the best repre-sentation of their respective input and their late fusion allowscombining the best decisions from each modality. Further-more, if we combine all three systems, then we achieve thelowest EERs on both VC1 and VC2. Note that our VC2 EERis not directly comparable to the one reported in [3] as weare not using the same verification pairs. Still, we achieve thelowest EER that has been reported on the VC2 test set so far.On manually cross checking the results of our system withthe labels in the VC2 test set, we observe some interesting er-odel description VC2 EER VC1 EERA-only of [13] NA 1.0A-only of [8] NA 2.0AV of [3] 5.3 NAOur unimodal A-only 3.5 2.2Our unimodal V-only 3.4 3.9Mid-level AV fusion 2.0 1.4Score fusion unimodal A+V 1.7 0.9Score fusion A+V+AV : EER (%) of various models on VC1 and VC2 testsetsror cases. First, there is a small percentage (estimate 1%)of the dataset which is labelled incorrectly. Second, thereare cases where a person consciously obfuscates their voiceor face (by wearing makeup or deliberately changing theirvoice), leading to hard cases where there is a deliberate mis-match between this atypical sample and videos where the per-son is acting naturally.
The multi-view model allows for greater flexibility duringtesting as compared to the other models. With a multi-view model, in addition to unimodal testing, we can apply latefusion to audio and video similarity scores, as well averageaudio and video embeddings to compute the similarity scores.In Table 2, we show the unimodal performance of the multi-view model as well as their score fusion. When we com-pare the A-only performance of the multi-view model withthat of unimodal model, we see some degradation in perfor-mance. However, the V-only performance of the multi-view model is comparable to the unimodal system, the differencesare less than 0.2%. We think that there are a couple of reasonsfor the reduction in the A-only performance: a) The multi-view model requires that the intermediate dimensions of Au-dio and Video embeddings to be the same which are differentthan our unimodal systems and this causes reduction in thetotal number of trainable parameters, b) especially, at the be-ginning of training, statistics of audio and video embeddingsdiffer so we had to remove the shared batch normalizationlayer from the classifier part of the network. That might haveas well affected the performance. On the other hand, when welook at the score fusion of the multi-view system given in lastrow of Table 2, we see that its EER is lower than either of the unimodal systems reported in Table 1 (A-only, V-only). Thisalso shows that score fusion is a simple but an effective mech-anism to reduce the EER. Another observation that we makeis that it is harder to optimize both embeddings simultane-ously as compared to the separate training case which causesthe EER difference between A-only and V-only test cases ascompared to Table 1.As described in Section 4, the main goal of the multi-view model is to do cross-modal testing. We simulate this A vs. Test condition VC2 EER VC1 EERMulti-view A-only 7.2 6.1Multi-view V-only 3.5 3.7Multi-view Score fusion A+V 2.4 1.8
Table 2 : Audio-only, video-only and score fusion resultsfrom the multi-view system on both VC1 and VC2Test pairs VC2 EER VC1 EERA vs V of [15] NA 29.5A vs V of [17] NA 29.6A vs V of [16] 22.5 NAA vs V (our) 29.5 28.0
Table 3 : Cross-modal verification EER on VC1 and VC2 testsets from previous studies and the proposed multi-view modelV testing condition by dropping the audio modality from oneside and the video modality from the other side of the verifica-tion pair. Another critical point is that both VC2 and VC1 testspeakers are unheard and unseen during training. This makesthe problem challenging as we try to match the face of a pre-viously unseen person to the voice of a previously unheardperson. It has been shown that even human performance onthis task is low (more than 20% error [25]). Since A vs. Vcross-modal verification setting is the most difficult situation,it has higher EER as compared to Table 2 but it is still betterthan the 50% chance level. Table 3 shows cross-modal EERof our system and other published systems on VC1 and VC2test sets. Here we observe that our system’s performance iscomparable to that of previously published systems. How-ever, we cannot claim that our system is better or worse asthe other works do not use the VoxCeleb dev/test splits in asimilar manner.We also performed A vs. AV and V vs. AV type of verifi-cation tests which are lacking one modality only on one side.Our experiments show that in such scenarios, it is better to usethe matched data (A vs. A, or V vs. V) rather than fusing au-dio and video embeddings linearly, i.e. taking the average ofaudio and video embeddings. This is probably because of thefact that the shared space is not a linear space and it does notnecessarily cover the linear combination of two embeddings.
6. CONCLUSIONS
In this work, we first investigated AV speaker verification onVoxCeleb datasets. We learned the AV embeddings from VC2dataset and then applied cosine similarity based verificationon both VC2 and VC1 test sets. We showed that with score fu-sion of unimodal and mid-level AV fusion models, we achievethe lowest EER reported on VC1 test set in the AV testingcondition. We also proposed a multi-view system that mapsaudio and video to a shared space and enables the cross-modalverification scenario of real verification systems. . REFERENCES [1] Zhiyong Wu, Lianhong Cai, and Helen Meng, “Multi-level fusion of audio and visual features for speakeridentification,” in
International Conference on Biomet-rics . Springer, 2006, pp. 493–499.[2] Andrew Senior, Chalapathy V Neti, and Benoit Mai-son, “On the use of visual information for improvingaudio-based speaker recognition,” in
AVSP - Interna-tional Conference on Auditory-Visual Speech Process-ing , 1999.[3] Suwon Shon, Tae-Hyun Oh, and James Glass, “Noise-tolerant audio-visual online person verification using anattention-based neural network fusion,” in
Proc. IEEEICASSP . IEEE, 2019, pp. 3995–3999.[4] Aggelos K Katsaggelos, Sara Bahaadini, and RafaelMolina, “Audiovisual fusion: Challenges and new ap-proaches,”
Proceedings of the IEEE , vol. 103, no. 9, pp.1635–1653, 2015.[5] Najim Dehak, Patrick J Kenny, R´eda Dehak, Pierre Du-mouchel, and Pierre Ouellet, “Front-end factor analysisfor speaker verification,”
IEEE Transactions on Audio,Speech, and Language Processing , vol. 19, no. 4, pp.788–798, 2010.[6] David Snyder, Daniel Garcia-Romero, Daniel Povey,and Sanjeev Khudanpur, “Deep neural network embed-dings for text-independent speaker verification.,” in
In-terspeech , 2017, pp. 999–1003.[7] Weidi Xie, Arsha Nagrani, Joon Son Chung, andAndrew Zisserman, “Utterance-level aggregation forspeaker recognition in the wild,” in
Proc. IEEE ICASSP .IEEE, 2019, pp. 5791–5795.[8] Sarthak Yadav and Atul Rai, “Frequency and tempo-ral convolutional attention for text-independent speakerrecognition,” in
Proc. IEEE ICASSP . IEEE, 2020, pp.6794–6798.[9] Relja Arandjelovic, Petr Gronat, Akihiko Torii, TomasPajdla, and Josef Sivic, “NetVLAD: CNN architecturefor weakly supervised place recognition,” in
Proc. IEEECVPR , 2016, pp. 5297–5307.[10] Yujie Zhong, Relja Arandjelovi´c, and Andrew Zisser-man, “GhostVLAD for set-based face recognition,” in
Asian Conference on Computer Vision . Springer, 2018,pp. 35–50.[11] Arsha Nagrani, Joon Son Chung, and Andrew Zisser-man, “Voxceleb: a large-scale speaker identificationdataset,” arXiv preprint arXiv:1706.08612 , 2017.[12] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2:Deep speaker recognition,” in
Proc. ISCA Interspeech ,2018, pp. 1086–1090. [13] Hossein Zeinali, Shuai Wang, Anna Silnova, PavelMatˇejka, and Oldˇrich Plchot, “BUT system descriptionto VoxCeleb speaker recognition challenge 2019,” arXivpreprint arXiv:1910.12592 , 2019.[14] A. Nagrani, J. S. Chung, S. Albanie, and A. Zisser-man, “Disentangled speech embeddings using cross-modal self-supervision,” in
Proc. IEEE ICASSP , 2020,pp. 6829–6833.[15] Arsha Nagrani, Samuel Albanie, and Andrew Zisser-man, “Learnable pins: Cross-modal embeddings forperson identity,” in
Proc. ECCV , 2018, pp. 71–88.[16] Ruijie Tao, Rohan Kumar Das, and Haizhou Li, “Audio-visual speaker recognition with a cross-modal discrim-inative network,” arXiv preprint arXiv:2008.03894 ,2020.[17] Shah Nawaz, Muhammad Kamran Janjua, IgnazioGallo, Arif Mahmood, and Alessandro Calefati, “Deeplatent space learning for cross-modal mapping of audioand visual signals,” in
Digital Image Computing: Tech-niques and Applications (DICTA) . IEEE, 2019, pp. 1–7.[18] Leda Sarı, Samuel Thomas, and Mark Hasegawa-Johnson, “Training spoken language understanding sys-tems with non-parallel speech and text,” in
Proc. IEEEICASSP . IEEE, 2020, pp. 8109–8113.[19] Yuxin Wu, Alexander Kirillov, Francisco Massa,Wan-Yen Lo, and Ross Girshick, “Detectron2,” https://github.com/facebookresearch/detectron2 , 2019.[20] Mark Sandler, Andrew Howard, Menglong Zhu, AndreyZhmoginov, and Liang-Chieh Chen, “Mobilenetv2: In-verted residuals and linear bottlenecks,” in
Proc. IEEECVPR , 2018, pp. 4510–4520.[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Deep residual learning for image recognition,” in
Proc. IEEE CVPR , 2016, pp. 770–778.[22] Sergey Ioffe and Christian Szegedy, “Batch nor-malization: Accelerating deep network training byreducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015.[23] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout:a simple way to prevent neural networks from overfit-ting,”
The journal of machine learning research , vol.15, no. 1, pp. 1929–1958, 2014.[24] Jiankang Deng, Jia Guo, Niannan Xue, and StefanosZafeiriou, “Arcface: Additive angular margin loss fordeep face recognition,” in
Proc. IEEE CVPR , 2019, pp.4690–4699.[25] Harriet MJ Smith, Andrew K Dunn, Thom Baguley, andPaula C Stacey, “Matching novel face and voice identityusing static and dynamic facial images,”