[PDF] An Attention-Based Speaker Naming Method for Online Adaptation in Non-Fixed Scenarios

Abstract

A speaker naming task, which finds and identifies the active speaker in a certain movie or drama scene, is crucial for dealing with high-level video analysis applications such as automatic subtitle labeling and video summarization. Modern approaches have usually exploited biometric features with a gradient-based method instead of rule-based algorithms. In a certain situation, however, a naive gradient-based method does not work efficiently. For example, when new characters are added to the target identification list, the neural network needs to be frequently retrained to identify new people and it causes delays in model preparation. In this paper, we present an attention-based method which reduces the model setup time by updating the newly added data via online adaptation without a gradient update process. We comparatively analyzed with three evaluation metrics(accuracy, memory usage, setup time) of the attention-based method and existing gradient-based methods under various controlled settings of speaker naming. Also, we applied existing speaker naming models and the attention-based model to real video to prove that our approach shows comparable accuracy to the existing state-of-the-art models and even higher accuracy in some cases.

Full PDF

AAn Attention-Based Speaker Naming Method for Online Adaptationin Non-Fixed Scenarios

Jungwoo Pyo , Joohyun Lee , Youngjune Park ,Tien-Cuong Bui , Sang Kyun Cha Seoul National University, Seoul, Korea { wjddn1801, wngusdlekd, dudwns930, cuongbt91, chask } @snu.ac.kr Abstract

A speaker naming task, which ﬁnds and identiﬁes the activespeaker in a certain movie or drama scene, is crucial for deal-ing with high-level video analysis applications such as au-tomatic subtitle labeling and video summarization. Modernapproaches have usually exploited biometric features with agradient-based method instead of rule-based algorithms. Ina certain situation, however, a naive gradient-based methoddoes not work efﬁciently. For example, when new charac-ters are added to the target identiﬁcation list, the neural net-work needs to be frequently retrained to identify new peopleand it causes delays in model preparation. In this paper, wepresent an attention-based method which reduces the modelsetup time by updating the newly added data via online adap-tation without a gradient update process. We comparativelyanalyzed with three evaluation metrics(accuracy, memory us-age, setup time) of the attention-based method and existinggradient-based methods under various controlled settings ofspeaker naming. Also, we applied existing speaker namingmodels and the attention-based model to real video to provethat our approach shows comparable accuracy to the exist-ing state-of-the-art models and even higher accuracy in somecases.

Introduction

Biometric recognition plays an important role in advancedauthentication systems. It identiﬁes individuals based onphysical or behavioral characteristics. The speaker nam-ing task, which is to identify visible speaking characters inmultimedia videos, consists of multiple types of biometricrecognition. Most of the speaker naming methods distin-guish an active speaker based on biometric features like faceimage or voice. The necessity of this task is proven to be es-sential for high-level video analysis problems such as sum-marization (Takenaka et al. 2012), interaction analysis (Liu,Jiang, and Huang 2008), and semantic indexing (Zhang et al.2013). In particular, identiﬁcation of active speaking charac-ters with automatic subtitling can help deaf audiences to en-joy the videos more without any difﬁculties in understandingthe context.

Most of existing speaker naming models mainly focus onboosting up the accuracy for ﬁnding active speaker amongﬁxed origin character list. Gradient-based methods are con-sidered one of the right solutions to get higher accuracy, andthese methods have been proposed in various ways usingmultiple modalities for speaker naming. (Hu et al. 2015) pro-poses a deep multimodal model based on a CNN architectureto extract the facial and acoustic features from videos, thencombine them through a fusion function. Correspondingly,a multimodal Long Short-Term Memory(LSTM) architec-ture(Ren et al. 2016) merges visual and auditory modali-ties from the beginning of each input sequence. (Bredin andGelly 2016) improves the performance of talking-face de-tection by capturing the lip motion.However, it is not always possible to occur expected orﬁxed situation in a video. In the real world, there are sev-eral uncertain situations which cause the difﬁculties of iden-tifying active speaker such as appearing new characters ormisinterpretation due to lack of labeled training data. Mostof the gradient-based identiﬁcation approaches cannot im-mediately adapt to a change in predicted character list orupdate the newly added data to model. Traditionally, thesemethods have to predeﬁne a set of targeted characters beforethe training period. In particular, an existing model has to beretrained with a new set of targeted characters, which con-sist of both origin classes and the new ones, from scratch. Itcauses much time-consuming to rebuild the model. Transferlearning (Yosinski et al. 2014) and domain adaptation(Ganinand Lempitsky 2015) are proved to be efﬁcient for fasteradaptation of neural network through initialization based onorigin data. Nonetheless, these methods still require consid-erable time in the training phase adapting the newly addeddata to original model, which takes considerable time. Be-sides, it is hard to contain a sufﬁcient amount of labeled datain real-world datasets since labeling task is costly. The avail-ability of labeled data poses a major practical issue for manygradient-based models.To overcome these problems, we apply an attention mod-ule with few-shot learning (Fei-Fei, Fergus, and Perona2006) for making our identiﬁcation model ﬂexible to accom-modate changes at run-time. The attention module, which isbased on scaled dot-product attention structure (Vaswani et a r X i v : . [ c s . MM ] D ec a)(b) Figure 1: (a) Speaker naming task contains two situations:i) ﬁnding matched face with corresponding voice if speakerappears in the scene, ii) picking out all distractors if speakeris out of the scene. (b) Visualization of predicting ID of face-voice pair with few-shot learning based attention module.The predicted ID of the target can be inferred to linear com-bination of cosine similarity between every prior knowledgeembedding and target embedding. It also shows the simplic-ity of dealing with data of newly added ID inserting into theattention module.al. 2017), represents the similarities between prior knowl-edge embeddings and extracted features from the targetvideo. The prior knowledge embeddings are the given data,which consist of facial and vocal embeddings of the pre-dicted classes from the training dataset. Next, few-shotlearning is used for dealing with the scarcity of labeled dataand imbalanced class distribution. Attention mechanism andfew-shot learning are effectively combined in our modelsince they are both linear and straightforward. The essen-tial component of the few-shot learning method derives fea-ture embeddings based on a distance function. The attentionmechanism consists of a linear combination with scalingand a softmax operation among these feature embeddingsas shown in Figure 1(b). This combination makes the modelconsider every single embedding of prior knowledge care-fully. Therefore, our method works well under the condi-tions even with a small amount of data or a highly imbal-anced class distribution. More importantly, our model onlyutilizes the pretrained neural networks to extract embed-dings. It means our model does not comprise a backpropa-gation process unlike other gradient-based models. Namely, the setup time is signiﬁcantly decreased by updating the newinformation on the attention module in run-time.However, our proposed method is not always an opti-mal solution for all situations. In situations where characterchanges are not frequent or there are many IDs to identify, adeep-learning approach that guarantees robust performancemay be more suitable even if the model setup takes a longtime. Consequently, we compared attention-based methodwith gradient-based methods under various conditions ofspeaker naming by adjusting two variables: the number oftarget IDs to be identiﬁed, and the number of shots per eachcharacter. Furthermore, we compared our proposed modelwith existing speaker naming models on real video.Our contributions are summarized as follows: • We proposed a non-gradient-based method using atten-tion module with few-shot learning, which can efﬁcientlydeal with the scarcity of labeled data as well as imbal-anced class distribution. • Our model signiﬁcantly reduces the setup time of themodel by removing the gradient descent process and up-dating the new data to model online. • Under various environments adjusting both the number oftarget IDs to be identiﬁed and the number of shots pereach character, we conducted comparative analyses withreal-world dataset between our proposed method and ex-isting gradient-based methods through three metrics: ac-curacy, memory usage, and setup time. • Our model shows comparable accuracy to the state-of-the-art speaker naming models on real video.

Related Work

Speaker Naming

Speaker naming is a task to identify the speaker in a videosource. Recent studies about automatic speaker naming useddeep neural networks to get each speaker’s name from themultimodal face-voice sources. In (Hu et al. 2015), they pro-posed a convolutional neural network (CNN) based multi-modal framework that automatically learns from the com-bined face and acoustic features. They trained the SVMclassiﬁer to reject all non-matching face-voice pairs andget identiﬁcation results. Likewise, (Ren et al. 2016) im-proved the accuracy by changing the CNN based modelto Long Short-Term Memory (LSTM) based model. Thischange gave robust identiﬁcation results for face distortion.(Liu et al. 2019) used attention architecture to accommodatethe face variations.

Feature Extractors for Face and Audio Cues

The primary purpose of feature extractors is to express a par-ticular type of data to distilled numerical embeddings, whichhas lower dimension than original data. Several feature ex-tractors have been studied in each ﬁeld according to varioustypes of data. Most feature extractor operates by setting ap-propriate loss function and distance metric, then optimizingthem.igure 2: Overall architecture of attention-based model for speaker naming.

Face

Various types of loss functions have been tried touse as facial feature extraction. (Sun et al. 2014; Wen et al.2016) used cross-entropy loss to minimize euclidean dis-tance. FaceNet(Schroff, Kalenichenko, and Philbin 2015)introduced triplet loss based on Euclidean distance totrain the face feature extractor. Also, FaceNet utilizedMTCNN(Zhang et al. 2016) to extract the aligned croppedface images from raw image dataset. SphereFace(Liu et al.2017), CosFace(Wang et al. 2018), ArcFace(Deng et al.2019) used angular loss to minimize the cosine similarity.

Audio

The area of feature extraction of audio data hasalso been studied in various directions. There have beenmany useful methods, such as using MFCC (Muda, Begam,and Elamvazuthi 2010) and using CNN (Hershey et al.2017). Recently, (Xie et al. 2019) suggested a new modelusing ”thinResNet” trunk architecture and dictionary-basedNetVLAD layer. This method successfully performed thespeaker identiﬁcation task on audio data with varyinglengths and mixing of unrelated signals.

Attention Mechanism

Attention mechanism is ﬁrst proposed in (Bahdanau, Cho,and Bengio 2014) for neural machine translation(NMT)ﬁeld. The attention mechanism looks up all of the inputelements(e.g., sequential input such as frames in video orwords in a sentence) at every decoding time, calculates theattention map which is a matrix that reﬂects the relevance ofpresent input and previous input elements.Attention map is a probability matrix that each targetword is aligned to, or translated from source word. Each ele-ment of the attention map is computed as the softmax value,which means the similarity of source word and each targetword.Some papers have brought attention mechanism to speaker naming task. In (Liu et al. 2019), they proposedattention guided deep audio-face fusion approach to detectactive speaker. They also used individual network modelto convert from face and voice sources to each embeddingas ours. Before fusing the face and voice embeddings, theyapplied the attention module only for face embeddings toconsider the relationship between other face embeddings.However, our work applied attention mechanism to fusionof face-voice pair embeddings and focused on the relevanceof target embeddings and prior knowledge embeddings.

Methodology

Speaker naming contains all processes from detecting faces,recognizing voice, and matching these embeddings to iden-tifying the current speaker. As shown in Figure 1(a), we re-gard the speaker naming problem in two cases. The ﬁrst caseis to ﬁnd out the pair embeddings that both face and voiceembedding are identiﬁed as same ID(so-called ”matched-pair”). The second one is to pick out the pair embeddingswhere ID of face and voice do not match(so-called ”non-matched-pair”). We propose a non-gradient based methodusing attention networks with few-shot learning to solve thisproblem. In this section, we formulate our problem preciselyand elaborate on our proposed model.

Problem Formulation

We formulate our problem as follows. Let t be the index oftime window, I = { i , i , ..., i N } denotes the ID of charac-ters. J t is the number of faces which are captured in t . f tj isthe j -th number of face embeddings cropped in time window t . Likewise, v t represents the voice embedding in time win-dow t . Then the maximum probability of facial embeddingwhose ID is i k in time window t is as follows. F prob ( i k , t ) = max ≤ j ≤ J p ( i k | f tj ) (1)igure 3: Mechanism of attention module with few-shotlearning.By multiplying F prob ( i k , t ) and the probability that pre-dicted ID of voice embedding in t is i k , we can infer theID of speaker in t as below. Spk ID ( t ) = argmax i k ∈ I ( F prob ( i k , t ) · p ( i k | v t )) (2)Based on Equation (2), we calculate the accuracy of thespeaker naming model if it correctly estimates the ID ofmatched-pair, or picks out the non-matched-pair in the timewindow t . After all, we aggregate Spk ID ( t ) over all timewindows to get the total accuracy of the target video. Attention-Based Method for Speaker Naming

The speaker naming problem consists of two parts: ﬁndingout matched face-voice pairs to predict current speaker, andpicking out the non-matched-pairs. Our approach to solvethe problem is as follows. First, we capture the face im-ages and voice chunks by every ﬁxed size of the time win-dow. Then, we convert face images and voice chunks to ex-tracted embeddings with pre-trained face and audio featureextractor. We concatenate both face and voice embeddings tomake candidates of pair embeddings by each frame. Then,we calculate the attention map with this concatenated em-bedding. Attention map applies scaling and a softmax func-tion to cosine similarity matrix among all of the characters’prior knowledge embeddings and extracted target embed-dings. We predict the IDs of target embeddings based on theattention mechanism. Then, the proposed method aggregatesthe prediction result by each time window, and it determinesthe active speaker in the scene. Finally, we measure the pre-diction accuracy of the model by aggregating every result ofall time windows. We describe our method’s overall archi-tecture and ﬂow in Figure 2.

Feature Extraction.

For generating embeddings whichcontain the features of facial appearance and voice, we use pre-trained feature extractors, which convert raw inputsources to numerical vectors with reduced dimensions. Ournetwork uses FaceNet as facial feature extractor, NetVLADas voice feature extractor. The weights of these extractorsare ﬁxed while updating the attention module or end-to-endinferencing phase.

Attention Module with Few-Shot Learning.

Our at-tention module with few-shot learning consists of multi-ple components. Let Q denotes the query matrix, which isthe extracted face-voice pair embeddings from target video. Q contains several matched-pairs and non-matched-pairs,which we will predict within a certain time window. Thus, Q is a variable for time window t . K , V belong to priorknowledge for our network. K denotes the matrix of multi-ple face-voice pair embeddings extracted from the trainingdata. V is a one-hot vector matrix of IDs corresponding to K . These KV set work as proofs for our decision whetherthe pair embeddings are matched-pair or not, and classifyingthe pair’s ID.Detailed process of attention mechanism is shown in Fig-ure 3. The intuitive role of attention module is to considerthe correlations between every pair of Q and K . In our case,computing attention map and context vectors in attentionmodule correspond to computing similarity and matrix ofpredicted IDs, respectively. As a distance metric, we use co-sine distance because embeddings in Q and K are unit vec-tors, we can get cosine similarity using inner product. Be-fore performing matrix multiplication of Q and K , we usetransposed matrix Q T to match the dimension. Our methodperforms a few additional operations after doing matrix mul-tiplication with Q T and K . First, multiply scale factor sf to Q T K . Then, apply the softmax function to all elements. Ournetwork is set to the value of sf as √ d K , where d K is thedimension of K , set to 1024. The reason for multiplying sf is that after performing multiplication between unit vectorsto calculate cosine similarity, the value becomes so smallthat it interferes with subsequent softmax operation. If theinput parameter’s scale of softmax function is too small orlarge, it cannot express the appropriate probability distribu-tion. Our setting can arrange the scale of value at the properlevel to perform softmax. Based on the above explanation,the attention map is mathematically written as: A = softmax( sf ( Q T K )) (3)The context vectors which represents the prediction of IDsto Q is written as: C = V A T (4) C represents the probability of which face-voice pair in Q isregarded as a particular ID. The probability of face and voiceare separated in C . Our method uses conﬁdence score vec-tor c p as the criteria for the decision to distinguish whetherthe p -th embeddings in Q is the active speaker or not. Weapply Hadamard product(Horn 1990), which multiplies theface part and voice part of C element-wise, to consider bothfeatures of face and voice. From this operation, we get × N vector of conﬁdence score where N is the number of IDs.The maximum value and its index in c p will be regarded as aconﬁdence score and Spk ID ( t ) , respectively. We elaborateon the procedure of overall ﬂow in Algorithm 1. lgorithm 1 End-to-End Speaker Naming Prediction

Let Q : Query, K : Key, V : Value, A : Attention map I ← { i , i , ..., i N } : a set of characters’ IDsCut video by 0.5s interval of time window for each time window t ← to T do Rep t : representative frame in t { f , ..., f J t } : cropped J t faces from Rep t { q f , ..., q f Jt } : facial embeddings from { f , ..., f J t } q v : voice embedding extracted from audio in tQ ← (cid:18) q f q f ... q f Jt q v q v ... q v (cid:19) A ← softmax( sf ( Q T K )) C ← V A T max conf ← for p ← to J do c p ← c f p (cid:12) c v p (cid:46) Hadamard product max conf ← max( max conf, max ( c p ) ) if max conf == max ( c p ) then Spk ID ( t ) ← argmax i ∈ I ( c p ) end ifend forend for Experiments

Dataset Overview

In experiments, we used two public datasets: the utterancevideos of celebrities(

VoxCeleb2 (Chung, Nagrani, and Zis-serman 2018)) and a TV show(

The Big Bang Theory(BBT) ).For the experiment in incremental settings, we randomlychose 500 people from

VoxCeleb2 , which contain more than10 videos. Then, we split the train set and valid set by 5:2per each ID. For

BBT , we selected 5 episodes (

S01E02,S01E03, S01E04, S01E05, S01E06 ). Each episode consistsof the whole video, face images with various poses and illu-mination, and aggregated voice ﬁle without silence.

Data Preprocessing

We used FaceNet and NetVLAD, which are the same extrac-tors used in our model, for extracting train and test embed-dings from the raw datasets. First of all,

BBT dataset consistsof multiple cropped face images and merged voice ﬁles perID, by each episode. Cropped face images were resized to × for ﬁtting the image size with the input of FaceNetmodel. After that, we used pre-trained FaceNet to convertresized face images into 512-dimension embeddings. Simi-larly, we converted our audio ﬁle into 512-dimension voiceembeddings. The window size of each audio chunk is 2s, cutwith 0.1s stride. We did additional preprocessing in orderto get cropped face images and voice chunks when prepro-cessing VoxCeleb2 because it only consists of video ﬁles.First of all, we cut the videos every 30 frames per second.Then, MTCNN(Zhang et al. 2016) cropped face images forall frames. The captured images which are not actual faceimage were removed. About voice ﬁle, we applied the samesetting of what

BBT was preprocessed.

Comparative Analysis among Speaker NamingMethods under Various Settings

Previous studies(Hu et al. 2015; Ren et al. 2016; Liu et al.2019) have evaluated the accuracy of their methods in a re-ﬁned setting, which has puriﬁed voice and appears a small,ﬁxed number of characters(5-6 IDs) in the scene. Also, theyuse sufﬁcient pair embeddings per each character for train-ing model. In this experiment, we compare our speaker nam-ing model with existing gradient-based methods in detailunder more various environments unlike previous work. Byconsidering the advent of new characters in the story, we canprecisely evaluate the performance of speaker naming meth-ods in a more realistic situation with

VoxCeleb2 dataset.

Evaluation Metric

Speaker naming is to ﬁnd thematched-pairs of face and voice embeddings and predict itsidentity. To compare how well the speaker naming methodcan identify the ID of matched-pair, we deﬁned matchingpair accuracy( mpA ) as follows: mpA = N id pred == id gt N total × (5)The second metric is the number of parameters of speakernaming model loaded in memory. If the model consists ofa neural network, the weights and biases belonged to theseparameters. In the case of attention-based model, pair em-beddings were counted as parameters. We convert these pa-rameters into kilobytes(KB) and compare them.The third metric is the setup time of model. In the caseof neural network, we calculated setup time by adding dataloading in memory, calculating the gradient, and updating itto the weights. In the case of an attention-based model, wemeasure the setup time by adding the loading time of priorknowledge embeddings and the calculation time of attentionmodule to derive prediction results. Experimental Setup

We conducted the experiment ad-justing two main variables in situations: the number of targetIDs for prediction, and the number of shots(pair embeddingsof face-voice) for prior knowledge per each target ID. Aboutthe number of target IDs, we separated the situation into twoparts: the number of target IDs is small or large. In each case,the number of target IDs started to be set from 5 to 50 withﬁve increments, and from 50 to 500 with ﬁfty increments,respectively. We also adjust the number of shots per eachcharacter set to 5(small), 50(large) shots in both situationsto consider the effects of the number of labeled training datato performances of speaker naming methods.As the baseline methods, we selected two representa-tive gradient-based methods to compare with our

Attention-based(Att-based) method. The ﬁrst method is

Training fromScratch(TfS) , which trains the neural network with bothoriginal and new data. Most of the deep neural networks nor-mally use

TfS in training phase. The second one is

Learningwithout Forgetting(LwF) (Li and Hoiem 2017), which gener-ates the new branch on top of the network and trains withonly new data. We followed the same neural network struc-ture with one of the previous work(Hu et al. 2015) on bothmethods for fair comparison. The maximum training epoch

10 15 20 25 30 35 40 45 506080100 m p A ( % ) mpA(%) a r a m s ( K B ) † s e t up ti m e ( s ) † m p A ( % ) mpA(%) a r a m s ( K B ) † s e t up ti m e ( s ) † T f S (50 shots)

LwF (50 shots)

Att − based (Ours, 50 shots) T f S (5 shots)

LwF (5 shots)

Att − based (Ours, 5 shots)Figure 4: Comparative Analyses between speaker naming methods under various settings. Three metrics( mpA , the number ofparameters, setup time) are measured for each situation where the number of target IDs and the number of shots per characterare changed. † The y-axis of the graph is in logarithmic scale.is 500, which is sufﬁcient to converge the loss function. Ifthe network reached the optimal cost before the maximumepoch while training, we took the accuracy and the setuptime at the moment the optimal cost was derived. Trans-fer learning was applied in every stage of all gradient-basedmethods when the number of IDs was increased.

Results

As shown in Figure 4 and Table 1, we conductedboth quantitative and qualitative analysis based on the exper-imental results. Most notably, our method(

Att-based ) signif-icantly reduced the setup time of model compared to othergradient-based methods about tens to hundreds of times re-gardless of conditions. The mpA was high in order of

TfS,Att-based, LwF . However, when the number of target IDswas 450 with large shots, the

LwF gradually surpassed

Att-based as shown in the ”Large IDs- mpA ” graph in Figure 4.Generally, gradient-based methods showed a big differencein mpA depending on the number of shots. In contrast,

Att-based worked well in both situations and had less effect interms of the number of shots.

Att-based utilized small num-ber of parameters when the number of target IDs or its shotsare small. However, as the number of target IDs is increasedwith large(50) shots,

Att-based showed memory inefﬁciencybecause the number of parameters increased quadraticallywith (the number of IDs × the number of shots per ID). In contrast, TfS occupies constant number of parameters; it isonly related to the structure of the neural network.

LwF isproportional to the number of times that ID is added. Be-cause

LwF has a multi-branch structure, the new branch isgenerated when the new target character comes in.To sum up,

Att-based is the most appropriate methodwhen new people appear frequently, and the shots per eachcharacter is not sufﬁcient. Also,

Att-based works effectivelywhere immediate update for hard-to-recognize data such asvarious facial poses is needed. Overall,

TfS is the best suitedfor situations where the new people are not frequently up-dated and high accuracy is required.

LwF locates in the mid-dle of other two methods, because it shows faster setup ofmodel than

TfS , but compromising its mpA and memory us-age.

Speaker Naming Accuracy for Real Video

In this experiment, we applied our model in real video tocompare the accuracy with previous gradient-based speakernaming models.

Evaluation Metric

Speaker naming accuracy( snA ) hadused broadly in multiple speaker naming related papers be-fore and was formulated in (Liu et al. 2019), which is from of target IDs

Training from Scratch(TfS) Learning without Forgetting(LwF) Att-based (Ours)Small IDs(5-50 people) Small shots(5 shots) mpA (%)

Setup time(s) 60.02 7.06

Large shots(50 shots) mpA (%)

Setup time(s) 381.21 86.28

Large IDs(50-500 people) Small shots(5 shots) mpA (%)

Setup time(s) 437.20 85.14

Large shots(50 shots) mpA (%)

Table 1: Summary for comparison of performances between gradient-based methods(

TfS, LwF ) and attention-based(

Att-based )method under various settings. The numbers in the table are the average of the measurements in the range.

Time window(s) Bauml et al. 2013 Tapaswi et al. 2012 Hu et al. 2015 Ren et al. 2016 Liu et al. 2019

Att-based (ours)0.5s - - 74.93 86.59 - 88.892s - - 82.12 90.84 - - 89.893s - - 83.42 91.38 -

Table 2: Speaker naming accuracy( snA(%) ) comparison between attention-based model and existing speaker naming modelson real video of

BBT S01E03 .one of our baseline. We used this metric to measure the per-formance of real video inference for comparing our modelwith a well-known speaker naming baselines. They deﬁne snA as follows: snA = N [ p sn == s tr ] N s tr × (6)where p sn and s tr denote the labels of predicted samplesand ground truth, respectively. N [ p sn == s tr ] is the numberof correctly predicted time windows and N s tr is the totalnumber of time windows. Experimental Setup

For evaluation, we followed thesame settings of previous works(Tapaswi, Buml, and Stiefel-hagen 2012; Bauml, Tapaswi, and Stiefelhagen 2013; Hu etal. 2015; Ren et al. 2016) on speaker naming experiment.The four-minute-long

BBT S01E03 video clip was used forevaluation dataset. In real situation, there occurs many non-matched-pairs of face-voice embeddings per period. Unlikeprevious controlled settings, we put 30 shots of matched andnon-matched-pairs at a ratio of 1 to 4 in prior knowledgeembeddings because end-to-end inference detects not onlyan active speaker but also distractors. We tested our modelto video with the time window of multiple periods of 0.5s.Time windows more than 0.5s were also tested for compar-ing existing methods to clarify the result. If the time windowis more than 0.5s, the prediction of the model is determinedby the majority vote of multiple 0.5s-sized windows as pre-vious work(Hu et al. 2015) did.

Results

As shown in Table 2,

Att-based showed compara-ble snA as other gradient-based speaker naming models inmost cases. In certain circumstances, such as the size of thetime window is 1s, 2s, and 3s, our model even outperformedthe other state-of-the-art models.

Conclusion and Future Work

In this paper, we presented an attention-based speaker nam-ing method for online adaptation in non-ﬁxed scenarios. Thekey idea is to predict the ID of the matched-pair based on at-tention mechanism, which considers the correlation betweenall pairs of prior knowledge embeddings and extracted targetembeddings. Our proposed approach signiﬁcantly reducedthe model setup time by keeping comparable accuracy toexisting state-of-the-art models, as demonstrated in our ex-periments. Also, the model can be updated online by onlychanging information on the attention module.Our further research aims to solve the current limitationsand improve the method well-applied to more generalizedsituations. Now, our current method was using only twomodalities and showed low accuracy when the number oftarget IDs for identiﬁcation is large. Also, it can occur mem-ory inefﬁciency if the number of IDs and the number of shotsper ID are increased. If we properly combine the advantagesof the gradient-based methods with our method, the inte-grated method will be one of the solutions to cover morevarious situations adequately in the future.

Acknowledgments

This work was supported by the New Industry PromotionProgram(1415158216, Development of Front/Side CameraSensor for Autonomous Vehicle) funded by the Ministry ofTrade, Industry & Energy(MOTIE, Korea).

References

Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural ma-chine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .Bauml, M.; Tapaswi, M.; and Stiefelhagen, R. 2013. Semi-supervised learning with constraints for person identiﬁcationn multimedia data. In

Proceedings of the 2013 IEEE Con-ference on Computer Vision and Pattern Recognition , CVPR’13, 3602–3609. Washington, DC, USA: IEEE ComputerSociety.Bredin, H., and Gelly, G. 2016. Improving speaker diariza-tion of tv series using talking-face detection and clustering.In

Proceedings of the 24th ACM International Conferenceon Multimedia , MM ’16, 157–161. New York, NY, USA:ACM.Chung, J. S.; Nagrani, A.; and Zisserman, A. 2018.Voxceleb2: Deep speaker recognition. arXiv preprintarXiv:1806.05622 .Deng, J.; Guo, J.; Xue, N.; and Zafeiriou, S. 2019. Arcface:Additive angular margin loss for deep face recognition. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 4690–4699.Fei-Fei, L.; Fergus, R.; and Perona, P. 2006. One-shot learn-ing of object categories.

IEEE transactions on pattern anal-ysis and machine intelligence

Proceedings of the32Nd International Conference on International Conferenceon Machine Learning - Volume 37 , ICML’15, 1180–1189.JMLR.org.Hershey, S.; Chaudhuri, S.; Ellis, D. P.; Gemmeke, J. F.;Jansen, A.; Moore, R. C.; Plakal, M.; Platt, D.; Saurous,R. A.; Seybold, B.; et al. 2017. Cnn architectures for large-scale audio classiﬁcation. In ,131–135. IEEE.Horn, R. A. 1990. The hadamard product. In

Proc. Symp.Appl. Math , volume 40, 87–169.Hu, Y.; Ren, J. S.; Dai, J.; Yuan, C.; Xu, L.; and Wang, W.2015. Deep Multimodal Speaker Naming. In

Proceedingsof the 23rd Annual ACM International Conference on Mul-timedia , 1107–1110. ACM.Li, Z., and Hoiem, D. 2017. Learning without forgetting.

IEEE transactions on pattern analysis and machine intelli-gence

Proceedings of the IEEE conference on computervision and pattern recognition , 212–220.Liu, X.; Geng, J.; Ling, H.; and ming Cheung, Y. 2019.Attention guided deep audio-face fusion for efﬁcient speakernaming.

Pattern Recognition

Proceedings ofthe 16th ACM international conference on Multimedia , 717–720. ACM.Muda, L.; Begam, M.; and Elamvazuthi, I. 2010. Voicerecognition algorithms using mel frequency cepstral coef-ﬁcient (mfcc) and dynamic time warping (dtw) techniques. arXiv preprint arXiv:1003.4083 .Ren, J. S.; Hu, Y.; Tai, Y.-W.; Wang, C.; Xu, L.; Sun, W.;and Yan, Q. 2016. Look, Listen and Learn - A Multimodal LSTM for Speaker Identiﬁcation. In

Proceedings of the 30thAAAI Conference on Artiﬁcial Intelligence , 3581–3587.Schroff, F.; Kalenichenko, D.; and Philbin, J. 2015. Facenet:A uniﬁed embedding for face recognition and clustering. In

Proceedings of the IEEE conference on computer vision andpattern recognition , 815–823.Sun, Y.; Chen, Y.; Wang, X.; and Tang, X. 2014. Deep learn-ing face representation by joint identiﬁcation-veriﬁcation. In

Advances in neural information processing systems , 1988–1996.Takenaka, K.; Bando, T.; Nagasaka, S.; and Taniguchi, T.2012. Drive video summarization based on double articu-lation structure of driving behavior. In

Proceedings of the20th ACM international conference on Multimedia , 1169–1172. ACM.Tapaswi, M.; Buml, M.; and Stiefelhagen, R. 2012. knock!knock! who is it? probabilistic person identiﬁcation in tv-series. In , 2658–2665.Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-tention is all you need. In

Advances in neural informationprocessing systems , 5998–6008.Wang, H.; Wang, Y.; Zhou, Z.; Ji, X.; Gong, D.; Zhou, J.;Li, Z.; and Liu, W. 2018. Cosface: Large margin cosineloss for deep face recognition. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,5265–5274.Wen, Y.; Zhang, K.; Li, Z.; and Qiao, Y. 2016. A dis-criminative feature learning approach for deep face recogni-tion. In

European conference on computer vision , 499–515.Springer.Xie, W.; Nagrani, A.; Chung, J. S.; and Zisserman, A. 2019.Utterance-level aggregation for speaker recognition in thewild. In

ICASSP 2019-2019 IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP) ,5791–5795. IEEE.Yosinski, J.; Clune, J.; Bengio, Y.; and Lipson, H. 2014.How transferable are features in deep neural networks? In

Advances in neural information processing systems , 3320–3328.Zhang, H.; Zha, Z.-J.; Yang, Y.; Yan, S.; Gao, Y.; and Chua,T.-S. 2013. Attribute-augmented semantic hierarchy: to-wards bridging semantic gap and intention gap in image re-trieval. In

Proceedings of the 21st ACM international con-ference on Multimedia , 33–42. ACM.Zhang, K.; Zhang, Z.; Li, Z.; and Qiao, Y. 2016. Jointface detection and alignment using multitask cascadedconvolutional networks.