Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances
Youngmoon Jung, Seong Min Kye, Yeunju Choi, Myunghun Jung, Hoirin Kim
IImproving Multi-Scale Aggregation Using Feature Pyramid Modulefor Robust Speaker Verification of Variable-Duration Utterances
Youngmoon Jung, Seong Min Kye, Yeunju Choi, Myunghun Jung, Hoirin Kim
School of Electrical Engineering, KAIST, Daejeon, South Korea { dudans,kye9165,wkadldppdy,kss2517,hoirkim } @kaist.ac.kr Abstract
Currently, the most widely used approach for speaker verifica-tion is the deep speaker embedding learning. In this approach,we obtain a speaker embedding vector by pooling single-scalefeatures that are extracted from the last layer of a speaker featureextractor. Multi-scale aggregation (MSA), which utilizes multi-scale features from different layers of the feature extractor, hasrecently been introduced and shows superior performance forvariable-duration utterances. To increase the robustness deal-ing with utterances of arbitrary duration, this paper improvesthe MSA by using a feature pyramid module. The moduleenhances speaker-discriminative information of features frommultiple layers via a top-down pathway and lateral connections.We extract speaker embeddings using the enhanced features thatcontain rich speaker information with different time scales. Ex-periments on the VoxCeleb dataset show that the proposed mod-ule improves previous MSA methods with a smaller number ofparameters. It also achieves better performance than state-of-the-art approaches for both short and long utterances.
Index Terms : Speaker verification, deep speaker embedding,multi-scale aggregation, feature pyramid module, short duration
1. Introduction
Speaker verification (SV) is the task of verifying a speaker’sclaimed identity based on his or her speech. Depending on theconstraint of the transcripts used for enrollment and verifica-tion, SV systems fall into two categories, text-dependent SV(TD-SV) and text-independent SV (TI-SV). TD-SV requires thecontent of input speech to be fixed, while TI-SV operates on un-constrained speech. Before the deep learning era, the combina-tion of i -vector [1] and probabilistic linear discriminant analysis(PLDA) [2] was the dominant approach for SV [3,4]. Althoughthis approach performs well with long utterances, it suffers fromperformance degradation with short utterances.Recently, the most widely used SV approach is thedeep speaker embedding learning which extracts speakerembeddings directly from a deep-learning-based speaker-discriminative network [5–18]. This approach outperforms i-vector/PLDA approach, especially on short utterances. In deepspeaker embedding learning, convolutional neural networks(CNNs) such as time-delay neural network (TDNN) [10–12]or ResNet [5, 8, 9, 13–18] are mostly used as the speaker-discriminative network. Specifically, the network is trained toclassify training speakers [9–18] or to separate same-speakerand different-speaker utterance pairs [5, 6]. After training, anutterance-level speaker embedding called deep speaker embed-ding is obtained by aggregating speaker features extracted fromthe network. Most of these approaches use a pooling layer forfeature aggregation, mapping variable-length speaker featuresto a fixed-dimensional embedding. There are several poolingmethods, such as global average pooling [5, 7, 14], statistics pooling [10, 13], attentive statistics pooling [11], learnable dic-tionary encoding [9], and spatial pyramid encoding [16].Meanwhile, all these pooling layers use only single-scalefeatures from the last layer of the feature extractor. To ag-gregate speaker information from different time scales, multi-scale aggregation (MSA) methods have been proposed recently[12–15]. The MSA aggregates multi-scale features from dif-ferent layers of a feature extractor to generate a speaker embed-ding. In [12–15], the authors show the effectiveness of the MSAin dealing with short or long utterances.In this work, we propose a new MSA method using a fea-ture pyramid module (FPM). A top-down architecture with lat-eral connections is used to generate feature maps with richspeaker information at all selected layers. Then, we exploit therich multi-scale features of a ResNet-based feature extractor toextract speaker embeddings. In addition, we present a novel in-terpretation of the MSA using the theory of [19]. We evaluateour method using various pooling layers for TI-SV on the Vox-Celeb dataset [7, 8]. Experimental results show that the perfor-mance of MSA is further improved by the FPM with a smallernumber of parameters. Besides, the effectiveness of our methodis verified on variable-duration test utterances.
2. Relation to prior works
Gao et al. [13] proposed multi-stage aggregation, where ResNetis used as a feature extractor. The output feature maps of stage2, 3, and 4 (see Table 1) are concatenated along the channel axis.To make feature maps match in time-frequency resolution, thefeature map from stage 2 is downsampled by convolution withstride 2, and the feature map from stage 4 is upsampled by bi-linear interpolation or transposed convolution. After concatena-tion, speaker embeddings are generated by statistics pooling.Seo et al. [14] also utilize feature maps from different stagesof ResNet to fuse information at different resolutions. Unlikethe method of Gao et al. , global average pooling (GAP) is ap-plied to the feature maps, and the pooled feature vectors areconcatenated into a long vector. The concatenated vector is fedinto fully-connected layers to generate the speaker embedding.Hajavi et al. [15] proposed a similar approach to the study ofSeo et al.
Their proposed model, UtterIdNet, shows significantimprovement in speaker recognition with short utterances.To sum up, the method of Gao et al. fuses feature mapsfrom different stages to form a single feature map, and appliespooling to the fused feature map to generate the speaker embed-ding. On the other hand, the method of Seo et al. applies pool-ing to feature maps respectively, and aggregates the resultingvectors to obtain the speaker embedding. In this paper, we takethese two methods as baselines. For clarity, we denote the firstmethod as multi-scale feature aggregation (MSFA) and the sec-ond one as multi-scale embedding aggregation (MSEA). Theseapproaches are illustrated in Fig. 1. We will show that the pro- a r X i v : . [ ee ss . A S ] M a y b) Multi-scale embedding aggregation (MSEA)(a) Multi-scale feature aggregation (MSFA) pooling × up Speakerembedding × down Aggregation poolingpoolingpooling
Speakerembedding
Aggregation
Figure 1:
Two types of multi-scale aggregation (MSA). (a)Multi-scale feature aggregation (MSFA). (b) Multi-scale em-bedding aggregation (MSEA). In this paper, acoustic featuresof consecutive frames are indicated by grey rectangles and fea-ture maps are indicated by blue outlines. “2 × up (or down)” isan upsampling (or downsampling) by a factor of 2. Table 1:
The architecture of the feature extractor based onResNet-34 [21]. Inside the brackets is the shape of a residualblock and outside the brackets is the number of stacked blockson a stage. The input size is × T . Layer name Output size ResNet-34 Stageconv1 × T ×
32 7 × , , stride -conv2 x × T × (cid:20) × , × , (cid:21) × × T/ × (cid:20) × , × , (cid:21) × × T/ × (cid:20) × , × , (cid:21) × × T/ × (cid:20) × , × , (cid:21) × posed feature pyramid module improves both baselines.Meanwhile, these studies show the effectiveness of MSAon only one type of pooling operation, i.e., statistics pool-ing and GAP, respectively. Unlike these studies, we evaluateour approach using three popular pooling methods: GAP, self-attentive pooling (SAP) [9], and learnable dictionary encoding(LDE) [9]. Experimental results show that the proposed methodachieves good performance for the three pooling layers.
3. Proposed approach
In this section, we introduce the proposed multi-scale aggrega-tion using the feature pyramid module motivated by [20]. Inthis work, we use ResNet as our feature extractor just as in thebaselines [13, 14]. The architecture is described in Table 1.
Deep CNNs such as ResNet are usually bottom-up, feed-forward architectures, which use repeated convolutional andsubsampling layers to learn sophisticated feature representa-tions. Deep CNNs compute a feature hierarchy layer by layer,and with subsampling layers, the feature hierarchy is inherentlymulti-scale of pyramidal shape. This in-network feature hier-archy produces feature maps of different time-frequency scalesand resolutions, but introduces large semantic gaps caused bydifferent depths. In SV, as the network is trained to discrimi-nate speakers, features of higher layers contain more speaker-discriminative information (higher-level speaker information)but have lower resolutions due to the repeated subsampling.Deep CNNs are robust to variance in scale and thus facili-tate extraction of speaker embeddings from feature maps com-puted on a single input scale (Fig. 2(a)). But even with this ro-
Speakerembedding (a) Single-scale feature map (b) Multi-scale aggregation w/o FPM
Speakerembedding (c) Multi-scale aggregation with FPM
Speakerembedding conv up Figure 2:
How to use feature maps for deep speaker embedding.(a) Using only single-scale feature maps. (b) Using multi-scalefeature maps without feature pyramid module (FPM). (c) Us-ing multi-scale feature maps with FPM. In this paper, thickeroutlines denote feature maps with more speaker-discriminativeinformation. ⊗ : concatenation, ⊕ : element-wise addition. bustness, using multi-scale features from multiple layers (Fig.2(b)) improves the performance as we discussed in Section 2.According to the previous works, the advantage of MSAis that it extracts speaker embeddings from multiple temporalscales, improving speaker recognition performance [12, 13]. Itis also useful for short-segment speaker recognition through anefficiently increased use of information in short utterances [15].Besides, it passes error signals back to earlier layers, whichhelps alleviate the vanishing gradient problem [12, 14].Two types of MSA are described in Fig. 1. For MSFA (Fig.1(a)), we use the same method in [13]. For MSEA (Fig. 1(b)), × convolutions are added before pooling. The embeddings ofdifferent stages are concatenated and the output of the followingfully-connected (FC) layer is used as the speaker embedding. In deep CNNs, feature maps of lower layers have less speaker-discriminative information compared to those of higher layers.Intuitively, if we can enhance the speaker discriminability of thelow-layer feature maps, the performance of MSA will be im-proved. Motivated by this intuition, we aim to create multi-scalefeatures that have high-level speaker information at all layers.The feature pyramid module (FPM) is used to achieve this goal.The MSA with FPM is illustrated in Fig. 2(c). The dotted boxindicates the building block of FPM that consists of the lateralconnection and the top-down pathway, merged by addition.The detailed architecture is shown in Fig. 3, involving abottom-up pathway, a top-down pathway, and lateral connec-tions. The bottom-up pathway is the feed-forward computationof the backbone ResNet. It computes a feature hierarchy con-sisting of feature maps at multiple scales with a scaling step of2. In each ResNet stage, there are many layers producing fea-ture maps of the same spatial size (see Table 1). We choose theoutput of the last layer of each stage as the output of each stage,since the deepest layer contains the strongest features. We de-note the output of conv i as C i for i=2,3,4,5.The black dotted box in Fig. 3 indicates the FPM. Theprocedure of the proposed approach is as follows: (1) Usingbilinear interpolation or transposed convolution, we upsamplelower-resolution feature maps from higher stages by a factorof 2. That is, the top-down pathway hallucinates higher res-olution features by upsampling low-resolution feature maps,but with more speaker-discriminative information, from higherstages. (2) The upsampled feature maps are then enhanced withfeatures from the bottom-up pathway via lateral connections. peakerembeddingconv1conv2_x conv conv up up up conv conv poolingpoolingpoolingpooling FC conv3_xconv4_xconv5_x Figure 3:
Proposed MSA with FPM. The black dotted box indi-cates the FPM. Only MSEA is illustrated here, but the FPM canbe applied to both MSFA and MSEA.
Concretely, the top-down feature map is merged with the cor-responding bottom-up feature map by element-wise addition.Before merging, a × convolution in the lateral connectionreduces the channel dimension of the bottom-up feature map to32 which is the channel dimension of the lowest stage. Theselateral connections play the same role as the skip connectionsin U-Net [22]. They directly transfer the high-resolution infor-mation from the bottom-up pathway to the top-down pathway.(3) This process is repeated from the top stage to the bottomstage. At the beginning, a × convolutional layer reduces thechannel dimension of C to 32. (4) Finally, an additional convo-lutional layer is appended to each merged feature map to reducethe aliasing effect of upsampling. This final set of feature mapsis called { P , P , P , P } corresponding to { C , C , C , C } that are respectively of the same time-frequency resolution.The FPM combines lower-resolution features with higher-level speaker information and higher-resolution features withlower-level speaker information. The result is a feature pyramidthat has rich speaker information at all stages. In other words,the FPM plays the role of feature enhancement for MSA. Fur-thermore, the FPM reduces the total number of parameters inthe network because the number of channels at stage 2, 3, and 4are reduced by the × convolution in lateral connections.A recent study shows that the collection of paths having dif-ferent lengths in ResNet exhibits ensemble-like behavior [19].Similarly, we can interpret that the MSA method uses an ensem-ble of multi-scale features from different paths. As the variable-length feature maps are used to extract speaker embeddings, weexpect that speaker verification performance will be improvedfor variable-duration test utterances, especially with the FPM.In our experiments, we show that the MSA with FPM providesimproved performance for both short and long utterances.
4. Experimental setup
We use the VoxCeleb1 [7] and VoxCeleb2 [8] datasets. Bothare for large scale text-independent speaker recognition, con-taining 1,250 and 5,994 speakers, respectively. The utterancesare extracted from YouTube videos where the speech segmentsare corrupted with real-world noise. Both datasets are split intodevelopment (dev) and test sets. The dev set of VoxCeleb2 isused for training and the test set of VoxCeleb1 is used for test-ing. There are no overlapping speakers between them. Table 2:
Comparison of Single, MSFA, and MSEA. The soft-max loss and GAP are used for all systems. In this paper, Sin-gle denotes using only single-scale features (Fig. 2(a)), FPM-Band FPM-TC are the FPMs with bilinear upsampling and trans-posed convolution upsampling, respectively. The results of theproposed approaches are shown in bold.
Systems C mindet EER (%) w/ FPM-B 0.398 4.22 5.82Mw/ FPM-TC 0.408 4.01 5.85M
MSEA w/o FPM 0.416 4.22 5.90M w/ FPM-B 0.403 4.20 5.83Mw/ FPM-TC 0.411 4.01 5.85M
When evaluating the performance on short utterances, ourtest recordings are cut into four different durations: 1 s, 2 s, 3 s,and 5 s, which is determined by the energy-based voice activitydetection (VAD). If the length of the utterance is less than thegiven duration, the entire utterance is used.
The input features are 64-dimensional log Mel-filterbank fea-tures with a frame-length of 25 ms, which are mean-normalizedover a sliding window of up to 3 s. Neither VAD nor data aug-mentation is used for training. In training, the input size of theResNet is × for 3 s of speech (i.e., d = 64 and T = 300 in Fig. 3). In testing, the entire utterance is evaluated at once.The 128-dimensional speaker embeddings are extracted fromthe network. We report the equal error rate (EER) in % and theminimum detection cost function ( C mindet ) with the same settingsas in [8]. Verification trials are scored using cosine distance.The models are implemented using PyTorch [23] and opti-mized by stochastic gradient descent with momentum 0.9. Themini-batch size is 64, and the weight decay parameter is 0.0001.We use the same learning rate (LR) schedule as in [9], withthe initial LR of 0.1. For LDE, we use the same method asin [16]. Before the LDE layer, a × convolution is applied tochange the number of channels to 64. After the LDE layer, L -normalization and an FC layer are added to reduce the dimen-sion to 128. The number of codewords is 64. When applyingMSA, both the LDE and FC layers are shared by all stages. Onthe other hand, the parameters of the SAP layers are not shared.
5. Results
In Table 2 and 3, the models are trained on the VoxCeleb1 devset. Table 2 compares with (w/) and without (w/o) the FPM intwo MSA methods. MSFA w/o FPM and MSEA w/o FPM cor-respond to the approaches in Gao et al. [13] and Seo et al. [14],respectively. In both MSA approaches, we aggregate featuremaps from 3 different stages: C , C , and C for w/o FPM and P , P , and P for w/ FPM. Note that the output of stage i is C i +1 for i = 1 , , , because the stage 1 corresponds to conv2as shown in Table 1. We can see that both MSAs yield betterperformance than Single, but with more parameters.As we discussed in Section 2, the MSFA uses upsamplingso that the three feature maps have the same spatial size. In thiswork, bilinear upsampling is applied to C since using trans-posed convolution does not improve the performance of MSFAwith additional parameters. On the other hand, for the FPM,both bilinear and transposed convolution upsampling are used inthe top-down pathway, which correspond to FPM-B and FPM-TC, respectively. Among the three systems, w/o FPM has theable 3: Performance comparison with and without the FPMfor three pooling strategies: GAP, SAP [9], and LDE [9]. MSEAwith softmax loss is applied for all the systems.
Systems GAP SAP LDE C mindet EER C mindet EER C mindet EERSingle 0.423 4.55 0.410 4.38 0.421 4.44w/o FPM 0.416 4.22 0.416 4.24 0.435 4.09 w/ FPM-B 0.403 4.20 0.393 4.13 0.402 3.84w/ FPM-TC 0.411 4.01 0.408 4.09 0.368 3.63
Table 4:
Comparison with state-of-the-art systems. In theparentheses, the first and second terms are the used loss func-tion and pooling layer. For the loss function, C, ASM, SM, andEAMS denote contrastive, A-softmax [24], softmax, and EAMsoftmax [25] loss, respectively. For the pooling layer, SPE, ASP,and SP denote spatial pyramid encoding [16], attentive statis-tics pooling [11], and statistics pooling [10], respectively. *means that data augmentation is used and NR is “not reported”.
Systems Train set C mindet EERi-vectors+PLDA [7] Vox1 0.73 8.8VGG-M (C+GAP) [7] Vox1 0.71 7.8ResNet-34 (ASM+SAP) [9] Vox1 0.622 4.40ResNet-34 (ASM+SPE) [16] Vox1 0.402 4.03TDNN (SM+ASP) [11] Vox1* 0.406 3.85
ResNet-34 (ASM+GAP) w/ FPM-TC
Vox1
Vox1
Thin ResNet-34 (SM+GhostVLAD) [26] Vox2 NR 3.22ResNet-50 (EAMS+GAP) [25] Vox2 0.278 2.94ResNet-34 (ASM+SPE) [16] Vox2 0.245 2.61DDB+Gate (SM+SP) [27] Vox1&2 0.268 2.31
ResNet-34 (ASM+GAP) w/ FPM-TC
Vox2
Vox2 worst performance with the largest number of parameters. Byadding the FPM, we obtain better performance with fewer pa-rameters. The number of parameters is reduced because thechannel dimension of feature maps at selected stages is reducedas we discussed in Section 3.2. FPM-TC has slightly moreparameters than FPM-B because the transposed convolutionallayer has learnable parameters. FPM-B achieves a relative im-provement of 8.92% in C mindet over w/o FPM. FPM-TC shows arelative improvement of 6.74% in EER over w/o FPM.For MSEA, FPM-B provides the best C mindet (0.403), andFPM-TC obtains the best EER (4.01%). The proposed FPMimproves the performance of both MSFA and MSEA, achiev-ing similar performance with a similar number of parameters.Therefore, we only use MSEA in the following experiments.Moreover, the MSEA is more flexible to use various number ofstages unlike the MSFA which is developed to use only 3 stages. In Table 3, we compare the performance among the three pool-ing strategies. For the SAP layer, FPM-B provides the best C mindet (0.393), and FPM-TC obtains the best EER (4.09%). Forthe LDE layer, we only aggregate features from stage 3 and 4because using features from stage 2 does not improve the perfor-mance. FPM-TC achieves the best performance on both C mindet (0.368) and EER (3.63%). For all the three pooling strategies,the FPM improves the performance of the MSA method. Wecan see that the FPM is most effective for the LDE layer. In Table 4, we compare the two proposed systems with recentlyreported SV systems. Both proposed systems apply the MSEAwith FPM-TC, and the combination of A-softmax and ring losswith the same settings as in [16]. When we use the VoxCeleb2dataset for training, we extract 256-dimensional speaker embed-dings. In the first system (Proposed 1), we aggregate features Table 5:
EER (%) of systems on the 5 s enrollment set.
Systems 1 s 2 s 3 s 5 s fullTDNN (ASM+SP) 10.93 6.12 4.50 3.83 3.35TDNN (ASM+ASP) 10.21 5.62 4.47 3.62 3.29ResNet-34 (ASM+SPE) [16] 12.13 5.60 4.10 3.45 3.07MSEA w/o FPM (ASM+GAP) 6.65 4.05 3.23 2.74 2.56
Proposed 1 6.25 3.95 2.99 2.54 2.35Proposed 2 6.54 3.76 2.80 2.47 2.23
Table 6:
EER (%) of systems on the full-length enrollment set.
Systems 1 s 2 s 3 s 5 s fullTDNN (ASM+SP) 9.92 5.50 4.06 3.47 3.04TDNN (ASM+ASP) 9.63 5.02 3.87 3.36 2.92ResNet-34 (ASM+SPE) [16] 11.12 4.93 3.55 2.98 2.61MSEA w/o FPM (ASM+GAP) 6.13 3.68 2.84 2.50 2.31
Proposed 1 5.85 3.64 2.83 2.41 2.17Proposed 2 5.92 3.38 2.54 2.17 1.98 from all the stages using GAP. In the second system (Proposed2), we aggregate features from stage 2, 3, and 4 using LDE. Theproposed systems outperform other baseline systems in terms ofboth C mindet and EER. Proposed 2 achieves the best performanceamong all the systems. Using the VoxCeleb1 dataset for train-ing, Proposed 2 obtains a C mindet of 0.350 and an EER of 3.22%.Using the VoxCeleb2 dataset for training, Proposed 2 achievesa C mindet of 0.205 and an EER of 1.98%.In Table 5 and 6, we evaluate the performance of severalsystems in 5 s and full-length enrollment conditions, respec-tively. For each condition, we evaluate the performance on fivedifferent test durations: 1 s, 2 s, 3 s, 5 s, and original full-length(full). The average duration of ‘full’ is 6.3 s. All the results areobtained by our own implementation and VoxCeleb2 dataset isused for training. For TDNN, we follow the same architectureas in [11]. For all the baseline systems, we use the same acous-tic features and optimization as our proposed systems.Among baselines, the ResNet-based system using SPE (theadvanced version of LDE) outperforms TDNN-based systemsexcept for the 1 s test condition. Similarly, Proposed 2 achievesbetter results than Proposed 1 for utterances longer than 1 s.From these, we can see that LDE-based pooling shows a greaterperformance degradation on very short utterances than the otherpooling strategies. MSEA w/o FPM is the approach in the 5throw of Table 2. We find that applying the FPM improves the per-formance of MSA for variable-duration test utterances. Whenwe compare proposed systems with other state-of-the-art base-line systems including TDNN (x-vector) and ResNet-based sys-tems, we also observe that the MSA with FPM achieves higherperformance for both short and long utterances.
6. Conclusions
In this study, we proposed a novel MSA method for TI-SV us-ing a FPM. We applied the FPM to two types of MSA methods,MSFA and MSEA. It enhances speaker-discriminative informa-tion on multi-scale features at multiple layers of a speaker fea-ture extractor. On the VoxCeleb dataset, experimental resultsshowed that the FPM improves both MSA methods with fewerparameters, and works well with three popular pooling layers.The proposed systems obtained better results for both short andlong utterances than the state-of-the-art baseline systems.
7. Acknowledgements
This material is based upon work supported by the Ministry ofTrade, Industry and Energy (MOTIE, Korea) under IndustrialTechnology Innovation Program (No.10063424, Developmentof distant speech recognition and multi-task dialog processingtechnologies for in-door conversational robots). . References [1] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,“Front-end factor analysis for speaker verification,”
IEEE Trans-actions on Audio, Speech and Language Processing , vol. 19, no. 4,pp. 788–798, 2011.[2] S. Ioffe, “Probabilistic linear discriminant analysis,” in
Proc. ofEuropean Conference on Computer Vision (ECCV) , 2006, pp.531–542.[3] P. Kenny, “Bayesian speaker verification with heavy tailed priors,”in
Proc. of Odyssey Speaker and Language Recognition Work-shop , 2010, p. 14.[4] D. Garcia-Romero and C. Espy-Wilson, “Analysis of i-vectorlength normalization in speaker recognition systems,” in
Proc. ofInterspeech , 2011, pp. 249–252.[5] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kan-nan, and Z. Zhu, “Deep speaker: An end-to-end neural speakerembedding system,” arXiv preprint arXiv:1705.02304 , 2017.[6] C. Zhang, K. Koishida, and J. Hansen, “Text-independent speakerverification based on triplet convolutional neural network embed-dings,”
IEEE/ACM Transactions on Audio Speech and LanguageProcessing , vol. 26, no. 9, pp. 1633–1644, 2018.[7] A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A large-scale speaker identification dataset,” in
Proc. of Interspeech , 2017,pp. 2616–2620.[8] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deepspeaker recognition,” in
Proc. of Interspeech , 2018, pp. 1086–1090.[9] W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and lossfunction in end-to-end speaker and language recognition system,”in
Proc. of Odyssey Speaker and Language Recognition Work-shop , 2018, pp. 74–81.[10] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust DNN embeddings for speaker recogni-tion,” in
Proc. of the IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , 2018, pp. 5329–5333.[11] K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statisticspooling for deep speaker embedding,” in
Proc. of Interspeech ,2018, pp. 2252–2256.[12] Y. Tang, G. Ding, J. Huang, X. He, and B. Zhou, “Deep speakerembedding learning with multi-level pooling for text-independentspeaker verification,” in
Proc. of the IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP) , 2019,pp. 6116–6120.[13] Z. Gao, Y. Song, I. McLoughlin, P. Li, Y. Jiang, and L. Dai, “Im-proving aggregation and loss function for better embedding learn-ing in end-to-end speaker verification system,” in
Proc. of Inter-speech , 2019, pp. 361–365.[14] S. Seo, D. J. Rim, M. Lim, D. Lee, H. Park, J. Oh, C. Kim, andJ. Kim, “Shortcut connections based deep speaker embeddings forend-to-end speaker verification system,” in
Proc. of Interspeech ,2019, pp. 2928–2932.[15] A. Hajavi and A. Etemad, “A deep neural network for short-segment speaker recognition,” in
Proc. of Interspeech , 2019, pp.2878–2882.[16] Y. Jung, Y. Kim, H. Lim, Y. Choi, and H. Kim, “Spatial pyramidencoding with convex length normalization for text-independentspeaker verification,” in
Proc. of Interspeech , 2019, pp. 4030–4034.[17] Y. Jung, Y. Choi, and H. Kim, “Self-adaptive soft voice activitydetection using deep neural networks for robust speaker verifica-tion,” in
Proc. of 2019 IEEE Automatic Speech Recognition andUnderstanding Workshop (ASRU) , 2019, pp. 365–372.[18] S. M. Kye, Y. Jung, H. B. Lee, S. J. Hwang, and H. Kim, “Meta-learning for short utterance speaker recognition with imbalancelength pairs,” arXiv preprint arXiv:2004.02863 , 2020. [19] A. Veit, M. Wilber, and S. Belongie, “Residual networks behavelike ensembles of relatively shallow networks,” in
Proc. of NeuralInformation Processing Systems (NIPS) , 2016.[20] T. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Be-longie, “Feature pyramid networks for object detection,” in
Proc.of Computer Vision and Pattern Recognition (CVPR) , 2017, pp.936–944.[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learningfor image recognition,” in
Proc. of Computer Vision and PatternRecognition (CVPR) , 2016, pp. 770–778.[22] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutionalnetworks for biomedical image segmentation,” arXiv preprintarXiv:1505.04597 , 2015.[23] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito,Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differ-entiation in pytorch,” in
Advances in Neural Information Process-ing Systems (NIPS) Autodiff Workshop , 2017.[24] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface:Deep hypersphere embedding for face recognition,” in
Proc. ofComputer Vision and Pattern Recognition (CVPR) , 2017, pp.212–220.[25] Y. Yu, L. Fan, and W. Li, “Ensemble additive margin softmax forspeaker verification,” in
Proc. of the IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP) , 2019,pp. 6046–6050.[26] W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-level aggregation for speaker recognition in the wild,” in
Proc.of the IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2019, pp. 5791–5795.[27] Y. Jiang, Y. Song, I. McLoughlin, Z. Gao, and L. Dai, “An ef-fective deep embedding learning architecture for speaker verifica-tion,” in