Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Normalization for End-to-End Speaker Verification System
SSelf-Attentive Multi-Layer Aggregation with Feature Recalibration and Normalization for End-to-End Speaker Verification System
Soonshin Seo, Ji-Hwan Kim*
Dept. of Computer Science and Engineering, Sogang University, Seoul, South Korea {ssseo,kimjihwan}@sogang.ac.kr
Abstract
One of the most important parts of an end-to-end speaker verification system is the speaker embedding generation. In our previous paper, we reported that shortcut connections-based multi-layer aggregation improves the representational power of the speaker embedding. However, the number of model parameters is relatively large and the unspecified variations increase in the multi-layer aggregation. Therefore, we propose a self-attentive multi-layer aggregation with feature recalibration and normalization for end-to-end speaker verification system. To reduce the number of model parameters, the ResNet, which scaled channel width and layer depth, is used as a baseline. To control the variability in the training, a self-attention mechanism is applied to perform the multi-layer aggregation with dropout regularizations and batch normalizations. Then, a feature recalibration layer is applied to the aggregated feature using fully-connected layers and nonlinear activation functions. Deep length normalization is also used on a recalibrated feature in the end-to-end training process. Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models (equal error rate of 4.95% and 2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively).
Index Terms : end-to-end speaker verification system, self-attentive pooling, multi-layer aggregation, feature recalibration, deep length normalization, convolutional neural networks Introduction
In the speaker verification field, deep neural networks (DNNs) have been used as speaker embedding extractors. Generally, a speaker embedding-based speaker verification system executes the following process [1–4]: First, the classification-based speaker model is trained. Second, the speaker embedding is extracted by using the output value of the inner layer of the speaker model. Third, the similarity between the embedding of the enrolled speaker and test speaker is computed. Fourth, the acceptance or rejection is determined by a pre-decision threshold. Also, back-end methods, e.g., probabilistic linear discriminant analysis or length normalization, can be used [5–7]. Since the advances in computational power and deep learning techniques, an end-to-end training can demonstrate competitive performance [8–11]. Here, the ‘end-to-end’ does not refer to a complete end-to-end system, e.g., [12–14], in which a verification result is output when a speech input is given. Herein, it only means that the speaker model training process. Specifically, it is a single-pass training without no additional strategies or back-end methods after extracting the speaker embedding [8, 10]. The most important part of the end-to-end speaker verification system is the speaker embedding generation [10]. A speaker embedding is a high-dimensional feature vector that contains speaker information. An ideal speaker embedding maximizes inter-class variations and minimizes intra-class variations [4, 11, 15]. The component that directly affects the speaker embedding generation is the encoding layer. The encoding layer takes a frame-level feature and converts it into a compact utterance-level feature. It also converts variable-length features to fixed length features. Most encoding layers are based on a pooling method, e.g. temporal average pooling (TAP) [7, 14, 16], global average pooling (GAP) [10, 15], and statistical pooling (SP) [3, 11, 13, 17, 18]. In particular, self-attentive pooling (SAP) was improved performance by focusing on the frames for a more discriminative utterance-level feature [7, 19, 20]. These pooling layers provide compressed speaker information by rescaling the input size. These are mainly used with convolutional neural networks (CNNs) [7, 10, 11, 14–17, 20]. Therefore, the speaker embedding is extracted by using the output value of the last pooling layer in a CNN-based speaker model. Furthermore, to improve the representational power of the speaker embedding, residual learning derived from ResNet [21] and squeeze-and-excitation (SE) blocks [22] were adapted for the speaker models [7, 10, 11, 15, 16, 20, 23]. Residual learning maintains input information through mappings between layers called ‘shortcut connections.’ A large-scaled CNN using shortcut connections can avoid gradient degradation. The SE block consists of a squeeze operation, which condenses all of the information on the features, and an excitation operation, which scales the importance of each features. Therefore, the channel-wise feature response can be adjusted without significantly increasing the model complexity in the training. The main limitation of previous encoding layers is that the model uses only the output feature of the last pooling layer as input. In other words, it uses only one frame-level feature when constructing a speaker embedding. Therefore, similar to [11, 24], our previous study presented a shortcut connections-based multi-layer aggregation to improve the speaker representations when calculating the weight at the encoding layer [10]. Specifically, the frame-level features is extracted from between each residual layer in ResNet. Then, these frame-level features are fed into the input of the encoding layer using shortcut connections. As a result, a high-dimensional speaker embedding is generated. * Corresponding author owever, our previous study has the limitations. First, the model parameter size is relatively large and the model generates a high-dimensional speaker embeddings (1024-dimensions, about 15 million model parameters). These lead to inefficient training and requires a sufficiently large amount of data for training. Second, the multi-layer aggregation approach increases not only the speaker information, but also the intrinsic and extrinsic variation factors, e.g., emotion, noise, and reverberation. Some of these unspecified factors increase the variability while generating a speaker embedding. Given that, we propose a self-attentive multi-layer aggregation with feature recalibration and normalization for end-to-end speaker verification system, as shown in Figure 1. We present an improved version of our previous work as described in the following steps: First, a ResNet, which scaled channel width and layer depth, is used as a baseline. The scaled ResNet has fewer parameters than the standard ResNet [21]. Second, a self-attention mechanism is applied to perform the multi-layer aggregation with dropout regularizations and batch normalizations [25]. It helps to construct a more discriminative utterance-level feature, while considering the frame-level features of each layer. Third, a feature recalibration layer is applied to the aggregated feature. The channel-wise dependencies are trained using fully-connected layers and nonlinear activation functions. Fourth, deep length normalization [8] is also used for a recalibrated feature in the end-to-end training process. The paper is organized as follows. Section 2 describes a baseline system using shortcut connections-based multi-layer aggregation. Section 3 introduces the proposed self-attentive multi-layer aggregation with feature recalibration and normalization. Section 4 discusses our experiments and conclusions are drawn in Section 5. Baseline System: Shortcut Connections-based Multi-Layer Aggregation
Prior system
In our previous study [10], we proposed a shortcut connections-based multi-layer aggregation with ResNet-18. The main difference from standard ResNet-18 [21] is how the speaker embedding is aggregated. The multi-layer aggregation uses not only the output feature of the last residual layer but also the output features of all previous residual layer. These features are concatenated into one feature through shortcut connections. The concatenated feature is fed into several fully connected layers to construct a high-dimensional speaker embedding. Our prior system improved the performance in a simple method; but, model parameters were too large. Table 1:
Architecture of scaled ResNet-34 using multi-layer aggregation as a baseline (D: input dimension, L: input length, N: number of speakers, SE: speaker embedding)
Layer Output size Channels Blocks Encoding conv1 D × L
32 - - pool1 1 × 32 - - GAP res1 D × L
32 3 - pool2 1 × 32 - - GAP res2 D /2 × L /2 64 4 - pool3 1 × 64 - - GAP res3 D /4 × L /4 128 6 - pool4 1 × 128 - - GAP res4 D /8 × L /8 256 3 - pool5 1 × 256 - - GAP concat 1 × 512 - - SE output 512 × N - - - Modifications
The prior system is modified considering the scaling factors, e.g., layer depth, channel width, and input resolution, for efficient learning in the CNN [26]. First, we use a high-dimensional log-Mel filterbanks with data augmentation for the input resolution. Second, the channel width is reduced and the layer depth is expanded because ResNet can improve the performance without significantly increasing the parameters when the layer depth is increased. Consequently, the scaled ResNet-34 is constructed as shown in Table 1. The scaled ResNet-34 is composed of three, four, six, and three residual blocks. It is reduced the number of channels by half, compared to the standard ResNet-34 [21]. Also, shortcut connections-based multi-layer aggregation is added to the model using GAP encoding method. The output features of each GAP are concatenated and fed into the output layer. Then, a high-dimensional speaker embedding is generated from a penultimate layer in networks. As a result, the scaled ResNet-34 has only about 6 million model Figure 1:
Network architecture overview: Self-attentive multi-layer aggregation with a feature recalibration layer and a deep length normalization layer. (We extract a speaker embedding after the normalization layer on each utterances.)
SpeakersInput
Features
Scaled ResNet
SAP
Conv1 Res1 Res2 Res3 Res4SAP SAP SAP SAP
Recalibration layer
Normalization layerEncoding layer
Shortcut Connections arameters compared to the 12 million of the standard ResNet-18 and 22 million of the standard ResNet-34. Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Normalization
Model architecture
As shown in Figure 1 and Table 2, the proposed network mainly consists of a scaled ResNet and an encoding layer. Frame-level features are trained in the scaled ResNet and utterance-level features are trained in the encoding layer. Table 2:
Architecture of proposed scaled ResNet-34 model using self-attentive multi-layer aggregation with feature recalibration (FR) and deep length normalization (DLN) layers (D: input dimension, L: input length, N: number of speakers, : output features of pooling layers, : output features of concatenation layer, : output features of FR layer, SE: speaker embedding) Layer Output size Channels Blocks Encoding conv1 D × L
32 - - pool1 1 × 32 - - SAP ( ) res1 D × L
32 3 - pool2 1 × 32 - - SAP ( ) res2 D /2 × L /2 64 4 - pool3 1 × 64 - - SAP ( ) res3 D /4 × L /4 128 6 - rool4 1 × 128 - - SAP ( ) res4 D /8 × L /8 256 3 - pool5 1 × 256 - - SAP ( ) concat 1 × 512 - - FR 1 × 512
DLN 1 × 512 - - SE output 512 × N - - - Scaled ResNet.
In the scaled ResNet, given an input feature
𝑿 = [𝒙 , 𝒙 , … , 𝒙 𝑙 , … , 𝒙 𝐿 ] of length 𝐿 ( 𝒙 𝑙 ∈ ℝ 𝑑 ), output features 𝑖 = [𝒑 , 𝒑 , … , 𝒑 𝑐 , … , 𝒑 𝐶 ] ( 𝒑 𝑐 ∈ ℝ ) from each residual layer of scaled ResNet are generated using SAP. Here, the length 𝐶 𝑖 is determined by the number of the last channel in the 𝑖 𝑡ℎ residual layer. Then, the generated output features are concatenated in order into one feature ( [+] indicates concatenation) = [+] [+] [+] [+] . (1) The concatenated feature = [𝒗 , 𝒗 , … , 𝒗 𝑐 , … , 𝒗 𝐶 ] (length 𝐶 = 𝐶 + 𝐶 + 𝐶 + 𝐶 + 𝐶 , 𝒗 𝑐 ∈ ℝ ) is a set of frame-level features and is used as the input of the encoding layer. Encoding layer.
The encoding layer consists of a feature recalibration layer and a deep length normalization layer. In the feature recalibration layer, the concatenated feature is recalibrated by fully connected layers and nonlinear activations. As a result, a recalibrated feature =[𝒗́ , 𝒗́ , … , 𝒗́ 𝑐 , … , 𝒗́ 𝐶 ] ( 𝒗́ 𝑐 ∈ ℝ ) is generated. Then, the recalibrated feature is normalized according to the length of the input in the deep length normalization layer. The normalized feature is used to a speaker embedding and is fed into the output layer for discriminating speaker classes. Self-attentive multi-layer aggregation
As shown in Figure 1, the SAP is applied to each residual layer using shortcut connections. Given an output feature of the first convolution layer or the 𝑖 𝑡ℎ residual layers after conducting an average pooling, 𝒊 = [𝒚 , 𝒚 , … , 𝒚 𝑛 , … , 𝒚 𝑁 ] of length 𝑁 ( 𝒚 𝑛 ∈ ℝ 𝑐 ) is obtained. The number of dimension 𝑐 is determined by the number of channels. Then, the average feature is fed into a fully-connected hidden layer to obtain 𝑯 𝒊 = [𝒉 , 𝒉 , … , 𝒉 𝑛 , … , 𝒉 𝑁 ] using a hyperbolic tangent activation function . Given 𝒉 𝑛 ∈ ℝ 𝑐 and a learnable context vector 𝒖 ∈ ℝ 𝑐 , the attention weight 𝑤 𝑛 is measured by training the similarity between 𝒉 𝑛 and 𝒖 with a softmax normalization as 𝑤 𝑛 = 𝑒𝑥𝑝(𝒉 𝑛𝑇 ⋅ 𝒖)∑ 𝑒𝑥𝑝 (𝒉 𝑛𝑇 ⋅ 𝒖) 𝑁𝑛=1 . (2) Then, the embedding 𝒆 ∈ ℝ 𝑑 is generated using the weighted sum of the normalized attention weights 𝑤 𝑛 and 𝒉 𝑛 as 𝒆 = ∑ 𝒉 𝑛𝑁𝑛=1 𝑤 𝑛 . (3) In this manner, the SAP output feature 𝑖 =[𝒑 , 𝒑 , … , 𝒑 𝑐 , … , 𝒑 𝐶 ] ( 𝒑 𝑐 ∈ ℝ ) is generated. This process helps to generate a more discriminative feature while paying attention to the frame-level features of each layer. Moreover, dropout regularization and batch normalization are used in feature 𝑖 . Then, the generated features are concatenated into one feature as in equation (1). Feature recalibration
After the self-attentive multi-layer aggregation, concatenated feature is given to the feature recalibration layer as input. The feature recalibration layer aims to train the correlations between each channel of the concatenated feature; this is inspired by [22]. Given an input feature = [𝒗 , 𝒗 , … , 𝒗 𝑐 , … , 𝒗 𝐶 ] ( 𝒗 𝑐 ∈ℝ , where 𝐶 is the sum of all channels), the feature channels are recalibrated using two fully-connected layers and nonlinear activations as follows: = 𝑓 𝐹𝑅 ( , 𝑾) = 𝜎(𝑾 𝛿(𝑾 )). (4) Here, 𝛿 refers to the leaky ReLU activation and 𝜎 refers to the sigmoid activation. 𝑾 is the front fully-connected layer, 𝑾 ∈ ℝ 𝑐× 𝑐𝑟 , and 𝑾 is the back fully-connected layer, 𝑾 ∈ ℝ 𝑐𝑟 ×𝑐 . According to the reduction ratio 𝑟 , a dimensional transformation is performed between the two fully-connected layers, such as a bottleneck structure. Also, the channel-wise multiplication is performed. Then, the rescaled channels are multiplied by the input feature . As a result, an output feature = [𝒗́ , 𝒗́ , … , 𝒗́ 𝑐 , … , 𝒗́ 𝐶 ] ( 𝒗́ 𝑐 ∈ ℝ ) is generated. It is recalibrated according to the importance of the channels. Deep length normalization
Like [8], deep length normalization is applied to proposed model. The L2-constraint is applied to the length axis of recalibrated feature with scale parameter 𝛼 as 𝛼𝑓 𝐷𝐿𝑁 ( )‖𝑓
𝐷𝐿𝑁 ( )‖ = 𝑓 𝐷𝐿𝑁 ( ). (5)
Then, the normalized is fed into the output layer for speaker classification. This feature is used as a speaker embedding. Experiments
Datasets Training.
In our experiments, we used the VoxCeleb1 [27] and VoxCeleb2 [16] training datasets, which were collected in real environments. These are large-scale text-independent speaker verification datasets, consisting of more than 100 thousand and 1 million utterances with 1211 and 5994 speakers, respectively.
Evaluation.
We used the VoxCeleb1 evaluation dataset, which includes 40 speakers and 37,220 pairs of the official test protocol. This is an open-set test, which evaluates all speaker pairs that are not seen in the training dataset.
Experimental setup Input features.
We used a 64-dimensional log Mel-filterbanks with a 25 ms frame length and 10 ms frame shift, which were mean-variance normalized over a sliding window of up to 3 s. For each training step, 12 s interval was extracted from each utterance using cropping or padding.
Preprocessing.
In the training, a SpecAugment method was used to perform time and frequency masking on input features [28].
Hyper-parameters.
We used a cross entropy-based softmax loss function. Also, we used the stochastic gradient descent optimizer with a momentum of 0.9, a weight decay of 0.0001, and an initial learning rate of 0.1, reduced by a 0.1 scale factor on the plateau. All experiments were trained for 200 epochs with a 96 mini-batch size. The scaling parameter 𝛼 was initialized to a value of 10 and the reduction ratio 𝑟 was initialized to a value of 8. Evaluation metrics.
From the trained model, we generated a 512-dimensional speaker embedding in each utterance. We used a standard cosine similarity for computing the speaker pair. We used the equal error rate (EER, %) and minimum detection cost function (minDCF) as evaluation metrics.
Experimental results
To evaluate the proposed methods, we first tested the baseline using different encoding methods. We then compared our proposed method with state-of-the-art encoding methods. Table 3 : Experimental results for modifying the baseline construction, using the VoxCeleb1 training dataset (Dim: speaker embedding dimension).
Model Encoding method Dim EER minDCF
Scaled ResNet-34 GAP 256 6.85 0.3389 SAP 256 6.68 0.3402 MLA-SAP 512 5.42 0.3025 MLA-SAP-FR 512 5.07 0.2932 MLA-SAP -FR-DLN 512 4.95 0.2902
First comparison.
Table 3 presents the results of modifying the baseline described in Table 1. We experimented with basic encoding layers, e.g., GAP and SAP. Then, we combined the proposed methods one by one to the baseline, e.g., self-attentive multi-layer aggregation (MLA-SAP), feature recalibration (FR), and deep length normalization (DLN).
Second comparison.
Table 4 shows a comparison of our proposed methods with state-of-the-art encoding methods. Here, we focused on encoding methods using a ResNet-based model and the softmax loss function in the end-to-end training process. Various encoding methods were compared, e.g., TAP [7, 16], learnable dictionary encoding (LDE) [7], SAP [7], GAP [15], NetVLAD [4], and GhostVLAD [4]. The experimental results showed that our proposed methods was superior to the state-of-the-art methods (EER of 4.95% and 2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively). Table 4 : Experimental results comparing our proposed methods with state-of-the-art encoding methods using the VoxCeleb1 and VoxCeleb2 training datasets
Model Encoding method Dim EER
ResNet-34 [7] * TAP 128 5.48 ResNet-34 [7] * LDE 128 5.21 ResNet-34 [7] *
SAP 128 5.51 ResNet-34 [15] * GAP 256 5.39 Scaled ResNet-34 ( proposed ) *
MLA-SAP -FR-DLN 512
ResNet-34 [16]
TAP 512 5.04 ResNet-50 [16] TAP 512 4.19 Thin-ResNet-34 [4]
NetVLAD 512 3.57 Thin-ResNet-34 [4]
GhostVLAD 512 3.22 Scaled ResNet-34 ( proposed ) MLA-SAP -FR-DLN 512 * These models used the VoxCeleb1 training datasets. Conclusions
In previous multi-layer aggregation methods for end-to-end speaker verification, the number of model parameters had relatively large and unspecified variations increased in the training. Therefore, we proposed a self-attentive multi-layer aggregation with feature recalibration and normalization for end-to-end speaker verification system. First, the ResNet, which scaled channel width and layer depth, was used as a baseline. Second, a self-attentive multi-layer aggregation was applied when training the frame-level features of each residual layer in the scaled ResNet. Third and fourth, the feature recalibration layer and deep length normalization were applied to train the utterance-level feature in the encoding layer. The experimental results using the VoxCeleb1 evaluation dataset showed that the proposed method achieved a performance comparable to state-of-the-art models. Acknowledgements
This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No.2017-0-01772, Development of QA systems for Video Story Understanding to pass the Video Turing Test) . References [1]
E. Variani, X. Lei, E. McDermott, I. Lopez-Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,”
Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2014, pp. 4052-4056. [2]
Y. Chen, I. Lopez-Moreno, T. N. Sainath, M. Visontai, R. Alvarez, and C. Parada, “Locally-connected and convolutional neural networks for small footprint speaker recognition,”
Proc. of Conference of the International Speech Communication Association (INTERSPEECH) , 2015, pp. 1136-1140. [3]
D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification,”
Proc. of Conference of the International Speech Communication Association (INTERSPEECH) , 2017, pp. 999-1003. [4]
W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-level aggregation for speaker recognition in the wild,”
Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2019, pp. 5791-5795. [5]
D. Garcia-Romero, C. Y. Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems,”
Proc. of Conference of the International Speech Communication Association (INTERSPEECH) , 2011, pp. 249-252. [6]
G. Bhattacharya, J. Alam, and P. Kenny, “Deep speaker embeddings for short-duration speaker verification,”
Proc. of Conference of the International Speech Communication Association (INTERSPEECH) , 2017, pp. 1517-1521. [7]
W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,”
Proc. of Speaker and Language Recognition Workshop (Odyssey) , 2018, pp. 74-81. [8]
W. Cai, J. Chen, and M. Li, “Analysis of length normalization in end-to-end speaker verification system,”
Proc. of Conference of the International Speech Communication Association (INTERSPEECH) , 2018, pp. 3618-3622. [9]
J. Jung, H. Heo, J. Kim, H. Shim, and H. Yu, “RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification,”
Proc. of Conference of the International Speech Communication Association (INTERSPEECH) , 2019, pp. 1268-1272. [10]
S. Seo, D. J. Rim, M. Lim, D. Lee, H. Park, J. Oh, C. Kim, and J. Kim, “Shortcut connections based deep speaker embeddings for end-to-end speaker verification system,”
Proc. of Conference of the International Speech Communication Association (INTERSPEECH) , 2019. pp. 2928-2932. [11]
Z. Gao, Y. Song, I. McLoughlin, P. Li, Y. Jiang, and L. Dai, “Improving aggregation and loss function for better embedding learning in end-to-end speaker verification system,”
Proc. of Conference of the International Speech Communication Association (INTERSPEECH) , 2019. pp. 361-365. [12]
G. Heigold, I. Moreno, S. Bengio, and N. Shazzer, “End-to-end text-dependent speaker verification,”
Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2016, pp. 5115-5119. [13]
D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, and S. Khudanpur, “Deep neural network-based speaker embeddings for end-to-end speaker verification,”
Proc. of the IEEE Workshop on Spoken Language Technology (SLT) , 2016, pp. 165-170. [14]
C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu, “Deep Speaker : An end-to-end neural speaker embedding system,” arXiv preprint arXiv:1705.02304 , 2017. [15]
I. Kim, K. Kim, J. Kim, and C. Choi, “Deep speaker representation using orthogonal decomposition and recombination for speaker verification,”
Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2019, pp. 6126-6130. [16]
J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speaker recognition,”
Proc. of Conference of the International Speech Communication Association (INTERSPEECH) , 2018, pp. 1086-1090. [17]
Z. Gao, Y. Song, I. McLoughlin, W. Guo, and L. Dai, “An improved deep embedding learning method for short duration speaker,”
Proc. of Conference of the International Speech Communication Association (INTERSPEECH) , 2018, pp. 3578-3582. [18]
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,”
Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2018, pp. 5329-5333. [19]
Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentive speaker embeddings for text-independent speaker verification,”
Proc. of Conference of the International Speech Communication Association (INTERSPEECH) , 2018, pp. 3573-3577. [20]
M. India, P. Safari, and J. Hernando, “Self multi-head attention for speaker recognition,”
Proc. of Conference of the International Speech Communication Association (INTERSPEECH) , 2019, pp. 4305-4309. [21]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016, pp. 770-778. [22]
J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,”
Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018, pp. 7132-7141. [23]
J. Zhou, T. Jiang, Z. Li, L. Li, and Q. Hong, “Deep speaker embedding extraction with channel-wise feature responses and additive supervision softmax loss function,”
Proc. of Conference of the International Speech Communication Association (INTERSPEECH) , 2019, pp. 2883-2887. [24]
Y. Tang, G. Ding, J. Huang, X. He, and B. Zhou, “Deep speaker embedding learning with multi-level pooling for text-independent speaker verification,”
Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2019, pp. 6116-6120. [25]
Z. Yan, H. Zhang, Y. Jia, T. Breuel, and Y. Yu, “Combining the best of convolutional layers and recurrent layers: A hybrid network for semantic segmentation,” arXiv preprint arXiv:1603.04871 , 2016. [26]
M. Tan, Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” arXiv preprint arXiv:1905.11946 , 2019. [27]
A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A large-scale speaker identification dataset,”
Proc. of Conference of the International Speech Communication Association (INTERSPEECH) , 2017, pp. 2616-2620. [28]
D. Park, W. Chan, C. Chiu, B. Zoph, E. Cubuk, Q. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,”