Intra-class variation reduction of speaker representation in disentanglement framework
IIntra-class variation reduction of speaker representationin disentanglement framework
Yoohwan Kwon, Soo-Whan Chung and Hong-Goo Kang
Department of Electrical & Electronic Engineering, Yonsei University, Seoul, South Korea [email protected]
1. Abstract
In this paper, we propose an effective training strategy to ex-tract robust speaker representations from a speech signal. Oneof the key challenges in speaker recognition tasks is to learnlatent representations or embeddings containing solely speakercharacteristic information in order to be robust in terms of intra-speaker variations. By modifying the network architecture togenerate both speaker-related and speaker-unrelated representa-tions, we exploit a learning criterion which minimizes the mu-tual information between these disentangled embeddings. Wealso introduce an identity change loss criterion which utilizes areconstruction error to different utterances spoken by the samespeaker. Since the proposed criteria reduce the variation ofspeaker characteristics caused by changes in background envi-ronment or spoken content, the resulting embeddings of eachspeaker become more consistent. The effectiveness of the pro-posed method is demonstrated through two tasks; disentangle-ment performance, and improvement of speaker recognition ac-curacy compared to the baseline model on a benchmark dataset,VoxCeleb1. Ablation studies also show the impact of each cri-terion on overall performance.
Index Terms : speaker verification, disentanglement, mutual in-formation
2. Introduction
Speaker recognition systems have been studied for many yearsdue to their usefulness in various applications. Recently, theaccuracy of speaker recognition has dramatically improved dueto advances in deep learning and the availability of large-scaledatasets for training. The main objective of deep learning-basedspeaker recognition is to extract a high dimensional embeddingvector such that it uniquely represents the characteristic of eachspeaker. The d-vector [1, 2] and x-vector [3] are typical ex-amples, where they are estimated via an identity classificationtask with an encoder style network. The detailed extraction pro-cess differs with respect to the type of network structure and thecriterion of the objective function such as softmax, triplet, andangular softmax [4]. However, given that the extracted embed-dings also include speaker-unrelated information , there remainsroom for further improvement.To overcome the aforementioned limitation inherent to theencoder style framework, a method for disentangling the em-beddings with the use of relevant and irrelevant speaker in-formation was proposed [5]. The method consists of two en-coders, a speaker purifying encoder and a dispersing encoder,as well as a decoder for reconstruction. While the speakerpurifying encoder is trained by the original speaker classifica-tion scheme, the dispersing encoder is trained by an adversarialtraining scheme designed to fool it from correctly classifyingthe speaker identity. Later, two encoded features are concate-nated, following which they are fed to the decoder, which uti- lizes a reconstruction loss to the original input so that all in-formation is embedded within the representative features. Inother words, they decompose the entirety of the speech infor-mation into speaker identity-related and -unrelated information.Although the speaker and non-speaker embeddings are learnedeffectively using the adversarial classifier, the method does notdirectly address the task of dispersing both embeddings simulta-neously in disentanglement. There is an opportunity to improvethe disentanglement performance by adopting a method whichconsiders the relation of embeddings simultaneously.In this paper, we propose a method to effectively disentan-gle speaker identity-related and identity-unrelated informationusing various types of criteria. We first introduce a criterion forminimizing mutual information between speaker-related and -unrelated representations, which is beneficial due to that it di-rectly considers the relation between those features. We alsopropose a novel identity change criterion which measures thedifference between the input and generated mel-spectrums. Thereconstructed mel-spectrum used for the identity change lossis generated via a speaker embedding from one utterance anda residual embedding from the other utterance possessing thesame speaker identity. Since the criterion enforces speaker em-beddings to be similar to a different set of utterances, it reducesintra-variation within each speaker’s cluster. The main contri-butions of this paper are as follows: (1) we propose an effectivemethod for disentanglig identity-related and identity-unrelatedinformation using a mutual information criterion through anauto-encoder framework; (2) we introduce a speaker identitychange loss criterion to further enhance the performance ofspeaker embeddings; (3) we use this framework to improvespeaker verification performance on benchmark datasets.The remainder of the paper is organized as follows. Section3 presents a brief overview of related works on speaker embed-ding and disentanglement. In Section 4, we present the detailsof the proposed method such as network architectures and lossfunctions. Experimental results are presented in Section 5, andthe conclusion follows in Section 6.
3. Related works
Speaker embedding vectors are high level representations (typ-ically obtained via deep neural networks) that aim to compactlyrepresent a speaker’s identity. They are very important for manyapplications such as speaker recognition and diarization. Thereare various speaker embedding methods that differ in terms ofthe type of network architecture, feature aggregation, and train-ing criteria. Deep learning architectures such as DNN- [1, 6, 7],CNN- [2, 8–11], or LSTM-based ones [12] first extract theframe-level features from a variable length of utterances. Then,a pooling method [13–16] is used to aggregate the frame-levelfeatures to a fixed length of utterance-level. In terms of the ob- a r X i v : . [ ee ss . A S ] A ug ective function, they are trained by performing a classificationtask with a criterion of softmax, angular softmax or a metriclearning task using a contrastive loss [2, 8], triplet loss [9] andetc [17,18]. Nevertheless, there is still room for improvement ifwe introduce the concept of target-unrelated information to theextracted embedding features. Disentanglement is a learning technique that represents the in-put signal’s characteristics through multiple separated dimen-sions or embeddings. Therefore, it is beneficial for obtainingrepresentations that contain certain attributes or for extractingdiscriminative features. Adversarial training [19–23] and re-construction based training [24–28] are widely used to obtaindisentangled representations.Tai at el. [5] proposed a disentanglement method forspeaker recognition that is the baseline for our work. By con-structing an identity-related and an identity-unrelated encoder,they trained each encoder to represent only speaker-related and-unrelated information using speaker identification loss and ad-versarial training loss. They also adopted an auto-encoderframework to maintain all input speech information within out-put embeddings. The information contained in the output em-beddings is preserved using spectral reconstruction approaches.
Mutual information (MI) based feature learning methods havebeen popular for a long time, but they are often difficult to ap-ply for deep learning-based approaches because it is not easy tocalculate the MI for high dimensional continuous variables. Re-cently, a mutual information neural estimator (MINE) [29] wasproposed to estimate mutual information with a neural networkarchitecture.By definition, the MI is equivalent to the Kullback-Leibler(KL) divergence of a joint distribution, P X,Y , and the productof marginals, P X ⊗ Y . According to the Donsker-Varadhan rep-resentation [30], the lower bound of mutual information can berepresented by: I ( X, Y ) ≥ sup T E P X,Y [ T θ ] − log ( E P X ⊗ Y [ e T θ ]) . (1)The T function is trained by a neural network with the param-eter θ , for which the output can be considered to be an approx-imated value of mutual information between X and Y . It hasbeen widely used in recent works on feature learning [31–33].
4. Proposed Method
The main goal of the proposed algorithm is to extract a high-level latent embedding that contains only speaker-related infor-mation. To achieve this goal, we propose a disentanglementmethod to decouple speaker information from an input signalsuch that the embedding represents the speaker’s identity beingrobust to the variation of linguistic information.
Figure 1 illustrates the proposed training strategies in our disen-tanglement method. Our network consists of three modules: aspeaker encoder E spk , a residual encoder E res , and a decoder D r . f spk and f res are respectively the output features of en-coders E spk and E res . Our method reconstructs mel-scaledspectrum instead of the magnitude spectrum so that it efficientlydisentangles embeddings without speaker information loss. The network is trained in various learning criteria used inthe baseline model, depicted in Figure 1a with auxiliary losswhich minimizes intra-variance of clusters; speaker loss, dis-entanglement loss, reconstruction loss, and our novel criterion– identity change loss. Also, we modify disentanglement loss,which uses the adversarial classifier on the residual embeddingin the baseline method, into the mutual information between f spk and f res . Details of each criterion are described in theSection 4.2. In this section, we demonstrate the details of the proposedmethod with objective functions for training; speaker loss L S ,disentanglement loss L MI , reconstruction loss L R and identitychange loss L IC . The total objective function of the proposedmethod consists of four loss functions: L total = λ L S + λ L MI + λ L R + λ L IC . (2)The hyper-parameters are set based on experimental results, [ λ , ..., λ ] = [1 , . , . , . . Speaker loss.
The objective of the speaker loss is embeddingspeaker representation f spk into the latent space using the en-coder E spk as done in [4,8,9,12]. Following the baseline model,the speaker encoder is trained in a speaker label classificationtask using a cross-entropy criterion. The loss function is de-noted as: L S = − C (cid:88) i =1 t i log ( softmax ( f spk ) i ) , (3)where C is the number of speakers and t is the label index. Disentanglement loss.
In the disentanglement mechanism, theresidual embedding f res contains information which is not in-cluded in the speaker vector f spk . The baseline method adoptsthe adversarial classification to embed residual of speaker char-acteristics. The adversarial classification shares network param-eters used in speaker loss whereas its objective is to eliminatesthe speaker information by fooling the classifier. The residualencoder E res is trained not to estimate any speaker label byusing uniform distribution, and its definition is as follows: L adv = 1 C C (cid:88) j =1 log ( softmax ( f res ) j ) , (4)where C is the number of classes.In our strategy, we attempt disentanglement using mutualinformation between f spk and f res instead of adversarial learn-ing. Since the genuine disentanglement is achieved in dispers-ing residual information but not in embedding features sepa-rately, we consider both f spk and f res in terms of disentan-glement criterion. Here, we adopt the MINE method, whichhandles correspondence between the three embeddings usingdeep learning approaches. In [32], MINE controls the infor-mation differences between speakers; minimizing in the samespeaker and maximizing in different speakers. MINE, in our pa-per, maximizes the discrepancy between disentangled features( f spk , f res ), and minimizes between speaker representationsextracted from different segments of the same speech signalsas shown in Figure 1c. The criterion is designed as Equation 5. L MI = E [ T θ ( f Aspk , f A (cid:48) spk )] − log (cid:16) E (cid:104) e T θ ( f Aspk ,f Ares ) (cid:105)(cid:17) + E [ T θ ( f A (cid:48) spk , f Aspk )] − log (cid:16) E (cid:104) e T θ ( f A (cid:48) spk ,f A (cid:48) res ) (cid:105)(cid:17) , (5)a) Baseline loss (b)
Identity change loss (c)
Mutual information loss
Figure 1:
Overview of proposed training criteria. (a)
Training criteria based on [5]: speaker loss, disentanglement loss and recon-struction loss. (b)
Identity change loss: switch the speaker embedding to mean of those. (c)
Mutual information loss: estimate themutual information from speaker and residual embeddings by MINE where f Aspk and f A (cid:48) spk represent identical speaker extracted fromthe same speech signal with different offsets, and f Ares and f A (cid:48) res are their residual embeddings. It holds the common informationbetween speaker embeddings and disperses residuals to speakerembeddings on the other embedding. Reconstruction loss.
The disentangled embeddings, f spk and f res preserve the spectral information in the input spectrumwhen they are combined. The decoder D r ( f spk , f res ) is trainedto generate a reconstructed spectrum using a concatenated em-bedding input. The reconstruction loss L R is defined by mea-suring the distance between input and the reconstructed spec-trum using an MSE criterion as follows: L R = || D r ( f spk , f res ) − S mel || , (6)where S mel is a mel-spectrum of the input speech signal S . Re-constructing the mel-spectrum instead of a magnitude spectrumcan reduce the burden of the decoder during the spectrum gen-eration process, while it still enables the generation of embed-dings containing all information of input. Identity change (IC) loss.
Intra-class variance inevitable ineach speaker cluster is caused by the variation of linguistic in-formation, recording environments, and speakers’ emotional orhealth state. To further improve speaker recognition perfor-mance by minimizing intra-class variances in speaker clusters,we propose identity change loss. Instead of minimizing intra-class variance directly, we use a reconstruction loss criterionthat measures spectral distance between the reference and re-constructed one. Since the reconstructed mel-spectrum is gen-erated by substituting the identity embedding with the one ex-tracted from different utterances spoken by the same speaker,we may obtain perfect reconstruction only when the substituteembedding has the same distribution as the original identity.The identity change loss is described in Equation 7. L IC = (cid:107) ˆ S A − S A (cid:107) + (cid:107) ˆ S B − S B (cid:107) , ˆ S A = D r (cid:18) f Aspk + f Bspk , f Ares (cid:19) , ˆ S B = D r (cid:18) f Aspk + f Bspk , f Bres (cid:19) , (7) Table 1: Verification results on VoxCeleb1 test set. S, C and AMare Softmax, Contrastive and Angular margin loss, respectively.
Model Criterion EER
Chung et al. [2] Encoder S + C 5.04%Xie et al. [16] Encoder S 5.02%Tai et al. [5] Enc(2)+Dec S 3.83%
Proposed
Enc(2)+Dec S
Enc(2)+Dec AM where S A and S B are the mel-spectrum of speech signals A , B spoken by the same speaker, and ˆ S A and ˆ S B are the re-constructed mel-spectrum using substituted identities. In theproposed method, f Aspk and f Bspk are substituted with the meanof two identities depicted in Figure 1b; it guides the directionwhere speaker embeddings to be gathered to minimize intra-class variance.
5. Experiments
We train our model on VoxCeleb2 [2], which is a large-scaleaudio-visual dataset containing over 1 million utterances for5,994 celebrities, extracted from YouTube videos. We evalu-ate our model on VoxCeleb1 [8] test set which consists of 677clips spoken by 40 speakers. Clips are segmented into 3 secondswith a random offset from each utterance for training. They aresliced in every 10ms with 25ms window length and transformedinto log-magnitude spectrum with the FFT size of 512; thus, thedimension of input speech features is × . For recon-struction, we prepare mel-spectrogram in logarithm scale using64 mel-filterbanks as outputs. The structures of the speaker encoder and the residual encoderare designed based on ResNet34 with small changes into thepooling strategy. Both encoders use a time average pooling(TAP) method to embed variable length input features into aable 2:
Ablation study of the proposed method L s L r L adv L mi L ic EER (%)Baseline (cid:88) (cid:88) (cid:88) - - 3.83%
Proposed (cid:88) (cid:88) (cid:88) (cid:88) - 3.71% (cid:88) (cid:88) - (cid:88) - 3.81% (cid:88) (cid:88) (cid:88) - (cid:88) (cid:88) (cid:88) - (cid:88) (cid:88) fixed dimension of utterance level. The decoder consists of3 fully-connected layers and 9 transposed convolutional layersreferenced by [34]. In the training phase, the batch size of theinput is set to 32 and the model is trained with the Adam opti-mizer [35]. The learning rate is set to 1e-3 and reduced by halfevery 10 epochs until convergence. In phase I, the networkis pre-trained using speaker loss, disentanglement loss and re-construction loss, similar to the baseline strategy. According toeach experimental setup, either adversarial loss or mutual infor-mation loss is used.
Phase II. Identity change training.
During phase II, we con-sider an efficient training strategy for identity change loss. Itsmotivation is based on dispersing information by setting oneembedding as an anchor and stable adaptation of the other sideembedding. The detailed process is shown below and the stagesare processed recursively:1.
Intra-class minimization – The identity is replaced by themean of two identities to generate mel-spectrogram, andits reconstruction error L IC is minimized through back-propagation on the decoder and residual encoder.2. Adaptation – The original identity is ingested on the de-coder and the parameters of the decoder and the speakerencoder are updated to minimize reconstruction error L R . We compare the performance of our models to that of conven-tional models and analyze the impact of each loss function onoverall performance with an ablation study under the same set-tings. All models for comparison are re-implemented by ours.Table 1 shows the equal error rate (EER) obtained by the Vox-Celeb1 [8] testset, where we compare our models with the en-coder model [16] and the disentanglement model [5]. With thestandard softmax loss and TAP aggregation, our model outper-forms previous models based on the ResNet encoder by 36.6%and the disentanglement model using an adversarial method [5]by 16.9%. These results demonstrate that the represented em-beddings of the proposed disentanglement approach are moreinformative than those of the baseline. The proposed methodtrained with angular margin softmax provided our best resultsamong the experiments.
Ablation study.
Table 2 shows equal error rates (EERs) ob-tained by ablation studies, which indicates the effectiveness ofloss functions used in the proposed model. First, we trained themodel using the mutual information criterion with and withoutthe adversarial criterion. The results confirm that minimizingthe mutual information between speaker and residual embed-dings is effective to disentangle speaker information. Unlike (a) (b)(c) (d)Figure 2: t-SNE plot of extracted embeddings: extracted from10 speaker and 20 utterances each and each color correspondsto a different speaker. (a) and (b) are extracted from baselinemodel. (c) and (d) are from our proposed model. adversarial training, which is applied to the encoders indepen-dently, mutual information is calculated between speaker andresidual embedding simultaneously, resulting in more power-ful disentanglement performance. Among these experiments,the case absent adversarial criterion performs better, with anEER 3.81%. Then, the other experiments are conducted in or-der to investigate the effect of identity change loss. The re-sults prove that identity change loss improves the performanceof speaker embedding, and it shows the best result when it istrained using the mutual information and identity change losscriterion together, giving an EER 3.18%.Figure 2 illustrates t-SNE plots [36] for visualization ofthe effectiveness of the proposed method more concretely. Asshown in Figure 2a and Figure 2b, the proposed model also ef-fectively disentangles speaker-related and speaker-unrelated in-formation. Moreover, compared to the baseline with proposedmodel in Figure 2a and Figure 2c, our method shows moredensely clustered identities with small variance.Through the results of experiments, we proved that mutualinformation loss and identity change loss is helpful in learningthe clearly disentangled features for speaker recognition.
6. Conclusion
In this paper, we present a novel disentanglement trainingscheme to estimate more informative speaker embedding vec-tors for robust speaker recognition. Our method is built uponauto-encoder frameworks with two encoders and trained viamutual information and identity change loss, which extractsmore discriminative representations by reducing the variancein the intra-cluster. Experimental results demonstrated that ouralgorithm achieved improved EER compared to the baselinemethod. Through ablation experiments, we demonstrated theimpact of each criterion to the overall performance.
Acknowledgements.
This research is sponsored by Naver Cor-poration. . References [1] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in .IEEE, 2014, pp. 4052–4056.[2] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deepspeaker recognition,” arXiv preprint arXiv:1806.05622 , 2018.[3] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust dnn embeddings for speaker recognition,”in . IEEE, 2018, pp. 5329–5333.[4] Z. Huang, S. Wang, and K. Yu, “Angular softmax for short-duration text-independent speaker verification.” in
Interspeech ,2018, pp. 3623–3627.[5] J. Tai, X. Jia, Q. Huang, W. Zhang, and S. Zhang, “Sef-aldr: Aspeaker embedding framework via adversarial learning based dis-entangled representation.” arXiv: Audio and Speech Processing ,2020.[6] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text-dependent speaker verification,” in . IEEE, 2016, pp. 5115–5119.[7] D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero,Y. Carmiel, and S. Khudanpur, “Deep neural network-basedspeaker embeddings for end-to-end speaker verification,” in . IEEE,2016, pp. 165–170.[8] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb:a large-scale speaker identification dataset,” arXiv preprintarXiv:1706.08612 , 2017.[9] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kan-nan, and Z. Zhu, “Deep speaker: an end-to-end neural speakerembedding system,” arXiv preprint arXiv:1705.02304 , 2017.[10] M. Hajibabaei and D. Dai, “Unified hypersphere embedding forspeaker recognition,” arXiv preprint arXiv:1807.08312 , 2018.[11] Y. Jung, S. M. Kye, Y. Choi, M. Jung, and H. Kim, “Improv-ing multi-scale aggregation using feature pyramid module for ro-bust speaker verification of variable-duration utterances,” arXivpreprint arXiv:2004.03194 , 2020.[12] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalizedend-to-end loss for speaker verification,” in . IEEE, 2018, pp. 4879–4883.[13] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,“Deep neural network embeddings for text-independent speakerverification.” in
Interspeech , 2017, pp. 999–1003.[14] W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and lossfunction in end-to-end speaker and language recognition system,” arXiv preprint arXiv:1804.05160 , 2018.[15] ——, “Analysis of length normalization in end-to-end speakerverification system,” arXiv preprint arXiv:1806.03209 , 2018.[16] W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-level aggregation for speaker recognition in the wild,” in
ICASSP2019-2019 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2019, pp. 5791–5795.[17] J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham,S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning forspeaker recognition,” arXiv preprint arXiv:2003.11982 , 2020.[18] S. M. Kye, Y. Jung, H. B. Lee, S. J. Hwang, and H. Kim, “Meta-learning for short utterance speaker recognition with imbalancelength pairs,” arXiv preprint arXiv:2004.02863 , 2020.[19] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle,F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,”
The Journal of MachineLearning Research , vol. 17, no. 1, pp. 2096–2030, 2016. [20] J. Zhou, T. Jiang, L. Li, Q. Hong, Z. Wang, and B. Xia, “Trainingmulti-task adversarial network for extracting noise-robust speakerembedding,” in
ICASSP 2019-2019 IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2019, pp. 6196–6200.[21] Z. Meng, Y. Zhao, J. Li, and Y. Gong, “Adversarial speaker ver-ification,” in
ICASSP 2019-2019 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE,2019, pp. 6216–6220.[22] X. Peng, Z. Huang, X. Sun, and K. Saenko, “Domain agnos-tic learning with disentangled representations,” arXiv preprintarXiv:1904.12347 , 2019.[23] G. Bhattacharya, J. Monteiro, J. Alam, and P. Kenny, “Generativeadversarial speaker embedding networks for domain robust end-to-end speaker verification,” in
ICASSP 2019-2019 IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 6226–6230.[24] J. Zhang, Z. Ling, and L.-R. Dai, “Non-parallel sequence-to-sequence voice conversion with disentangled linguistic andspeaker representations,”
IEEE/ACM Transactions on Audio,Speech, and Language Processing , 2019.[25] J.-c. Chou, C.-c. Yeh, H.-y. Lee, and L.-s. Lee, “Multi-target voiceconversion without parallel data by adversarially learning disen-tangled audio representations,” arXiv preprint arXiv:1804.02812 ,2018.[26] Y. Liu, F. Wei, J. Shao, L. Sheng, J. Yan, and X. Wang, “Exploringdisentangled feature representation beyond face identification,” in
Proceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition , 2018, pp. 2080–2089.[27] C. Eom and B. Ham, “Learning disentangled representation forrobust person re-identification,” in
Advances in Neural Informa-tion Processing Systems , 2019, pp. 5298–5309.[28] A. Gonzalez-Garcia, J. Van De Weijer, and Y. Bengio, “Image-to-image translation for cross-domain disentanglement,” in
Advancesin neural information processing systems , 2018, pp. 1287–1298.[29] M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio,A. Courville, and R. D. Hjelm, “Mine: mutual information neuralestimation,” arXiv preprint arXiv:1801.04062 , 2018.[30] M. D. Donsker and S. S. Varadhan, “Asymptotic evaluation ofcertain markov process expectations for large time. iv,”
Commu-nications on Pure and Applied Mathematics , vol. 36, no. 2, pp.183–212, 1983.[31] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal,P. Bachman, A. Trischler, and Y. Bengio, “Learning deep repre-sentations by mutual information estimation and maximization,” arXiv preprint arXiv:1808.06670 , 2018.[32] M. Ravanelli and Y. Bengio, “Learning speaker representa-tions with mutual information,” arXiv preprint arXiv:1812.00271 ,2018.[33] E. H. Sanchez, M. Serrurier, and M. Ortner, “Learning disen-tangled representations via mutual information estimation,” arXivpreprint arXiv:1912.03915 , 2019.[34] A. Radford, L. Metz, and S. Chintala, “Unsupervised representa-tion learning with deep convolutional generative adversarial net-works,” arXiv preprint arXiv:1511.06434 , 2015.[35] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980 , 2014.[36] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”