[PDF] Neural MOS Prediction for Synthesized Speech Using Multi-Task Learning With Spoofing Detection and Spoofing Type Classification

Abstract

Several studies have proposed deep-learning-based models to predict the mean opinion score (MOS) of synthesized speech, showing the possibility of replacing human raters. However, inter- and intra-rater variability in MOSs makes it hard to ensure the high performance of the models. In this paper, we propose a multi-task learning (MTL) method to improve the performance of a MOS prediction model using the following two auxiliary tasks: spoofing detection (SD) and spoofing type classification (STC). Besides, we use the focal loss to maximize the synergy between SD and STC for MOS prediction. Experiments using the MOS evaluation results of the Voice Conversion Challenge 2018 show that proposed MTL with two auxiliary tasks improves MOS prediction. Our proposed model achieves up to 11.6% relative improvement in performance over the baseline model.

Full PDF

NNeural MOS Prediction for Synthesized Speech Using Multi-Task LearningWith Spooﬁng Detection and Spooﬁng Type Classiﬁcation

Yeunju Choi, Youngmoon Jung, Hoirin Kim

School of Electrical Engineering, KAIST, Daejeon, Republic of Korea { wkadldppdy,dudans,hoirkim } @kaist.ac.kr Abstract

Several papers have proposed deep-learning-based models topredict the mean opinion score (MOS) of synthesized speech,showing the possibility of replacing human raters. However,inter- and intra-rater variability in MOSs makes it hard to en-sure the generalization ability of the models. In this paper, wepropose a method using multi-task learning (MTL) with spoof-ing detection (SD) and spooﬁng type classiﬁcation (STC) toimprove the generalization ability of a MOS prediction model.Besides, we use the focal loss to maximize the synergy betweenSD and STC for MOS prediction. Experiments using the resultsof the Voice Conversion Challenge 2018 show that proposedMTL with two auxiliary tasks improves MOS prediction.

Index Terms : speech synthesis, MOS prediction, multi-tasklearning, spooﬁng detection, spooﬁng type classiﬁcation

1. Introduction

Speech generation tasks such as text-to-speech and voice con-version have achieved great success in recent years with ad-vances in deep learning [1–5]. In terms of the quality of syn-thesized speech, state-of-the-art systems have reached human-level performance. However, researchers still rely on a subjec-tive mean opinion score (MOS) test to evaluate the quality ofthe synthesized speech. Researchers need to employ enoughhuman raters and give them a guideline to evaluate synthesizedutterances, which is expensive and time-consuming [6]. Addi-tionally, different studies may produce inconsistent results whenevaluating the same speech generation system.Therefore, researchers have recently proposed deep-learning-based models to predict the subjective MOS of syn-thesized speech [7–9]. Patton et al. [8] proposed AutoMOS,based on long short-term memory (LSTM), to predict the MOS.Lo et al. [9] proposed MOSNet based on convolutional neuralnetwork-bidirectional LSTM (CNN-BLSTM), which producesan utterance-level MOS using frame-level scores. Furthermore,to evaluate and compare the performance of systems, a system-level MOS is calculated by averaging utterance-level MOSs.However, inter- and intra-rater variability of MOSs funda-mentally limits the generalization ability of the predictor. Weexpect that multi-task learning (MTL) [10, 11], a famous reg-ularization technique, can alleviate the problem. MTL helps amodel to generalize better by using the information in relatedtasks, which are used as auxiliary tasks for the main task. It hasbeen successfully applied to various research areas [12–17].As suggested by [18], identifying beneﬁcial auxiliary tasksis important in MTL. For example, for slot ﬁlling in languageunderstanding, [19] proposed to use named entity recognitionas an auxiliary task. In this work, we propose to apply MTLwith spooﬁng detection (SD) and spooﬁng type classiﬁcation(STC) to help MOS prediction. To the best of our knowledge,this is the ﬁrst work that uses MTL for MOS prediction. We also present a detailed analysis of the effect of our approach.Besides, we use the focal loss [20] for SD to improve our MTLapproach. Experiments using the evaluation results of the VoiceConversion Challenge (VCC) 2018 show that both auxiliarytasks help MOS prediction. They also demonstrate that SD cancreate more synergy with STC by using the focal loss.

2. Proposed methodology

For MOS prediction, we adopt the recently proposed MOSNet[9] that predicts the MOS of synthesized speech with a deepneural network using a large open dataset. In [9], threedifferent architectures were proposed: CNN, BLSTM, andCNN-BLSTM. Among them, we use the CNN-BLSTM-basedMOSNet as our baseline model since it achieved the best results.The architecture of the model is shown in gray in Fig. 1.The input and output of MOSNet are a 257-dimensional mag-nitude spectrogram and a MOS of an utterance, respectively.First, the CNN-BLSTM network extracts frame-level features.The following two fully-connected (FC) layers predict frame-level scores from the frame-level features. Finally, we obtainthe utterance-level MOS by averaging the frame-level scores.The overall loss function is as follows: L = 1 U U (cid:88) u =1 [( ˆ Q u − Q u ) + α f T u T u (cid:88) t =1 ( ˆ Q u − q u,t ) ] , (1)where the ﬁrst and second terms are mean squared errors(MSEs) for the utterance-level MOS and frame-level MOS, re-spectively. u is an utterance index, and U is the number oftraining utterances. ˆ Q u and Q u are the ground-truth MOS andpredicted MOS for the u -th utterance, respectively. t is a frameindex, and T u is the length of the u -th utterance. q u,t is the pre-dicted frame-level MOS at the t -th frame of the u -th utterance. α f is a loss weight for frame-level MOS prediction. To improve MOS prediction, we propose MTL with two auxil-iary tasks: spooﬁng detection (SD) and spooﬁng type classiﬁca-tion (STC). In this work, SD refers to a task to classify humanspeech as “human” and synthesized speech as “spooﬁng” [21].STC is a task to identify the source of input speech, which canbe a speech generation system or a human speaker. Since thesynthesized speech is used for spooﬁng, we call the speech gen-eration system a “spooﬁng system,” and all the spooﬁng systemsand human speakers are collectively called “spooﬁng types.”Fig. 1 shows the architecture of our MTL model where14 layers are shared by all the tasks, consisting of the CNN-BLSTM network and FC layer with 128 nodes (FC-128). ForMOS prediction, the FC-1 layer is used to predict frame-level a r X i v : . [ ee ss . A S ] J u l pectrogram ( 𝑆 )Conv 3x3-N, s=1Conv 3x3-N, s=3Conv 3x3-N, s=1BLSTM-128FC-128FC-1 Utterance-level MOS ( 𝑄 ) x 4Average pooling FC-MProbability of spoofing types ( 𝐶 ) Average pooling

FC-2Probability of spoofing ( 𝐷 ) Average pooling

Task-specific layer Task-specific layer

Shared layers

Task-specific layer

Frame-level MOSs ( 𝑞 )Softmax Softmax Figure 1:

Overview of the proposed model. The MOS predictionmodel is shown in gray. N is the number of channels of threeconvolution layers, corresponding to 16, 16, 32, and 32 for fourstacks. M: number of spooﬁng types, s: stride of convolution.

MOSs, and the following average pooling layer is used to pre-dict the utterance-level MOS. For each auxiliary task, we assignan additional task-speciﬁc layer, which consists of FC, averagepooling, and softmax layers. The number of nodes of the FClayer is the same as the number of classes for each task, i.e., 2for SD, and the number of spooﬁng systems plus 2 (correspond-ing to the source and target speakers) for STC.We use the cross-entropy loss for both auxiliary tasks. Thenwe deﬁne the ﬁnal loss function as follows: L = 1 U U (cid:88) u =1 [ α m ( ˆ Q u − Q u ) + α f T u T u (cid:88) t =1 ( ˆ Q u − q u,t ) − α d (cid:88) i =1 ˆ D u,i log D u,i − α c M (cid:88) j =1 ˆ C u,j log C u,j ] , (2)where ˆ D u,i and D u,i are the i -th dimension of the ground-truthand predicted probability of spooﬁng for the u -th utterance, re-spectively. ˆ C u,j and C u,j are the ground-truth and predictedprobability of j -th spooﬁng type for the u -th utterance, respec-tively. M is the number of spooﬁng types. α m , α d , and α c arethe loss weights for utterance-level MOS prediction, SD, andSTC, respectively. In this setup, the number of parameters onlyincreases by 1.438% (from 358,833 to 363,993). Fig. 2 is a conceptual illustration of the decision boundary inthe SD task that separates the shared features of human speechand spoofed (or synthesized) speech, which are extracted fromthe shared layers. From now on, we give a detailed analysis ofthe proposed MTL approach based on this ﬁgure.We ﬁrst discuss the difﬁculty of MOS prediction, especiallyfor high MOS. Human speech usually has higher MOS thansynthesized speech, as also can be seen in Fig. 3. Amongthe synthesized utterances, there are much fewer utteranceswith high MOSs (i.e., high-quality speech) than those with lowMOSs (i.e., low-quality speech), which makes it difﬁcult for aMOS prediction model to predict high MOSs. We show thatusing the SD task can alleviate this problem.In SD, the model is trained to discriminate between humanspeech and synthesized speech. Then it learns a decision bound-ary in a zone marked as “Concentrating effect zone” in Fig. 2,where both human speech and high-quality synthesized speechexist. Therefore, if we train a model on the SD task, the modelcan learn to distinguish between the human speech and high-

Decision boundaryHuman speechSpoofed speechIgnoring effect zoneConcentrating effect zone

Figure 2:

A conceptual illustration of decision boundary in SD. quality synthesized speech by giving more attention to high-quality speech than when trained only on MOS prediction. Asa result, in MOS prediction, the model will predict high MOSsbetter by concentrating on the utterances with high MOSs. Ac-cordingly, we call the effect above concentrating effect .As discussed in Section 2.2, in STC, the model is trainedto identify the source of speech, called the spooﬁng type, andthus learns low-level features that are useful for distinguishingbetween various types of speech. Here, the spooﬁng type fun-damentally determines the way of vocalization and intonation,which has a large effect on the MOS of speech. In addition tothese reasons, we believe that the independence of these twotasks from the raters may be helpful for inter- and intra- ratervariability. Experimental results in Section 4 support the effec-tiveness of using these auxiliary tasks.

In this section, we describe our motivation to use the focal lossfor SD. The SD model is trained to classify synthesized utter-ances into the same class, “spooﬁng,” even though the utter-ances are generated by various spooﬁng systems. Then MTLwith SD would prevent the model from distinguishing betweenspooﬁng systems. We call this effect of SD ignoring effect as itignores the difference between unique spooﬁng systems.However, when we train a MOS predictor with STC, themodel will be able to distinguish between the spooﬁng types intraining data, which we call distinguishing effect . Then, whentesting the model using the same spooﬁng types in training data,using SD as another auxiliary task causes a conﬂict between the ignoring effect and distinguishing effect . To maximize the syn-ergy between SD and STC, we want to prevent SD from havingthe ignoring effect . Note that there are much more synthesizedutterances far from the decision boundary, having lower MOSsand SD losses, than those near the decision boundary, havinghigher MOSs and SD losses (see Fig. 2). Those low-qualitysynthesized utterances having low SD losses contribute to onlythe ignoring effect but not the concentrating effect , marked as“Ignoring effect zone.” To reduce the contribution of the utter-ances with low SD losses, we adopt the focal loss [20] for SD.The cross-entropy (CE) loss of the u -th utterance for the SDtask is as follows: CE ( D u , ˆ D u ) = (cid:26) − log D u, , if ˆ D u = (1 , − log D u, , if ˆ D u = (0 , . (3)The ﬁrst and second dimensions of ˆ D u corresponds to the “hu-man” and “spooﬁng,” respectively. Please note that D u, =1 − D u, since we use the softmax layer.Now we deﬁne p u ∈ [0 , as follows for convenience: p u = (cid:26) D u, , if the u -th utterance ∈ “human.” − D u, , otherwise. (4)Then the CE loss can be rewritten as CE ( p u ) = − log( p u ) . Todown-weight the contribution of easy samples (that have lower .0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 MOS N u m b e r o f u tt e r a n c e s MOS N u m b e r o f u tt e r a n c e s (a) Synthesized speech (b) Human speech Figure 3:

Distribution of ground-truth utterance-level MOSs for(a) synthesized and (b) human speech from the VCC’18 data. loss) and focus on hard samples (that have higher loss), the focalloss (FL) is deﬁned as

F L ( p u ) = − (1 − p u ) γ log( p u ) with aparameter γ ≥ that adjusts the rate at which easy samples aredown-weighted.

3. Experiments

We use the MOS evaluation results of the VCC 2018 (VCC’18)[22], which is a large, open, and intrinsically-predictabledataset. A total of 38 systems participated in VCC’18, includingtwo human speakers (target and source speakers). The ground-truth MOS of a system was obtained by averaging the ground-truth MOSs of all the utterances from the system.A total of 267 people rated a total of 20,580 utterances ona scale of 1 (completely unnatural) to 5 (completely natural).An average of 4 people per utterance participated in the evalua-tion, leading to a total of 82,304 evaluation results. The ground-truth of the utterance-level MOS was obtained as the average ofall the MOSs of the utterance. Fig. 3 shows the distributionof ground-truth utterance-level MOSs. From 20,580 < audio,ground-truth MOS > pairs, we randomly select 15,580, 3,000,and 2,000 pairs for training, validation, and testing, respectively.We also use the MOS evaluation data from the VCC 2016(VCC’16) [23] to evaluate the robustness of the models to un-seen spooﬁng types and raters. Including a target speaker,source speaker, and baseline system, a total of 20 systems exist.For each system, 1,600 utterance-level evaluation results existwith no speciﬁcation of the utterances or raters. Therefore, wecannot use the utterance-level MOSs and only use the system-level MOSs with a total of 26,028 utterances. All the models are implemented using PyTorch and trained ona single GTX 1080 Ti GPU. We use a batch size of 32 and theAdam optimizer with a learning rate of − for all the models.We set the weights for utterance- and frame-level MOS predic-tion to 1 and 0.8, respectively. For MTL, we set the weight ofthe loss for each auxiliary task to 1. When we adopt the focalloss (FL) for the SD task, we set γ to 0.8.For testing, we use the model that has the lowest MSE onthe VCC’18 validation set during 200 epochs of training. Notethat we train all the models using only the VCC’18 training setand test them on both the VCC’18 test set and VCC’16 data. Weconduct the experiments for each model with four different ran-dom seeds and report the average value of the four results as theperformance of each model. The performance is evaluated interms of the MSE, linear correlation coefﬁcient (LCC) [24], andSpearmans rank correlation coefﬁcient (SRCC) [25]. We reportboth utterance- and system-level performances for the VCC’18test set. For the VCC’16 data, we report only system-level per- True MOS P r e d i c t e d M O S True MOS P r e d i c t e d M O S (a) Baseline (b) +SD True MOS P r e d i c t e d M O S True MOS P r e d i c t e d M O S (c) +STC (d) +STC +SD Figure 4:

Scatter plots of system-level MOSs for (a) Base-line, (b) MTL model with SD (+SD), (c) MTL model with STC(+STC), and (d) MTL model with both tasks (+STC +SD). formance because the utterance-level MOSs for the VCC’16 arenot available, as mentioned in Section 3.1.

4. Results and Discussion

Table 1 shows the effectiveness of the proposed MTL approach.The results of MTL with SD (+SD) and STC (+STC) are in thesecond and fourth rows of Table 1, respectively. +SD improvedthe baseline in terms of all the metrics with the VCC’18 data.Moreover, +STC achieved better performance compared to thebaseline, not only on the VCC’18 data but also on the VCC’16data. +STC yielded better results than +SD, which correspondsto the intuition that the model learns more information with amulti-class classiﬁcation task (i.e., STC) than with a simple bi-nary classiﬁcation task (i.e., SD).The MTL model with both auxiliary tasks (+STC +SD) im-proved the performance of the baseline in terms of all the met-rics except the utterance-level SRCC for the VCC’18, whichis almost the same as the baseline. However, the performanceon the VCC’18 data was not better than that of +STC due to theconﬂict between the ignoring effect and distinguishing effect . InSection 4.3, we will see that the focal loss reduces the ignoringeffect , and thus improves the performance of +STC +SD.Fig. 4 shows the the scatter plots of system-level MOSsfor four models. Each dot corresponds to individual system.Comparing (a) and (b), we can see that the systems with highMOSs are better aligned using MTL with SD. This indicatesthat using SD is useful for MOS prediction in high MOS, asdiscussed in Section 2.3. Comparing (a) and (c), we can seethat the dots get closer to the y = x line using MTL with STC,which means that predicted MOSs are more close to ground-truth MOSs. The scatter plot of system-level MOSs for +STC+SD, shown in (d), looks like the combination of (b) and (c). As mentioned in the introduction, inter- and intra- rater vari-ability exists in MOSs. Different raters evaluate speech qualitybased on their own subjectivity without objective criteria, lead-ing to high variance within the evaluated scores of the samespeech [9]. This variability caused by the raters fundamentallylimits the generalization ability of a MOS predictor.As spooﬁng types and raters are not overlapping betweenable 1:

Results of ablation study. MOS and F-MOS are utterance- and frame-level MOS, respectively. +SD and +STC denote usingthe SD and STC task, respectively. RI is the average relative improvement of nine metrics. The best results are shown in bold.

Model FL Loss weights VCC’18 VCC’16 RI (%) α m α f α d α c utterance-level system-level system-level (MOS) (F-MOS) (SD) (STC) MSE LCC SRCC MSE LCC SRCC MSE LCC SRCCBaseline - 1 0.8 0 0 0.448 0.651 0.619 0.039 0.966 0.924 0.316 0.896 0.858 -+SD - 1 0.8 1 0 0.439 0.661 0.623 0.029 0.972 0.925 0.333 0.886 0.829 2.34+SD (cid:88) (cid:88) VCC 2018

VCC 2016 (a) Baseline (b) +SD (c) +SD w/ FL (f) +STC +SD w/ FL(d) +STC (e) +STC +SD

Figure 5: t-SNE plots of shared features of six different models. The legends are sorted in descending order based on system-level MOS. the VCC’16 and 18, we can see the generalization ability of themodels to unseen spooﬁng types and raters from the test resultson the VCC’16 in Table 1. +SD degrades the performance ofthe baseline on the VCC16 since it tends to prevent the modelfrom distinguishing between spooﬁng systems by the ignoringeffect while there are more variations in spooﬁng systems in theVCC’16; the ground-truth system-level MOSs of the spooﬁngsystems have a higher standard deviation in the VCC’16 (0.59)than in the VCC’18 (0.49).Meanwhile, we can see that +STC shows better generaliza-tion ability than the baseline on the VCC’16. We argue that itis because, in the process of distinguishing various systems, theshared layers learn useful low-level features that can be gener-alized to unseen spooﬁng types and raters. Moreover, by usingthe SD task to +STC, +STC +SD achieves a 10% relative im-provement compared to +STC on the VCC’16 data. We providea detailed analysis of the result in Section 4.4.

As discussed in Section 2.4, if we use the FL for SD in training,a model tends to keep learning from the utterances with highSD losses but to stop learning from those that have relativelylow SD losses. That is, the FL increases the concentrating ef-fect and decreases the ignoring effect on SD. For the VCC’18test set, +SD w/ FL plays a similar role with +SD in terms ofthat it helps MOS prediction by the concentrating effect . Forthe VCC’16 data, however, it improved +SD by relieving the ignoring effect that acted as a disadvantage for +SD, as we dis-cussed in Section 4.2. Applying the FL to +STC +SD improvedthe performance for the VCC’18 test set by relieving the ig-noring effect that limited the performance of +STC +SD. Forthe VCC’16 data, +STC +SD w/ FL achieved little better than+STC +SD. In summary, +STC +SD w/ FL achieved the bestrelative improvement of 11.6% compared to the baseline.

To further support our analysis, we visualized the shared fea-tures using t-distributed stochastic neighbor embedding (t-SNE)[26] in Fig. 5. We displayed a frame-level shared feature of anutterance as a single dot and colorized it according to the systemto which the utterance belongs. For the VCC’18, we consider two systems made by oneteam as one system since they have similar characteristics, re-sulting in 26 systems. Then for both test data, we sort the sys-tems in descending order according to the system-level MOSand evenly select half of them. The red indicates the sourcespeech (‘S00’ or ‘SRC’), and the orange indicates the spooﬁngsystem with the highest MOS (‘N10’ or ‘N’). We randomly se-lect 390 and 300 utterances among the VCC’18 and 16 data, re-spectively, so that each system has an average of 15 utterances.The concentrating effect can be easily observed from redand orange dots in (a) and (b), where those points are moreclosely together in +SD. In (e), the colors except the red tendto be more mingled with other colors and less cohesive than in(d). This indicates the ignoring effect . (d), (e), and (f) clearlyshows the ability of STC to distinguish between the systems.Compared to (a), (b), and (c), the points are more clustered ac-cording to their classes. The effect of the FL (i.e., keeping the concentrating effect and alleviating the ignoring effect ) can beobserved by comparison between (b) and (c) or (e) and (f).

5. Conclusion

We proposed MTL with SD and STC to improve the deepMOS predictor. With experimental results on the VCC’18 andVCC’16 evaluation data, we showed that both the SD and STCtasks improve MOS prediction. Furthermore, we adopt the FLto maximize the synergy between the two tasks. The proposedMTL model can be used to automatically evaluate and com-pare several speech generation systems. For future work, wewill consider the ranks of the utterance-level MOSs to increasethe SRCC. By providing the MOS predicted by our proposedmodel, we will also directly guide a speech generation model tosynthesize speech with high MOS as well as low MSE.

6. Acknowledgements

This material is based upon work supported by the Ministryof Trade, Industry & Energy (MOTIE, Korea) under IndustrialTechnology Innovation Program (No. 10080667, Developmentof conversational speech synthesis technology to express emo-tion and personality of robots through sound source diversiﬁca-tion). . References [1] A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,“WaveNet: A generative model for raw audio,” arXiv preprintarXiv:1609.03499 , 2016.[2] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang,Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Saurous,Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS synthesis by con-ditioning WaveNet on Mel spectrogram predictions,” in

Proc. ofthe IEEE International Conference on Acoustics, Speech and Sig-nal Processing (ICASSP) , 2018, pp. 4779–4783.[3] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan,S. Narang, J. Raiman, and J. Miller, “Deep Voice 3: Scaling text-to-speech with convolutional sequence learning,” in

Proc. of theInternational Conference on Learning Representations (ICLR) ,2018.[4] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthe-sis with Transformer network,” in

Proc. of the AAAI Conferenceon Artiﬁcial Intelligence , 2019, pp. 6706–6713.[5] T. Kaneko and H. Kameoka, “CycleGAN-VC: Non-parallel voiceconversion using cycle-consistent adversarial networks,” in

Proc.of 2018 26th European Signal Processing Conference (EU-SIPCO) , 2018, pp. 2100–2104.[6] M. Chu and H. Peng, “An objective measure for estimating MOSof synthesized speech,” in

Proc. of Eurospeech , 2001.[7] T. Yoshimura, G. E. Henter, O. Watts, M. Wester, J. Yamagishi,and K. Tokuda, “A hierarchical predictor of synthetic speech nat-uralness using neural networks,” in

Proc. of Interspeech , 2016, pp.342–346.[8] B. Patton, Y. Agiomyrgiannakis, M. Terry, K. W. Wilson, R. A.Saurous, and D. Sculley, “AutoMOS: Learning a non-intrusiveassessor of naturalness-of-speech,” in

Proc. of NIPS End-to-endLearning for Speech and Audio Processing Workshop , 2016.[9] C. Lo, S. Fu, W. Huang, X. Wang, J. Yamagishi, Y. Tsao, andH. Wang, “MOSNet: Deep learning based objective assessmentfor voice conversion,” in

Proc. of Interspeech , 2019, pp. 1541–1545.[10] R. A. Caruana, “Multitask learning: A knowledge-based sourceof inductive bias,” in

Proc. of the international conference on ma-chine learning (ICML) , 1993.[11] S. Ruder, “An overview of multi-task learning in deep neural net-works,” arXiv preprint arXiv:1706.05098 , 2017.[12] S. Subramanian, A. Trischler, Y. Bengio, and C. J. Pal, “Learn-ing general purpose distributed sentence representations via largescale multi-task learning,” in

Proc. of the International Confer-ence on Learning Representations (ICLR) , 2018.[13] R. Girshick, “Fast R-CNN,” in

Proc. of the IEEE InternationalConference on Computer Vision (ICCV) , 2015, pp. 1440–1448.[14] S. Toshniwal, H. Tang, L. Lu, and K. Livescu, “Multitask learningwith low-level auxiliary tasks for encoder-decoder based speechrecognition,” in

Proc. of Interspeech , 2017.[15] T. Maekaku, Y. Kida, and A. Sugiyama, “Simultaneous detectionand localization of a wake-up word using multi-task learning ofthe duration and endpoint,” in

Proc. of Interspeech , 2019, pp.4240–4244.[16] L. You, W. Guo, L. Dai, and J. Du, “Multi-task learning with high-order statistics for x-vector based text-independent speaker veriﬁ-cation,” in

Proc. of Interspeech , 2019, pp. 1158–1162.[17] A. Jati, R. Peri, M. Pal, T. J. Park, N. Kumar, R. Travadi,P. Georgiou, and S. Narayanan, “Multi-task training of hybridDNN-TVM model for speaker veriﬁcation with noisy and far-ﬁeldspeech,” in

Proc. of Interspeech , 2019, pp. 2463–2467.[18] J. Bingel and A. Søggard, “Identifying beneﬁcial task relationsfor multi-task learning in deep neural networks,” arXiv preprintarXiv:1702.08303 , 2017. [19] S. Louvan and B. Magnini, “Exploring named entity recognitionas an auxiliary task for slot ﬁlling in conversational language un-derstanding,” in

Proc. of the 2018 EMNLP Workshop on Search-Oriented Conversational AI (SCAI) , 2018, pp. 74–80.[20] T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollr, “Focal lossfor dense object detection,” in

Proc. of the IEEE internationalconference on computer vision (ICCV) , 2017, pp. 2980–2988.[21] P. L. De Leon, I. Hernaez, I. Saratxaga, M. Pucher, and J. Yam-agishi, “Detection of synthetic speech for the problem of impos-ture,” in

Proc. of the IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , 2011, pp. 4844–4847.[22] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicen-cio, T. Kinnunen, and Z. Ling, “The voice conversion challenge2018: Promoting development of parallel and nonparallel meth-ods,” in

Proc. of Odyssey The Speaker and Language RecognitionWorkshop , 2018, pp. 195–202.[23] T. Toda, L. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu,and J. Yamagishi, “The Voice Conversion Challenge 2016,” in

Proc. of Interspeech , 2016, pp. 1632–1636.[24] K. Pearson, “Notes on the history of correlation,”

Biometrika ,vol. 13, no. 1, pp. 25–45, 1920.[25] C. Spearman, “The proof and measurement of association be-tween two things,”

The American Journal of Psychology , vol. 15,no. 1, pp. 72–101, 1904.[26] L. V. D. Maaten and G. Hinton, “Visualizing data using t-SNE,”