Predictions of Subjective Ratings and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions
Rohan Kumar Das, Tomi Kinnunen, Wen-Chin Huang, Zhenhua Ling, Junichi Yamagishi, Yi Zhao, Xiaohai Tian, Tomoki Toda
PPredictions of Subjective Ratings and Spoofing Assessments ofVoice Conversion Challenge 2020 Submissions
Rohan Kumar Das , Tomi Kinnunen , Wen-Chin Huang , Zhenhua Ling ,Junichi Yamagishi , Yi Zhao , Xiaohai Tian , Tomoki Toda National University of Singapore, Singapore University of Eastern Finland, Finland Nagoya University, Japan University of Science and Technology of China, China National Institute of Informatics, Japan [email protected]
Abstract
The Voice Conversion Challenge 2020 is the third edition underits flagship that promotes intra-lingual semiparallel and cross-lingual voice conversion (VC). While the primary evaluationof the challenge submissions was done through crowd-sourcedlistening tests, we also performed an objective assessment ofthe submitted systems. The aim of the objective assessmentis to provide complementary performance analysis that maybe more beneficial than the time-consuming listening tests. Inthis study, we examined five types of objective assessments us-ing automatic speaker verification (ASV), neural speaker em-beddings, spoofing countermeasures, predicted mean opinionscores (MOS), and automatic speech recognition (ASR). Eachof these objective measures assesses the VC output along dif-ferent aspects. We observed that the correlations of these objec-tive assessments with the subjective results were high for ASV,neural speaker embedding, and ASR, which makes them moreinfluential for predicting subjective test results. In addition, weperformed spoofing assessments on the submitted systems andidentified some of the VC methods showing a potentially highsecurity risk.
Index Terms : Voice Conversion Challenge 2020, objectiveevaluation, subjective rating prediction, spoofing assessment
1. Introduction
Voice conversion (VC), which refers to the digital cloning of aperson’s voice, can be used to modify an audio waveform sothat it appears as if spoken by someone else (target) than theoriginal speaker (source). VC is useful in many applicationssuch as customizing audio book and avatar voices, dubbing, themovie industry, teleconferencing, singing voice modification,voice restoration after surgery, and the cloning of voices of his-torical persons. Since VC technology involves identity conver-sion, it can also be used to protect the privacy of individuals onsocial media and in sensitive interviews. For the same reason,VC also enables spoofing (fooling) voice biometric systems andtherefore has potential security implications.VCC 2020 is the 3 rd edition of the Voice Conversion Chal-lenge (VCC). While the general background and subjective re-sults are provided in [1], this study focuses on complementary objective evaluation results.Conventionally, the target of VC technology is human lis-teners, so subjective assessment has been the primary methodof assessment in all the VCC challenges. On the other hand,progress has been made recently in research fields relevant toobjective evaluation assessment, and human perception predic-tion and spoofing performance assessment against automaticspeaker verification (ASV) systems are increasingly being used. The former is utilized to model and predict human per-ceptions automatically. Until recently, predicting human per-ception on synthetic speech has been challenging, but new re-search [2–9] has demonstrated that the advanced deep learningmodels and large amounts of paired data of synthetic speech andassociated human judgement scores can lead to data-driven ob-jective models that can predict human perception on syntheticspeech to a certain extent. There are also several studies onusing automatic speech recognition (ASR) as a proxy for sub-jective intelligibility estimation [10].The latter pertains to spoofing and anti-spoofing researchfor ASV. One goal of VC is to produce convincing mimicry ofspecific target speaker voices, and it is widely known that VCcan fool (spoof) unprotected ASV systems [11, 12]. Therefore,for increasing security and robustness, ASV systems normallyadopt spoofing countermeasures, which are designed to learnthe distinguishing artifacts present in spoofed audio producedby VC from human speech. We assume that spoofing perfor-mance against ASV may be correlated with the speaker simi-larity of VC systems and that the countermeasure (CM) perfor-mance represents the amount of artifacts produced by VC sys-tems, which may or may not be audible to humans. Note thatthese systems are designed and optimized for discrimination bymachines, and as such, the performances of ASV and CM maybe different from human perceptions.With the above as our motivation, we provide an array ofcomplementary objective results representative of recent objec-tive evaluation techniques. Our tools include• text-independent ASV [13] for speaker similarity,• text-independent CM [11] for real-vs.-fake assessment,• automatic MOS prediction [6] for quality, and• ASR for intelligibility.To the best of our knowledge, these four metrics have neverbefore been examined as a group within any single VC study.Using these metrics, this paper investigates the two questionsbelow:• Can the metrics predict human judgements on natural-ness and speaker similarity?• Which VC technology has the highest spoofing risk forASV and CM?In Section 2 of this paper, we give an overview of the mo-tivation behind each of the objective evaluation metrics. Theirimplementation details are described in Section 3. Correlationswith human judgements are analyzed in Section 4, and spoof-ing performance against ASV and CM is discussed in Section a r X i v : . [ ee ss . A S ] S e p able 1: Summary of evaluation metrics used in assessing VCC 2020 submissions.
ASV : automatic speaker verification,
EER : equalerror rate,
MOS : mean opinion score,
LCNN : light convolutional neural network,
ASR : automatic speech recognition,
WER : worderror rate, CM : countermeasure. Metric Type of measure Measurement tool Implementation Metric interpretation
ASV EER Conv. src ↔ tgt similarity ASV Kaldi x-vector [13] Not similar = . . . = Similar P tarfa Conv. src ↔ tgt similarity ASV Kaldi x-vector [13] Not similar = . . . = Similar P srcmiss Conv. src ↔ src similarity ASV Kaldi x-vector [13] Similar = . . . = Not similarCosine Conv. src ↔ tgt similarity Speaker embedding Kaldi x-vector [13] Not similar = − . . . = SimilarCM EER Artifact assessment Spoofing CM LCNN [14] Fake = . . . = RealMOSNet Quality Objective MOS MOSNet [6] Lowest = . . . = HighestASR WER Intelligibility ASR Seq2seq with attention [15] Perfect = . . . = Unintelligible
5. We conclude in Section 6 with a brief summary and mentionof future work.
2. Methodology
Objective metrics for speech signals can be categorized into in-trusive and non-intrusive assessment methods. The former usesa ground-truth natural clean audio that has the same linguisticcontent as an input speech as a reference. The latter does notuse any reference. A summary of the objective metrics usedin this paper is provided in Table 1. They are all non-intrusivemetrics. The following subsections provide the motivation andgeneral description of each method.
We assess speaker similarity using ASV. An ASV system com-pares a test utterance of an unknown speaker with a hypothe-sized speaker’s training utterance(s) and then outputs a speakersimilarity score s ∈ R . Higher scores indicate support forthe same speaker hypothesis and lower scores for the differentspeakers . A hard decision is obtained by comparing s with apre-set verification threshold, τ asv . The speakers are declared tobe the same if s > τ asv (otherwise, different).When VC methods achieve good mimicry of specific targetspeaker voices, the above score becomes higher and hence theconverted audio will be judged as the same speaker. Therefore,we use the impact of VC on the ASV error rates as speakersimilarity metrics here. For each VC system, we report threedifferent kinds of ASV error:1. Equal error rate (EER): error rate at τ asv at which thespoof false acceptance rate and target speaker miss rateequal each other;2. False acceptance rate of target ( P tarfa ): proportion ofconverted utterances declared as the targeted speaker;3. Miss rate of source ( P srcmiss ): proportion of converted ut-terances not declared as the original source speaker.The first one gauges the ASV system’s general accuracyin differentiating converted source utterances from real targetspeaker utterances (alternatively, the effectiveness of a VC infooling ASV). The second metric also gauges the VC system’sability to fool the ASV system, with the difference that ourfalse acceptance rate computation uses a threshold fixed priorto observing any VC samples. The same holds for the thirdmetric, source miss rate, which gauges the VC system’s abil-ity to de-identify the source speaker, in terms of ASV (alter-natively, the ASV system’s inability to re-identify the sourcespeaker). The reference values for ideal VC and useless VC are ( EER , P tarfa , P srcmiss ) = ( , , and ( EER , P tarfa , P srcmiss ) =(0 , , , respectively .The second and third error rates defined above are obtainedby fixing τ asv before observing the converted utterances. TheASV system is optimized to differentiate real human same-speaker and different-speaker trials, but no VC samples are usedin optimizing it. Thus, τ asv is fixed using non-converted (natu-ral) data only and remains fixed for the VC test. In practice, weset τ asv at the EER operating point at which the (natural speech)miss and false alarm rates equal each other. Strictly speaking, the above ASV scoring procedures are differ-ent from how our subjects evaluated speaker similarity in themain listening test [1]. In the listening test, they were asked tolisten to a reference audio (which had different linguistic con-tent from converted audio) and judge the speaker similarity ofthe converted audio to one of the reference audio files.Therefore, we computed the cosine similarity of speakerembedding vectors as well as the ASV scores. We ex-tracted speaker embedding vectors from both converted audioand one of the reference audio files using the same tool asthe above ASV system and measured their cosine similarity, cos sim ( A, B ) = A · B/ (cid:107) A (cid:107)(cid:107) B (cid:107) , where A and B are thespeaker embedding vectors obtained from the converted audioand reference audio, respectively. Note that this is still tech-nically a non-intrusive assessment because the reference audiofor this measurement is different from the ground-truth natu-ral clean audio that has the same linguistic content as an inputspeech. The spoofing countermeasures play an imperative role in de-fending against various attacks to ASV systems. In general,such countermeasures are designed to learn the distinguishingartifacts present in the natural human speech against differentkinds of generated/spoofed speech to identify the spoofing at-tacks. Therefore, the artifact assessment using a spoofing coun-termeasure for the resultant speech of various VC systems couldindicate the amount of artifacts of the converted speech.The performance of spoofing countermeasures is normallyevaluated in terms of EER, similar to that of the ASV systems.However, unlike ASV, any human speaker speech is consideredas target trials and the converted/spoofed speech serve as non- Theoretically, EER is constrained between 0 and for any binaryclassification task, including ASV. Values larger than indicate deci-sions worse than random guessing (label flip). In practice, the valuesmay exceed due to data anomalies. arget trials. In the context of VC outputs, a high EER indicatesthe generation of more human-like speech, whereas a low EERindicates that the converted speech is inclined towards the char-acteristics of artificially generated speech. The mean opinion score (MOS) used for subjective quality as-sessment is a numerical measure of the human-judged overallquality of audio, typically in the range of 15, where 1 is the low-est perceived quality and 5 is the highest. The objective MOS isused to assess speech quality by predicting and approximatingthe human assessment from an input audio. This technique has along history and several metrics have been proposed for speechcoding and telephony. Famous metrics include PESQ [16] andPOLQA [17], both of which are intrusive assessments. Thereis also p536 [18], a metric for non-intrusive assessment, but itis not designed for evaluating the quality of synthetic speech orconverted speech.When a large amount of paired data of synthetic speechand associated human judgement scores is available, we canview the objective MOS as a machine learning-based regressionproblem. Various deep learning models to predict the MOS val-ues utilizing listening test data collected from the Blizzard Chal-lenge [19], our previous VCCs [20, 21], or the ASVspoof Chal-lenge [22] for supervision have been proposed. More specifi-cally, [2] conducted prediction of the Blizzard Challenge’s re-sults, [6, 8, 9] reported predictions of VCC’s results, and [7] re-ported prediction results of ASVspoof’s listening test results.While all of them exhibited moderate correlations with humanjudgments, it is still unknown whether these models can be gen-eralized to new speech synthesis methods that are not alreadyincluded in the databases.Among these models, we have chosen to examineMOSNet [6], which is a deep learning-based non-intrusive as-sessmentor, since it is reported that its prediction has a moder-ate correlation with the subjective evaluation done for the VCC2018.
Although some of the recent VC methods can achieve highnaturalness and similarity of converted speech, they may de-grade the intelligibility of converted speech. For example, inthe recognition-synthesis approach [23–26] to VC, an ASRmodel is usually adopted to extract linguistic-related features,e.g., phonetic posteriorgrams (PPGs) [24] or bottleneck fea-tures [26], from source speech. In this case, recognition errorsare inevitable, which may degrade the intelligibility of the con-verted speech. Moreover, in the VC methods using sequence-to-sequence acoustic models [27], the failed alignment may leadto repetition and deletion of speech segments, especially whenthe amount of training data is limited [28].Considering the cost of conducting subjective intelligibilityevaluations for all conversion pairs, we adopted the word errorrate (WER) of ASR as an objective metric on the intelligibilityof converted speech in VCC 2020. A lower WER indicates ahigher intelligibility.
3. Implementation Details
Our ASV system utilizes x-vector [29]-based deep speaker em-beddings. We use Kaldi’s [30] recipe [13] trained on Vox- Celeb data [31]. The system uses a time-delay neural net-work model (TDNN) trained with cross-entropy loss (treatingtraining speakers as classes) to extract one 512-dimensionaldeep speaker embedding per utterance. The speaker similarityscore is computed using probabilistic linear discriminant analy-sis (PLDA).The system is used as a scoring tool without specific modi-fications (e.g., domain adaptation) for the VCC 2020 data. Thesource and target speaker reference models are obtained fromthe respective training utterances provided to the challenge par-ticipants. The training x-vectors of each utterance are usedto form one averaged x-vector per speaker. Test utterance x-vectors are then scored against these averaged models. For pre-VC ASV tests (required when setting the ASV threshold), weuse the original source test data (provided to challenge partic-ipants) and the target speaker reference data ( not provided tochallenge participants). For the VC tests, we replace the origi-nal source utterances with their VC-processed versions.For calculating the cosine distance of the speaker embed-dings, we use the same Kaldi-based x-vector extractor as theASV system. The x-vector dimensions were reduced to 200using LDA before we compute the cosine similarity betweenconverted speech and natural speech. We then calculate the av-eraged value per system.
We used the light convolutional neural network (LCNN)-basedsystem as a spoofing countermeasure in our studies [14].The system considers 60-dimensional (20-static+20- ∆ +20- ∆∆ ) linear frequency cepstral coefficient (LFCC) features asthe input [32]. The training set of the ASVspoof 2019 logi-cal access corpus is used to build the model [22]. The detailedarchitecture and implementation of the LCNN system is avail-able in [33]. We considered the utterances from the trainingset of the VCC 2020 database as the bona fide trials, whereasthe submissions from various teams constitute the spoof trialsfor evaluating the performance of every system submitted to thechallenge. Our MOSNet model architecture followed the original settingin [6]. Specifically, the raw magnitude spectrogram was firstextracted from the converted speech and used as the input fea-ture. The main model consisted of 12 convolution layers, onebidirectional long-short term memory layer, and two fully con-nected layers followed by a global averaging layer that pooledthe frame-level scores to generate the final utterance level pre-dicted MOS. The whole network was trained to regress the lis-tening test scores by minimizing the mean square loss.For training the MOSNet, we used two datasets: one com-posed of listening test data collected for the VCC 2018 [21]and the other of listening test data collected for the ASVspoof2019 [22]. The former contains many of the VC systems avail-able in 2018 and the latter contains more recent speech synthesisand VC methods available from 2019. They are referred to asMOSNet (vcc18) and MOSNet (asvspoof19) , respectively. The ASR engine was a prototype system developed by iFlytek.It features a state-of-the art end-to-end neural network-based https://github.com/lochenchou/MOSNet https://github.com/rhoposit/MOS_Estimation2 SR architecture and was trained using 10,000 hours of record-ings and GB-level texts for language modeling. The vocabularysize was around 200,000. WERs were calculated by the
HRe-sults tool in HTK using manual transcriptions as ground truthand considering substitutions, deletions, and insertions.
4. Objective Evaluation Results for EachSubmitted VC System
We used converted audio produced by each of the submitted VCsystems for Tasks 1 and 2 and assessed the objective evaluationmetrics described earlier. It is noted that the Tasks 1 and 2 areintra-lingual semi-parallel and cross-lingual VC tasks, respec-tively. The results for each VC system for Tasks 1 and 2 aresummarized in Tables 2 and 3, respectively.
ASV:
The ASV results are shown in the first to fourth columnsin the tables. Concerning false acceptance rates on Task 1,20 (out of 31) systems achieved error rates higher than 90%.Further, the top-5 systems obtained perfect results (100% falseacceptance). Many other teams obtained near-perfect results.Concerning the miss rate of converted source speakers, it washigher than 90% for all teams except three (06, 08, 12). To sumup, the top systems all achieved similar results, and the majorityof the systems achieved a high target speaker similarity. Nearlyall systems managed to successfully move the converted voice‘away’ from the original source. The results were more variedfor Task 2, however: only ten systems yielded false acceptancerates above 90%, and there was substantial variation across thesystems. The miss rates were generally worse than those inTask 1 for most systems, although, similar to Task 1, for mostsystems they were reasonably high.
Spoofing Countermeasures:
The fifth column of the tablesshows the performance of the LCNN-based spoofing counter-measure for all the submitted systems on both tasks of VCC2020. We can see that most of the teams achieved a high EER,which indicates that the VC systems were able to generate nat-ural human-like speech that was not easily detectable by thespoofing countermeasure. In addition, the performance trendsof the spoofing countermeasure for most teams were similar inboth tasks, excluding four teams (08, 18, 22 and 23). They alsoshowed a relatively higher EER for Task 2 than Task 1, whichmight be a result of the very different settings used for bothtasks by those teams.
MOSNet:
The MOSNet predictions of all systems are shownin the sixth column in the tables. We can see that the MOSNetpredictions fell between 2.5 to 4.5, while the ground truth MOStypically ranged from 1.0 to 4.5, indicating that the overall vari-ance of the MOSNet predictions was rather small. We can alsosee that the scores of each team for Tasks 1 and 2 were similar.This is consistent with the fact that most teams utilized the samesystem for both tasks.
WER:
The ASR WERs of all systems are shown in the finalcolumn in the tables. We can observe a large variance of WERsamong teams in both tasks. For example, in Task 1, seven teamshad WERs that were lower than 5%, while seven other teamshad WERs that were higher than 50%. This indicates the diver-sity of the conversion methods adopted by different teams. Af-ter subjectively examining a few samples from the teams withhigh WERs, we can clearly perceive their intelligibility degra-dation. Comparing the WERs in Tasks 1 and 2, as expected,most teams achieved a higher WER on Task 2, which indicatesthe difficulty of cross-lingual VC.
5. Can the Metrics Predict HumanJudgements on VC Speech?
In this section, we investigate our first question, “Can the met-rics predict human judgements on the naturalness and speakersimilarity of converted audio submitted for VCC 2020?”In the VCC 2020, we conducted two large-scale crowd-sourced listening tests on the naturalness and speaker similarityof converted speech. The first test was done by 68 native En-glish listeners (32 female, 33 male, and 3 unknown) and the sec-ond by 206 native Japanese listeners (96 male and 110 female).More details are described in [1]. Using these listening test re-sults, we measured the correlations of each objective evaluationmetric with the subjective evaluation results by the English andJapanese listeners.More specifically, we created scatter plots matching each ofthe objective metrics at the system level and each of the subjec-tive evaluation results, then calculated the Pearson correlationcoefficients. The scatter plots are shown in Appendix B.Table 4 shows the Pearson correlation coefficients with sub-jective evaluation results for each metric along with their p -values. The top and bottom tables show the correlations withthe subjective scores obtained from the English and Japaneselisteners, respectively. We first summarize the correlation anal-ysis using the English listeners and then discuss the differencebetween this case and the Japanese one. Subjective quality rating:
From Table 4, we can see that theASV-related metrics (EER, Pfa), MOSNet (vcc18, asvspoof19),and ASR WER had moderately positive or negative correla-tions with the subjective quality ratings in Task 1, while cosinedistance, MOSNet (asvspoof19), and ASR WER had moder-ate correlations in Task 2. These findings are statistically sig-nificant. While it is slightly surprising to see the ASV-relatedmetrics and cosine distance were correlated with the subjectivequality ratings, we assume this stems from the fact that humanjudgements on quality and speaker similarity are not indepen-dent. The fact that the ASV-related metrics had higher corre-lations with the speaker similarity ratings also underpins this.We can also see that MOSNet (asvspoof19) had higher correla-tions than MOSNet (vcc18). This was because the asvspoof19dataset contains more diverse and new speech generation meth-ods than the VCC18 dataset, which demonstrates the impor-tance of choosing the appropriate training dataset for MOSNet.
Subjective speaker similarity rating:
We can see that all ofthe ASV-related metrics (EER, Pfa, cosine distance) had strongcorrelations with subjective speaker similarity ratings in bothtasks. Among these, the EER had the highest correlations ( r =0 . for Task 1 and r = 0 . for Task 2). MOSNet (vcc18,asvspoof19) had a moderate correlation with the speaker sim-ilarity ratings in Task 1, but its correlation in Task 2 was notstatistically significant. English listeners vs. Japanese listeners:
Next, we analyzedthe differences between the English listeners’ case and theJapanese one. We can see that the general tendencies werethe same in both cases; that is, ASV-related metrics hadstrong correlations with subjective speaker similarity ratings,and MOSNet and ASR WER had moderately negative corre-lations with subjective quality ratings. Two minor differenceswere that the cosine distance had a slightly higher correlationthan ASV EER for Task 2 and that MOSNet (asvspoof19) had ahigher correlation than ASR WER for Task 2. These differenceswere marginal.able 2:
Performance of objective measures for Task 1 (intra-lingual semi-parallel VC). Red cells indicate top-5 systems (includingties) for each metric.
Team ID ASV EER (%) ASV Pfa (%) ASV Pmiss (%) Cosine CM EER (%) MOSNet (vcc18) MOSNet (asvspoof19) ASR WER (%)
T01 33.00 98.25 100.00 0.93 22.47 3.57 3.55 22.78T02 14.00 87.50 100.00 0.86 26.74 3.32 3.22 12.32T03 23.00 82.00 99.75 0.90 0.78 3.37 3.64 80.26T04 45.13 99.00 100.00 0.97 38.30 3.92 3.25 22.84T06 0.00 0.00 21.75 0.72 14.77 2.65 2.99 3.65T07 48.50 99.75 100.00 0.96 43.48 3.73 3.61 18.08T08 0.50 0.50 78.25 0.76 37.97 2.89 2.86 6.95T09 19.00 86.25 100.00 0.91 7.97 3.71 3.17 62.76T10 51.00 100.00 100.00 0.98 43.98 3.90 3.70 4.12T11 38.50 99.00 99.50 0.94 42.75 4.27 4.17 5.43T12 0.00 0.00 8.00 0.45 31.46 3.02 3.10 3.50T13 37.00 97.25 99.75 0.94 28.25 3.44 3.30 9.70T14 1.00 6.00 99.50 0.76 61.96 2.85 2.47 19.77T16 33.00 97.00 100.00 0.93 36.51 3.33 3.33 21.52T17 7.50 34.25 100.00 0.80 47.24 2.95 3.11 52.07T18 14.00 75.25 100.00 0.91 20.50 2.65 2.61 55.58T19 33.63 98.50 100.00 0.94 42.75 3.08 3.39 65.80T20 24.00 95.00 100.00 0.94 32.24 3.75 3.37 22.58T21 2.00 13.75 98.50 0.75 47.52 3.86 3.45 30.84T22 52.00 100.00 100.00 0.96 33.76 3.63 3.69 6.69T23 45.00 99.75 100.00 0.94 32.02 3.51 3.71 19.28T24 25.50 98.50 100.00 0.91 20.50 3.54 3.51 23.83T25 33.50 98.50 98.75 0.96 27.52 3.58 3.67 4.82T26 3.63 53.00 99.25 0.86 18.98 3.76 3.38 28.04T27 37.50 100.00 100.00 0.95 19.49 3.41 3.31 3.42T28 34.50 96.00 99.75 0.95 32.70 3.58 3.25 96.17T29 45.50 100.00 100.00 0.96 34.44 3.94 3.71 8.47T30 46.00 99.75 100.00 0.97 2.02 3.72 3.61 2.77T31 31.63 99.50 100.00 0.92 25.45 3.27 2.66 77.80T32 18.00 95.00 100.00 0.94 30.55 3.48 3.55 4.21T33 43.13 100.00 100.00 0.96 33.25 3.55 3.72 9.64
Table 3:
Performance of objective measures for Task 2 (cross-lingual VC). Red cells indicate top-5 systems (including ties) for eachmetric.
Team ID ASV EER (%) ASV Pfa (%) ASV Pmiss (%) Cosine CM EER (%) MOSNet (vcc18) MOSNet (asvspoof19) ASR WER (%)
T02 19.18 60.50 98.50 0.82 22.15 3.39 2.96 12.97T03 16.00 43.50 99.83 0.84 0.82 3.31 3.67 81.25T05 25.63 79.50 99.67 0.90 13.48 2.78 2.09 6.48T06 1.18 1.33 21.33 0.73 16.01 2.80 2.98 5.18T07 60.37 100.00 99.00 0.91 44.49 3.68 3.55 24.82T08 0.08 0.17 72.83 0.74 46.64 3.00 3.07 3.80T09 25.92 85.00 99.83 0.86 7.15 3.71 3.14 65.85T10 45.55 97.50 96.00 0.95 49.81 3.96 3.72 4.11T11 41.55 98.83 93.67 0.91 42.97 4.26 4.17 5.96T12 26.00 71.33 100.00 0.84 29.81 2.81 2.31 29.40T13 36.37 90.50 97.33 0.90 21.51 3.55 3.47 6.46T15 4.82 17.00 98.00 0.86 50.50 4.33 3.30 13.10T16 41.18 95.17 99.67 0.88 34.36 3.29 3.02 25.43T18 20.37 66.00 99.67 0.84 32.02 2.75 2.27 74.01T19 44.00 98.67 100.00 0.87 38.35 3.24 3.31 76.77T20 5.63 18.67 91.00 0.85 34.68 4.06 3.61 23.15T22 30.82 89.50 100.00 0.85 42.97 3.55 3.64 30.96T23 32.82 88.83 97.50 0.91 53.67 3.31 2.87 18.32T24 48.82 99.33 99.33 0.88 17.97 3.83 3.53 45.11T25 30.82 89.83 90.33 0.93 29.30 3.60 3.70 4.58T26 4.37 15.67 97.50 0.80 22.97 4.14 3.36 34.58T27 33.63 75.33 93.17 0.89 26.64 3.37 3.47 3.93T28 18.82 48.17 88.83 0.87 34.17 3.49 3.35 72.41T29 47.63 98.83 98.83 0.93 33.85 3.98 3.74 8.86T30 40.00 92.17 96.33 0.94 2.02 3.47 3.70 3.21T31 29.63 90.83 99.17 0.86 19.81 3.21 2.90 70.02T32 15.63 64.83 98.50 0.92 28.98 3.54 3.44 5.14T33 23.63 80.33 80.67 0.89 34.49 3.92 3.53 19.55
We have demonstrated that the ASV metrics, in particular EER,had a strong correlation with subjective ratings on speaker sim-ilarity. Therefore, it should prove fruitful to analyze the top-ranked VC submissions further, as their subjective speaker sim-ilarity in Task 1 was as good as the target speakers according tothe listening test results [1] and hence we cannot obtain mean-ingful differences between the top submissions from the listen-ing test only. As reported in [1], eight VC systems (T10, T22, T27, T13,T33, T23, T29, and T07) had statistically fewer significant dif-ferences from human speech. As expected, some of these dif-ferences related to the EERs (as shown in Table 2), althoughall of them had very high EERs above 35%. In particular, T10and T22 had slightly higher EERs than 50%, the chance level,and hence it is expected that T10 and T22 have slightly ‘empha-sized’ speaker characteristics compared to real target speakers.T23, T29, and T07 also had EERs between 45% and 48% andwere higher than T27, T13, and T33.able 4:
Pearson correlation coefficients with subjective evaluation results for each metric and ( p -values). Top: correlation withsubjective scores from English listeners . Bottom: correlation with subjective scores from
Japanese listeners . Bold font indicates thehighest correlation among the objective metrics.
Subjectivescore (ENG) ASVEER (%) ASVPfa (%) Cosine distance CountermeasureEER (%) MOSNet(vcc18) MOSNet(asvspoof19) ASRWER (%)
Task 1 MOS ( p < p < p > p > p < p < − p < ( p < p < p < p > p < p < − p > p > p > p < p > p > p < − ( p < ( p < p < p < p > p > p > − p > Subjectivescore (JPN) ASVEER (%) ASVPfa (%) Cosine distance CountermeasureEER (%) MOSNet(vcc18) MOSNet(asvspoof19) ASRWER (%)
Task 1 MOS ( p < p < p < − p > p < p < − p = ( p < p < p < p > p < p < − p > p > p > p < p > p < ( p < − p < p < p < ( p < p > p > p > − p > Table 5:
Breakdown per language of Pearson correlation coefficients with subjective evaluation results for each metric along with p -values. Top: correlation with subjective scores from English listeners . Bottom: correlation with subjective scores from
Japaneselisteners . Bold font indicates the highest correlation among the objective metrics. F, G, and M represent Finnish, German, andMandarin target speakers, respectively.
Subjectivescore (ENG) ASVEER (%) ASVPfa (%) Cosine distance CountermeasureEER (%) MOSNet(vcc18) MOSNet(asvspoof19) ASRWER (%)
Task 2 MOS (F) 0.26 ( p > p > p < p > p > p > − ( p < p > p > p > p > p > p > − ( p < p > p > p < p > p > p > − ( p < ( p < p < p < p > p > p > − p > p < p < ( p < p > p > p > − p > ( p < p < p < p > p > p > − p > Subjectivescore (JPN) ASVEER (%) ASVPfa (%) Cosine distance CountermeasureEER (%) MOSNet(vcc18) MOSNet(asvspoof19) ASRWER (%)
Task 2 MOS (F) 0.34 ( p > p > ( p < p > p > p > − p < p > p > p < p > p < p > − ( p < p > p > ( p < p > p > p > − p < p < p < ( p < p > p > p > − p > p < p < ( p < p > p > p > − p > ( p < p < p < p > p > p > − p > Task 2 of VCC 2020 is a cross-lingual VC task and the tar-get speaker’s speech contains utterances in German, Finnish, orMandarin. As reported in [1], this factor affected the perfor-mance of the VC systems built by challenge participants. Theconverted audio files for German target speakers had the high-est naturalness and speaker similarity, while those for Mandarinhad the lowest. Therefore, we want to confirm whether the ob-jective metrics can capture such score variations and predict thelistening test scores for each language.A breakdown of the Pearson correlation coefficients perlanguage is provided in Table 5. The top and bottom tablesshow correlations with the subjective scores from the Englishand Japanese listeners, respectively. The results for individualsubmissions are shown in Appendix C.As we can see, the correlations of the ASV metrics andASR WER were similar and stable across the three languages.Again, MOS ratings were correlated with ASR WER andspeaker similarity ratings were well correlated with ASV met-rics. We can also see that the MOS ratings done by Japaneselisteners had weaker correlations with ASR WER than thosedone by English listeners.
Next, we carried out multiple linear regression analysis to de-termine whether the prediction accuracy on subjective scorescould be improved by combining several objective metrics.Since the ASV metrics reported in previous subsections have“multicollinearity”, we selected ASV EER only for this anal-ysis. Also, we chose MOSNet (asvspoof19) over MOSnet(vcc18) since the former had higher correlation with the sub-jective scores (as shown in Table 4). The estimated coefficientsand the statistics are listed in Table 6, where the top and bottomparts show subjective scores from English and Japanese listen-ers, respectively.From the table, by comparing the adjusted R-squared valuesand the Pearson correlation coefficients of Table 4, we can seethat the prediction accuracy on subjective quality rating scorescan be improved by combining multiple objective metrics forall of Task 1 and for the quality rating of Task 2. The signifi-cant explainable variables for MOS were ASV EER and ASRWER for Task 1 and MOSNet (asvspoof19) and ASR WER forTask 2. This was the case for both the English and Japaneselisteners’ scores. However, this was not the case for the subjec-tive speaker similarity rating, and only the coefficient for ASVEER was statistically significant. This is presumably becausethe ASV EER itself had sufficiently high correlations. Thisable 6:
Coefficients ( p -values) of multiple linear regression models that use ASR WER, MOSNet predictions, ASV EER, and counter-measure EER as inputs. Top: subjective scores from English listeners . Bottom: subjective scores from
Japanese listeners . Subjectivescore (ENG) Intercept MOSNet(asvspoof19) ASRWER (%) ASVEER (%) CountermeasureEER (%) MultipleR-Squared AdjustedR-squared Significance F
Task 1 MOS 1.713( p = p = − ( p < ( p < − p = < p = p = − p = ( p < p = < p = ( p = − ( p < p = p = < ( p < p = − p = ( p < p = < Subjectivescore (JPN) Intercept MOSNet(asvspoof19) ASRWER (%) ASVEER (%) CountermeasureEER (%) MultipleR-Squared AdjustedR-squared Significance F
Task 1 MOS 1.024( p = p = − ( p = ( p < − p = < p = p = − p = ( p < p = < p = ( p < − ( p < p = p = < ( p < p = − p = ( p < p = < finding is consistent with the correlation analysis discussed inthe previous section.We can also see that the MOS score regression models inTask 2 had lower adjusted R-squared values (0.74 and 0.68)compared to the other regression results. This clearly suggeststhat predicting the naturalness score in the cross-lingual settingis harder, and needs to be explained by more factors. The results of our analysis demonstrated that MOS ratingswere correlated with ASR WER or MOSNet (asvspoof19) andspeaker similarity ratings were well correlated with ASV met-rics, and that they were complementary for quality rating.Contrary to our expectations, only the ASV and ASR mod-els trained using human speech data became important explain-able variables for predicting subjective ratings, and the CMmodels using synthesized speech or converted speech did not.This might be due to the many new types of waveform genera-tion methods adopted by the challenge participants. As reportedin [1], ten types of vocoders were used for VCC 2020, many ofwhich were not included in the training databases used for theCM model.We should point out that the correlations reported in thispaper are at the system level and as such represent neither thequality nor the speaker similarity of individual sentences. In-vestigating sentence-level score predictions will be the focus offuture work.
6. Spoofing Performance Assessment
In this section, we address our second question, “Which VCtechnology has the highest spoofing risk for ASV and CM?” Ifwe can answer this question and identify which VC methodsthe current ASV and CM systems are most vulnerable to, thespeaker recognition community can use our results as a guide-line to efficiently prepare additional training data to enhancethe robustness of ASV and CM systems. Once spoofed dataproduced by the missing VC technologies is added to trainingdatabases for the CM systems and the VC technologies become“known” from the CM perspective, their discrimination nor- mally becomes much easier. Therefore, we focus on the ASVand CM metrics in this section.
The ASV and CM results in Tables 2 and 3 were reported in iso-lation from each other. For “spoofing performance” assessment,we consider the ASV and CM results jointly . Specifically, weenvision a cascaded (tandem) system [34] where a CM systemis placed before the ASV, with the aim of preventing spoofingattacks from reaching the ASV system. VC audio files may berejected by a CM system if the audio contains detectable arti-facts. Even if VC audio files are passed on to the CM system,they may still be rejected by the ASV if their speaker similarityis not close enough to the target speakers. Further, while ourprimary interest is the performance degradation due to falselyaccepted VC spoofing attacks, the tandem system is also proneto two other types of errors. First, it may accept another similar-appearing human user (a non-target ) as a target. Second, ei-ther CM or ASV may reject the actual target user. Using anoverly sloppy CM or ASV system leads to compromised secu-rity, while conversely, an overly aggressive CM (or ASV) leadsto reduced user convenience. To assess the joint performanceof CM and ASV, we adopt a new metric called (minimum) tan-dem detection cost function (t-DCF) [34]. It combines all theASV and CM system errors into a single cost value for eachsubmitted VC system.Unlike the parameter-free metrics considered above, t-DCFis a parameterized cost that makes the modeling assumptionsof an envisioned operating environment (application) explicit.A desired security-convenience trade-off is specified through detection costs assigned to erroneous system decisions and prior probabilities assigned to the commonality of targets, non-targets, and spoofing attacks. The ASV threshold is set tothe EER point on bona fide samples. Following the notationsand parameter constraints in [34], we assign costs C miss = 1 , C fa = C fa,spoof = 10 , and priors π spoof = 0 . , π tar = (1 − π spoof ) × . ≈ . , π non = (1 − π spoof ) × . ≈ . .This is representative of a high-security user authentication ap-able 7: Minimum t-DCF for each system of VCC 2020. Redcells indicate top-5 systems for each task.
System Task 1 Task 2 System Task 1 Task 2
T01 0.73542 – T18 0.70372 0.81145T02 0.85274 0.70888 T19 0.8743 0.90471T03 0.01467 0.01467 T20 0.85301 0.77249T04 0.88342 – T21 0.86755 –T05 – 0.60904 T22 0.86204 0.93512T06 1.0000 0.72722 T23 0.8297 0.9037T07 0.87227 0.9033 T24 0.76482 0.79092T08 1.00000 1.00000 T25 0.85402 0.85048T09 0.25987 0.29213 T26 0.71041 0.53263T10 0.87126 0.91282 T27 0.80151 0.84287T11 0.87531 0.88646 T28 0.91214 0.82598T12 1.00000 0.84693 T29 0.83375 0.87311T13 0.88646 0.79685 T30 0.04508 0.09695T14 0.91708 – T31 0.84069 0.70379T15 – 0.8805 T32 0.80942 0.76208T16 0.87633 0.88818 T33 0.78095 0.83375T17 0.87734 – – – – plication (e.g., access control) where target users and spoof-ing attacks are almost equally likely to occur, while nontargetusers are rare. False acceptances (whether of nontargets or VCattacks) incur a ten-fold cost relative to false rejections. Thehigher the t-DCF value, the more detrimental the VC attack.The maximum value of 1.0 indicates an attack that renders thetandem system useless.Table 7 shows the minimum t-DCF performance for eachteam of VCC 2020, where the top-5 systems showing thehighest minimum t-DCF values are highlighted for both tasks.Among these top-5 systems, team T08 reached the maximumpossible cost (1.00) in both tasks. Table 2 reveals that T08yielded nearly zero ASV EER (0.50%)—i.e., it did not succeedin fooling ASV. At the same time, however, the correspond-ing CM EER was very high (37.97%), indicating difficultiesin detecting this spoofing attack. This might be due to a lack-ing spoofing artifacts in T08, issues with CM generalization, orboth.The t-DCF was high in this case due to the non-accurateCM that falsely rejected many target utterances. Since the ASVwould have rejected this unsuccessful attack with high proba-bility, it might have been better to not use any CM system atall —certainly not a low-performing one. This general pattern(low ASV EER %, relatively high CM EER %) also held forT06, T12, and T14 in Task 1. For T28, however, the high t-DCFvalue was due to the relatively high EER for both CM and ASV.In Task 2, apart from T08, we find a similar explanation forhigh t-DCF values in the top-5 VC systems. Taking T10 as anexample, the ASV EER was 45.55% and CM EER was 49.81%;in other words, T10 fooled the ASV nearly perfectly. At thesame time, the CM could do no better than random guessingin detecting this attack. Thus, T10 is a highly effective attackthat is difficult to detect (by our CM). Careful examination ofTable 2 reveals this pattern (high EERs on both ASV and CM)for T19, T22, and T23 as well.Next, we take a closer look at the top-5 systems to deter-mine which VC approaches show a potential threat for spoof- The mathematical properties of t-DCF [34] imply that if the spooffalse acceptance rate (SFAR) of the ASV system is 0, the optimal coun-termeasure is no countermeasure . Interested readers are directed to [34,Eq. (10)], where C = 0 whenever P asvfa,spoof = 0 . A CM that mini-mizes Eq. (10) in this case must use a detection threshold τ cm = −∞ ,aka ‘accept everything’ or ‘no countermeasure’. Table 8:
Details of top-performing VC systems in terms of min-imum t-DCF as a spoofing threat.
Task 1Team ID VC model Vocoder
T06 StarGAN WORLDT08 VTLN + Spectral differential WORLDT12 ADAGAN AHOcoderT14 One-shot VC NSFT28 Tacotron WaveRNN
Task 2Team ID VC model Vocoder
T08 VTLN + Spectral differential WORLDT22 ASR-TTS (Transformer) Parallel WaveGANT10 PPG-VC (LSTM) WaveNetT19 VQVAE Parallel WaveGANT23 CycleVAE WaveNet ing in terms of t-DCF. Table 8 shows the details of the VC ap-proaches used in the top-5 systems for spoofing assessment un-der the t-DCF measure, whose details are taken from our VCC2020 summary paper [1]. With this analysis we try to high-light some of the VC models and vocoders that have a poten-tially high spoofing threat. As expected, unseen VC methodsthat are not included in the training database (e.g., GAN vari-ants (StarGAN, AdaGAN, Parallel WaveGAN) and VTLN) hadthe highest t-DCF values. These VC methods should be prior-itized as attack methods added to countermeasure databases inthe future.
Do the VC methods featured in the VCC 2020 impose a spoof-ing risk? Yes, they certainly do. One useful reference is theASV performance on natural human samples—more specifi-cally, EER on target-nontarget discrimination. These EERswere 0.50% and 0.80% for Tasks 1 and 2, respectively. Most VCsystems increased these numbers, most of them substantially.This is not news to the ASV community as such. Our spoofingperformance assessment through t-DCF clearly highlights theimportance of improving both ASV and CM technology. TheVCC 2020 not only featured a battery of new VC techniquesbut also facilitated an initial CM performance benchmarking ona new type of multilingual data. Further research is needed toanalyze the combined effect of VC methods and language. Sub-stantial future work remains in improving the generalizability ofCM techniques across diverse data conditions.
7. Conclusion
This work summarizes the predictions of subjective ratings andspoofing assessments with objective assessments performed atthe latest VCC 2020. We considered five different objective as-sessments based on ASV, neural speaker embedding, spoofingcountermeasure, predicted MOS, and ASR. The correlations ofobjective assessments computed with subjective ratings indicatethat the ASV, neural speaker embedding, and ASR had high cor-relations, which suggests the possibility of predicting the sub-jective ratings. Further, we found that the ASV and ASR resultswere more effective than the predicted MOS and spoofing coun-termeasure for predicting the subjective test results using mul-tiple linear regression models. This indicates the potential shifttoward relying on the objective assessments over tedious listen-ing tests for large-scale evaluations in the future. We performedfurther spoofing assessment on the submissions and identifiedhe VC methods with a potentially high threat. However, thistopic deserves future exploration as the performance highly de-pends on the coverage of the various VC methods included inthe training data.
8. Acknowledgements
The authors thank Ms. Jennifer Williams of the University ofEdinburgh for kindly providing a MOSNet model fine-tunedusing the ASVspoof 2019 data. This work was partially sup-ported by JST CREST Grants (JPMJCR18A6, VoicePersonaeproject, and JPMJCR19A3, CoAugmentation project), Japan,MEXT KAKENHI Grants (16H06302, 17H04687, 17H06101,18H04120, 18H04112, 18KT0051, 19K24373), Japan, theNational Natural Science Foundation of China (Grant No.61871358), Programmatic Grant No. A1687b0033 from theSingapore Governments Research, Innovation and Enterprise2020 plan (Advanced Manufacturing and Engineering domain),and the Academy of Finland (project no. 309629).
9. References [1] Y. Zhao, W.-C. Huang, X. Tian, J. Yamagishi, R. K. Das, T. Toda,T. Kinnunen, and Z. Ling, “Voice conversion challenge 2020intra-lingual semiparallel and cross-lingual voice conversion ,” in
ISCA Joint Workshop for the Blizzard Challenge and Voice Con-version Challenge 2020 , 2020, pp. XXX–XXX.[2] T. Yoshimura, G. Eje Henter, O. Watts, M. Wester, J. Yamagishi,and K. Tokuda, “A hierarchical predictor of synthetic speech nat-uralness using neural networks,” in
Interspeech 2016 , 2016, pp.342–346.[3] B. Patton, Y. Agiomyrgiannakis, M. Terry, K. W. Wilson, R. A.Saurous, and D. Sculley, “AutoMOS: Learning a non-intrusiveassessor of naturalness-of-speech,” in
NIPS End-to-end Learningfor Speech and Audio Processing Workshop , 2016.[4] S.-W. Fu, Y. Tsao, H.-T. Hwang, and H.-M. Wang, “Quality-Net:An end-to-end non-intrusive speech quality assessment modelbased on BLSTM,” in
Interspeech 2018 , 2018, pp. 1873–1877.[5] A. R. Avila, H. Gamper, C. Reddy, R. Cutler, I. Tashev, andJ. Gehrke, “Non-intrusive speech quality assessment using neu-ral networks,” in
IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) 2019 , 2019, pp. 631–635.[6] C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi,Y. Tsao, and H.-M. Wang, “MOSNet: Deep learning-based objec-tive assessment for voice conversion,” in
Interspeech 2019 , 2019,pp. 1541–1545.[7] J. Williams, J. Rownicka, P. Oplustil, and S. King, “Compar-ison of speech representations for automatic quality estimationin multi-speaker text-to-speech synthesis,” in
Odyssey 2020 TheSpeaker and Language Recognition Workshop , 2020, pp. 222–229.[8] Y. Choi, Y. Jung, and H. Kim, “Deep MOS predictor forsynthetic speech using cluster-based modeling,” arXiv preprintarXiv:2008.03710 , 2020.[9] ——, “Neural MOS prediction for synthesized speech usingmulti-task learning with spoofing detection and spoofing typeclassification,” arXiv preprint arXiv:2007.08267 , 2020.[10] B. T. Meyer, B. Kollmeier, and J. Ooster, “Autonomous measure-ment of speech intelligibility utilizing automatic speech recogni-tion,” in
Interspeech 2015 , 2015, pp. 2982–2986.[11] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li,“Spoofing and countermeasures for speaker verification: A sur-vey,”
Speech Communication , vol. 66, pp. 130 – 153, 2015.[12] R. K. Das, X. Tian, T. Kinnunen, and H. Li, “The attacker’s per-spective on automatic speaker verification: An overview,” in
In-terspeech 2020 , 2020. [13] https://kaldi-asr.org/models/m7.[14] G. Lavrentyva, S. Novoselov, A. Tseren, M. Volkova, A. Gor-lanov, and A. Kozlos, “STC antispoofing systems for theASVspoof2019 challenge,” in
Interspeech 2019 , 2019, pp. 1033–1037.[15] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Ben-gio, “End-to-end attention-based large vocabulary speech recog-nition,” in
IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) 2016 , 2016, pp. 4945–4949.[16] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Per-ceptual evaluation of speech quality (PESQ)-a new method forspeech quality assessment of telephone networks and codecs,” in
IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) 2001 , vol. 2, 2001, pp. 749–752.[17] J. G. Beerends, C. Schmidmer, J. Berger, M. Obermann, R. Ull-mann, J. Pomy, and M. Keyhl, “Perceptual objective listeningquality assessment (POLQA), the third generation ITU-T standardfor end-to-end speech quality measurement part Itemporal align-ment,”
Journal of the Audio Engineering Society , vol. 61, no. 6,pp. 366–384, 2013.[18] L. Malfait, J. Berger, and M. Kastner, “P. 563the ITU-T standardfor single-ended speech quality assessment,”
IEEE Transactionson Audio, Speech, and Language Processing , vol. 14, no. 6, pp.1924–1934, 2006.[19] S. King, “Measuring a decade of progress in text-to-speech,”
Lo-quens , vol. 1, no. 1, p. 006, 2014.[20] T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu,and J. Yamagishi, “The voice conversion challenge 2016.” in
In-terspeech 2016 , 2016, pp. 1632–1636.[21] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicen-cio, T. Kinnunen, and Z. Ling, “The voice conversion challenge2018: Promoting development of parallel and nonparallel meth-ods,” in
Odyssey 2018 The Speaker and Language RecognitionWorkshop , 2018, pp. 195–202.[22] X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch,N. Evans, M. Sahidullah, V. Vestman, T. Kinnunen, K. A. Lee,L. Juvela, P. Alku, Y.-H. Peng, H.-T. Hwang, Y. Tsao, H.-M. Wang, S. L. Maguer, M. Becker, F. Henderson, R. Clark,Y. Zhang, Q. Wang, Y. Jia, K. Onuma, K. Mushika, T. Kaneda,Y. Jiang, L.-J. Liu, Y.-C. Wu, W.-C. Huang, T. Toda, K. Tanaka,H. Kameoka, I. Steiner, D. Matrouf, J.-F. Bonastre, A. Goven-der, S. Ronanki, J.-X. Zhang, and Z.-H. Ling, “Asvspoof 2019: Alarge-scale public database of synthesized, converted and replayedspeech,”
Computer Speech and Language , vol. 64, p. 101114,2020.[23] Huadi Zheng, W. Cai, Tianyan Zhou, Shilei Zhang, and M. Li,“Text-independent voice conversion using deep neural networkbased phonetic level features,” in
International Conference onPattern Recognition (ICPR) , Dec 2016, pp. 2872–2877.[24] L. Sun, K. Li, H. Wang, S. Kang, and H. Meng, “Phonetic poste-riorgrams for many-to-one voice conversion without parallel datatraining,” in
IEEE International Conference on Multimedia andExpo (ICME) 2016 , 2016, pp. 1–6.[25] H. Miyoshi, Y. Saito, S. Takamichi, and H. Saruwatari, “Voiceconversion using sequence-to-sequence learning of context poste-rior probabilities,” in
Interspeech 2017 , 2017.[26] L.-J. Liu, Z.-H. Ling, Y. Jiang, M. Zhou, and L.-R. Dai, “WaveNetvocoder with limited training data for voice conversion,” in
Inter-speech 2018 , 2018, pp. 1983–1987.[27] J.-X. Zhang, Z.-H. Ling, L.-J. Liu, Y. Jiang, and L.-R. Dai,“Sequence-to-sequence acoustic modeling for voice conversion,”
IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 27, no. 3, pp. 631–644, 2019.[28] J.-X. Zhang, Z.-H. Ling, Y. Jiang, L.-J. Liu, C. Liang, and L.-R. Dai, “Improving sequence-to-sequence acoustic modeling byadding text-supervision,” in
IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) 2019 , 2019,pp. 6785–6789.29] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,“Deep neural network embeddings for text-independent speakerverification,” in
Interspeech 2017 , F. Lacerda, Ed., 2017, pp. 999–1003.[30] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recog-nition toolkit,” in
IEEE Workshop on Automatic Speech Recogni-tion and Understanding 2011 . IEEE Signal Processing Society,Dec. 2011, iEEE Catalog No.: CFP11SRW-USB.[31] A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb:Large-scale speaker verification in the wild,”
Computer Speechand Language , vol. 60, p. 101027, 2020. [32] M. Sahidullah, T. Kinnunen, and C. Hanilc¸i, “A comparison offeatures for synthetic speech detection,” in
Interspeech 2015 ,2015, pp. 2087–2091.[33] Z. Wu, R. K. Das, J. Yang, and H. Li, “Light convolutional neu-ral network with feature genuinization for detection of syntheticspeech attacks,” in
Interspeech 2020 , 2020.[34] T. Kinnunen, H. Delgado, N. Evans, K. A. Lee, V. Vestman,A. Nautsch, M. Todisco, X. Wang, M. Sahidullah, J. Yamag-ishi, and D. A. Reynolds, “Tandem assessment of spoofing coun-termeasures and automatic speaker verification: Fundamentals,”
IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 28, pp. 2195–2210, 2020. . Results of Individual Metrics
Here, we show the rankings and comparison of every team based on the various objective measures for both tasks discussed in Section 4. T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T A S V e rr o r r a t e ( % ) EER (tar-spoof) Pfa (tar) Pmiss (src) EER (tar-non)=0.50%
Figure 1:
Summary of ASV-based speaker similarity assessment for Task 1. T T T T T T T T T T T T T T T T T T T T T T T T T T T T A S V e rr o r r a t e ( % ) EER (tar-spoof) Pfa (tar) Pmiss (src) EER (tar-non)=0.80%
Figure 2:
Summary of ASV-based speaker similarity assessment for Task 2. T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T C o s i n e d i s t a n c e Task 1 Cosine Distance Performance
Figure 3:
Summary of cosine distance-based neural speaker embedding for Task 1. T T T T T T T T T T T T T T T T T T T T T T T T T T T T C o s i n e d i s t a n c e Task 2 Cosine Distance Performance
Figure 4:
Summary of cosine distance-based neural speaker embedding for Task 2.
20 30
50 60 T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T C M EE R ( % ) Task 1 Countermeasure Performance
Figure 5:
Summary of spoofing countermeasure EER (%) for Task 1. T T T T T T T T T T T T T T T T T T T T T T T T T T T T C M EE R ( % ) Task 2 Countermeasure Performance
Figure 6:
Summary of spoofing countermeasure EER (%) for Task 2. .00 T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T P r e d i c t e d M O S Task 1 Predicted MOS Performance vcc18 asvspoof19
Figure 7:
MOSNet predictions of all systems in Task 1. T T T T T T T T T T T T T T T T T T T T T T T T T T T T P r e d i c t e d M O S Task 2 Predicted MOS Performance vcc18 asvspoof19
Figure 8:
MOSNet predictions of all systems in Task 2. T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T A S R W E R ( % ) Task 1 ASR Performance
Figure 9:
Summary of ASR WER (%) for Task 1. T T T T T T T T T T T T T T T T T T T T T T T T T T T T A S R W E R ( % ) Task 2 ASR Performance
Figure 10:
Summary of ASR WER (%) for Task 2. . Objective Evaluation Results and Scatter Plots
The analysis presented here shows the correlation scatter plots of various objective measures against the subjective MOS and speakersimilarity for both English and Japanese listeners. The Pearson correlation coefficients along with the p -values corresponding toFigures 11 24 have been presented in Table 4. MOS EE R ( % ) (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5
MOS EE R ( % ) (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T33 1 1.5 2 2.5 3 3.5 4 4.5 5
MOS EE R ( % ) (d) Task 2: Japanese Listeners T02 T03T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5
MOS EE R ( % ) (c) Task 2: English Listeners T02T03 T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T33
Figure 11:
Scatter plots for ASV EER (%) with subjective MOS.
Similarity score EE R ( % ) (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32 T331 1.5 2 2.5 3 3.5 4
Similarity score EE R ( % ) (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31T32 T33 1 1.5 2 2.5 3 3.5 4
Similarity score EE R ( % ) (d) Task 2: Japanese Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4
Similarity score EE R ( % ) (c) Task 2: English Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23 T24T25T26 T27T28 T29T30T31T32T33
Figure 12:
Scatter plots for ASV EER (%) with subjective speaker similarity.
MOS P f a ( % ) (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5
MOS P f a ( % ) (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T33 1 1.5 2 2.5 3 3.5 4 4.5 5
MOS P f a ( % ) (d) Task 2: Japanese Listeners T02 T03T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5
MOS P f a ( % ) (c) Task 2: English Listeners T02T03 T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T33
Figure 13:
Scatter plots for ASV Pfa (%) with subjective MOS.
Similarity score P f a ( % ) (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32 T331 1.5 2 2.5 3 3.5 4
Similarity score P f a ( % ) (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31T32 T33 1 1.5 2 2.5 3 3.5 4
Similarity score P f a ( % ) (d) Task 2: Japanese Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4
Similarity score P f a ( % ) (c) Task 2: English Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23 T24T25T26 T27T28 T29T30T31T32T33
Figure 14:
Scatter plots for ASV Pfa (%) with subjective speaker similarity.
MOS C o s i ne d i s t an c e (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T33 SOU1 1.5 2 2.5 3 3.5 4 4.5 5
MOS C o s i ned i s t an c e (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T33 SOU 1 1.5 2 2.5 3 3.5 4 4.5 5
MOS C o s i ne d i s t an c e (d) Task 2: Japanese Listeners T02 T03T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T33 SOU1 1.5 2 2.5 3 3.5 4 4.5 5
MOS C o s i ned i s t an c e (c) Task 2: English Listeners T02T03 T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T33 SOU
Figure 15:
Scatter plots for speaker embedding cosine distance with subjective MOS.
Similarity score C o s i ne d i s t an c e (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32 T33SOU1 1.5 2 2.5 3 3.5 4
Similarity score C o s i ned i s t an c e (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31T32 T33SOU 1 1.5 2 2.5 3 3.5 4
Similarity score C o s i ne d i s t an c e (d) Task 2: Japanese Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23T24 T25T26 T27T28 T29T30T31 T32T33SOU1 1.5 2 2.5 3 3.5 4
Similarity score C o s i ned i s t an c e (c) Task 2: English Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23 T24T25T26 T27T28 T29T30T31T32T33SOU
Figure 16:
Scatter plots for speaker embedding cosine distance with subjective speaker similarity.
MOS EE R ( % ) (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5
MOS EE R ( % ) (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T33 1 1.5 2 2.5 3 3.5 4 4.5 5
MOS EE R ( % ) (d) Task 2: Japanese Listeners T02 T03T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5
MOS EE R ( % ) (c) Task 2: English Listeners T02T03 T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T33
Figure 17:
Scatter plots for spoofing countermeasure EER (%) with subjective MOS.
Similarity score EE R ( % ) (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32 T331 1.5 2 2.5 3 3.5 4
Similarity score EE R ( % ) (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31T32 T33 1 1.5 2 2.5 3 3.5 4
Similarity score EE R ( % ) (d) Task 2: Japanese Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4
Similarity score EE R ( % ) (c) Task 2: English Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23 T24T25T26 T27T28 T29T30T31T32T33
Figure 18:
Scatter plots for spoofing countermeasure EER (%) with subjective speaker similarity.
True MOS P r ed i c t ed M O S (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5
True MOS P r ed i c t ed M O S (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T33 1 1.5 2 2.5 3 3.5 4 4.5 5
True MOS P r ed i c t ed M O S (d) Task 2: Japanese Listeners T02 T03T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5
True MOS P r ed i c t ed M O S (c) Task 2: English Listeners T02T03 T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T33
Figure 19:
Scatter plots for MOSNet (vcc18) predictions with subjective MOS.
Similarity score P r ed i c t ed M O S (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32 T331 1.5 2 2.5 3 3.5 4
Similarity score P r ed i c t ed M O S (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31T32 T33 1 1.5 2 2.5 3 3.5 4
Similarity score P r ed i c t ed M O S (d) Task 2: Japanese Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4
Similarity score P r ed i c t ed M O S (c) Task 2: English Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23 T24T25T26 T27T28 T29T30T31T32T33
Figure 20:
Scatter plots for MOSNet (vcc18) predictions with subjective speaker similarity.
True MOS P r ed i c t ed M O S (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5
True MOS P r ed i c t ed M O S (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T33 1 1.5 2 2.5 3 3.5 4 4.5 5
True MOS P r ed i c t ed M O S (d) Task 2: Japanese Listeners T02 T03T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5
True MOS P r ed i c t ed M O S (c) Task 2: English Listeners T02T03 T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T33
Figure 21:
Scatter plots for MOSNet (asvspoof19) predictions with subjective MOS.
Similarity score P r ed i c t ed M O S (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32 T331 1.5 2 2.5 3 3.5 4
Similarity score P r ed i c t ed M O S (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31T32 T33 1 1.5 2 2.5 3 3.5 4
Similarity score P r ed i c t ed M O S (d) Task 2: Japanese Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4
Similarity score P r ed i c t ed M O S (c) Task 2: English Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23 T24T25T26 T27T28 T29T30T31T32T33
Figure 22:
Scatter plots for MOSNet (asvspoof19) predictions with subjective speaker similarity.
MOS W E R ( % ) (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5
MOS W E R ( % ) (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T33 1 1.5 2 2.5 3 3.5 4 4.5 5
MOS W E R ( % ) (d) Task 2: Japanese Listeners T02 T03T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5
MOS W E R ( % ) (c) Task 2: English Listeners T02T03 T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T33
Figure 23:
Scatter plots for ASR WER (%) with subjective MOS.
Similarity score W E R ( % ) (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32 T331 1.5 2 2.5 3 3.5 4
Similarity score W E R ( % ) (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31T32 T33 1 1.5 2 2.5 3 3.5 4
Similarity score W E R ( % ) (d) Task 2: Japanese Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4
Similarity score W E R ( % ) (c) Task 2: English Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23 T24T25T26 T27T28 T29T30T31T32T33
Figure 24:
Scatter plots for ASR WER (%) with subjective speaker similarity. . Objective Evaluation Results by Target Speaker Language
Here, we perform an analysis on the effect of target speaker language on Task 2 of VCC 2020 for different objective measures on allthe systems. The correlation of each objective measure with the subjective test for the language pairs is presented in Table 5. In thefigures, “Fin”, “Ger”, and “Man” stand for Finnish, German, and Mandarin, respectively.
20 30 40 50 60 70 T T T T T T T T T T T T T T T T T T T T T T T T T T T T A S V EE R ( % ) Fin Ger Man
Figure 25:
ASV performance in EER (%) of various teams with different language pair analysis in Task 2 of VCC 2020.
20 40
60 80
100 120 T T T T T T T T T T T T T T T T T T T T T T T T T T T T P f a ( % ) Fin Ger Man
Figure 26:
ASV performance in Pfa (%) of various teams with different language pair analysis in Task 2 of VCC 2020. T T T T T T T T T T T T T T T T T T T T T T T T T T T T C o s i n e d i s t a n c e Fin Ger Man
Figure 27:
Neural speaker embedding cosine similarity of various teams with different language pair analysis in Task 2 of VCC 2020.
10 20 30
50 60 70 T T T T T T T T T T T T T T T T T T T T T T T T T T T T C M EE R ( % ) Fin Ger Man
Figure 28:
Spoofing countermeasure performance in EER (%) of various teams with different language pair analysis in Task 2 of VCC2020. T T T T T T T T T T T T T T T T T T T T T T T T T T T T P r e d i c t e d M O S Fin Ger Man
Figure 29:
MOSNet (vcc18) predictions of all systems for different target speaker languages on Task 2 of VCC 2020. T T T T T T T T T T T T T T T T T T T T T T T T T T T T P r e d i c t e d M O S Fin Ger Man
Figure 30:
MOSNet (asvspoof19) predictions of all systems for different target speaker languages on Task 2 of VCC 2020.
20 30 40 50 60 70 80 90 100 T T T T T T T T T T T T T T T T T T T T T T T T T T T T W E R ( % ) Fin Ger Man