[PDF] Predictions of Subjective Ratings and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions

Abstract

The Voice Conversion Challenge 2020 is the third edition under its flagship that promotes intra-lingual semiparallel and cross-lingual voice conversion (VC). While the primary evaluation of the challenge submissions was done through crowd-sourced listening tests, we also performed an objective assessment of the submitted systems. The aim of the objective assessment is to provide complementary performance analysis that may be more beneficial than the time-consuming listening tests. In this study, we examined five types of objective assessments using automatic speaker verification (ASV), neural speaker embeddings, spoofing countermeasures, predicted mean opinion scores (MOS), and automatic speech recognition (ASR). Each of these objective measures assesses the VC output along different aspects. We observed that the correlations of these objective assessments with the subjective results were high for ASV, neural speaker embedding, and ASR, which makes them more influential for predicting subjective test results. In addition, we performed spoofing assessments on the submitted systems and identified some of the VC methods showing a potentially high security risk.

Full PDF

PPredictions of Subjective Ratings and Spooﬁng Assessments ofVoice Conversion Challenge 2020 Submissions

Rohan Kumar Das , Tomi Kinnunen , Wen-Chin Huang , Zhenhua Ling ,Junichi Yamagishi , Yi Zhao , Xiaohai Tian , Tomoki Toda National University of Singapore, Singapore University of Eastern Finland, Finland Nagoya University, Japan University of Science and Technology of China, China National Institute of Informatics, Japan [email protected]

Abstract

The Voice Conversion Challenge 2020 is the third edition underits ﬂagship that promotes intra-lingual semiparallel and cross-lingual voice conversion (VC). While the primary evaluationof the challenge submissions was done through crowd-sourcedlistening tests, we also performed an objective assessment ofthe submitted systems. The aim of the objective assessmentis to provide complementary performance analysis that maybe more beneﬁcial than the time-consuming listening tests. Inthis study, we examined ﬁve types of objective assessments us-ing automatic speaker veriﬁcation (ASV), neural speaker em-beddings, spooﬁng countermeasures, predicted mean opinionscores (MOS), and automatic speech recognition (ASR). Eachof these objective measures assesses the VC output along dif-ferent aspects. We observed that the correlations of these objec-tive assessments with the subjective results were high for ASV,neural speaker embedding, and ASR, which makes them moreinﬂuential for predicting subjective test results. In addition, weperformed spooﬁng assessments on the submitted systems andidentiﬁed some of the VC methods showing a potentially highsecurity risk.

Index Terms : Voice Conversion Challenge 2020, objectiveevaluation, subjective rating prediction, spooﬁng assessment

1. Introduction

Voice conversion (VC), which refers to the digital cloning of aperson’s voice, can be used to modify an audio waveform sothat it appears as if spoken by someone else (target) than theoriginal speaker (source). VC is useful in many applicationssuch as customizing audio book and avatar voices, dubbing, themovie industry, teleconferencing, singing voice modiﬁcation,voice restoration after surgery, and the cloning of voices of his-torical persons. Since VC technology involves identity conver-sion, it can also be used to protect the privacy of individuals onsocial media and in sensitive interviews. For the same reason,VC also enables spooﬁng (fooling) voice biometric systems andtherefore has potential security implications.VCC 2020 is the 3 rd edition of the Voice Conversion Chal-lenge (VCC). While the general background and subjective re-sults are provided in [1], this study focuses on complementary objective evaluation results.Conventionally, the target of VC technology is human lis-teners, so subjective assessment has been the primary methodof assessment in all the VCC challenges. On the other hand,progress has been made recently in research ﬁelds relevant toobjective evaluation assessment, and human perception predic-tion and spooﬁng performance assessment against automaticspeaker veriﬁcation (ASV) systems are increasingly being used. The former is utilized to model and predict human per-ceptions automatically. Until recently, predicting human per-ception on synthetic speech has been challenging, but new re-search [2–9] has demonstrated that the advanced deep learningmodels and large amounts of paired data of synthetic speech andassociated human judgement scores can lead to data-driven ob-jective models that can predict human perception on syntheticspeech to a certain extent. There are also several studies onusing automatic speech recognition (ASR) as a proxy for sub-jective intelligibility estimation [10].The latter pertains to spooﬁng and anti-spooﬁng researchfor ASV. One goal of VC is to produce convincing mimicry ofspeciﬁc target speaker voices, and it is widely known that VCcan fool (spoof) unprotected ASV systems [11, 12]. Therefore,for increasing security and robustness, ASV systems normallyadopt spooﬁng countermeasures, which are designed to learnthe distinguishing artifacts present in spoofed audio producedby VC from human speech. We assume that spooﬁng perfor-mance against ASV may be correlated with the speaker simi-larity of VC systems and that the countermeasure (CM) perfor-mance represents the amount of artifacts produced by VC sys-tems, which may or may not be audible to humans. Note thatthese systems are designed and optimized for discrimination bymachines, and as such, the performances of ASV and CM maybe different from human perceptions.With the above as our motivation, we provide an array ofcomplementary objective results representative of recent objec-tive evaluation techniques. Our tools include• text-independent ASV [13] for speaker similarity,• text-independent CM [11] for real-vs.-fake assessment,• automatic MOS prediction [6] for quality, and• ASR for intelligibility.To the best of our knowledge, these four metrics have neverbefore been examined as a group within any single VC study.Using these metrics, this paper investigates the two questionsbelow:• Can the metrics predict human judgements on natural-ness and speaker similarity?• Which VC technology has the highest spooﬁng risk forASV and CM?In Section 2 of this paper, we give an overview of the mo-tivation behind each of the objective evaluation metrics. Theirimplementation details are described in Section 3. Correlationswith human judgements are analyzed in Section 4, and spoof-ing performance against ASV and CM is discussed in Section a r X i v : . [ ee ss . A S ] S e p able 1: Summary of evaluation metrics used in assessing VCC 2020 submissions.

ASV : automatic speaker veriﬁcation,

EER : equalerror rate,

MOS : mean opinion score,

LCNN : light convolutional neural network,

ASR : automatic speech recognition,

WER : worderror rate, CM : countermeasure. Metric Type of measure Measurement tool Implementation Metric interpretation

ASV EER Conv. src ↔ tgt similarity ASV Kaldi x-vector [13] Not similar = . . . = Similar P tarfa Conv. src ↔ tgt similarity ASV Kaldi x-vector [13] Not similar = . . . = Similar P srcmiss Conv. src ↔ src similarity ASV Kaldi x-vector [13] Similar = . . . = Not similarCosine Conv. src ↔ tgt similarity Speaker embedding Kaldi x-vector [13] Not similar = − . . . = SimilarCM EER Artifact assessment Spooﬁng CM LCNN [14] Fake = . . . = RealMOSNet Quality Objective MOS MOSNet [6] Lowest = . . . = HighestASR WER Intelligibility ASR Seq2seq with attention [15] Perfect = . . . = Unintelligible

5. We conclude in Section 6 with a brief summary and mentionof future work.

2. Methodology

Objective metrics for speech signals can be categorized into in-trusive and non-intrusive assessment methods. The former usesa ground-truth natural clean audio that has the same linguisticcontent as an input speech as a reference. The latter does notuse any reference. A summary of the objective metrics usedin this paper is provided in Table 1. They are all non-intrusivemetrics. The following subsections provide the motivation andgeneral description of each method.

We assess speaker similarity using ASV. An ASV system com-pares a test utterance of an unknown speaker with a hypothe-sized speaker’s training utterance(s) and then outputs a speakersimilarity score s ∈ R . Higher scores indicate support forthe same speaker hypothesis and lower scores for the differentspeakers . A hard decision is obtained by comparing s with apre-set veriﬁcation threshold, τ asv . The speakers are declared tobe the same if s > τ asv (otherwise, different).When VC methods achieve good mimicry of speciﬁc targetspeaker voices, the above score becomes higher and hence theconverted audio will be judged as the same speaker. Therefore,we use the impact of VC on the ASV error rates as speakersimilarity metrics here. For each VC system, we report threedifferent kinds of ASV error:1. Equal error rate (EER): error rate at τ asv at which thespoof false acceptance rate and target speaker miss rateequal each other;2. False acceptance rate of target ( P tarfa ): proportion ofconverted utterances declared as the targeted speaker;3. Miss rate of source ( P srcmiss ): proportion of converted ut-terances not declared as the original source speaker.The ﬁrst one gauges the ASV system’s general accuracyin differentiating converted source utterances from real targetspeaker utterances (alternatively, the effectiveness of a VC infooling ASV). The second metric also gauges the VC system’sability to fool the ASV system, with the difference that ourfalse acceptance rate computation uses a threshold ﬁxed priorto observing any VC samples. The same holds for the thirdmetric, source miss rate, which gauges the VC system’s abil-ity to de-identify the source speaker, in terms of ASV (alter-natively, the ASV system’s inability to re-identify the sourcespeaker). The reference values for ideal VC and useless VC are ( EER , P tarfa , P srcmiss ) = ( , , and ( EER , P tarfa , P srcmiss ) =(0 , , , respectively .The second and third error rates deﬁned above are obtainedby ﬁxing τ asv before observing the converted utterances. TheASV system is optimized to differentiate real human same-speaker and different-speaker trials, but no VC samples are usedin optimizing it. Thus, τ asv is ﬁxed using non-converted (natu-ral) data only and remains ﬁxed for the VC test. In practice, weset τ asv at the EER operating point at which the (natural speech)miss and false alarm rates equal each other. Strictly speaking, the above ASV scoring procedures are differ-ent from how our subjects evaluated speaker similarity in themain listening test [1]. In the listening test, they were asked tolisten to a reference audio (which had different linguistic con-tent from converted audio) and judge the speaker similarity ofthe converted audio to one of the reference audio ﬁles.Therefore, we computed the cosine similarity of speakerembedding vectors as well as the ASV scores. We ex-tracted speaker embedding vectors from both converted audioand one of the reference audio ﬁles using the same tool asthe above ASV system and measured their cosine similarity, cos sim ( A, B ) = A · B/ (cid:107) A (cid:107)(cid:107) B (cid:107) , where A and B are thespeaker embedding vectors obtained from the converted audioand reference audio, respectively. Note that this is still tech-nically a non-intrusive assessment because the reference audiofor this measurement is different from the ground-truth natu-ral clean audio that has the same linguistic content as an inputspeech. The spooﬁng countermeasures play an imperative role in de-fending against various attacks to ASV systems. In general,such countermeasures are designed to learn the distinguishingartifacts present in the natural human speech against differentkinds of generated/spoofed speech to identify the spooﬁng at-tacks. Therefore, the artifact assessment using a spooﬁng coun-termeasure for the resultant speech of various VC systems couldindicate the amount of artifacts of the converted speech.The performance of spooﬁng countermeasures is normallyevaluated in terms of EER, similar to that of the ASV systems.However, unlike ASV, any human speaker speech is consideredas target trials and the converted/spoofed speech serve as non- Theoretically, EER is constrained between 0 and for any binaryclassiﬁcation task, including ASV. Values larger than indicate deci-sions worse than random guessing (label ﬂip). In practice, the valuesmay exceed due to data anomalies. arget trials. In the context of VC outputs, a high EER indicatesthe generation of more human-like speech, whereas a low EERindicates that the converted speech is inclined towards the char-acteristics of artiﬁcially generated speech. The mean opinion score (MOS) used for subjective quality as-sessment is a numerical measure of the human-judged overallquality of audio, typically in the range of 15, where 1 is the low-est perceived quality and 5 is the highest. The objective MOS isused to assess speech quality by predicting and approximatingthe human assessment from an input audio. This technique has along history and several metrics have been proposed for speechcoding and telephony. Famous metrics include PESQ [16] andPOLQA [17], both of which are intrusive assessments. Thereis also p536 [18], a metric for non-intrusive assessment, but itis not designed for evaluating the quality of synthetic speech orconverted speech.When a large amount of paired data of synthetic speechand associated human judgement scores is available, we canview the objective MOS as a machine learning-based regressionproblem. Various deep learning models to predict the MOS val-ues utilizing listening test data collected from the Blizzard Chal-lenge [19], our previous VCCs [20, 21], or the ASVspoof Chal-lenge [22] for supervision have been proposed. More speciﬁ-cally, [2] conducted prediction of the Blizzard Challenge’s re-sults, [6, 8, 9] reported predictions of VCC’s results, and [7] re-ported prediction results of ASVspoof’s listening test results.While all of them exhibited moderate correlations with humanjudgments, it is still unknown whether these models can be gen-eralized to new speech synthesis methods that are not alreadyincluded in the databases.Among these models, we have chosen to examineMOSNet [6], which is a deep learning-based non-intrusive as-sessmentor, since it is reported that its prediction has a moder-ate correlation with the subjective evaluation done for the VCC2018.

Although some of the recent VC methods can achieve highnaturalness and similarity of converted speech, they may de-grade the intelligibility of converted speech. For example, inthe recognition-synthesis approach [23–26] to VC, an ASRmodel is usually adopted to extract linguistic-related features,e.g., phonetic posteriorgrams (PPGs) [24] or bottleneck fea-tures [26], from source speech. In this case, recognition errorsare inevitable, which may degrade the intelligibility of the con-verted speech. Moreover, in the VC methods using sequence-to-sequence acoustic models [27], the failed alignment may leadto repetition and deletion of speech segments, especially whenthe amount of training data is limited [28].Considering the cost of conducting subjective intelligibilityevaluations for all conversion pairs, we adopted the word errorrate (WER) of ASR as an objective metric on the intelligibilityof converted speech in VCC 2020. A lower WER indicates ahigher intelligibility.

3. Implementation Details

Our ASV system utilizes x-vector [29]-based deep speaker em-beddings. We use Kaldi’s [30] recipe [13] trained on Vox- Celeb data [31]. The system uses a time-delay neural net-work model (TDNN) trained with cross-entropy loss (treatingtraining speakers as classes) to extract one 512-dimensionaldeep speaker embedding per utterance. The speaker similarityscore is computed using probabilistic linear discriminant analy-sis (PLDA).The system is used as a scoring tool without speciﬁc modi-ﬁcations (e.g., domain adaptation) for the VCC 2020 data. Thesource and target speaker reference models are obtained fromthe respective training utterances provided to the challenge par-ticipants. The training x-vectors of each utterance are usedto form one averaged x-vector per speaker. Test utterance x-vectors are then scored against these averaged models. For pre-VC ASV tests (required when setting the ASV threshold), weuse the original source test data (provided to challenge partic-ipants) and the target speaker reference data ( not provided tochallenge participants). For the VC tests, we replace the origi-nal source utterances with their VC-processed versions.For calculating the cosine distance of the speaker embed-dings, we use the same Kaldi-based x-vector extractor as theASV system. The x-vector dimensions were reduced to 200using LDA before we compute the cosine similarity betweenconverted speech and natural speech. We then calculate the av-eraged value per system.

We used the light convolutional neural network (LCNN)-basedsystem as a spooﬁng countermeasure in our studies [14].The system considers 60-dimensional (20-static+20- ∆ +20- ∆∆ ) linear frequency cepstral coefﬁcient (LFCC) features asthe input [32]. The training set of the ASVspoof 2019 logi-cal access corpus is used to build the model [22]. The detailedarchitecture and implementation of the LCNN system is avail-able in [33]. We considered the utterances from the trainingset of the VCC 2020 database as the bona ﬁde trials, whereasthe submissions from various teams constitute the spoof trialsfor evaluating the performance of every system submitted to thechallenge. Our MOSNet model architecture followed the original settingin [6]. Speciﬁcally, the raw magnitude spectrogram was ﬁrstextracted from the converted speech and used as the input fea-ture. The main model consisted of 12 convolution layers, onebidirectional long-short term memory layer, and two fully con-nected layers followed by a global averaging layer that pooledthe frame-level scores to generate the ﬁnal utterance level pre-dicted MOS. The whole network was trained to regress the lis-tening test scores by minimizing the mean square loss.For training the MOSNet, we used two datasets: one com-posed of listening test data collected for the VCC 2018 [21]and the other of listening test data collected for the ASVspoof2019 [22]. The former contains many of the VC systems avail-able in 2018 and the latter contains more recent speech synthesisand VC methods available from 2019. They are referred to asMOSNet (vcc18) and MOSNet (asvspoof19) , respectively. The ASR engine was a prototype system developed by iFlytek.It features a state-of-the art end-to-end neural network-based https://github.com/lochenchou/MOSNet https://github.com/rhoposit/MOS_Estimation2 SR architecture and was trained using 10,000 hours of record-ings and GB-level texts for language modeling. The vocabularysize was around 200,000. WERs were calculated by the

HRe-sults tool in HTK using manual transcriptions as ground truthand considering substitutions, deletions, and insertions.

4. Objective Evaluation Results for EachSubmitted VC System

We used converted audio produced by each of the submitted VCsystems for Tasks 1 and 2 and assessed the objective evaluationmetrics described earlier. It is noted that the Tasks 1 and 2 areintra-lingual semi-parallel and cross-lingual VC tasks, respec-tively. The results for each VC system for Tasks 1 and 2 aresummarized in Tables 2 and 3, respectively.

ASV:

The ASV results are shown in the ﬁrst to fourth columnsin the tables. Concerning false acceptance rates on Task 1,20 (out of 31) systems achieved error rates higher than 90%.Further, the top-5 systems obtained perfect results (100% falseacceptance). Many other teams obtained near-perfect results.Concerning the miss rate of converted source speakers, it washigher than 90% for all teams except three (06, 08, 12). To sumup, the top systems all achieved similar results, and the majorityof the systems achieved a high target speaker similarity. Nearlyall systems managed to successfully move the converted voice‘away’ from the original source. The results were more variedfor Task 2, however: only ten systems yielded false acceptancerates above 90%, and there was substantial variation across thesystems. The miss rates were generally worse than those inTask 1 for most systems, although, similar to Task 1, for mostsystems they were reasonably high.

Spooﬁng Countermeasures:

The ﬁfth column of the tablesshows the performance of the LCNN-based spooﬁng counter-measure for all the submitted systems on both tasks of VCC2020. We can see that most of the teams achieved a high EER,which indicates that the VC systems were able to generate nat-ural human-like speech that was not easily detectable by thespooﬁng countermeasure. In addition, the performance trendsof the spooﬁng countermeasure for most teams were similar inboth tasks, excluding four teams (08, 18, 22 and 23). They alsoshowed a relatively higher EER for Task 2 than Task 1, whichmight be a result of the very different settings used for bothtasks by those teams.

MOSNet:

The MOSNet predictions of all systems are shownin the sixth column in the tables. We can see that the MOSNetpredictions fell between 2.5 to 4.5, while the ground truth MOStypically ranged from 1.0 to 4.5, indicating that the overall vari-ance of the MOSNet predictions was rather small. We can alsosee that the scores of each team for Tasks 1 and 2 were similar.This is consistent with the fact that most teams utilized the samesystem for both tasks.

WER:

The ASR WERs of all systems are shown in the ﬁnalcolumn in the tables. We can observe a large variance of WERsamong teams in both tasks. For example, in Task 1, seven teamshad WERs that were lower than 5%, while seven other teamshad WERs that were higher than 50%. This indicates the diver-sity of the conversion methods adopted by different teams. Af-ter subjectively examining a few samples from the teams withhigh WERs, we can clearly perceive their intelligibility degra-dation. Comparing the WERs in Tasks 1 and 2, as expected,most teams achieved a higher WER on Task 2, which indicatesthe difﬁculty of cross-lingual VC.

5. Can the Metrics Predict HumanJudgements on VC Speech?

In this section, we investigate our ﬁrst question, “Can the met-rics predict human judgements on the naturalness and speakersimilarity of converted audio submitted for VCC 2020?”In the VCC 2020, we conducted two large-scale crowd-sourced listening tests on the naturalness and speaker similarityof converted speech. The ﬁrst test was done by 68 native En-glish listeners (32 female, 33 male, and 3 unknown) and the sec-ond by 206 native Japanese listeners (96 male and 110 female).More details are described in [1]. Using these listening test re-sults, we measured the correlations of each objective evaluationmetric with the subjective evaluation results by the English andJapanese listeners.More speciﬁcally, we created scatter plots matching each ofthe objective metrics at the system level and each of the subjec-tive evaluation results, then calculated the Pearson correlationcoefﬁcients. The scatter plots are shown in Appendix B.Table 4 shows the Pearson correlation coefﬁcients with sub-jective evaluation results for each metric along with their p -values. The top and bottom tables show the correlations withthe subjective scores obtained from the English and Japaneselisteners, respectively. We ﬁrst summarize the correlation anal-ysis using the English listeners and then discuss the differencebetween this case and the Japanese one. Subjective quality rating:

From Table 4, we can see that theASV-related metrics (EER, Pfa), MOSNet (vcc18, asvspoof19),and ASR WER had moderately positive or negative correla-tions with the subjective quality ratings in Task 1, while cosinedistance, MOSNet (asvspoof19), and ASR WER had moder-ate correlations in Task 2. These ﬁndings are statistically sig-niﬁcant. While it is slightly surprising to see the ASV-relatedmetrics and cosine distance were correlated with the subjectivequality ratings, we assume this stems from the fact that humanjudgements on quality and speaker similarity are not indepen-dent. The fact that the ASV-related metrics had higher corre-lations with the speaker similarity ratings also underpins this.We can also see that MOSNet (asvspoof19) had higher correla-tions than MOSNet (vcc18). This was because the asvspoof19dataset contains more diverse and new speech generation meth-ods than the VCC18 dataset, which demonstrates the impor-tance of choosing the appropriate training dataset for MOSNet.

Subjective speaker similarity rating:

We can see that all ofthe ASV-related metrics (EER, Pfa, cosine distance) had strongcorrelations with subjective speaker similarity ratings in bothtasks. Among these, the EER had the highest correlations ( r =0 . for Task 1 and r = 0 . for Task 2). MOSNet (vcc18,asvspoof19) had a moderate correlation with the speaker sim-ilarity ratings in Task 1, but its correlation in Task 2 was notstatistically signiﬁcant. English listeners vs. Japanese listeners:

Next, we analyzedthe differences between the English listeners’ case and theJapanese one. We can see that the general tendencies werethe same in both cases; that is, ASV-related metrics hadstrong correlations with subjective speaker similarity ratings,and MOSNet and ASR WER had moderately negative corre-lations with subjective quality ratings. Two minor differenceswere that the cosine distance had a slightly higher correlationthan ASV EER for Task 2 and that MOSNet (asvspoof19) had ahigher correlation than ASR WER for Task 2. These differenceswere marginal.able 2:

Performance of objective measures for Task 1 (intra-lingual semi-parallel VC). Red cells indicate top-5 systems (includingties) for each metric.

Team ID ASV EER (%) ASV Pfa (%) ASV Pmiss (%) Cosine CM EER (%) MOSNet (vcc18) MOSNet (asvspoof19) ASR WER (%)

T01 33.00 98.25 100.00 0.93 22.47 3.57 3.55 22.78T02 14.00 87.50 100.00 0.86 26.74 3.32 3.22 12.32T03 23.00 82.00 99.75 0.90 0.78 3.37 3.64 80.26T04 45.13 99.00 100.00 0.97 38.30 3.92 3.25 22.84T06 0.00 0.00 21.75 0.72 14.77 2.65 2.99 3.65T07 48.50 99.75 100.00 0.96 43.48 3.73 3.61 18.08T08 0.50 0.50 78.25 0.76 37.97 2.89 2.86 6.95T09 19.00 86.25 100.00 0.91 7.97 3.71 3.17 62.76T10 51.00 100.00 100.00 0.98 43.98 3.90 3.70 4.12T11 38.50 99.00 99.50 0.94 42.75 4.27 4.17 5.43T12 0.00 0.00 8.00 0.45 31.46 3.02 3.10 3.50T13 37.00 97.25 99.75 0.94 28.25 3.44 3.30 9.70T14 1.00 6.00 99.50 0.76 61.96 2.85 2.47 19.77T16 33.00 97.00 100.00 0.93 36.51 3.33 3.33 21.52T17 7.50 34.25 100.00 0.80 47.24 2.95 3.11 52.07T18 14.00 75.25 100.00 0.91 20.50 2.65 2.61 55.58T19 33.63 98.50 100.00 0.94 42.75 3.08 3.39 65.80T20 24.00 95.00 100.00 0.94 32.24 3.75 3.37 22.58T21 2.00 13.75 98.50 0.75 47.52 3.86 3.45 30.84T22 52.00 100.00 100.00 0.96 33.76 3.63 3.69 6.69T23 45.00 99.75 100.00 0.94 32.02 3.51 3.71 19.28T24 25.50 98.50 100.00 0.91 20.50 3.54 3.51 23.83T25 33.50 98.50 98.75 0.96 27.52 3.58 3.67 4.82T26 3.63 53.00 99.25 0.86 18.98 3.76 3.38 28.04T27 37.50 100.00 100.00 0.95 19.49 3.41 3.31 3.42T28 34.50 96.00 99.75 0.95 32.70 3.58 3.25 96.17T29 45.50 100.00 100.00 0.96 34.44 3.94 3.71 8.47T30 46.00 99.75 100.00 0.97 2.02 3.72 3.61 2.77T31 31.63 99.50 100.00 0.92 25.45 3.27 2.66 77.80T32 18.00 95.00 100.00 0.94 30.55 3.48 3.55 4.21T33 43.13 100.00 100.00 0.96 33.25 3.55 3.72 9.64

Table 3:

Performance of objective measures for Task 2 (cross-lingual VC). Red cells indicate top-5 systems (including ties) for eachmetric.

Team ID ASV EER (%) ASV Pfa (%) ASV Pmiss (%) Cosine CM EER (%) MOSNet (vcc18) MOSNet (asvspoof19) ASR WER (%)

T02 19.18 60.50 98.50 0.82 22.15 3.39 2.96 12.97T03 16.00 43.50 99.83 0.84 0.82 3.31 3.67 81.25T05 25.63 79.50 99.67 0.90 13.48 2.78 2.09 6.48T06 1.18 1.33 21.33 0.73 16.01 2.80 2.98 5.18T07 60.37 100.00 99.00 0.91 44.49 3.68 3.55 24.82T08 0.08 0.17 72.83 0.74 46.64 3.00 3.07 3.80T09 25.92 85.00 99.83 0.86 7.15 3.71 3.14 65.85T10 45.55 97.50 96.00 0.95 49.81 3.96 3.72 4.11T11 41.55 98.83 93.67 0.91 42.97 4.26 4.17 5.96T12 26.00 71.33 100.00 0.84 29.81 2.81 2.31 29.40T13 36.37 90.50 97.33 0.90 21.51 3.55 3.47 6.46T15 4.82 17.00 98.00 0.86 50.50 4.33 3.30 13.10T16 41.18 95.17 99.67 0.88 34.36 3.29 3.02 25.43T18 20.37 66.00 99.67 0.84 32.02 2.75 2.27 74.01T19 44.00 98.67 100.00 0.87 38.35 3.24 3.31 76.77T20 5.63 18.67 91.00 0.85 34.68 4.06 3.61 23.15T22 30.82 89.50 100.00 0.85 42.97 3.55 3.64 30.96T23 32.82 88.83 97.50 0.91 53.67 3.31 2.87 18.32T24 48.82 99.33 99.33 0.88 17.97 3.83 3.53 45.11T25 30.82 89.83 90.33 0.93 29.30 3.60 3.70 4.58T26 4.37 15.67 97.50 0.80 22.97 4.14 3.36 34.58T27 33.63 75.33 93.17 0.89 26.64 3.37 3.47 3.93T28 18.82 48.17 88.83 0.87 34.17 3.49 3.35 72.41T29 47.63 98.83 98.83 0.93 33.85 3.98 3.74 8.86T30 40.00 92.17 96.33 0.94 2.02 3.47 3.70 3.21T31 29.63 90.83 99.17 0.86 19.81 3.21 2.90 70.02T32 15.63 64.83 98.50 0.92 28.98 3.54 3.44 5.14T33 23.63 80.33 80.67 0.89 34.49 3.92 3.53 19.55

We have demonstrated that the ASV metrics, in particular EER,had a strong correlation with subjective ratings on speaker sim-ilarity. Therefore, it should prove fruitful to analyze the top-ranked VC submissions further, as their subjective speaker sim-ilarity in Task 1 was as good as the target speakers according tothe listening test results [1] and hence we cannot obtain mean-ingful differences between the top submissions from the listen-ing test only. As reported in [1], eight VC systems (T10, T22, T27, T13,T33, T23, T29, and T07) had statistically fewer signiﬁcant dif-ferences from human speech. As expected, some of these dif-ferences related to the EERs (as shown in Table 2), althoughall of them had very high EERs above 35%. In particular, T10and T22 had slightly higher EERs than 50%, the chance level,and hence it is expected that T10 and T22 have slightly ‘empha-sized’ speaker characteristics compared to real target speakers.T23, T29, and T07 also had EERs between 45% and 48% andwere higher than T27, T13, and T33.able 4:

Pearson correlation coefﬁcients with subjective evaluation results for each metric and ( p -values). Top: correlation withsubjective scores from English listeners . Bottom: correlation with subjective scores from

Japanese listeners . Bold font indicates thehighest correlation among the objective metrics.

Subjectivescore (ENG) ASVEER (%) ASVPfa (%) Cosine distance CountermeasureEER (%) MOSNet(vcc18) MOSNet(asvspoof19) ASRWER (%)

Task 1 MOS ( p p > p p p > p > p p > p < − ( p < ( p p > p > − p > Subjectivescore (JPN) ASVEER (%) ASVPfa (%) Cosine distance CountermeasureEER (%) MOSNet(vcc18) MOSNet(asvspoof19) ASRWER (%)

Task 1 MOS ( p p p p > p > p p < ( p < − p p > p > − p > Table 5:

Breakdown per language of Pearson correlation coefﬁcients with subjective evaluation results for each metric along with p -values. Top: correlation with subjective scores from English listeners . Bottom: correlation with subjective scores from

Japaneselisteners . Bold font indicates the highest correlation among the objective metrics. F, G, and M represent Finnish, German, andMandarin target speakers, respectively.

Subjectivescore (ENG) ASVEER (%) ASVPfa (%) Cosine distance CountermeasureEER (%) MOSNet(vcc18) MOSNet(asvspoof19) ASRWER (%)

Task 2 MOS (F) 0.26 ( p > p > p p > p > − ( p p > p > p > p > p > − ( p p > p p > p > − ( p < ( p p > p > − p > p p > p > − p > ( p p > p > − p > Subjectivescore (JPN) ASVEER (%) ASVPfa (%) Cosine distance CountermeasureEER (%) MOSNet(vcc18) MOSNet(asvspoof19) ASRWER (%)

Task 2 MOS (F) 0.34 ( p > p > ( p p > p > − p p > p p − ( p p > ( p p > p > − p p > p > − p > p p > p > − p > ( p p > p > − p > Task 2 of VCC 2020 is a cross-lingual VC task and the tar-get speaker’s speech contains utterances in German, Finnish, orMandarin. As reported in [1], this factor affected the perfor-mance of the VC systems built by challenge participants. Theconverted audio ﬁles for German target speakers had the high-est naturalness and speaker similarity, while those for Mandarinhad the lowest. Therefore, we want to conﬁrm whether the ob-jective metrics can capture such score variations and predict thelistening test scores for each language.A breakdown of the Pearson correlation coefﬁcients perlanguage is provided in Table 5. The top and bottom tablesshow correlations with the subjective scores from the Englishand Japanese listeners, respectively. The results for individualsubmissions are shown in Appendix C.As we can see, the correlations of the ASV metrics andASR WER were similar and stable across the three languages.Again, MOS ratings were correlated with ASR WER andspeaker similarity ratings were well correlated with ASV met-rics. We can also see that the MOS ratings done by Japaneselisteners had weaker correlations with ASR WER than thosedone by English listeners.

Next, we carried out multiple linear regression analysis to de-termine whether the prediction accuracy on subjective scorescould be improved by combining several objective metrics.Since the ASV metrics reported in previous subsections have“multicollinearity”, we selected ASV EER only for this anal-ysis. Also, we chose MOSNet (asvspoof19) over MOSnet(vcc18) since the former had higher correlation with the sub-jective scores (as shown in Table 4). The estimated coefﬁcientsand the statistics are listed in Table 6, where the top and bottomparts show subjective scores from English and Japanese listen-ers, respectively.From the table, by comparing the adjusted R-squared valuesand the Pearson correlation coefﬁcients of Table 4, we can seethat the prediction accuracy on subjective quality rating scorescan be improved by combining multiple objective metrics forall of Task 1 and for the quality rating of Task 2. The signiﬁ-cant explainable variables for MOS were ASV EER and ASRWER for Task 1 and MOSNet (asvspoof19) and ASR WER forTask 2. This was the case for both the English and Japaneselisteners’ scores. However, this was not the case for the subjec-tive speaker similarity rating, and only the coefﬁcient for ASVEER was statistically signiﬁcant. This is presumably becausethe ASV EER itself had sufﬁciently high correlations. Thisable 6:

Coefﬁcients ( p -values) of multiple linear regression models that use ASR WER, MOSNet predictions, ASV EER, and counter-measure EER as inputs. Top: subjective scores from English listeners . Bottom: subjective scores from

Japanese listeners . Subjectivescore (ENG) Intercept MOSNet(asvspoof19) ASRWER (%) ASVEER (%) CountermeasureEER (%) MultipleR-Squared AdjustedR-squared Signiﬁcance F

Task 1 MOS 1.713( p = p = − ( p < ( p < − p = < p = p = − p = ( p < p = < p = ( p = − ( p < p = p = < ( p < p = − p = ( p < p = < Subjectivescore (JPN) Intercept MOSNet(asvspoof19) ASRWER (%) ASVEER (%) CountermeasureEER (%) MultipleR-Squared AdjustedR-squared Signiﬁcance F

Task 1 MOS 1.024( p = p = − ( p = ( p < − p = < p = p = − p = ( p < p = < p = ( p < − ( p < p = p = < ( p < p = − p = ( p < p = < ﬁnding is consistent with the correlation analysis discussed inthe previous section.We can also see that the MOS score regression models inTask 2 had lower adjusted R-squared values (0.74 and 0.68)compared to the other regression results. This clearly suggeststhat predicting the naturalness score in the cross-lingual settingis harder, and needs to be explained by more factors. The results of our analysis demonstrated that MOS ratingswere correlated with ASR WER or MOSNet (asvspoof19) andspeaker similarity ratings were well correlated with ASV met-rics, and that they were complementary for quality rating.Contrary to our expectations, only the ASV and ASR mod-els trained using human speech data became important explain-able variables for predicting subjective ratings, and the CMmodels using synthesized speech or converted speech did not.This might be due to the many new types of waveform genera-tion methods adopted by the challenge participants. As reportedin [1], ten types of vocoders were used for VCC 2020, many ofwhich were not included in the training databases used for theCM model.We should point out that the correlations reported in thispaper are at the system level and as such represent neither thequality nor the speaker similarity of individual sentences. In-vestigating sentence-level score predictions will be the focus offuture work.

6. Spooﬁng Performance Assessment

In this section, we address our second question, “Which VCtechnology has the highest spooﬁng risk for ASV and CM?” Ifwe can answer this question and identify which VC methodsthe current ASV and CM systems are most vulnerable to, thespeaker recognition community can use our results as a guide-line to efﬁciently prepare additional training data to enhancethe robustness of ASV and CM systems. Once spoofed dataproduced by the missing VC technologies is added to trainingdatabases for the CM systems and the VC technologies become“known” from the CM perspective, their discrimination nor- mally becomes much easier. Therefore, we focus on the ASVand CM metrics in this section.

The ASV and CM results in Tables 2 and 3 were reported in iso-lation from each other. For “spooﬁng performance” assessment,we consider the ASV and CM results jointly . Speciﬁcally, weenvision a cascaded (tandem) system [34] where a CM systemis placed before the ASV, with the aim of preventing spooﬁngattacks from reaching the ASV system. VC audio ﬁles may berejected by a CM system if the audio contains detectable arti-facts. Even if VC audio ﬁles are passed on to the CM system,they may still be rejected by the ASV if their speaker similarityis not close enough to the target speakers. Further, while ourprimary interest is the performance degradation due to falselyaccepted VC spooﬁng attacks, the tandem system is also proneto two other types of errors. First, it may accept another similar-appearing human user (a non-target ) as a target. Second, ei-ther CM or ASV may reject the actual target user. Using anoverly sloppy CM or ASV system leads to compromised secu-rity, while conversely, an overly aggressive CM (or ASV) leadsto reduced user convenience. To assess the joint performanceof CM and ASV, we adopt a new metric called (minimum) tan-dem detection cost function (t-DCF) [34]. It combines all theASV and CM system errors into a single cost value for eachsubmitted VC system.Unlike the parameter-free metrics considered above, t-DCFis a parameterized cost that makes the modeling assumptionsof an envisioned operating environment (application) explicit.A desired security-convenience trade-off is speciﬁed through detection costs assigned to erroneous system decisions and prior probabilities assigned to the commonality of targets, non-targets, and spooﬁng attacks. The ASV threshold is set tothe EER point on bona ﬁde samples. Following the notationsand parameter constraints in [34], we assign costs C miss = 1 , C fa = C fa,spoof = 10 , and priors π spoof = 0 . , π tar = (1 − π spoof ) × . ≈ . , π non = (1 − π spoof ) × . ≈ . .This is representative of a high-security user authentication ap-able 7: Minimum t-DCF for each system of VCC 2020. Redcells indicate top-5 systems for each task.

System Task 1 Task 2 System Task 1 Task 2

T01 0.73542 – T18 0.70372 0.81145T02 0.85274 0.70888 T19 0.8743 0.90471T03 0.01467 0.01467 T20 0.85301 0.77249T04 0.88342 – T21 0.86755 –T05 – 0.60904 T22 0.86204 0.93512T06 1.0000 0.72722 T23 0.8297 0.9037T07 0.87227 0.9033 T24 0.76482 0.79092T08 1.00000 1.00000 T25 0.85402 0.85048T09 0.25987 0.29213 T26 0.71041 0.53263T10 0.87126 0.91282 T27 0.80151 0.84287T11 0.87531 0.88646 T28 0.91214 0.82598T12 1.00000 0.84693 T29 0.83375 0.87311T13 0.88646 0.79685 T30 0.04508 0.09695T14 0.91708 – T31 0.84069 0.70379T15 – 0.8805 T32 0.80942 0.76208T16 0.87633 0.88818 T33 0.78095 0.83375T17 0.87734 – – – – plication (e.g., access control) where target users and spoof-ing attacks are almost equally likely to occur, while nontargetusers are rare. False acceptances (whether of nontargets or VCattacks) incur a ten-fold cost relative to false rejections. Thehigher the t-DCF value, the more detrimental the VC attack.The maximum value of 1.0 indicates an attack that renders thetandem system useless.Table 7 shows the minimum t-DCF performance for eachteam of VCC 2020, where the top-5 systems showing thehighest minimum t-DCF values are highlighted for both tasks.Among these top-5 systems, team T08 reached the maximumpossible cost (1.00) in both tasks. Table 2 reveals that T08yielded nearly zero ASV EER (0.50%)—i.e., it did not succeedin fooling ASV. At the same time, however, the correspond-ing CM EER was very high (37.97%), indicating difﬁcultiesin detecting this spooﬁng attack. This might be due to a lack-ing spooﬁng artifacts in T08, issues with CM generalization, orboth.The t-DCF was high in this case due to the non-accurateCM that falsely rejected many target utterances. Since the ASVwould have rejected this unsuccessful attack with high proba-bility, it might have been better to not use any CM system atall —certainly not a low-performing one. This general pattern(low ASV EER %, relatively high CM EER %) also held forT06, T12, and T14 in Task 1. For T28, however, the high t-DCFvalue was due to the relatively high EER for both CM and ASV.In Task 2, apart from T08, we ﬁnd a similar explanation forhigh t-DCF values in the top-5 VC systems. Taking T10 as anexample, the ASV EER was 45.55% and CM EER was 49.81%;in other words, T10 fooled the ASV nearly perfectly. At thesame time, the CM could do no better than random guessingin detecting this attack. Thus, T10 is a highly effective attackthat is difﬁcult to detect (by our CM). Careful examination ofTable 2 reveals this pattern (high EERs on both ASV and CM)for T19, T22, and T23 as well.Next, we take a closer look at the top-5 systems to deter-mine which VC approaches show a potential threat for spoof- The mathematical properties of t-DCF [34] imply that if the spooffalse acceptance rate (SFAR) of the ASV system is 0, the optimal coun-termeasure is no countermeasure . Interested readers are directed to [34,Eq. (10)], where C = 0 whenever P asvfa,spoof = 0 . A CM that mini-mizes Eq. (10) in this case must use a detection threshold τ cm = −∞ ,aka ‘accept everything’ or ‘no countermeasure’. Table 8:

Details of top-performing VC systems in terms of min-imum t-DCF as a spooﬁng threat.

Task 1Team ID VC model Vocoder

T06 StarGAN WORLDT08 VTLN + Spectral differential WORLDT12 ADAGAN AHOcoderT14 One-shot VC NSFT28 Tacotron WaveRNN

Task 2Team ID VC model Vocoder

T08 VTLN + Spectral differential WORLDT22 ASR-TTS (Transformer) Parallel WaveGANT10 PPG-VC (LSTM) WaveNetT19 VQVAE Parallel WaveGANT23 CycleVAE WaveNet ing in terms of t-DCF. Table 8 shows the details of the VC ap-proaches used in the top-5 systems for spooﬁng assessment un-der the t-DCF measure, whose details are taken from our VCC2020 summary paper [1]. With this analysis we try to high-light some of the VC models and vocoders that have a poten-tially high spooﬁng threat. As expected, unseen VC methodsthat are not included in the training database (e.g., GAN vari-ants (StarGAN, AdaGAN, Parallel WaveGAN) and VTLN) hadthe highest t-DCF values. These VC methods should be prior-itized as attack methods added to countermeasure databases inthe future.

Do the VC methods featured in the VCC 2020 impose a spoof-ing risk? Yes, they certainly do. One useful reference is theASV performance on natural human samples—more speciﬁ-cally, EER on target-nontarget discrimination. These EERswere 0.50% and 0.80% for Tasks 1 and 2, respectively. Most VCsystems increased these numbers, most of them substantially.This is not news to the ASV community as such. Our spooﬁngperformance assessment through t-DCF clearly highlights theimportance of improving both ASV and CM technology. TheVCC 2020 not only featured a battery of new VC techniquesbut also facilitated an initial CM performance benchmarking ona new type of multilingual data. Further research is needed toanalyze the combined effect of VC methods and language. Sub-stantial future work remains in improving the generalizability ofCM techniques across diverse data conditions.

7. Conclusion

This work summarizes the predictions of subjective ratings andspooﬁng assessments with objective assessments performed atthe latest VCC 2020. We considered ﬁve different objective as-sessments based on ASV, neural speaker embedding, spooﬁngcountermeasure, predicted MOS, and ASR. The correlations ofobjective assessments computed with subjective ratings indicatethat the ASV, neural speaker embedding, and ASR had high cor-relations, which suggests the possibility of predicting the sub-jective ratings. Further, we found that the ASV and ASR resultswere more effective than the predicted MOS and spooﬁng coun-termeasure for predicting the subjective test results using mul-tiple linear regression models. This indicates the potential shifttoward relying on the objective assessments over tedious listen-ing tests for large-scale evaluations in the future. We performedfurther spooﬁng assessment on the submissions and identiﬁedhe VC methods with a potentially high threat. However, thistopic deserves future exploration as the performance highly de-pends on the coverage of the various VC methods included inthe training data.

8. Acknowledgements

The authors thank Ms. Jennifer Williams of the University ofEdinburgh for kindly providing a MOSNet model ﬁne-tunedusing the ASVspoof 2019 data. This work was partially sup-ported by JST CREST Grants (JPMJCR18A6, VoicePersonaeproject, and JPMJCR19A3, CoAugmentation project), Japan,MEXT KAKENHI Grants (16H06302, 17H04687, 17H06101,18H04120, 18H04112, 18KT0051, 19K24373), Japan, theNational Natural Science Foundation of China (Grant No.61871358), Programmatic Grant No. A1687b0033 from theSingapore Governments Research, Innovation and Enterprise2020 plan (Advanced Manufacturing and Engineering domain),and the Academy of Finland (project no. 309629).

9. References [1] Y. Zhao, W.-C. Huang, X. Tian, J. Yamagishi, R. K. Das, T. Toda,T. Kinnunen, and Z. Ling, “Voice conversion challenge 2020intra-lingual semiparallel and cross-lingual voice conversion ,” in

ISCA Joint Workshop for the Blizzard Challenge and Voice Con-version Challenge 2020 , 2020, pp. XXX–XXX.[2] T. Yoshimura, G. Eje Henter, O. Watts, M. Wester, J. Yamagishi,and K. Tokuda, “A hierarchical predictor of synthetic speech nat-uralness using neural networks,” in

Interspeech 2016 , 2016, pp.342–346.[3] B. Patton, Y. Agiomyrgiannakis, M. Terry, K. W. Wilson, R. A.Saurous, and D. Sculley, “AutoMOS: Learning a non-intrusiveassessor of naturalness-of-speech,” in

NIPS End-to-end Learningfor Speech and Audio Processing Workshop , 2016.[4] S.-W. Fu, Y. Tsao, H.-T. Hwang, and H.-M. Wang, “Quality-Net:An end-to-end non-intrusive speech quality assessment modelbased on BLSTM,” in

Interspeech 2018 , 2018, pp. 1873–1877.[5] A. R. Avila, H. Gamper, C. Reddy, R. Cutler, I. Tashev, andJ. Gehrke, “Non-intrusive speech quality assessment using neu-ral networks,” in

IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) 2019 , 2019, pp. 631–635.[6] C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi,Y. Tsao, and H.-M. Wang, “MOSNet: Deep learning-based objec-tive assessment for voice conversion,” in

Interspeech 2019 , 2019,pp. 1541–1545.[7] J. Williams, J. Rownicka, P. Oplustil, and S. King, “Compar-ison of speech representations for automatic quality estimationin multi-speaker text-to-speech synthesis,” in

Odyssey 2020 TheSpeaker and Language Recognition Workshop , 2020, pp. 222–229.[8] Y. Choi, Y. Jung, and H. Kim, “Deep MOS predictor forsynthetic speech using cluster-based modeling,” arXiv preprintarXiv:2008.03710 , 2020.[9] ——, “Neural MOS prediction for synthesized speech usingmulti-task learning with spooﬁng detection and spooﬁng typeclassiﬁcation,” arXiv preprint arXiv:2007.08267 , 2020.[10] B. T. Meyer, B. Kollmeier, and J. Ooster, “Autonomous measure-ment of speech intelligibility utilizing automatic speech recogni-tion,” in

Interspeech 2015 , 2015, pp. 2982–2986.[11] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li,“Spooﬁng and countermeasures for speaker veriﬁcation: A sur-vey,”

Speech Communication , vol. 66, pp. 130 – 153, 2015.[12] R. K. Das, X. Tian, T. Kinnunen, and H. Li, “The attacker’s per-spective on automatic speaker veriﬁcation: An overview,” in

In-terspeech 2020 , 2020. [13] https://kaldi-asr.org/models/m7.[14] G. Lavrentyva, S. Novoselov, A. Tseren, M. Volkova, A. Gor-lanov, and A. Kozlos, “STC antispooﬁng systems for theASVspoof2019 challenge,” in

Interspeech 2019 , 2019, pp. 1033–1037.[15] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Ben-gio, “End-to-end attention-based large vocabulary speech recog-nition,” in

IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) 2016 , 2016, pp. 4945–4949.[16] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Per-ceptual evaluation of speech quality (PESQ)-a new method forspeech quality assessment of telephone networks and codecs,” in

IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) 2001 , vol. 2, 2001, pp. 749–752.[17] J. G. Beerends, C. Schmidmer, J. Berger, M. Obermann, R. Ull-mann, J. Pomy, and M. Keyhl, “Perceptual objective listeningquality assessment (POLQA), the third generation ITU-T standardfor end-to-end speech quality measurement part Itemporal align-ment,”

Journal of the Audio Engineering Society , vol. 61, no. 6,pp. 366–384, 2013.[18] L. Malfait, J. Berger, and M. Kastner, “P. 563the ITU-T standardfor single-ended speech quality assessment,”

IEEE Transactionson Audio, Speech, and Language Processing , vol. 14, no. 6, pp.1924–1934, 2006.[19] S. King, “Measuring a decade of progress in text-to-speech,”

Lo-quens , vol. 1, no. 1, p. 006, 2014.[20] T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu,and J. Yamagishi, “The voice conversion challenge 2016.” in

In-terspeech 2016 , 2016, pp. 1632–1636.[21] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicen-cio, T. Kinnunen, and Z. Ling, “The voice conversion challenge2018: Promoting development of parallel and nonparallel meth-ods,” in

Odyssey 2018 The Speaker and Language RecognitionWorkshop , 2018, pp. 195–202.[22] X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch,N. Evans, M. Sahidullah, V. Vestman, T. Kinnunen, K. A. Lee,L. Juvela, P. Alku, Y.-H. Peng, H.-T. Hwang, Y. Tsao, H.-M. Wang, S. L. Maguer, M. Becker, F. Henderson, R. Clark,Y. Zhang, Q. Wang, Y. Jia, K. Onuma, K. Mushika, T. Kaneda,Y. Jiang, L.-J. Liu, Y.-C. Wu, W.-C. Huang, T. Toda, K. Tanaka,H. Kameoka, I. Steiner, D. Matrouf, J.-F. Bonastre, A. Goven-der, S. Ronanki, J.-X. Zhang, and Z.-H. Ling, “Asvspoof 2019: Alarge-scale public database of synthesized, converted and replayedspeech,”

Computer Speech and Language , vol. 64, p. 101114,2020.[23] Huadi Zheng, W. Cai, Tianyan Zhou, Shilei Zhang, and M. Li,“Text-independent voice conversion using deep neural networkbased phonetic level features,” in

International Conference onPattern Recognition (ICPR) , Dec 2016, pp. 2872–2877.[24] L. Sun, K. Li, H. Wang, S. Kang, and H. Meng, “Phonetic poste-riorgrams for many-to-one voice conversion without parallel datatraining,” in

IEEE International Conference on Multimedia andExpo (ICME) 2016 , 2016, pp. 1–6.[25] H. Miyoshi, Y. Saito, S. Takamichi, and H. Saruwatari, “Voiceconversion using sequence-to-sequence learning of context poste-rior probabilities,” in

Interspeech 2017 , 2017.[26] L.-J. Liu, Z.-H. Ling, Y. Jiang, M. Zhou, and L.-R. Dai, “WaveNetvocoder with limited training data for voice conversion,” in

Inter-speech 2018 , 2018, pp. 1983–1987.[27] J.-X. Zhang, Z.-H. Ling, L.-J. Liu, Y. Jiang, and L.-R. Dai,“Sequence-to-sequence acoustic modeling for voice conversion,”

IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 27, no. 3, pp. 631–644, 2019.[28] J.-X. Zhang, Z.-H. Ling, Y. Jiang, L.-J. Liu, C. Liang, and L.-R. Dai, “Improving sequence-to-sequence acoustic modeling byadding text-supervision,” in

IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) 2019 , 2019,pp. 6785–6789.29] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,“Deep neural network embeddings for text-independent speakerveriﬁcation,” in

Interspeech 2017 , F. Lacerda, Ed., 2017, pp. 999–1003.[30] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recog-nition toolkit,” in

IEEE Workshop on Automatic Speech Recogni-tion and Understanding 2011 . IEEE Signal Processing Society,Dec. 2011, iEEE Catalog No.: CFP11SRW-USB.[31] A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb:Large-scale speaker veriﬁcation in the wild,”

Computer Speechand Language , vol. 60, p. 101027, 2020. [32] M. Sahidullah, T. Kinnunen, and C. Hanilc¸i, “A comparison offeatures for synthetic speech detection,” in

Interspeech 2015 ,2015, pp. 2087–2091.[33] Z. Wu, R. K. Das, J. Yang, and H. Li, “Light convolutional neu-ral network with feature genuinization for detection of syntheticspeech attacks,” in

Interspeech 2020 , 2020.[34] T. Kinnunen, H. Delgado, N. Evans, K. A. Lee, V. Vestman,A. Nautsch, M. Todisco, X. Wang, M. Sahidullah, J. Yamag-ishi, and D. A. Reynolds, “Tandem assessment of spooﬁng coun-termeasures and automatic speaker veriﬁcation: Fundamentals,”

IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 28, pp. 2195–2210, 2020. . Results of Individual Metrics

Here, we show the rankings and comparison of every team based on the various objective measures for both tasks discussed in Section 4. T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T A S V e rr o r r a t e ( % ) EER (tar-spoof) Pfa (tar) Pmiss (src) EER (tar-non)=0.50%

Figure 1:

Summary of ASV-based speaker similarity assessment for Task 1. T T T T T T T T T T T T T T T T T T T T T T T T T T T T A S V e rr o r r a t e ( % ) EER (tar-spoof) Pfa (tar) Pmiss (src) EER (tar-non)=0.80%

Figure 2:

Summary of ASV-based speaker similarity assessment for Task 2. T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T C o s i n e d i s t a n c e Task 1 Cosine Distance Performance

Figure 3:

Summary of cosine distance-based neural speaker embedding for Task 1. T T T T T T T T T T T T T T T T T T T T T T T T T T T T C o s i n e d i s t a n c e Task 2 Cosine Distance Performance

Figure 4:

Summary of cosine distance-based neural speaker embedding for Task 2.

20 30

50 60 T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T C M EE R ( % ) Task 1 Countermeasure Performance

Figure 5:

Summary of spooﬁng countermeasure EER (%) for Task 1. T T T T T T T T T T T T T T T T T T T T T T T T T T T T C M EE R ( % ) Task 2 Countermeasure Performance

Figure 6:

Summary of spooﬁng countermeasure EER (%) for Task 2. .00 T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T P r e d i c t e d M O S Task 1 Predicted MOS Performance vcc18 asvspoof19

Figure 7:

MOSNet predictions of all systems in Task 1. T T T T T T T T T T T T T T T T T T T T T T T T T T T T P r e d i c t e d M O S Task 2 Predicted MOS Performance vcc18 asvspoof19

Figure 8:

MOSNet predictions of all systems in Task 2. T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T A S R W E R ( % ) Task 1 ASR Performance

Figure 9:

Summary of ASR WER (%) for Task 1. T T T T T T T T T T T T T T T T T T T T T T T T T T T T A S R W E R ( % ) Task 2 ASR Performance

Figure 10:

Summary of ASR WER (%) for Task 2. . Objective Evaluation Results and Scatter Plots

The analysis presented here shows the correlation scatter plots of various objective measures against the subjective MOS and speakersimilarity for both English and Japanese listeners. The Pearson correlation coefﬁcients along with the p -values corresponding toFigures 11 24 have been presented in Table 4. MOS EE R ( % ) (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5

MOS EE R ( % ) (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T33 1 1.5 2 2.5 3 3.5 4 4.5 5

MOS EE R ( % ) (d) Task 2: Japanese Listeners T02 T03T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5

MOS EE R ( % ) (c) Task 2: English Listeners T02T03 T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T33

Figure 11:

Scatter plots for ASV EER (%) with subjective MOS.

Similarity score EE R ( % ) (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32 T331 1.5 2 2.5 3 3.5 4

Similarity score EE R ( % ) (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31T32 T33 1 1.5 2 2.5 3 3.5 4

Similarity score EE R ( % ) (d) Task 2: Japanese Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4

Similarity score EE R ( % ) (c) Task 2: English Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23 T24T25T26 T27T28 T29T30T31T32T33

Figure 12:

Scatter plots for ASV EER (%) with subjective speaker similarity.

MOS P f a ( % ) (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5

MOS P f a ( % ) (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T33 1 1.5 2 2.5 3 3.5 4 4.5 5

MOS P f a ( % ) (d) Task 2: Japanese Listeners T02 T03T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5

MOS P f a ( % ) (c) Task 2: English Listeners T02T03 T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T33

Figure 13:

Scatter plots for ASV Pfa (%) with subjective MOS.

Similarity score P f a ( % ) (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32 T331 1.5 2 2.5 3 3.5 4

Similarity score P f a ( % ) (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31T32 T33 1 1.5 2 2.5 3 3.5 4

Similarity score P f a ( % ) (d) Task 2: Japanese Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4

Similarity score P f a ( % ) (c) Task 2: English Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23 T24T25T26 T27T28 T29T30T31T32T33

Figure 14:

Scatter plots for ASV Pfa (%) with subjective speaker similarity.

MOS C o s i ne d i s t an c e (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T33 SOU1 1.5 2 2.5 3 3.5 4 4.5 5

MOS C o s i ned i s t an c e (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T33 SOU 1 1.5 2 2.5 3 3.5 4 4.5 5

MOS C o s i ne d i s t an c e (d) Task 2: Japanese Listeners T02 T03T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T33 SOU1 1.5 2 2.5 3 3.5 4 4.5 5

MOS C o s i ned i s t an c e (c) Task 2: English Listeners T02T03 T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T33 SOU

Figure 15:

Scatter plots for speaker embedding cosine distance with subjective MOS.

Similarity score C o s i ne d i s t an c e (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32 T33SOU1 1.5 2 2.5 3 3.5 4

Similarity score C o s i ned i s t an c e (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31T32 T33SOU 1 1.5 2 2.5 3 3.5 4

Similarity score C o s i ne d i s t an c e (d) Task 2: Japanese Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23T24 T25T26 T27T28 T29T30T31 T32T33SOU1 1.5 2 2.5 3 3.5 4

Similarity score C o s i ned i s t an c e (c) Task 2: English Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23 T24T25T26 T27T28 T29T30T31T32T33SOU

Figure 16:

Scatter plots for speaker embedding cosine distance with subjective speaker similarity.

MOS EE R ( % ) (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5

MOS EE R ( % ) (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T33 1 1.5 2 2.5 3 3.5 4 4.5 5

MOS EE R ( % ) (d) Task 2: Japanese Listeners T02 T03T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5

MOS EE R ( % ) (c) Task 2: English Listeners T02T03 T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T33

Figure 17:

Scatter plots for spooﬁng countermeasure EER (%) with subjective MOS.

Similarity score EE R ( % ) (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32 T331 1.5 2 2.5 3 3.5 4

Similarity score EE R ( % ) (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31T32 T33 1 1.5 2 2.5 3 3.5 4

Similarity score EE R ( % ) (d) Task 2: Japanese Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4

Similarity score EE R ( % ) (c) Task 2: English Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23 T24T25T26 T27T28 T29T30T31T32T33

Figure 18:

Scatter plots for spooﬁng countermeasure EER (%) with subjective speaker similarity.

True MOS P r ed i c t ed M O S (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5

True MOS P r ed i c t ed M O S (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T33 1 1.5 2 2.5 3 3.5 4 4.5 5

True MOS P r ed i c t ed M O S (d) Task 2: Japanese Listeners T02 T03T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5

True MOS P r ed i c t ed M O S (c) Task 2: English Listeners T02T03 T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T33

Figure 19:

Scatter plots for MOSNet (vcc18) predictions with subjective MOS.

Similarity score P r ed i c t ed M O S (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32 T331 1.5 2 2.5 3 3.5 4

Similarity score P r ed i c t ed M O S (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31T32 T33 1 1.5 2 2.5 3 3.5 4

Similarity score P r ed i c t ed M O S (d) Task 2: Japanese Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4

Similarity score P r ed i c t ed M O S (c) Task 2: English Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23 T24T25T26 T27T28 T29T30T31T32T33

Figure 20:

Scatter plots for MOSNet (vcc18) predictions with subjective speaker similarity.

True MOS P r ed i c t ed M O S (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5

True MOS P r ed i c t ed M O S (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T33 1 1.5 2 2.5 3 3.5 4 4.5 5

True MOS P r ed i c t ed M O S (d) Task 2: Japanese Listeners T02 T03T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5

True MOS P r ed i c t ed M O S (c) Task 2: English Listeners T02T03 T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T33

Figure 21:

Scatter plots for MOSNet (asvspoof19) predictions with subjective MOS.

Similarity score P r ed i c t ed M O S (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32 T331 1.5 2 2.5 3 3.5 4

Similarity score P r ed i c t ed M O S (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31T32 T33 1 1.5 2 2.5 3 3.5 4

Similarity score P r ed i c t ed M O S (d) Task 2: Japanese Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4

Similarity score P r ed i c t ed M O S (c) Task 2: English Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23 T24T25T26 T27T28 T29T30T31T32T33

Figure 22:

Scatter plots for MOSNet (asvspoof19) predictions with subjective speaker similarity.

MOS W E R ( % ) (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5

MOS W E R ( % ) (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19 T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32T33 1 1.5 2 2.5 3 3.5 4 4.5 5

MOS W E R ( % ) (d) Task 2: Japanese Listeners T02 T03T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4 4.5 5

MOS W E R ( % ) (c) Task 2: English Listeners T02T03 T05T06 T07T08T09 T10T11T12 T13T15T16T18 T19 T20T22 T23T24 T25T26 T27T28 T29T30T31 T32T33

Figure 23:

Scatter plots for ASR WER (%) with subjective MOS.

Similarity score W E R ( % ) (b) Task 1: Japanese Listeners T01T02 T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31 T32 T331 1.5 2 2.5 3 3.5 4

Similarity score W E R ( % ) (a) Task 1: English Listeners T01T02T03 T04T06 T07T08T09 T10T11T12 T13T14 T16T17T18 T19T20T21 T22T23T24 T25T26 T27T28 T29T30T31T32 T33 1 1.5 2 2.5 3 3.5 4

Similarity score W E R ( % ) (d) Task 2: Japanese Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23T24 T25T26 T27T28 T29T30T31 T32T331 1.5 2 2.5 3 3.5 4

Similarity score W E R ( % ) (c) Task 2: English Listeners T02T03 T05T06 T07T08 T09 T10T11T12 T13T15 T16T18 T19T20 T22T23 T24T25T26 T27T28 T29T30T31T32T33

Figure 24:

Scatter plots for ASR WER (%) with subjective speaker similarity. . Objective Evaluation Results by Target Speaker Language

Here, we perform an analysis on the effect of target speaker language on Task 2 of VCC 2020 for different objective measures on allthe systems. The correlation of each objective measure with the subjective test for the language pairs is presented in Table 5. In theﬁgures, “Fin”, “Ger”, and “Man” stand for Finnish, German, and Mandarin, respectively.

20 30 40 50 60 70 T T T T T T T T T T T T T T T T T T T T T T T T T T T T A S V EE R ( % ) Fin Ger Man

Figure 25:

ASV performance in EER (%) of various teams with different language pair analysis in Task 2 of VCC 2020.

20 40

60 80

100 120 T T T T T T T T T T T T T T T T T T T T T T T T T T T T P f a ( % ) Fin Ger Man

Figure 26:

ASV performance in Pfa (%) of various teams with different language pair analysis in Task 2 of VCC 2020. T T T T T T T T T T T T T T T T T T T T T T T T T T T T C o s i n e d i s t a n c e Fin Ger Man

Figure 27:

Neural speaker embedding cosine similarity of various teams with different language pair analysis in Task 2 of VCC 2020.

10 20 30

50 60 70 T T T T T T T T T T T T T T T T T T T T T T T T T T T T C M EE R ( % ) Fin Ger Man

Figure 28:

Spooﬁng countermeasure performance in EER (%) of various teams with different language pair analysis in Task 2 of VCC2020. T T T T T T T T T T T T T T T T T T T T T T T T T T T T P r e d i c t e d M O S Fin Ger Man

Figure 29:

MOSNet (vcc18) predictions of all systems for different target speaker languages on Task 2 of VCC 2020. T T T T T T T T T T T T T T T T T T T T T T T T T T T T P r e d i c t e d M O S Fin Ger Man

Figure 30:

MOSNet (asvspoof19) predictions of all systems for different target speaker languages on Task 2 of VCC 2020.

20 30 40 50 60 70 80 90 100 T T T T T T T T T T T T T T T T T T T T T T T T T T T T W E R ( % ) Fin Ger Man