ASVspoof 2019: spoofing countermeasures for the detection of synthesized, converted and replayed speech
Andreas Nautsch, Xin Wang, Nicholas Evans, Tomi Kinnunen, Ville Vestman, Massimiliano Todisco, Héctor Delgado, Md Sahidullah, Junichi Yamagishi, Kong Aik Lee
IIEEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE (T-BIOM), VOL. VV, NO. NN, MMMM YYYY 1
ASVspoof 2019: spoofing countermeasuresfor the detection of synthesized, convertedand replayed speech
Andreas Nautsch,
Member, IEEE,
Xin Wang,
Member, IEEE,
Nicholas Evans,
Member, IEEE,
Tomi Kinnunen,
Member, IEEE,
Ville Vestman, Massimiliano Todisco,
Member, IEEE,
H ´ector Delgado,Md Sahidullah,
Member, IEEE,
Junichi Yamagishi,
Senior Member, IEEE, and Kong Aik Lee,
Senior Member, IEEE
Abstract —The ASVspoof initiative was conceived to spearhead research in anti-spoofing for automatic speaker verification (ASV).This paper describes the third in a series of bi-annual challenges: ASVspoof 2019. With the challenge database and protocols beingdescribed elsewhere, the focus of this paper is on results and the top performing single and ensemble system submissions from 62teams, all of which out-perform the two baseline systems, often by a substantial margin. Deeper analyses shows that performance isdominated by specific conditions involving either specific spoofing attacks or specific acoustic environments. While fusion is shown tobe particularly effective for the logical access scenario involving speech synthesis and voice conversion attacks, participants largelystruggled to apply fusion successfully for the physical access scenario involving simulated replay attacks. This is likely the result of alack of system complementarity, while oracle fusion experiments show clear potential to improve performance. Furthermore, whileresults for simulated data are promising, experiments with real replay data show a substantial gap, most likely due to the presence ofadditive noise in the latter. This finding, among others, leads to a number of ideas for further research and directions for future editionsof the ASVspoof challenge.
Index Terms —Spoofing, countermeasures, presentation attack detection, speaker recognition, automatic speaker verification. (cid:70)
NTRODUCTION I T is well known that automatic speaker verification (ASV)systems are vulnerable to being manipulated by spoofing,also known as presentation attacks [1]. Spoofing attacks canenable a fraudster to gain illegitimate access to resources,services or devices protected by ASV technology. The threatfrom spoofing can be substantial and unacceptable. Fol-lowing the first special session on anti-spoofing held in2013 [2], the effort to develop spoofing countermeasures,auxiliary systems which aim to protect ASV technology byautomatically detecting and deflecting spoofing attacks, has Manuscript received MMMM DD, YYYY; revised MMMM DD, YYYY;accepted MMMM, DD YYYY. Date of publication MMMM DD, YYYY;date of current version MMMM DD, YYYY. This work was supported bya number of projects and funding sources: VoicePersonae, supported by theFrench Agence Nationale de la Recherche (ANR) and the Japan Scienceand Technology Agency (JST) with grant No. JPMJCR18A6; RESPECT,supported by the ANR; the NOTCH project (no. 309629), supported bythe Academy of Finland; Region Grand Est, France. (Corresponding author:Andreas Nautsch.)A. Nautsch, N. Evans and M. Todisco are with EURECOM, Cam-pus SophiaTech, 450 Route des Chappes, 06410 Biot, France. E-mail: { nautsch,evans,todisco } @eurecom.frX. Wang and J. Yamagishi are with National Institute of Informatics, 2-1-2 Hi-totsubashi, Chiyoda-ku, Tokyo, Japan. E-mail: { wangxin,jyamagis } @nii.ac.jpT. Kinnunen and V. Vestman are with University of Eastern Fin-land, Joensuu campus, L¨ansikatu 15, FI-80110 Joensuu, Finland. E-mail: { tkinnu,ville.vestman } @uef.fiH. Delgado is with Nuance Communications, C/ Gran V´ıa 39, 28013 Madrid,Spain. E-mail: [email protected] Sahidullah is with Universit´e de Lorraine, CNRS, Inria, LORIA, F-54000,Nancy, France. E-mail: [email protected]. A. Lee is with Institute for Infocomm Research, A ⋆ STAR, 1 FusionopolisWay, Singapore 138632. E-mail: lee kong [email protected] been spearheaded by the ASVspoof initiative .ASVspoof 2019 [3], the most recent of three editions andthe focus in this paper, was the first to include all three majorforms of spoofing attacks involving speech synthesis, voiceconversion and replay, in separate logical and physical ac-cess scenarios. It also brought several advances with respectto previous editions. First, ASVspoof 2019 aimed to explorewhether advances in speech synthesis and voice conversiontechnologies pose a greater threat to ASV reliability; the lat-est of these techniques, e.g. neural network-based waveformmodelling techniques, can produce synthetic and convertedspeech that is perceptually indistinguishable from bona fidespeech. Second, the 2019 edition explored replay attacksusing a far more controlled evaluation setup in the formof simulated replay attacks and carefully controlled acousticconditions. Third, the database is substantially larger thanthe previous ASVspoof databases and considerably morediverse in terms of attack algorithms. With a comprehen-sive description of the database available in a publishedcompanion paper [4], only a brief description is provided inthe current article. The focus here is instead upon challengeresults and findings.Whereas previous editions of ASVspoof utilised theequal error rate (EER) metric to judge performance, the2019 edition shifted to the ASV-centric tandem detectioncost function (t-DCF) metric [5], [6]. While the latter re-flects the impact of both spoofing and countermeasures ©2021 IEEE Transactions on Biometrics, Behavior, and Identity Science (T-BIOM) a r X i v : . [ ee ss . A S ] F e b EEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE (T-BIOM), VOL. VV, NO. NN, MMMM YYYY 2
Replay loudspeakerASV mic Replay acquisition mic
Logical access (LA) Physical access (PA)
Simulated
Real
Fig. 1. The ASVspoof 2019 challenge featured four different types of spoofed audio data. The LA scenario contains text-to-speech and voiceconversion attacks. In the PA scenario attackers acquire a recording of the target speaker which is then replayed to the ASV system. Both simulated and real replay attacks are considered. The former refers to simulated acoustic environments/rooms with specified dimensions and controllablereverberation whereas the latter contains actual replay recordings collected at three different sites. Real replay attacks were included in the test set(but excluded from challenge ranking). This paper describes challenge results for all four setups illustrated. upon ASV performance, participation still calls only for thedevelopment and optimisation of countermeasures . Strongperformance depends upon generalisation, namely counter-measures that perform well in the face of spoofing attacksnot seen in training or development data.Two different baseline systems were provided for the2019 edition. With data, protocols and metrics being dif-ferent to those of previous editions, progress is judged interms of performance relative to the two baseline systemswhich provide some level of continuity or consistency withprevious challenge editions. The article describes the topfive single and fused systems for both challenge scenarios,provides insights into the most successful countermeasure(CM) techniques and assesses their impact upon ASV relia-bility. Finally, the article outlines priorities for the ASVspoofinitiative looking to the future, including ideas for the nextedition – ASVspoof 2021.
HALLENGE OUTLINE
This section describes the logical and physical accessASVspoof 2019 challenge scenarios (see Fig. 1), the challengerules, the new t-DCF metric, and baseline CM and ASVsystems. Since it is described elsewhere [4], the ASVspoof2019 database is not described here. Briefly, it is sourcedfrom the
Voice Cloning Toolkit (VCTK) corpus [7], a multi-speaker, native-English speech database of read sentencesrecorded in a hemi-anechoic chamber.
Logical access (LA) control implies a scenario in which aremote user seeks access to a system or service protected byASV. An example is a telephone banking service to whichattackers may connect and then send synthetic or convertedvoice signals directly to the ASV system while bypassing themicrophone, i.e. by injecting audio into the communicationchannel post sensor. The LA subset of the ASVspoof 2019 database was cre-ated using a diverse array of 17 text-to-speech (TTS), voiceconversion (VC) and hybrid systems. Waveform genera-tion methods vary from waveform concatenation to neuralnetwork-based waveform modelling techniques includingWaveNet [8]. Acoustic models also vary from Gaussianmixture models to advanced sequence-to-sequence neuralnetworks. Some are constructed using popular open-sourcetoolkits while others are selected based on their superiorevaluation results reported in the Voice Conversion Chal-lenge [9] or other literature. Six of these systems are desig-nated as known spoofing algorithms/attacks, with the other11 being designated as unknown spoofing attacks. Amongthe 6 known attacks there are 2 VC systems and 4 TTSsystems. The 11 unknown attacks comprise 2 VC, 6 TTSand 3 hybrid TTS-VC systems for which VC systems are fedwith synthetic speech. Known attacks are used to generatetraining and development data. Unknown attacks and twoof the known attacks are used to generate evaluation data.Attacks are referred to by attack identifiers (AIDs): A1 – A19.Full details of the LA setup, attack groups and analysis areprovided in [4].
In the physical access (PA) scenario, spoofing attacks arepresented to a fixed microphone which is placed in anenvironment in which sounds propagate and are reflectedfrom obstacles such as floors and walls. Spoofing attacks inthis scenario are referred to as replay attacks and match theISO definition of presentation attacks [10]. The PA scenario,is based upon simulated and carefully controlled acousticand replay configurations [11], [12], [13]. The approach usedto simulate room acoustics under varying source/receiverpositions is inspired from the approach reported in [14] andbased upon an image-source method [15]. Acoustic simula-
EEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE (T-BIOM), VOL. VV, NO. NN, MMMM YYYY 3
TABLE 1Submission categories for the ASVspoof 2019 challenge for both LAand PA scenarios and primary, single and contrastive submission. Onlyresults for single and primary systems are discussed in this paper.
LOGICAL ACCESS (LA) sub-challengeASV CM scoresSubmission scores Dev EvalSingle system —
Required Required
Primary —
Required Required
Contrastive1 — Optional OptionalContrastive2 — Optional OptionalPHYSICAL ACCESS (PA) sub-challengeASV CM scoresSubmission scores Dev EvalSingle system —
Required Required
Primary —
Required Required
Contrastive1 — Optional OptionalContrastive2 — Optional Optional tions are performed using Roomsimove , while the replaydevice effects are simulated using the generalised polyno-mial Hammerstein model and the Synchronized Swept Sinetool .The ASV system is used within a noise-free acousticenvironment defined by: the room size S ; the T reverbera-tion time; the talker-to-ASV distance D s . Each parameter iscategorised into three different intervals. The room size S iscategorised into: (a) small rooms of size 2-5 m ; (b) mediumrooms of size 5-10 m ; (c) large rooms of size 10-20 m .The T reverberation time is categorised into: (a) low 50-200 ms; (b) medium 200-600 ms; (c) large 600-1000 ms. Thetalker-to-ASV distance D s is categorised into: (a) low 10-50 cm; (b) medium 50-100 cm; (c) large 100-150 cm. This re-sults in 27 acoustic configurations denoted by environmentidentifiers (EIDs) ( aaa , aab , ..., ccc ).A replay spoofing attack is mounted through the mak-ing of a surreptitious recording of a bona fide access at-tempt and then the presentation of this recording to theASV microphone. Attackers acquire recordings of bona fideaccess attempts when positioned at an attacker-to-talkerdistance D a from the talker whereas the presentation ofrecording is made at the talker-to-ASV distance D s usinga playback device of quality Q . D a is categorised intothree different intervals: (A) low 10-50 cm; (B) medium 50-100 cm; (C) large 100-150 cm. Q is categorised into threequality groups: (A) perfect quality, i.e. a Dirac impulseresponse; (B) high quality; (C) low quality. Their combi-nation results in 9 attack configurations denoted by attackidentifiers (AIDs) ( AA , AB , ..., CC ). Full details of the PAsetup are also provided in [4]. The submission categories for ASVspoof 2019 are illustratedin Table 1. Participants were permitted to submit up to 4different score sets (or 8, counting sets for development andevaluation separately) for the LA scenario and an additional
2. http://homepages.loria.fr/evincent/software/Roomsimove 1.4.zip3. https://ant-novak.com/pages/sss/4. We refer to a talker instead of speaker in order to avoid confusionwith the loudspeaker. primary and single system scores. Score submissionswere required for both the development and evaluationsubsets defined in the ASVspoof 2019 protocols. Scores forcorresponding development and evaluation subsets wererequired to be derived using identical CM systems withoutany adaptation. Ensemble classifiers consisting of multiplesub-systems whose output scores are combined were per-mitted for primary systems only. Single system scores wererequired to be one of the sub-systems in the ensemble (nor-mally the single, best performing). While participants werepermitted to submit scores for an additional two contrastive systems, only results for single and primary systems arepresented in this paper.ASV scores used for scoring and ranking were computedby the organisers using separate ASV protocols. The useof external data resources was forbidden: all systems de-signed by the participants were required to be trained andoptimised using only the relevant ASVspoof 2019 data andprotocols. The only exception to this rule is the use of dataaugmentation, but only then using ASVspoof 2019 trainingand development data with external, non-speech data, e.g. impulse responses. Use of LA data for PA experiments andvice versa was also forbidden.Finally, CM scores produced for any one trial must beobtained using only the data in that trial segment. The useof data from any other trial segments was strictly prohibited.Therefore, the use of techniques such as normalization overmultiple trial segments and the use of trial data for modeladaptation was forbidden. Systems must therefore processtrial lists segment-by-segment independently without ac-cess to past or future trial segments.
While the parameter-free equal error rate (EER) metric isretained as a secondary metric, the primary metric is the tan-dem detection cost function (t-DCF) [5], and the specific
ASV-constrained variant detailed in [6]. The detection threshold(set to the EER operating point) of the ASV system (designedby the organiser) is fixed, whereas the detection threshold ofthe CM system (designed by participants) is allowed to vary.Results are reported in the form of minimum normalized t-DCF values, defined asmin t-DCF = min τ cm { C + C P cmmiss ( τ cm ) + C P cmfa ( τ cm ) t-DCF default } , (1)where P miss ( τ cm ) and P fa ( τ cm ) are the miss and false alarmrates of the CM at threshold τ cm . Coefficients C , C and C [6, Eq. (11)] depend not only on pre-defined target,nontarget and spoofing attack priors and detection costsbut also on the miss, false alarm and spoof false alarm rates(the ratio of spoofed trials accepted by the ASV to the totalnumber of spoofed trials) of the ASV system.The denominator t-DCF default = C + min { C , C } isthe cost of an uninformative default CM that either acceptsor rejects every test utterance. Its inclusion ensures thatmin t-DCF values are in the range between 0 and 1. Avalue 0 means that both ASV and CM systems are error-free whereas a value 1 means that the CM cannot improve
EEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE (T-BIOM), VOL. VV, NO. NN, MMMM YYYY 4 upon the default system. Another useful reference value inbetween these two extremes is the case of an error-free CM(but an imperfect ASV), given by C / t-DCF default . This lowerbound is referred to as the ASV floor .The above formulation differs slightly from that in theASVspoof 2019 evaluation plan. Differences include theabsence of sub-system-level detection costs and the inclu-sion of the ASV floor. The numerical scale of the t-DCFvalues between the formulations differs but the impact uponsystem rankings is negligible. The scoring toolkits , havebeen updated to reflect these changes.One last, relevant detail concerning the t-DCF is howperformance across different attack conditions is aggre-gated. The straightforward way (used for the ranking ofASVspoof 2019 challenge entries as reported in [3]) is toreport performance by pooling CM scores across all attackconditions. As an alternative, we further report max min t-DCF across attack conditions in selected cases. Here, ‘min’refers again to oracle CM calibration while ‘max’ refersto the highest per-condition t-DCF. The ‘max min’ t-DCF,therefore, serves as a reference point for worst-case attacks(see Section 5.2). Two CM systems were provided to ASVspoof 2019 partic-ipants. Baseline
B01 uses constant Q cepstral coefficients(CQCCs) [16], [17] and a bandwidth of 15 Hz to 8 kHz. Thenumber of bins per octave is set to 96 and the re-samplingperiod set to 16. Static features of 29 coefficients and thezeroth coefficient are augmented with delta and delta-deltacoefficients resulting in 90-dimensional features.Baseline
B02 uses linear frequency cepstral coefficients(LFCCs) [18] and a bandwidth of 30 Hz to 8 kHz. LFCCsare extracted using a 512-point discrete Fourier transformapplied to windows of 20 ms with 50% overlap. Staticfeatures of 19 coefficients and the zeroth coefficient areaugmented with delta and delta-delta coefficients resultingin 60-dimensional features.Both baselines use a Gaussian mixture model (GMM)back-end binary classifier. Randomly initialised, 512-component models are trained separately using anexpectation-maximisation (EM) algorithm and bona fideand spoofed utterances from the ASVspoof 2019 trainingdata. Scores are log-likelihood ratios given bona fide andspoofed models. A Matlab package including both baselinesis available for download from the ASVspoof website. The ASV system was used by the organisers to derive theASV scores used in computing the t-DCF metric. It utilizesan x-vector [19] embedding extractor network that was pre-trained for the VoxCeleb recipe of the Kaldi toolkit [20].Training was performed using the speech data collectedfrom 7325 speakers contained within the entire VoxCeleb2corpus [21] and the development portion of the VoxCeleb1 corpus [22]. The network extracts 512-dimensional x-vectorswhich are fed to a probabilistic linear discriminant analy-sis (PLDA) [23], [24] back-end (trained separately for LAand PA scenarios) for ASV scoring. PLDA backends wereadapted to LA and PA scenarios by using bona fide record-ings of CM training data. ASV scores for the developmentset were provided to participants so that they could calcu-late t-DCF values and use these for CM optimisation. Theywere not provided for the evaluation set.
OGICAL ACCESS SCENARIO
This section describes submissions to ASVspoof 2019 forthe LA scenario and results. Single system submissions aredescribed first, followed by primary system submissions,presenting only the top-5 performing of 48 LA systemsubmissions in each case.
The architectures of the top-5 single systems are illustratedin Fig. 2 (grey blocks). Systems are labelled (left) by theanonymised team identifier (TID) [3]. A short descriptionof each follows:
T45 [25]:
A light CNN (LCNN) which operates upon LFCCfeatures extracted from the first 600 frames and with thesame frontend configuration as the B2 baseline CM [18].The LCNN uses an angular-margin-based softmax loss (A-softmax) [26], batch normalization [27] after max poolingand a normal Kaiming initialization [28].
T24 [29]:
A ResNet classifier which operates upon linearfilterbank (LFB) coefficients (no cepstral analysis). The sys-tem extracts embeddings using a modified ResNet-18 [30]architecture in which the kernel size of the input layer is3 × ×
2. Global meanand standard deviation pooling [31] are applied after thelast convolutional layer and pooled features go throughtwo fully-connected layers with batch normalization. Theoutput is length-normalized and classified using a single-layer neural network.
T39:
A CNN classifier with Mel-spectrogram features. TheCNN uses multiple blocks of 1D convolution with ReLUactivation, dropout, and subsampling with strided convolu-tion. The CNN output is used as the score without binaryclassification.
T01:
A GMM-UBM classifier with 60-dimensional STFTcepstral coefficients, including static, delta, and delta-deltacomponents. The GMM-UBM uses the same configurationas the two baseline CMs.
T04:
Cepstral features with a GMM-UBM classifier. Due tohardware limitations, models are trained using the full setof bona fide utterances but a random selection of only 9,420spoofed utterances.
The architectures of the top-5 primary systems are alsoillustrated in Fig. 2. A short description of each follows:
T05:
A fusion of seven sub-systems, six of which derivespectral representations using the DFT, while the seventhuses the discrete cosine transform (DCT). Features are ex-tracted using frame lengths of 256, 160, or 128 samples,
EEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE (T-BIOM), VOL. VV, NO. NN, MMMM YYYY 5
GMM-UBMGMM-UBM Logistic regressionDCT(log|DFT| )IMFCC GMM-UBMLCNNLFCC LCNNLCNNCMVN Weighted score averagingLFCCCQT gram LCNNDFT gram Norm. on each sub-system F i r s t f r a m e s ResNet18 emb.ResNet18 emb.LFB coef. Score fusion(unspecified)CQCC NN (1layer)NN (1layer)CGCNNVAE (enc) CGCRNNResNet18Log CQT gramPhase gram Score averaging (equal weight)CGCNNResNet18ResNet18i-vector GMM-UBMCep. coef. Trained with 2,580 bona fide and 9,420 spoof trialsLogistic reg.stats.Scattering trans.Mel-spec. gram Sub-model selection &countingWORLD vocoder CNN 63063633CNNRule classisfer Logistic regressionMFCC GMM-UBMGMM-UBMSVM W/o training data from SS_1 (A01), VC_1 (A05)IMFCC i-vectorsSCMCCQCC CNNLog DFT gram CRNNMel-spec. gram Wave-U-Netraw-audio CNNRaw audio First 4sFirst 5sFirst 3.7s
Padded to 12.23s
Batch with equal numbers of bona fide and spoof trials MobileNetDenseNetMobileNetDFT gram 1 ResNetDFT gram 2DFT gram 3 Weighted score averagingDCT gram MobileNetMobileNetResNet[N x M]: N dimensions per frame M frames [ x ] [ x ] [ x ] [ x ] ☆★ T45 ☆★ T24 ☆★ T39 ☆ T01 ☆ T04 ★ T50 ★ T60 ★ T05
Fig. 2. Illustration of top single (grey blocks) and primary systemsubmissions for the LA scenario. ☆ and ★ denote top-5 single andtop-5 primary systems, respectively. frame overlaps of 100, 60, or 50 samples, and 256 or 160point DFTs. The input features are sliced in 2D matriceswith 256, 160, or 128 columns (frames). All are basedupon different neural network architectures: four uponMobileNetV2 [32]; two upon ResNet-50 [30]; one uponDenseNet-121 [33]. T45 [25]:
A fusion of five sub-systems, including the LFCC-GMM baseline system (B1) and T45 single system. Theremaining three sub-systems use LCNNs, each of whichuses different features: LFCCs with CMVN; a log powerspectrogram derived from the CQT; log power spectrogramderived from the DFT. All LCNN-based sub-systems usefeatures extracted from the first 600 frames of each file.Sub-system scores are normalized according to the standarddeviation of bona fide scores from the same sub-systembefore being fused using equal weights.
T60 [34]:
A fusion of seven sub-systems. Two sub-systemsare based upon 128-component GMMs trained with eitherMFCC or inverted
MFCC (IMFCC) [35] features appended
TABLE 2A comparison of top-5 (a) single and (b) primary systems for the LAscenario. Single systems B01 and B02 are the two baselines, whereas
Perfect refers to the perfect CM (ASV floor for min t-DCF and EER of0%). Systems are labelled by participating team identifiers (TIDs).Results are presented in terms of the minimum t-DCF (see Section 2.4)and EER metrics. Also illustrated are max min t-DCF results andcorresponding attack identifier (AID) for each system (see Section 5). (a) Single systemsTID min t-DCF EER [%] Max min t-DCF(AID)T45 0.1562 5.06 0.9905 (A17)T24 0.1655 4.04 0.8499 (A17)T39 0.1894 7.01 1.000 (A17)T01 0.1937 5.97 0.7667 (A17)T04 0.1939 5.74 0.7837 (A17)B01 0.2839 9.57 0.9901 (A17)B02 0.2605 8.09 0.6571 (A17)Perfect 0.0627 0.0 0.4218 (A17)(b) Primary systemsTID min t-DCF EER [%] Max min t-DCF(AID)T05 0.0692 0.22 0.4418 (A17)T45 0.1104 1.86 0.7778 (A17)T60 0.1331 2.64 0.8803 (A17)T24 0.1518 3.45 0.8546 (A17)T50 0.1671 3.56 0.8471 (A17) with delta and double-delta coefficients. The third sub-system is based upon the concatenation of 100-dimensioni-vectors extracted from MFCC, IMFCC, CQCC [16] featuresand sub-band centroid magnitude coefficient (SCMC) [36]features, and a support vector machine (SVM) classifier.The fourth sub-system is a CNN classifier operating onmean-variance normalized log DFT grams. The remainingthree sub-systems are based upon either mean-variance nor-malized Mel-scaled spectrograms or raw audio and eitherconvolutional recurrent neural network (CRNN), Wave-U-Net [37] or raw audio CNN classifiers. The NN-based sub-systems process a fixed number of feature frames or audiosamples for each trial. Data for two attack conditions wereexcluded for training and used instead for validation andto stop learning. Scores are combined according to logisticregression fusion.
T24:
A fusion of two sub-systems: the T24 single system; asecond sub-system using the same ResNet classifier but withCQCC-based features. Scores are derived using single-layerneural networks before fusion (details unspecified).
T50 [38]:
A fusion of six sub-systems all based on log-CQTgram features. Features are concatenated with a phase gramor compressed log-CQT gram obtained from a variationalautoencoder (VAE) trained on bona fide recordings only.Three of the six classifiers use ResNet-18 [30] classifiers, forone of which a standard i-vector is concatenated with theembedding layer of the network to improve generalizability.Two other classifiers use CGCNNs [39]. The last classi-fier (CGCRNN) incorporates bidirectional gated recurrentunits [40]. Scores are combined by equal weight averaging.
EEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE (T-BIOM), VOL. VV, NO. NN, MMMM YYYY 6
A summary of results for the top-5 single and primarysubmissions is presented in Tables 2(a) and 2(b) respectively.Results for the two baseline systems appear in the penulti-mate two rows of Table 2(a) whereas the last row showsperformance for a perfect CM, i.e. the ASV floor.In terms of the t-DCF, all of the top-5 single systemsoutperform both baselines by a substantial margin, withthe best T45 system outperforming the B02 baseline by 40%relative. Both systems use LFCC features, whereas T45 usesa LCNN instead of a GMM-UBM classifier. Even so, T01and T04 systems, both also based upon standard cepstralfeatures and GMM-UBM classifiers, are only slightly behindthe better performing, though more complex systems.Four of the top-5 primary systems perform even better,with the best T05 primary system outperforming the bestT45 single system by 56% relative (73% relative to B02).The lowest min t-DCF of 0.0692 (T05 primary) is also onlymarginally above the ASV floor of 0.0627, showing that thebest performing CM gives an expected detection cost thatis close to that of a perfect CM. The top-3 primary systemsall combine at least 5 sub-systems. All use diverse features,including both cepstral and spectral representations, with atleast one DNN-type classifier. Of note also are differences inperformance between the same teams’ primary and singlesystems. Whereas the T05 primary system is first placed,the corresponding single system does not feature among thetop-5 single systems, implying a substantial improvementthrough system combination. The first-placed T45 singlesystem, however, is the second-placed primary system and,here, the improvement from combining systems is moremodest. The same is observed for T24 primary and singlesystems.
HYSICAL ACCESS SCENARIO
This section describes submissions to ASVspoof 2019 for thePA scenario and results. It is organised in the same way asfor the logical access scenario in Section 3.
The architectures of the top-5 single systems are illustratedin Fig. 3 (grey blocks) in which systems are again labelledwith the corresponding TID. A short description of eachfollows:
T28 [41]:
Spectral features based upon the concatenationof Mel-grams and CQT-grams and a ResNetWt18 classifierbased upon a modified ResNet18 architecture [42] where thesecond 3x3 convolution layer is split into 32 groups. A dropout layer is used after pooling and the fully connectedlayer has a binary output.
T10 [43]:
Group delay (GD) grams [44] with cepstral meanand variance normalisation (CMVN) with data augmen-tation via speed pertubation. The classifier, referred to asResNetGAP, is based upon a ResNet34 architecture [42]with a global average pooling (GAP) layer that transformslocal features into 128-dimensional utterance-level represen-tations which are then fed to a fully connected layer withsoftmax based cross-entropy loss.
ResNeWt18Mel-spec. gram ResNeWt18ResNeWt18CQT gramCQTMGD gram Score averaging (equal weight)MGD gram LCNNCQT gram LCNN Score averaging (equal weight)LFCC LCNNDCT gram Norm. on each sub-system F i r s t f r a m e s GD-gramDFT-gram ResNetGAPResNetGAPAug, data Score averaging (equal weight)ResNetGAPResNetGAP ResNetResNetGAPResNetGAPGD-gramDFT-gramLFCCIMFCC SEnet34Log-DFT gram Mean-std ResNetCQCC Logistic regressionSEnet50Dilated ResNetMean-std ResNetTrained with 10 classes (bona fide + 9 AIDs)Unified feature map SVMLFB coef. ResNet emb. Logistic regressionNN (1layer)CNN emb.CQCC NN (1layer)CNN emb. NN (1layer)CNN emb.GD-gram NN (1layer)CNN emb.Trained with 10 classes (bona fide + 9 AIDs)Trained with 270 classes (bona fide + 9 AIDs) x (27 EIDs)VBNNLCNNMel-spec. gram Weighted score averagingDFT gramEqual numbers of bona fide and spoof trials ☆★ T28 ☆★ T10 ☆★ T45 ☆★ T44 ★ T24 ☆ T53
Fig. 3. Illustration of top single (grey blocks) and primary systemsubmissions for the PA scenario. ☆ and ★ denote top-5 single andtop-5 primary systems, respectively. T45 [25]:
Log power CQT-grams with a LCNN classifierwhich uses Kaiming initialization, additional batch normal-izations, and angular softmax loss. In identical fashion tothe T45 LA system, the PA system operates only upon thefirst 600 frames from each utterance and fuses scores byaveraging.
T44 [45]:
Log-DFT grams with a unified feature mapand squeeze and excitation network (SEnet34) [46] witha ResNet34 backbone, in which each block aggre-gates channel-wise statistics ( squeeze operation) to capturechannel-wise dependencies for an adaptive feature recali-bration ( excitation operation) and binary training objective.
T53:
Fixed-length log Mel grams (2048 frequency bins)extracted from concatenated or truncated utterances anda variational Bayesian neural network (VBNN) usingflipout [47] to decorrelate gradients within mini-batches.Bona fide data is oversampled to make the bona fide andspoofed data balanced [48].
The architectures of the top-5 primary systems are alsoillustrated in Fig. 3. A short description of each follows:
T28 [41]:
A fusion of three sub-systems, all ResNet variantsreferred to as ResNeWt18. The first sub-system is the T28
EEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE (T-BIOM), VOL. VV, NO. NN, MMMM YYYY 7
TABLE 3As for Table. 2 except for the PA scenario. In contrast to the LAscenario, the worst case PA scenario is denoted by both the attackidentifier (AID) and the environment identifier (EID). Also illustratedhere are min-tDCF results for hidden, real replay data (see Section 5.2)— min t-DCF results for real replay data are computed using C , C ,and C terms derived for simulated replay data. (a) Single systemsPerformance Hidden trackTID mint-DCF EER(%) Max min t-DCF(AID/EID) mint-DCF EER(%)T28 0.1470 0.52 0.2838 ( AA / acc ) 0.5039 19.68T10 0.1598 1.08 0.3768 ( AA / caa ) 0.8826 37.04T45 0.1610 1.23 0.2809 ( AA / acc ) 0.7139 25.03T44 0.1666 1.29 0.2781 ( AA, AC / acc ) 0.7134 41.11T53 0.1729 1.66 0.2852 ( BA / acc ) 0.6379 32.64B01 0.3476 11.04 1.0 ( BA / baa, caa, cac ) 0.3855 12.73B02 0.3481 13.54 1.0 ( BA / caa ) 0.6681 29.44Perfect 0.1354 0.0 0.2781 ( AA / acc ) - -(b) Primary systemsPerformance Hidden trackTID mint-DCF EER(%) Max min t-DCF(AID/EID) mint-DCF EER(%)T28 0.1437 0.39 0.2781 ( AA , AC / acc ) 0.7160 30.74T45 0.1460 0.54 0.2803 ( AA / acc ) 0.6136 20.02T44 0.1494 0.59 0.2781 ( AA , AC / acc ) 0.6798 33.66T10 0.1500 0.66 0.2781 ( AA , AC / acc ) 0.7987 32.04T24 0.1540 0.77 0.2781 ( AA , AC / acc ) 0.9236 31.67 single system and operates upon concatenated Mel and CQTgrams. The second operates upon a CQT modified groupdelay (CQTMGD) gram whereas the third operates directlyupon the MGD gram (no CQT). Scores are combined byequal weight averaging. T45 [25]:
A fusion of three sub-systems with different fron-tends and a common LCNN backend. The first sub-systemis the T45 single system operating on CQT grams, while theother two use either LFCC or DCT grams.
T44 [45]:
A fusion of five sub-systems with either log-DFT gram or CQCC frontends and either squeeze andexcitation network (SEnet) or ResNet based backends. Onesub-system is the T44 single system. Two are mean andstandard deviation ResNets (Mean-std ReNets) for whichthe input feature sequences are transformed into a singlefeature vector through statistics pooling. Other classifiersreceive fixed-size 2D feature matrices, referred to as unifiedfeature maps [49]. All are either binary classifiers ( i.e. bonafide vs. spoof) or multi-class classifiers trained to predictthe type of spoofing attack. Scores are combined via logisticregression fusion.
T10 [43]:
A fusion of six sub-systems, all ResNet-basedarchitectures with global average pooling (GAP) for ut-terance level aggregation. Two sub-systems, including theT10 single system, use data augmentation in the form of speed perturbation [50] applied to the raw signal. Front-endsinclude group-delay (GD) gram, DFT gram, LFCCs andIMFCCs. Networks are configured as binary classifiers andtrained with cross-entropy loss. Scores coming from thebona fide unit for each sub-system are fused using equalweight score averaging.
T24:
A fusion of five sub-systems using either LFB coef-ficients, CQCCs or GD-gram frontends and either CNN orResNet backends. Embeddings produced by the ResNet sys-tem are length-normalised and classified using a weighted,two-class SVM. Three of the CNN systems and the ResNetsystem are configured with 10 classes (combination of 9AIDs and the bona fide class) whereas the other CNNsystem has 270 output classes (full combination of all EIDs,AIDs and bona fide class). All use statistics pooling to obtainutterance-level representations from frame-level representa-tions. Utterance-level embeddings are computed from thesecond-to-last fully connected layer in a similar mannerto x-vector extraction. Except for the first sub-system, em-beddings are processed with a single-layer neural network.Scores are combined with logistic regression.
A summary of results for the top-5 single and primarysubmissions is presented in Tables 3(a) and 3(b) respectively,with those for the two baseline systems and the ASV floorappearing in the last three rows of Table 3(a).Just as is the case for the LA scenario, for the PA scenarioall of the top-5 single systems outperform both baselines,again by a substantial margin. In terms of the t-DCF, the bestT28 system outperforms baseline B01 by 58% relative. Incontrast to the LA scenario, however, all of the top-5 systemsuse spectral features rather than cepstral features and allalso use DNN-type classifiers. Of note also is the small gapin performance between the top-5 systems, and the use ofdata augmentation by only one of the top-5 systems, but notthe top system. The latter is, however, the only single systemthat uses concatenated Mel-gram and CQT-gram features.In contrast to single systems, primary systems utiliseboth spectral and cepstral features, but again with exclu-sively DNN-type classifiers. It seems, though, that systemcombination is less beneficial than for the LA scenario; pri-mary system results for the PA scenario are not substantiallybetter than those for single systems. Perhaps unsurprisingly,then, teams with the best single systems are generally thosewith the best primary systems. The best T28 primary systemoutperforms the best single system, also from T28, by only2% relative (59% relative to B01). Lastly, the lowest mint-DCF of 0.1437 (T28 primary) is only marginally abovethe ASV floor of 0.1354 showing, once again, that the bestperforming CM gives an expected detection cost that is closeto that of a perfect CM.
NALYSIS
This section aims to provide more in-depth analysis ofresults presented in Sections 3 and 4. We report an analysisof generalisation performance which shows that resultscan be dominated by detection performance for some so-called worst-case spoofing attacks. Further analysis showspotential to improve upon fusion strategies through the useof more complementary sub-systems.
Since its inception, ASVspoof has prioritised strategies topromote the design of generalised CMs that perform reliably
EEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE (T-BIOM), VOL. VV, NO. NN, MMMM YYYY 8 known varied unknown varied w/o A17 A17 . . . . Acoustic model/waveform generator m i n t - D C F (a) LA a : low b : medium c : high c : high w/o acc acc . . . T reverberation m i n t - D C F (b) PA Fig. 4. Illustrations of generalisation performance for the top-5 primary systems for the LA scenario (a) and the PA scenario (b), estimated usingevaluation set data. For the LA scenario, box plots illustrate performance decomposed across known, varied and unknown attacks. For thescenario, they illustrate performance decomposed across low, medium and high T reverberation categories. For all plots, the green profilessignify corresponding ASV floors (performance for a perfect CM). The two right-most box plots in each case indicate performance for varied attackswithout the worst case AID (LA) and high T reverberation without the worst case AID/EID (PA) and then for the worst case scenarios on their own(see Section 5.2). in the face of spoofing attacks not seen in training data.For ASVspoof 2019, the LA evaluation set features TTS andVC spoofing attacks generated with algorithms for whichsome component ( e.g. the acoustic model or the waveformgenerator) is different to those used in generating spoofingattacks in the training and development sets. The situationis different for the PA scenario. While the full set of attackidentifier (AID) and environment identifier (EID) categories (see last two paragraphs of Section 2.2) are seen in all threedata sets, the specific AIDs and EIDs in each are different(while the categories are the same, no specific attack or roomconfiguration appears in more than one set).A view of generalisation performance for the top-5 LAand PA primary system submissions is illustrated in Fig-ures 4(a) and 4(b) respectively. For the LA scenario, thethree left-most box plots depict performance in terms of themin t-DCF for: known attacks (attacks that are identical tothose seen in training and evaluation data); varied attacks(attacks for which either the acoustic model or waveformgenerator is identical to those of attacks in the trainingand development data); wholly unknown attacks (attacksfor which both components are unseen). Interestingly, whileperformance for unknown attacks is not dissimilar to, oreven better than that for known attacks, there is substantialvariability in performance for varied attacks. This observa-tion is somewhat surprising since, while systems appear togeneralise well to unknown attacks, they can fail to detectothers that are generated with only variations to knownattack algorithms. This can mean either that the unknownattack algorithms produce artefacts that are not dissimilarto those produced with known attacks, or that there issome peculiarity to the varied attacks. The latter impliesthat knowledge of even some aspects of an attack is of littleuse in terms of CM design; CMs are over-fitting and thereis potential for them to be overcome with perhaps evenonly slight adjustments to an attack algorithm. Reassuringly,however, as already seen from results in Table 2 and by thegreen profiles to the base of each box plot in Fig. 4(a) whichillustrate the ASV floor, some systems produce min t-DCFsclose to that of a perfect CM.A similar decomposition of results for the PA scenariois illustrated in Fig. 4(b), where the three left-most boxplots show performance for low, medium and high T reverberation categories, the component of the EID which was observed to have the greatest influence on performance.In each case results are pooled across the other AID and EIDcomponents, namely the room size S and the talker-to-ASVdistance D s . Results show that, as the level of reverberationincreases, the min t-DCF increases. However, comparisonsof each box plot to corresponding ASV floors show thatthe degradation is not caused by the CM, the performanceof which improves with increasing reverberation; replayattacks propagate twice in the same environment and hencereverberation serves as a cue for replay detection. Thedegradation is instead caused by the performance of theASV system; the gap between the min t-DCF and the ASVfloor decreases with increasing T and, for the highestlevel of reverberation, the min t-DCF is close to the ASVfloor. This observation also shows that the effect of highreverberation dominates the influence of the room size andthe talker-to-ASV distance.From the above, it is evident that min t-DCF resultsare dominated by performance for some worst case attackalgorithms (LA) or some worst case environmental influence(PA). Since an adversary could exploit knowledge of suchworst case conditions in order to improve their chances ofmanipulating an ASV system, it is of interest to examinenot just the pooled min t-DCF, but also performance in suchworst case scenarios. The worst case or maximum of the minimum (max min)t-DCFs (see Section 2.4) for the top-5 single and primarysystems in addition to the baseline systems are shown inTables 2 and 3. For the LA scenario, the worst case attackidentifier (AID) is A17, a VC system that combines a VAEacoustic model with direct waveform modification [51].While the best individual result for A17 is obtained by thebest performing primary system the max min t-DCF is over6 times higher than the min t-DCF. The lowest max mint-DCF for single systems is that of baseline B02. While thisresult (0.6571) is not substantially worse than the lowest maxmin t-DCF for primary systems (0.4418), it suggests thatthe fusion of different CMs may help to reduce the threatin the worst case scenario. The two, right-most box plotsin Fig. 4(a) show performance for varied attacks withoutattack A17, and then for attack A17 on its own, both forthe top-5 performing primary LA systems. A17 is a varied
EEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE (T-BIOM), VOL. VV, NO. NN, MMMM YYYY 9 m i n t - D C F Best primary system [T05]Best single system [T45]ASV floor
LA scenario
PA scenario
Single systemsPrimary systems
Fig. 5. min t-DCF results for oracle fusion performed with evaluationset scores for the top-20 performing systems for the LA scenario (left)and PA scenario (right). System T08, which returned problematic scoredistributions, was excluded in compution of results for the LA scenario. attack and is the single attack that accounts in large partfor the differences between the box plots for varied andknown/unknown attacks described in Section 5.1.Performance for the PA scenario is influenced by boththe attack (AID) and the environment (EID). Excludingbaselines, all but two systems struggle most for acc
EIDswith small rooms ( a ), high T reverberation times ( c ) andlarge talker-to-ASV distances D s ( c ), and either the AA or AC AID where recordings are captured in close proximityto the talker. The two, right-most box plots in Fig. 4(b)show performance for high T reverberation time withoutthe acc EID and then for the acc
EID on its own, bothfor the top-5 performing primary PA systems. Worst casemax min t-DCFs are substantially higher than pooled mint-DCFs. Even so, it appears that the greatest influence upontandem performance in the case of PA is within the systemdesigner’s control; the environment in which CM and ASVsystems are installed should have low reverberation. Indi-vidual system results shown in Table 3 show that a single,rather than a primary system gives almost the best pooledmin t-DCF (0.1470 cf. 0.1437). This observation suggests thatfusion techniques were not especially successful for the PAscenario. We expand on this finding next.
From the treatment of results presented in Sections 3.3and 4.3, we have seen already that fusion seems morebeneficial for the LA scenario than for the PA scenario; thebest performing single and pimary LA systems give mint-DCFs of 0.1562 and 0.0692 respectively, whereas the bestperforming single and primary PA systems give similar mint-DCFs of 0.1470 and 0.1437 respectively.For the LA scenario, we note that the best performingT45 single system still outperforms the fifth-placed T50 pri-mary system. The architectures of the top-4 primary systemsmight then suggest that the benefit from fusion requiressubstantial investment in front-end feature engineering inaddition to the careful selection and optimisation of theclassifier ensemble. By way of example, the top-ranked T05primary LA system uses DFT grams with different time-frequency resolutions, and three different classifiers in the shape of MobileNet, DenseNet, and a large ResNet-50. Inaddition, two out of the seven sub-systems take into consid-eration the ratio of bona fide and spoofed samples observedin training.Even if different front-end combinations are among thetop-performing primary PA systems, we do not see thesame diversity in the classifier ensembles. We hence soughtto investigate whether this lack of diversity could explainwhy fusion appears to have been less beneficial for thePA scenario. Using logistic regression [52], we conductedoracle fusion experiments for LA and PA evaluation datasetsusing the scores generated by the top-20 primary and singlesystems. In each case the number of systems in the fusionwas varied between 2 and 20.Results are illustrated in Figure. 5. Also illustrated ineach plot is the min t-DCF for the best performing primaryand single systems in addition to the ASV floor, i.e. a prefectCM that makes no errors such that the only remaining errorsare made by the ASV system. There are stark differences be-tween the two challenge scenarios. For the LA scenario, thebest-performing T05 primary system (left-most, blue point)obtains nearly perfect results (performance equivalent tothe ASV floor) and the fusion of multiple primary systemsimproves only marginally upon performance. While thefusion of multiple single systems (black profile) leads toconsiderably better performance, even though the fusion of20 single systems fails to improve upon the best T05 primarysystem.As we have seen already, there is little difference be-tween primary and single systems for the PA scenario. Inaddition, the performance of the best individual primaryand single systems is far from that of the ASV floor; there issubstantial room for improvement. Furthermore, the fusionof both primary and single systems gives substantially im-proved performance, to within striking distance of the ASVfloor. Fusion of only the best two single systems results inperformance that is superior to the best T28 primary system.There was significant scope for participants to improveperformance for the PA condition using the fusion of evenonly a small number of diverse sub-systems; it seems thatthose used by participants lack complimentarity.
Fig. 6 shows box plots of performance for the top-10systems for the three challenge editions: ASVspoof 2015(LA), ASVspoof 2017 (PA), and ASVspoof 2019 (LA+PA).Comparisons should be made with caution; each databasehas different partitioning schemes or protocols and was cre-ated with different spoofing attacks. Furthermore, while theASVspoof 2017 database was created from the re-recordingof a source database, the ASVspoof 2019 PA database wascreated using simulation and, while systems developed for2015 and 2017 editions were optimised for the EER metric,those developed for 2019 may have been optimised for thenew t-DCF metric. Accordingly, Fig. 6 shows results in termsof both EER and min t-DCF.Results for 2015 and 2019 LA databases shows thatprogress in anti-spoofing has kept apace with progressin TTS and VC research, including neural network-basedwaveform modelling techniques including WaveNet [8];
EEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE (T-BIOM), VOL. VV, NO. NN, MMMM YYYY 10 . . . . ASVspoof edition (a) min t-DCF
ASVspoof edition (b) EER (in %)
Fig. 6. An illustration of performance for top-10 submission to the threeASVspoof challenge editions: 2015, 2017 and 2019. Results are shownin terms of both min-DCF and EER.
EERs and min t-DCFs are similar, despite the use of state-of-the-art neural acoustic and waveform models to generatespoofing attacks in the ASVspoof 2019 database. Resultsfor 2017 and 2019 PA databases seemingly show signifi-cant progress, with both EERs and min t-DCFs droppingby substantial margins, though improvements are likelycaused by database differences. The 2017 database con-tains both additive background noise and convolutionalchannel noise, artefacts stemming from the source databaserather than being caused by replay spoofing, whereasthe 2019 database contains neither. EER results for theASVspoof 2019 database are substantially lower than thosefor any of the other three databases, indicating that resultsreflect acoustic environment effects upon the ASV system,rather than upon CM systems. While this finding is encour-aging differences between 2017 and 2019 PA results showthat additive noise might have a considerable impact onperformance. These issues are expanded upon next.
ESULTS FOR REAL REPLAY RECORDINGS
Results for simulated replay data were compared to re-sults for real replay data that were concealed within thePA database. This data, results for which were excludedfrom challenge scoring and ranking, is not described in [4].Accordingly, a brief description is provided here. Real re-play data was recorded in different rooms with twodifferent talker-to-ASV distance categories D s and in con-ditions equivalent to two different EID categories: a smallmeeting room (equivalent EIDs of aaa and aac ); a largeoffice (equivalent EIDs of bba and bbc ); a small/mediumoffice (equivalent to EIDs of cca and ccc ). Recordings werecaptured using high or low quality capture devices, whereasreplay data were recorded using various acquisition devicesbefore presentation to the ASV microphone using variouspresentation devices. Both recording and presentation de-vices were of quality equivalent to either B or C categories. Data were collected from 26 speakers, each of whom pro-vided 15 utterances selected at random from the same set ofphonetically-balanced TIMIT phrases as the VCTK sourcedata. This setup gave 540 bona fide utterances and 2160replay utterances.In contrast to simulated data, real replay data containsadditive, ambient noise. Differences between simulation andthe collection of real replay data also imply that consistenttrends between results for the two datasets cannot be ex-pected. The objective of this analysis is to expose consisten-cies or discrepancies in results derived between simulatedand real data in terms of the t-DCF, or to determine whetherthe use of simulated data leads to the design of CMs thatperform well when tested with real data. The two right-mostcolumns of Table 3 show min t-DCF and EER results for thebaselines and top-5 single and primary systems. In general,there are substantial differences. Among the top-5 systemsconsidered, the best t-DCF result for real data of 0.3855is obtained by the B01 baseline. This observation suggeststhat CMs are over-fitting to simulated data, or that CMslack robustness to background noise. This possibility seemslikely; we observed greater consistency in results for simu-lated replay data and real replay data recorded in quieterrooms. One other explanation lies in the relation betweenadditive noise and the ASV floor. Results for synthetic dataare dominated by the ASV floor, whereas those for real dataare dominated by the impact of additive noise. Whatever thereason for these differences, their scale is cause for concern.Some plans to address this issue are outlined in our thoughtsfor future directions.
UTURE DIRECTIONS
Each ASVspoof challenge raises new research questions andexposes ways in which the challenge can be developed andstrengthened. A selection of these is presented here.
Additive noise and channel variability
It is likely that ambient and channel noise will degrade CMperformance, thus it will be imperative to study the impactof such nuisance variation in future editions, e.g. as in [53].Even if such practice is generally frowned upon, the artificial addition of nuisance variation in a controlled fashion maybe appropriate at this stage. LA scenarios generally involvesome form of telephony, e.g.
VoIP. Coding and compressioneffects are readily simulated to some extent. In contrast,the consideration of additive noise is potentially more com-plex for it influences speech production , e.g. the Lombardreflex [54]. The simulation of additive noise is then generallyundesirable. An appropriate strategy to address these issuesin future editions of ASVspoof demands careful reflection. Quality of TTS/VC training data
For ASVspoof 2019, all TTS and VC systems were trainedwith data recorded in benign acoustic conditions. This setupis obviously not representative of in-the-wild scenarios wherean adversary could acquire only relatively noisy training oradaptation data. Future editions of ASVspoof should henceconsider TTS and VC attacks generated with more realisticdata. Such attacks may be less effective in fooling ASVsystem, and may also be more easily detectable.
EEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE (T-BIOM), VOL. VV, NO. NN, MMMM YYYY 11
Diversified spoofing attacks
ASVspoof presents an arguably naive view of potentialspoofing attacks. Future editions should consider more di-versified attacks, e.g. impersonation [55], attacks by twinsor siblings [56], non-speech [57] or adversarial attacks [58],[59], [60] and attacks that are injected into specific regions ofthe speech signal rather than the entire utterance. One canalso imagine blended attacks whereby, for instance, replayattacks are launched in an LA scenario, or replay attacksin a PA scenario are performed with speech data generatedusing TTS or VC systems.
Joint CM+ASV score calibration
With the 2019 edition transitioned to an ASV-centric formof assessment with the min t-DCF metric there are now nottwo, but three decision outcomes: target, non-target (bothbona fide) and spoof. The existing approaches to calibrate bi-nary classification scores are then no longer suitable. Futurework could hence investigate approaches to joint CM+ASVsystem optimisation and calibration.
Reproducibility
Anecdotal evidence shows that some ASVspoof results areun-reproducible. While it is not our intention to enforcereproducibility – doing so may deter participation – itis nonetheless something that we wish to promote. Onestrategy is to adopt the reviewing of system descriptions,either by the organisers or by fellow ASVspoof participants,or the reporting of system descriptions according to aharmonised format. Such a harmonised reporting formatshould include details of the fusion scheme and weights.This policy, together with a requirement for the submissionof scores for each system in an ensemble, would also allowa more fine-grained study of fusion strategies and systemcomplementarity.
Explainability
Explainability is a topic of growing importance in almostany machine learning task and is certainly lacking sufficientattention in the field of anti-spoofing. While the resultsreported in this paper show promising potential to detectspoofing attacks, we have learned surprisingly little aboutthe artefacts or the cues that distinguish bona fide fromspoofed speech. Future work which reveals these cues maybe of use to the community in helping to design better CMs.
Standards
The t-DCF metric adopted for ASVspoof 2019 does notmeet security assessment standards, in particular the so-called
Common Criteria for Information Technology SecurityEvaluation [61], [62], [63]. Rather than quantifying a prob-abilistic meaning of some attack likeliness, common criteriaare based upon a category-based points scheme in order todetermine a so-called attack potential . This reflects e.g. theequipment, expertise and time required to mount the attackand the knowledge required of the system under attack.The rigour in common criteria is that each category of at-tack potential then demands hierarchically higher assurance components in order to meet protection profiles that express security assurance requirements . In the future, it may provebeneficial to explore the gap between the common criteriaand the t-DCF. To bridge this gap pragmatically, we needto determine the attack potentials for ASVspoof, askingourselves: 1)
How long does it take to mount a given attack? ;2)
What level of expertise is necessary? ; 3)
What resources (dataor computation) are necessary to execute it? ; 4)
What familiaritywith the ASV system is needed?
Clearly, providing the answersto these questions is far from being straightforward.
Challenge model
The organisation of ASVspoof has developed into a de-manding, major organisational challenge involving the coor-dination of 6 different organising institutes and 19 differentdata contributors for the most recent edition. While theorganisers enjoy the support of various different nationalresearch funding agencies, it is likely that we will need toattract additional industrial, institutional or public fundingto support the initiative in the future. To this end, we areliaising with the Security and Privacy in Speech Commu-nications (SPSC), the Speaker and Language Characterisa-tion (SpLC) and Speech Synthesis (SynSig) Special InterestGroups (SIGs) of the International Speech CommunicationAssociation (ISCA) regarding a framework with which tosupport the initiative in the longer term.With regards the format, the organising team commit-ted to making ASVspoof 2019 the last edition to be runas a special session at INTERSPEECH. With anti-spoofingnow featuring among the Editor’s Information Classifica-tion Scheme (EDICS) and topics of major conferences andleading journals/transactions, it is time for ASVspoof tomake way for more genuinely ‘special’ topics. Accordingly,we will likely transition in the future to a satellite workshopformat associated with an existing major event, such asINTERSPEECH.
ONCLUSIONS
ASVspoof 2019 is the third in the series of anti-spoofingchallenges for automatic speaker verification. It was the firstto consider both logical and physical access scenarios in asingle evaluation and the first to adopt the tandem detectioncost function as the default metric and also brought a seriesof additional advances with respect to the 2015 and 2017predecessors. With the database and experimental protocolsdescribed elsewhere, the current paper describes the chal-lenge results, findings and trends, with a focus on the top-performing systems for each scenario.Results reported in the paper are encouraging and pointto advances in countermeasure performance. For the log-ical access scenario, reliability seems to have kept apacewith the recent, impressive developments in speech syn-thesis and voice conversion technology, including the lat-est neural network-based waveform modelling techniques.For the physical access scenario, countermeasure perfor-mance is stable across diverse acoustic environments. Likemost related fields in recent years, the 2019 edition wasmarked with a shift towards deep architectures and en-semble systems that brought substantial improvements in
EEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE (T-BIOM), VOL. VV, NO. NN, MMMM YYYY 12 performance, though more so for the logical access scenariothan the physical access counterpart. There seems to havebeen greater diversity among each teams’ ensemble systemsfor the former, while there is evidence that those for thelatter suffer from over-fitting. In both cases, however, tandem systems exhibit high detection costs under specific condi-tions: either specific attack algorithms or specific acousticenvironments with costs stemming from either countermea-sures or automatic speaker verification systems.Many challenges remain. Particularly for the physicalaccess scenario, though likely also for the logical accessscenario, countermeasure performance may degrade in real-world conditions characterised by nuisance variation suchas additive noise. Results for real replay data with additivenoise show substantial gaps between results for simulated,noise-free data. It is furthermore reasonable to assume thatperformance will also be degraded by channel variability aswell as any mismatch in the training data used to generatespoofing attacks.Future editions of ASVspoof will also consider greaterdiversification in spoofing attacks and blended attackswhereby speech synthesis, voice conversion and replay at-tack strategies are combined. Countermeasure optimisationstrategies also demand further attention now that theyare assessed in tandem with automatic speaker verificationsystems. Future editions demand greater efforts to promotereproducibility and explainability, as well as some reflectionon the gap between ASVspoof and standards such as thecommon criteria. Lastly, the paper outlines our plans toadopt a satellite event format, rather than a special sessionformat for the next edition, tentatively, ASVspoof 2021. A CKNOWLEDGEMENTS
The ASVspoof 2019 organisers thank the following for theirinvaluable contribution to the LA data collection effort –Lauri Juvela, Paavo Alku, Yu-Huai Peng, Hsin-Te Hwang,Yu Tsao, Hsin-Min Wang, Sebastien Le Maguer, MarkusBecker, Fergus Henderson, Rob Clark, Yu Zhang, QuanWang, Ye Jia, Kai Onuma, Koji Mushika, Takashi Kaneda,Yuan Jiang, Li-Juan Liu, Yi-Chiao Wu, Wen-Chin Huang,Tomoki Toda, Kou Tanaka, Hirokazu Kameoka, IngmarSteiner, Driss Matrouf, Jean-Francois Bonastre, AvashnaGovender, Srikanth Ronanki, Jing-Xuan Zhang and Zhen-Hua Ling. We also extend our thanks to the many re-searchers and teams who submitted scores to the ASVspoof2019 challenge. Since participants are assured of anonymity,we regret that we cannot acknowledge them here by name. R EFERENCES [1] M. Sahidullah, H. Delgado, M. Todisco, T. Kinnunen, N. Evans,J. Yamagishi, and K.-A. Lee,
Introduction to Voice Presentation AttackDetection and Recent Advances . Springer International Publishing,2019.[2] N. Evans, J. Yamagishi, and T. Kinnunen, “Spoofing and coun-termeasures for speaker verification: a need for standard corpora,protocols and metrics,”
IEEE Signal Processing Society Speech andLanguage Technical Committee Newsletter , 2013.[3] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado,A. Nautsch, J. Yamagishi, N. Evans, T. H. Kinnunen, and K. A.Lee, “ASVspoof 2019: future horizons in spoofed and fake audiodetection,” in
Proc. Interspeech , 2019, pp. 1008–1012. [4] X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch,N. Evans, M. Sahidullah, V. Vestman, T. Kinnunen, K. A. Lee et al. , “ASVspoof 2019: a large-scale public database of synthetic,converted and replayed speech,”
Elsevier Computer Speech andLanguage , vol. 64, November 2020.[5] T. Kinnunen, K. Lee, H. Delgado, N. Evans, M. Todisco,M. Sahidullah, J. Yamagishi, and D. A. Reynolds, “t-DCF: adetection cost function for the tandem assessment of spoofingcountermeasures and automatic speaker verification,” in
Proc.Odyssey , 2018, pp. 312–319.[6] T. Kinnunen, H. Delgado, N. Evans, K. A. Lee, V. Vestman,A. Nautsch, M. Todisco, X. Wang, M. Sahidullah, J. Yamagishi, andD. A. Reynolds, “Tandem assessment of spoofing countermeasuresand automatic speaker verification: Fundamentals,”
IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol. 28, pp.2195–2210, 2020.[7] C. Veaux, J. Yamagishi, and K. MacDonald, “CSTR VCTK corpus:English multi-speaker corpus for CSTR voice cloning toolkit,”2017, http://dx.doi.org/10.7488/ds/1994.[8] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,“WaveNet: A generative model for raw audio,” arXiv preprintarXiv:1609.03499 , 2016.[9] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio,T. Kinnunen, and Z. Ling, “The voice conversion challenge 2018:Promoting development of parallel and nonparallel methods,” in
Proc. Odyssey , 2018, pp. 195–202.[10] ISO/IEC JTC1 SC37 Biometrics,
ISO/IEC 30107-1. Information Tech-nology - Biometric presentation attack detection - Part 1: Framework ,International Organization for Standardization, 2016.[11] D. R. Campbell, K. J. Palom¨aki, and G. Brown, “A MATLABsimulation of “shoebox” room acoustics for use in research andteaching.”
Computing and Information Systems Journal , vol. 9, no. 3,2005.[12] E. Vincent, “Roomsimove,” 2008, [Online] http://homepages.loria.fr/evincent/software/Roomsimove 1.4.zip.[13] A. Novak, P. Lotton, and L. Simon, “Synchronized swept-sine:Theory, application, and implementation,”
Journal of the AudioEngineering Society
Security and Communication Networks , vol. 9, no. 15, pp. 3030–3044,2016.[15] J. B. Allen and D. A. Berkley, “Image method for efficientlysimulating small-room acoustics,”
Journal of the Acoustical Societyof America , vol. 65, no. 4, pp. 943–950, 1979.[16] M. Todisco, H. Delgado, and N. Evans, “Constant Q cepstralcoefficients: A spoofing countermeasure for automatic speakerverification,”
Computer Speech & Language , vol. 45, pp. 516–535,2017.[17] ——, “Articulation rate filtering of CQCC features for automaticspeaker verification,” in
Proc. Interspeech , 2016, pp. 3628–3632.[18] M. Sahidullah, T. Kinnunen, and C. Hanilc¸i, “A comparison offeatures for synthetic speech detection,” in
Proc. Interspeech , 2015,pp. 2087–2091.[19] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur,“X-vectors: Robust DNN embeddings for speaker recognition,”in
Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing(ICASSP) , 2018, pp. 5329–5333.[20] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The Kaldi speech recognition toolkit,” in
Proc. IEEE Workshop onAutomatic Speech Recognition and Understanding (ASRU) , 2011.[21] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deepspeaker recognition,” in
Proc. Interspeech , 2018, pp. 1086–1090.[22] A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: a large-scale speaker identification dataset,” in
Proc. Interspeech , 2017, pp.2616–2620.[23] S. Ioffe, “Probabilistic linear discriminant analysis,” in
Proc.European Conference on Computer Vision (ECCV) , A. Leonardis,H. Bischof, and A. Pinz, Eds., 2006, pp. 531–542.[24] S. J. Prince and J. H. Elder, “Probabilistic linear discriminantanalysis for inferences about identity,” in
Proc. IEEE Intl. Conf. onComputer Vision (ICCV) , 2007, pp. 1–8.
EEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE (T-BIOM), VOL. VV, NO. NN, MMMM YYYY 13 [25] G. Lavrentyeva, S. Novoselov, A. Tseren, M. Volkova, A. Gorlanov,and A. Kozlov, “STC antispoofing systems for the ASVspoof2019challenge,” in
Proc. Interspeech , 2019, pp. 1033–1037.[26] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “SphereFace:Deep hypersphere embedding for face recognition,” in
Proc. IEEEComputer Vision and Pattern Recognition (CVPR) , 2017, pp. 212–220.[27] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in
Proc. Intl.Conf. on Machine Learning (ICML) , 2015, pp. 448–456.[28] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:Surpassing human-level performance on ImageNet classification,”in
Proc. IEEE Intl. Conf. on Computer Vision (ICCV) , 2015, pp. 1026–1034.[29] T. Chen, A. Kumar, P. Nagarsheth, G. Sivaraman, andE. Khoury, “Generalization of Audio Deepfake Detection,”in
Proc. Odyssey , 2020, pp. 132–137. [Online]. Available:http://dx.doi.org/10.21437/Odyssey.2020-19[30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learningfor image recognition,” in
Proc. IEEE Computer Vision and PatternRecognition (CVPR) , 2016, pp. 770–778.[31] M. Lin, Q. Chen, and S. Yan, “Network in network,”
Proc. Intl.Conf. on Learning Representations (ICLR) , 2014.[32] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,“Mobilenetv2: Inverted residuals and linear bottlenecks,” in
Proc.IEEE Computer Vision and Pattern Recognition (CVPR) , 2018, pp.4510–4520.[33] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger,“Densely connected convolutional networks,” in
Proc. IEEE Com-puter Vision and Pattern Recognition (CVPR) , 2017, pp. 4700–4708.[34] B. Chettri, D. Stoller, V. Morfi, M. A. M. Ram´ırez, E. Benetos, andB. L. Sturm, “Ensemble models for spoofing detection in automaticspeaker verification,” in
Proc. Interspeech , 2019, pp. 1018–1022.[35] S. Chakroborty, A. Roy, and G. Saha, “Improved closed set text-independent speaker identification by combining MFCC withevidence from flipped filter banks,”
International Journal of SignalProcessing Systems (IJSPS) , vol. 4, no. 2, pp. 114–122, 2007.[36] J. M. K. Kua, T. Thiruvaran, M. Nosratighods, E. Ambikairajah,and J. Epps, “Investigation of spectral centroid magnitude andfrequency for speaker recognition,” in
Proc. Odyssey , 2010, pp. 34–39.[37] D. Stoller, S. Ewert, and S. Dixon, “Wave-u-net: A multi-scaleneural network for end-to-end audio source separation,” in
Proc.Intl. Society for Music Information Retrieval Conference (ISMIR) ,E. G´omez, X. Hu, E. Humphrey, and E. Benetos, Eds., 2018, pp.334–340.[38] Y. Yang, H. Wang, H. Dinkel, Z. Chen, S. Wang, Y. Qian, andK. Yu, “The SJTU robust anti-spoofing system for the ASVspoof2019 challenge,” in
Proc. Interspeech , 2019, pp. 1038–1042.[39] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Languagemodeling with gated convolutional networks,” in
Proc. Intl. Conf.on Machine Learning (ICML) , 2017, pp. 933–941.[40] K. Cho, B. van Merri¨enboer, C. Gulcehre, D. Bahdanau,F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase rep-resentations using RNN encoder–decoder for statistical machinetranslation,” in
Proc. Empirical Methods in Natural Language Process-ing (EMNLP) , 2014, pp. 1724–1734.[41] X. Cheng, M. Xu, and T. F. Zheng, “Replay detection using CQT-based modified group delay feature and ResNeWt network inASVspoof 2019,” in
Proc. Asia-Pacific Signal and Information Pro-cessing Association Annual Summit and Conference (APSIPA ASC) ,2019, pp. 540–545.[42] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learningfor image recognition,” in
Proc. IEEE Computer Vision and PatternRecognition (CVPR) , 2016, pp. 770–778.[43] W. Cai, H. Wu, D. Cai, and M. Li, “The DKU replay detectionsystem for the ASVspoof 2019 challenge: On data augmentation,feature representation, classification, and fusion,” in
Proc. Inter-speech , 2019, pp. 1023–1027.[44] F. Tom, M. Jain, and P. Dey, “End-toend audio replay attackdetection using deep convolutional networks with attention,” in
Proc. Interspeech , 2018, pp. 681–685.[45] C.-I. Lai, N. Chen, J. Villalba, and N. Dehak, “ASSERT:Anti-Spoofing with Squeeze-Excitation and Residual Networks,”in
Proc. Interspeech , 2019, pp. 1013–1017. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2019-1794 [46] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
Proc. IEEE Computer Vision and Pattern Recognition (CVPR) , 2018,pp. 7132–7141.[47] Y. Wen, P. Vicol, J. Ba, D. Tran, and R. Grosse, “Flipout: Efficientpseudo-independent weight perturbations on mini-batches,” in
Proc. Intl. Conf. on Learning Representations (ICLR) , 2018.[48] R. Białobrzeski, M. Ko´smider, M. Matuszewski, M. Plata, andA. Rakowski, “Robust Bayesian and Light Neural Networks forVoice Spoofing Detection,” in
Proc. Interspeech 2019 , 2019, pp. 1028–1032.[49] C.-I. Lai, A. Abad, K. Richmond, J. Yamagishi, N. Dehak, andS. King, “Attentive filtering networks for audio replay attackdetection,” in
Proc. IEEE Intl. Conf. on Acoustics, Speech, and SignalProcessing (ICASSP) , 2019, pp. 6316–6320.[50] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmenta-tion for speech recognition,” in
Proc. Interspeech , 2015, pp. 171–175.[51] K. Kobayashi, T. Toda, and S. Nakamura, “Intra-gender statisticalsinging voice conversion with direct waveform modification usinglog-spectral differential,”
Speech Communication , vol. 99, pp. 211–220, 2018.[52] N. Br ¨ummer and E. de Villiers, “The BOSARIS toolkit user guide:Theory, algorithms and code for binary classifier score process-ing,” [Online] https://sites.google.com/site/bosaristoolkit, AG-NITIO Research, South Africa, Tech. Rep., 12 2011.[53] Y. Gong, J. Yang, J. Huber, M. MacKnight, and C. Poellabauer,“ReMASC: Realistic Replay Attack Corpus for Voice ControlledSystems,” in
Proc. Interspeech 2019 , 2019, pp. 2355–2359. [Online].Available: http://dx.doi.org/10.21437/Interspeech.2019-1541[54] E. Lombard, “Le signe de l’elevation de la voix,”
Ann. Malad.l’Oreille Larynx , no. 37, pp. 101–119, 1911.[55] V. Vestman, T. Kinnunen, R. Gonz´alez Hautam¨aki, and M. Sahidul-lah, “Voice mimicry attacks assisted by automatic speaker verifi-cation,”
Computer Speech & Language , vol. 59, pp. 36–54, 2020.[56] M. R. Kamble, H. B. Sailor, H. A. Patil, and H. Li, “Advancesin anti-spoofing: from the perspective of ASVspoof challenges,”
APSIPA Transactions on Signal and Information Processing , vol. 9,2020.[57] F. Alegre, R. Vipperla, N. Evans, and B. Fauve, “On the vulnera-bility of automatic speaker recognition to spoofing attacks withartificial signals,” in
Proc. European Signal Processing Conference(EUSIPCO) , 2012, pp. 36–40.[58] F. Kreuk, Y. Adi, M. Cisse, and J. Keshet, “Fooling end-to-endspeaker verification with adversarial examples,” in
Proc. IEEE Intl.Conf. on Acoustics, Speech and Signal Processing (ICASSP) , 2018, pp.1962–1966.[59] X. Li, J. Zhong, X. Wu, J. Yu, X. Liu, and H. Meng, “Adversarialattacks on GMM i-vector based speaker verification systems,” in
Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing(ICASSP) , 2020, pp. 6579–6583.[60] R. K. Das, X. Tian, T. Kinnunen, and H. Li, “The attacker’sperspective on automatic speaker verification: An overview,” in
Proc. Interspeech (to appear) , 2020.[61] Common Criteria Recognition Arrangement (CCRA),
CommonCriteria for Information Technology Security Evaluation — Part 3:Security assurance components , Common Criteria for InformationTechnology Security Evaluation (CC) and Common Methodologyfor Information Technology Security Evaluation (CEM), 2017.[62] ISO/IEC JTC1 SC27 Security Techniques,
ISO/IEC DIS 19989-1:2019. Information security — Criteria and methodology for securityevaluation of biometric systems — Part 1: Framework , InternationalOrganization for Standardization, 2019.[63] ——,
ISO/IEC DIS 19989-3:2019. Information security — Criteria andmethodology for security evaluation of biometric systems — Part 3:Presentation attack detection , International Organization for Stan-dardization, 2019.
EEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE (T-BIOM), VOL. VV, NO. NN, MMMM YYYY 14
Andreas Nautsch is with the Audio Securityand Privacy research group (EURECOM). Hereceived the doctorate from Technische Univer-sit¨at Darmstadt in 2019. From 2014 to 2018,he was with the da/sec Biometrics and Internet-Security research group (Hochschule Darm-stadt) within the German National ResearchCenter for Applied Cybersecurity. He receivedB.Sc. and M.Sc. degrees from HochschuleDarmstadt (dual studies with atip GmbH) re-spectively in 2012 and 2014. He served as anexpert delegate to ISO/IEC and as project editor of the ISO/IEC 19794-13:2018 standard. Andreas serves currently as associate editor of theEURASIP Journal on Audio, Speech, and Music Processing, and is a co-initiator and secretary of the ISCA Special Interest Group on Security &Privacy in Speech Communication.
Xin Wang (S’16 - M’18) is a project researcherat National Institute of Informatics, Japan. Hereceived the Ph.D. degree from SOKENDAI,Japan, in 2018. Before that, he received M.S.and B.E degrees from University of Science andTechnology of China and University of Elec-tronic Science and Technology of China in 2015and 2012, respectively. His research interestsinclude statistical speech synthesis and machinelearning.
Nicholas Evans is a Professor at EURECOM,France, where he heads research in Audio Se-curity and Privacy. He is a co-founder of thecommunity-led, ASVspoof Challenge series andhas lead or co-lead a number of special issuesand sessions with an anti-spooing theme. Heparticipated in the EU FP7 Tabula Rasa andEU H2020 OCTAVE projects, both involving anti-spoofing. Today, his team is leading the EUH2020 TReSPAsS-ETN project, a training initia-tive in security and privacy for multiple biometrictraits. He co-edited the second edition of the Handbook of BiometricAnti-Spoofing, served previously on the IEEE Speech and LanguageTechnical Committee and serves currently as an asscociate editor forthe IEEE Trans. on Biometrics, Behavior, and Identity Science.
Tomi H. Kinnunen is an Associate Professor atthe University of Eastern Finland. He receivedhis Ph.D. degree in computer science from theUniversity of Joensuu in 2005. From 2005 to2007, he was an Associate Scientist at the In-stitute for Infocomm Research (I2R), Singapore.Since 2007, he has been with UEF. From 2010-2012, he was funded by a postdoctoral grantfrom the Academy of Finland. He has been a PIor co-PI in three other large Academy of Finland-funded projects and a partner in the H2020-funded OCTAVE project. He chaired the
Odyssey workshop in 2014.From 2015 to 2018, he served as an Associate Editor for IEEE/ACMTrans. on Audio, Speech and Language Processing and from 2016 to2018 as a Subject Editor in
Speech Communication . In 2015 and 2016,he visited the National Institute of Informatics, Japan, for 6 months undera mobility grant from the Academy of Finland, with a focus on voiceconversion and spoofing. Since 2017, he has been Associate Professorat UEF, where he leads the Computational Speech Group. He is oneof the cofounders of the ASVspoof challenge, a nonprofit initiative thatseeks to evaluate and improve the security of voice biometric solutionsunder spoofing attacks.
Ville Vestman is an Early Stage Researcherat the University of Eastern Finland (UEF). Hereceived his M.S. degree in mathematics fromUEF in 2013. Since 2015, his research work atUEF has been focused on speech technologyand, more specifically, on speaker recognition.He is one of the co-organizers of the ASVspoof2019 challenge.
Massimiliano Todisco is an Assistant Profes-sor within the Digital Security Department atEURECOM, France. He received his Ph.D. de-gree in Sensorial and Learning Systems Engi-neering from the University of Rome Tor Ver-gata in 2012. Currently, he is serving as princi-pal investigator and coordinator for TReSPAsS-ETN, a H2020 Marie Skłodowska-Curie Inno-vative Training Network (ITN) and RESPECT, aPRCI project funded by the French ANR and theGerman DFG. He co-organises the ASVspoofchallenge series, which is community-led challenges which promote thedevelopment of countermeasures to protect automatic speaker verifica-tion (ASV) from the threat of spoofing. He is the inventor of constantQ cepstral coefficients (CQCC), the most commonly used anti-spoofingfeatures for speaker verification and first author of the highest-citedtechnical contribution in the field in the last three years. He has morethan 90 publications. His current interests are in developing end-to-endarchitectures for speech processing and speaker recognition, fake audiodetection and anti-spoofing, and the development of privacy preser-vation algorithms for speech signals based on encryption solutionsthat support computation upon signals, templates and models in theencrypted domain.
H´ector Delgado received his Ph.D. degreein Telecommunication and System Engineeringfrom the Autonomous University of Barcelona(UAB), Spain, in 2015. From 2015 to 2019 hewas with the Speech and Audio Processing Re-search Group at EURECOM (France). Since2019 he is a Senior Research Scientist at Nu-ance Communications Inc. He serves as an as-sociate editor for the EURASIP Journal on Au-dio, Speech, and Music Processing. He is a co-organiser of the ASVspoof challenge since its2017 edition. His research interests include signal processing and ma-chine learning applied to speaker recognition and diarization, speakerrecognition anti-spoofing and audio segmentation.
EEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE (T-BIOM), VOL. VV, NO. NN, MMMM YYYY 15
Md Sahidullah (S’09, M’15) received his Ph.D.degree in the area of speech processing fromthe Department of Electronics & Electrical Com-munication Engineering, Indian Institute of Tech-nology Kharagpur in 2015. Prior to that heobtained the Bachelors of Engineering degreein Electronics and Communication Engineeringfrom Vidyasagar University in 2004 and the Mas-ters of Engineering degree in Computer Scienceand Engineering from West Bengal Universityof Technology in 2006. In 2014-2017, he wasa postdoctoral researcher with the School of Computing, University ofEastern Finland. In January 2018, he joined MULTISPEECH team, Inria,France as a post-doctoral researcher where he currently holds a startingresearch position. His research interest includes robust speaker recog-nition and spoofing countermeasures. He is also part of the organizingteam of two Automatic Speaker Verification Spoofing and Countermea-sures Challenges: ASVspoof 2017 and ASVspoof 2019. Presently, heis also serving as Associate Editor for the IET Signal Processing andCircuits, Systems, and Signal Processing.
Junichi Yamagishi (SM’13) is a professor at Na-tional Institute of Informatics in Japan. He is alsoa senior research fellow in the Centre for SpeechTechnology Research (CSTR) at the Universityof Edinburgh, UK. He was awarded a Ph.D.by Tokyo Institute of Technology in 2006 for athesis that pioneered speaker-adaptive speechsynthesis and was awarded the Tejima Prize asthe best Ph.D. thesis of Tokyo Institute of Tech-nology in 2007. Since 2006, he has authoredand co-authored over 250 refereed papers ininternational journals and conferences. He was awarded the ItakuraPrize from the Acoustic Society of Japan, the Kiyasu Special IndustrialAchievement Award from the Information Processing Society of Japan,and the Young Scientists’ Prize from the Minister of Education, Scienceand Technology, the JSPS prize, the Docomo mobile science award in2010, 2013, 2014, 2016, and 2018, respectively. He served previouslyas co-organizer for the bi-annual ASVspoof special sessions at INTER-SPEECH 2013-9, the bi-annual Voice conversion challenge at INTER-SPEECH 2016 and Odyssey 2018, an organizing committee memberfor the 10th ISCA Speech Synthesis Workshop 2019 and a technicalprogram committee member for IEEE ASRU 2019. He also served asa member of the IEEE Speech and Language Technical Committee,as an Associate Editor of the IEEE/ACM TASLP and a Lead GuestEditor for the IEEE JSTSP SI on Spoofing and Countermeasures forAutomatic Speaker Verification. He is currently a PI of JST-CREST andANR supported VoicePersonae project. He also serves as a chairpersonof ISCA SynSIG and as a Senior Area Editor of the IEEE/ACM TASLP.
Kong Aik Lee (M’05-SM’16) is currently a Se-nior Scientist at Institute for Infocomm Research,A*STAR, Singapore. From 2018 to 2020, he wasa Senior Principal Researcher at the Biomet-rics Research Laboratories, NEC Corp., Japan.He received his Ph.D. degree from NanyangTechnological University, Singapore, in 2006.From 2006 to 2018, he was a Scientist at theHuman Language Technology department, I2