[PDF] A Study of F0 Modification for X-Vector Based Speech Pseudonymization Across Gender

Abstract

Speech pseudonymization aims at altering a speech signal to map the identifiable personal characteristics of a given speaker to another identity. In other words, it aims to hide the source speaker identity while preserving the intelligibility of the spoken content. This study takes place in the VoicePrivacy 2020 challenge framework, where the baseline system performs pseudonymization by modifying x-vector information to match a target speaker while keeping the fundamental frequency (F0) unchanged. We propose to alter other paralin-guistic features, here F0, and analyze the impact of this modification across gender. We found that the proposed F0 modification always improves pseudonymization We observed that both source and target speaker genders affect the performance gain when modifying the F0.

Full PDF

AA Study of F0 Modiﬁcation for X-Vector Based Speech Pseudonymization AcrossGender

Pierre Champion, Denis Jouvet, Anthony Larcher Universit´e de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France. Le Mans Universit´e, LIUM, France { pierre.champion, denis.jouvet } @inria.fr, [email protected] Abstract

Speech pseudonymization aims at altering a speech signalto map the identiﬁable personal characteristics of a givenspeaker to another identity. In other words, it aims to hidethe source speaker identity while preserving the intelligibilityof the spoken content. This study takes place in the VoicePri-vacy 2020 challenge framework, where the baseline systemperforms pseudonymization by modifying x-vector informa-tion to match a target speaker while keeping the fundamentalfrequency (F0) unchanged. We propose to alter other paralin-guistic features, here F0, and analyze the impact of this modi-ﬁcation across gender. We found that the proposed F0 modiﬁ-cation always improves pseudonymization. We observed thatboth source and target speaker genders affect the performancegain when modifying the F0.

Introduction

In many applications, such as virtual assistants, speech sig-nal is sent from the device to centralized servers in whichdata is collected, processed, and stored. Recent regulations,e.g., the General Data Protection Regulation (GDPR) (Par-liament and Council 2016) in the EU, emphasize on pri-vacy preservation and protection of personal data. As speechdata can reﬂect both biological and behavioral characteris-tics of the speaker, it is qualiﬁed as personal data (Nautschet al. 2019). The research reported in this paper has beendone in the context of the VoicePrivacy challenge framework(Tomashenko et al. 2020), which is one of the ﬁrst attempt ofthe speech community to encourage research on this topic,deﬁne the task, introduce metrics, datasets and protocols.Anonymization is performed to suppress the personallyidentiﬁable paralinguistic information from a speech utter-ance while maintaining the linguistic content. The task of theVoicePrivacy challenge is to degrade automatic speaker veri-ﬁcation performance, by removing speaker identity as muchas possible, while keeping the linguistic content intelligible.This task is also referred to as speaker anonymization (Fanget al. 2019) or de-identiﬁcation (Magari˜nos et al. 2017).

Anonymization systems in the VoicePrivacy challengeshould satisfy the following requirements:• output a speech waveform;• conceal the speaker’s identity;• keep the linguistic content intelligible;• modify the speech signal of a given speaker to alwayssound like a unique target pseudo-speaker, while differ-ent speaker’s speech must not be similar.The fourth requirement constraints the system to have a one-to-one mapping between the real speaker identities and apseudo-speaker. Such system can be considered as a voiceconversion system where the output speaker identity residesin a pseudonymized space.The GDPR deﬁnes pseudonymization as: “processing ofpersonal data in such a manner that the personal data canno longer be attributed to a speciﬁc data subject without theuse of additional information, provided that such additionalinformation is kept separately and is subject to technical andorganizational measures to ensure that the personal dataare not attributed to an identiﬁed or identiﬁable natural per-son” (Art.4.5 of the GDPR (Parliament and Council 2016)).pseudonymization techniques differ from anonymizationtechniques. With anonymization, data is modiﬁed so thatany information that may serve as an identiﬁer to a subjectis deleted. pseudonymization enhances privacy by replacingmost identifying information within data by artiﬁcial iden-tiﬁers. Per the requirements imposed by the VoicePrivacychallenge, and the above deﬁnition from GDPR, the chal-lenge imposes contestants to build pseudonymization sys-tems. The VoicePrivacy challenge focuses on modifying thespeech characteristics; while keeping the linguistic contentunchanged; hence removing personal information from thelinguistic content is not part of that challenge.Recently, Fang et al. (Fang et al. 2019) proposed a speechsynthesis pipeline where only the continuous speaker rep-resentation (the x-vector (Snyder et al. 2018)) is modi-ﬁed. Linguistic related information necessary to generateanonymized speech is left untouched. The correspondingtoolchain doesn’t alter the fundamental frequency (F0) in-put values, and the articulation of speech sounds feature (thePhoneme Posterior-Grams (PPGs) (Sun et al. 2016)).The F0 values of speech determine the perceived relativehighness or lowness of the sound, it plays an indispens- a r X i v : . [ ee ss . A S ] J a n ble role for the listener as it helps to perceive a varietyof paralinguistic, and prosodic information (Gussenhoven2004). Analysis of the F0, which is typically higher in fe-male voices than in male voices, can be used to characterizespeaker-related attributes.In this paper, we use the pipeline proposed by Fang etal. (Fang et al. 2019) in the VoicePrivacy challenge 2020(Tomashenko et al. 2020), and discuss what possible im-provement may be obtained by modifying the F0 values.The remainder of the paper is structured as follows. First,we review the baseline framework and explains the conver-sion process. Secondly we describes the experimental setup.Then we present and discuss the results. Finally, we con-cludes the paper. Anonymization technique

The baseline system

The VoicePrivacy challenge provides two baseline sys-tems:

Baseline-1 that anonymizes speech utterances using x-vectors and neural waveform models (Fang et al. 2019) and

Baseline-2 that performs anonymization using McAdamscoefﬁcient (McAdams 1984). Our contributions are basedon

Baseline-1 which is referred to as the baseline system inthis paper.Figure 1: The speaker anonymization pipeline. Modules A,B and C are parts of the baseline model. We added moduleD to modify the F0 values, which are later used by modulesC.The central concept of the baseline system introducedin (Fang et al. 2019) is to separate speaker identity andlinguistic content from an input speech utterance. Assum-ing that those information are disentangled, an anonymizedspeech waveform can be obtained by altering only the fea-tures that encode the speaker’s identity. The anonymizationsystem illustrated in Figure 1 breaks down the anonymiza-tion process into three groups of modules:

A - Feature ex-traction comprises three modules that respectively extractfundamental frequency, PPGs like bottleneck features, andthe speaker’s x-vector from the input signal. Then,

B -Anonymization derives a new pseudo-speaker identity us-ing knowledge gleaned from a pool of external speakers.Finally,

C - Speech synthesis synthesizes a speech wave-form from the pseudo-speaker x-vector together with theoriginal PPGs features, and the original F0 using an acous-tic model (Tomashenko et al. 2020) and a neural waveform model (Wang, Takaki, and Yamagishi 2020). For all utter-ances of a given speaker, a single target pseudo-speaker isused to modify the input speech. This strategy, described as perm in (Srivastava et al. 2020b), ensures that a one-to-onemapping exists between the source speaker identity and thetarget pseudo-speaker. x-vector pseudonymization

Given the baseline system, where only the x-vector identityis changed, the selection algorithm used to derive a pseudo-identity plays an important role. Many criteria can be cho-sen to select the target pseudo-speaker identity. Recent re-search made by (Srivastava et al. 2020a) has outline multi-ple selection techniques for the VoicePrivacy Challenge. Thebaseline’s pseudo-speaker selection is performed by averag-ing a set of x-vectors candidates from the speaker pool. Thecandidate x-vectors are selected by retrieving the 200 fur-thest speakers given the original x-vector. From this subsetof 200 x-vectors, a set of 100 x-vectors is randomly chosento create the pseudo-speaker x-vector. Speaker’s distancesare queried according to the probabilistic linear discriminantanalysis (PLDA). The speaker pool is composed of speakersfrom the LibriTTS-train-other-500 (Zen et al. 2015) dataset.This dataset is not used elsewhere in our experiments.

Gender selection

Information conveyed by the x-vector embeddings can beused for other tasks than speaker recognition/veriﬁcation.Work by (Raj, Snyder, and Povey 2019) has shown that ses-sion and gender information, along with other characteris-tics, are also encoded in x-vectors.The aforementioned x-vector anonymization procedureis designed to select a pseudo-speaker identity from thesame gender as the source speaker. Constraining the x-vectoranonymization procedure to target x-vectors from same gen-der as the source is referred to as

Same , While constrainingthe selection to target the opposite gender is referred to as

Opposite . Same , and

Opposite gender selection were exper-imentally studied by (Srivastava et al. 2020a). Work on gen-der independent selection still needs to be done.In this paper, we focus our experience on

Same and

Oppo-site gender selections. We discuss the impact that F0 modi-ﬁcation has on female and male speakers when using thesetwo selection algorithms.

Speech synthesis

The speech synthesizer (cf. pipeline C in Figure 1) in thethe VoicePrivacy baseline system is composed of a speechsynthesis acoustic model, used to generate mel-fbanks fea-tures; and a vocoder, used to generate a speech signal. Thevocoder used in the baseline is a Neural Source-Filter (NSF)Waveform model (Wang, Takaki, and Yamagishi 2020). NSFmodels uses the F0 information to produce a sine-basedexcitation signal that is later transformed by ﬁlters into awaveform. Manipulating the F0 values will impact boththe speech synthesis acoustic model and vocoder models totransform the speech signal.

In the VoicePrivacy baseline, the F0 values extracted fromthe source speech are directly used (unchanged) by thespeech synthesizer pipeline (acoustic model and neuralvocoder), even though a different target pseudo-speakerwas selected. Multiple works have investigated F0 con-ditioned voice conversion (Bahmaninezhad, Zhang, andHansen 2018; Huang et al. 2020; Qian et al. 2020; Uedaet al. 2015). In some papers modifying the F0 improves thequality of the converted voice. Motivated by those results,we propose to modify the F0 values of a source utterancefrom a given speaker (cf. module D in Figure 1) by usingthe following linear transformation: ˆ x t = µ y + σ y σ x ( x t − µ x ) where x t represents the log-scaled F0 of the source speakerat the frame t , µ x and σ x represent the mean and stan-dard deviation for the source speaker. µ y and σ y representsthe mean and standard deviation of the log-scaled F0 forthe pseudo-speaker. The linear transformation and statisti-cal calculation are only performed on voiced frames. Themean and standard deviation for the target pseudo speakerare calculated by averaging information from the same 100speakers selected to derive the pseudo-speaker x-vector. Experimental setup

Data

All experiments where based on the challenge publiclyavailable baseline . The development and evaluation setsare built from LibriSpeech test-clean . The pool of externalspeakers on which x-vectors and F0 statistics are computedis LibriSpeech train-other-500 . Additional information onthe number of speakers, and the gender distributions can befound in the evaluation plan (Tomashenko et al. 2020). Attack models

One of the requirements of the VoicePrivacy challenge is to conceal the speaker’s identity . To assess the robustness ofanonymization systems, two attack models were designed(cf. evaluation plan). The ﬁrst scenario consists of a userwho publishes anonymized speech and an attacker whouses one enrollment utterance of non-anonymized (original)speech to compute a linkability score. In this scenario (re-ferred as o-a in Figure 2), the goal is to ensure the o riginalspeaker identity is not the same as the one in the gener-ated a nonymized speech. Performant systems are expectedto show low linkability. The second scenario consists of auser who also publishes anonymized speech, but this time,the attacker has itself anonymized an enrollment utteranceusing the same exact anonymization pipeline except for therandom seed. This scenario (referred as a-a in Figure 2) isdeﬁned as a Semi-Informed attacker in work done by (Srivas-tava et al. 2020b).

Hence, the pseudo-speaker correspond-ing to a given speaker in the enrollment set is differ-ent from the pseudo-speaker corresponding to that same https://github.com/Voice-Privacy-Challenge speaker in the trial set, as mentioned in Section 3.3 of theevaluation plan . Consequently, we also expect to have lowlinkability in this a-a scenario even through the attacker hasgained some knowledge about the anonymization system. Utility and linkability metrics

To evaluate the performance of the system in both linkability( speaker’s concealing capability) and utility ( content intel-ligibility ) two systems are used. To assess the linkability, apre-trained x-vector-PLDA based Automatic Speaker Veriﬁ-cation (ASV) system provided by the challenge organizers isused. The privacy protection is measured in terms of C minllr asthis measure provides an application-independent (Brummerand Preez 2006) evaluation score. As the Equal Error Rate(EER) measure is more often used in speaker veriﬁcation,we present our result in terms of both EER and C minllr . Thosemetrics are computed using the cllr toolkit of the challenge.For the utility, a pre-trained Automatic Speech Recognition(ASR) system provided by the challenge organizers is usedto decode the anonymized speech and compute the Word Er-ror Rate (WER % ). In this challenge, the WER % measure isused to evaluate how the content is kept intelligible. BothASR and ASV systems are trained on LibriSpeech train-clean-360 using Kaldi (Povey et al. 2011). The higher theEER/C minllr , the better the systems are capable of “conceal-ing a speaker identity” . The lower the WER % is, the moreintelligible the anonymized speech is. Experimental results

All results are compared to the VoicePrivacy baseline sys-tem. The pseudonymization pipeline with F0 modiﬁcationcontribution is publicly available . Figure 2 details thespeaker linkability scores for o riginal to a nonymized (o-a) ASV tests, and for a nonymized to a nonymized (a-a)ASV tests in different gender selection and F0 modiﬁca-tion setups. The o riginal to a nonymized test case helps toassess how capable systems are at modifying the origi-nal speech to make it sound like another speaker’s speech.As the system used to evaluate the linkability between o riginal and a nonymized speech is domain-dependent (Sri-vastava et al. 2020b), and only trained on the originalspeech, it is thus of no surprise that the baseline providedin the challenge already shows great results. As for the a nonymized to a nonymized test, enrolling the ASV systemwith anonymized data brings some speaker information inthe process, although the pseudo-speaker x-vector is not ex-actly the same between random and trial utterances, becauseof the random part of the x-vector selection process (see Sec-tion on x-vector pseudonymization above). Given this evalu-ation framework, our goal is to further degrade the linkabil-ity in both attacks models. For each anonymization pipelinesetups, the corresponding WER % values are reported in Ta-ble 1. https://gitlab.eurecom.fr/nautsch/cllr/ https://github.com/deep-privacy/Voice-Privacy-Challenge-2020 igure 2: EER ( % ) score obtained by the ASV evaluationsystem on Librispeech tests sets. The C minllr score is dis-played on the top of each bar. Multiple pipelines setups arereported for the gender selection and F0 modiﬁcation. o –original, a – anonymized speech data for enrollment and trialparts. Entry “Same gender - Original F0” corresponds to thechallenge baseline system. Male linkability

In the o riginal to a nonymized attack scenario (o-a in Fig-ure 2), we can observe that the proposed F0 modiﬁcationdoesn’t affect the already good male un-linkability perfor-mance when compared to the challenge’s baseline (“Samegender - Modiﬁed F0” compared to “Same gender - Origi-nal F0”). It appears that selecting an x-vector from the oppo-site gender without applying the F0 modiﬁcation always de-grades the pseudonymization un-linkability (“Opposite gen-der - Original F0” compared to “Same gender - OriginalF0”). Applying the F0 modiﬁcation together with the op-posite gender x-vector selection doesn’t improves perfor-mance. This limitation might come from the x-vector selec-tion algorithm, where the furthest speakers are selected toderive the pseudo-identity.Regarding the a nonymized to a nonymized attack scenario(a-a in Figure 2). Using the baseline anonymization setup,the attacker is able to re-identify the user at a much higherdegree. On their own, the F0 modiﬁcation always improvescompared to the baseline performance. Jointly selecting theopposite gender and applying the F0 modiﬁcation appears tobe an excellent design choice against this attacker. Female linkability

Contrary to the male results, the proposed F0 modiﬁcationalways improves the pseudonymization for female speakerin the o riginal to a nonymized attack scenario. This effectis observed regardless of the gender’s x-vector selection(“Same gender - Modiﬁed F0” compared to “Same gender- Original F0” and “Opposite gender - Modiﬁed F0” com-pared to “Opposite gender - Original F0”). Applying boththe F0 modiﬁcation and the opposite x-vector selection beatsthe baseline system. The a nonymized to a nonymized attack scenario drawssimilar conclusions as for the male speaker. Jointly modi-fying gender for the x-vector selection and applying the F0modiﬁcation always improves pseudonymization. It is worthnoting that female speakers are more sensitive to F0 modiﬁ-cation than males. Meaning, the source’s gender informationplays a role in choosing the best anonymization procedure. Speech intelligibility

Gender-selection F0 Test

WER % Same Original 6.73Modiﬁed 6.92Opposite Original 7.24Modiﬁed 6.74Table 1: Speech recognition results in terms of WER % forthe LibriSpeech test set.Across all experiments, the utility (Table 1) is not tremen-dously affected by the gender x-vector selection, F0 modiﬁ-cation, or the two modiﬁcations applied together. The highWER % score (7.24) reported with the opposite x-vector gen-der selection, and no F0 modiﬁcation might come from thefact that the ASR model used for the evaluation was trainedon audiobooks data; and the fact that selecting opposite gen-der without modifying F0 might leads to some inconsisten-cies in the speech signal. Conclusions

In this work, we proposed to alter the F0 paralinguistic infor-mation in an x-vector based speech pseudonymization sys-tem. We evaluated this modiﬁcation against the

Opposite and

Same gender x-vector target selection to obtain variousanonymization setups. We objectively evaluated the F0 mod-iﬁcation using the VoicePrivacy 2020 challenge tools. Theperformance was assessed in terms of EER/C minllr to mea-sure privacy protection and WER % to measure utility. Weobserved that keeping the original F0 values retains some in-formation about the original speaker. The experiments showthat applying the F0 modiﬁcation and selecting an x-vectorfrom the Opposite gender allows for better privacy protec-tion against attackers who has access to the anonymiza-tion pipeline. Our results also show that the performance ofanonymization depends on the gender of the source. Thisraises the question of the importance of personalized modi-ﬁcation in a privacy context. In future work, we plan to sub-jectively evaluate the naturalness of the generated speech.We think the F0 modiﬁcation helps to produce a more nat-ural speech when an

Opposite gender’s x-vector is selected.Because the F0 features will be coherent with the selectedgender.

Acknowledgments

This work was supported in part by the French National Re-search Agency under project DEEP-PRIVACY (ANR-18-CE23-0018) and R´egion Lorraine. eferences

Bahmaninezhad, F.; Zhang, C.; and Hansen, J. H. L.2018. Convolutional Neural Network Based Speaker De-Identiﬁcation. In

Odyssey .Brummer, N.; and Preez, J. 2006. Application-independentevaluation of speaker detection.

Computer Speech & Lan-guage .Fang, F.; Wang, X.; Yamagishi, J.; Echizen, I.; Todisco, M.;Evans, N.; and Bonastre, J.-F. 2019. Speaker AnonymizationUsing X-vector and Neural Waveform Models. In

Proc. 10thISCA Speech Synthesis Workshop .Gussenhoven, C. 2004.

Pitch in Language I: Stress and In-tonation . Research Surveys in Linguistics. Cambridge Uni-versity Press.Huang, W.-C.; Luo, H.; Hwang, H.-T.; Lo, C.-C.; Peng, Y.-H.; Tsao, Y.; and Wang, H.-M. 2020. Unsupervised Rep-resentation Disentanglement Using Cross Domain Featuresand Adversarial Learning in Variational Autoencoder BasedVoice Conversion.

IEEE Transactions on Emerging Topicsin Computational Intelligence .Magari˜nos, C.; Lopez-Otero, P.; Docio-Fernandez, L.;Rodriguez-Banga, E.; Erro, D.; and Garcia-Mateo, C. 2017.Reversible speaker de-identiﬁcation using pre-trained trans-formation functions.

Computer Speech & Language .McAdams, S. 1984. Spectral fusion, spectral parsing and theformation of the auditory image.

Ph. D. Thesis, Stanford .Nautsch, A.; Jasserand, C.; Kindt, E.; Todisco, M.; Tran-coso, I.; and Evans, N. 2019. The GDPR & Speech Data: Re-ﬂections of Legal and Technology Communities, First StepsTowards a Common Understanding. In

Proc. Interspeech .Parliament, E.; and Council. 2016. Regulation (EU)2016/679 of the European Parliament and of the Councilof 27 April 2016 on the protection of natural persons withregard to the processing of personal data and on the freemovement of such data, and repealing Directive 95/46/EC.

General Data Protection Regulation .Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glem-bek, O.; Goel, N.; Hannemann, M.; Motl´ıˇcek, P.; Qian, Y.;Schwarz, P.; Silovsk´y, J.; Stemmer, G.; and Vesel, K. 2011.The Kaldi speech recognition toolkit.

IEEE Workshop onAutomatic Speech Recognition and Understanding .Qian, K.; Jin, Z.; Hasegawa-Johnson, M.; and Mysore, G. J.2020. F0-Consistent Many-To-Many Non-Parallel VoiceConversion Via Conditional Autoencoder.

IEEE ICASSP .Raj, D.; Snyder, D.; and Povey, D. 2019. Probing the Infor-mation Encoded in X-Vectors. In

IEEE ASRU .Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; andKhudanpur, S. 2018. X-vectors: Robust DNN Embeddingsfor Speaker Recognition. In

IEEE ICASSP .Srivastava, B. M. L.; Tomashenko, N.; Wang, X.; Vincent,E.; Yamagishi, J.; Maouche, M.; Bellet, A.; and Tommasi,M. 2020a. Design Choices for X-vector Based SpeakerAnonymization.

Proc. Interspeech . Srivastava, B. M. L.; Vauquier, N.; Sahidullah, M.; Bellet,A.; Tommasi, M.; and Vincent, E. 2020b. Evaluating VoiceConversion-Based Privacy Protection against Informed At-tackers. In

IEEE ICASSP .Sun, L.; Li, K.; Wang, H.; Kang, S.; and Meng, H. 2016.Phonetic posteriorgrams for many-to-one voice conversionwithout parallel data training. In

IEEE ICME .Tomashenko, N.; Srivastava, B. M. L.; Wang, X.; Vincent,E.; Nautsch, A.; Yamagishi, J.; Evans, N.; Patino, J.; Bonas-tre, J.-F.; No´e, P.-G.; and Todisco, M. 2020. Introducing theVoicePrivacy Initiative.

Proc. Interspeech .Ueda, R.; Aihara, R.; Takiguchi, T.; and Ariki, Y. 2015.Individuality-Preserving Spectrum Modiﬁcation for Articu-lation Disorders Using Phone Selective Synthesis. In

Proc.Interspeech .Wang, X.; Takaki, S.; and Yamagishi, J. 2020. NeuralSource-Filter Waveform Models for Statistical ParametricSpeech Synthesis.

IEEE TASLP .Zen, H.; Dang, V.; Clark, R.; Zhang, Y.; Weiss, R. J.; Jia, Y.;Chen, Z.; and Wu, Y. 2015. LibriTTS: A Corpus Derivedfrom LibriSpeech for Text-to-Speech. In