[PDF] Understanding the Tradeoffs in Client-Side Privacy for Speech Recognition

Abstract

Existing approaches to ensuring privacy of user speech data primarily focus on server-side approaches. While improving server-side privacy reduces certain security concerns, users still do not retain control over whether privacy is ensured on the client-side. In this paper, we define, evaluate, and explore techniques for client-side privacy in speech recognition, where the goal is to preserve privacy on raw speech data before leaving the client's device. We first formalize several tradeoffs in ensuring client-side privacy between performance, compute requirements, and privacy. Using our tradeoff analysis, we perform a large-scale empirical study on existing approaches and find that they fall short on at least one metric. Our results call for more research in this crucial area as a step towards safer real-world deployment of speech recognition systems at scale across mobile devices.

Full PDF

UUNDERSTANDING THE TRADEOFFS IN CLIENT-SIDE PRIVACYFOR SPEECH RECOGNITION

Peter Wu, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency

Carnegie Mellon University, PA, USA

ABSTRACT

Existing approaches to ensuring privacy of user speech dataprimarily focus on server-side approaches. While improvingserver-side privacy reduces certain security concerns, usersstill do not retain control over whether privacy is ensuredon the client-side. In this paper, we deﬁne, evaluate, andexplore techniques for client-side privacy in speech recog-nition, where the goal is to preserve privacy on raw speechdata before leaving the client’s device. We ﬁrst formalizeseveral tradeoffs in ensuring client-side privacy between per-formance, compute requirements, and privacy. Using ourtradeoff analysis, we perform a large-scale empirical studyon existing approaches and ﬁnd that they fall short on at leastone metric. Our results call for more research in this crucialarea as a step towards safer real-world deployment of speechrecognition systems at scale across mobile devices.

Index Terms — privacy, voice conversion, speech recog-nition, voice biometrics

1. INTRODUCTION

Modern users are increasingly relying on cloud-based ma-chine learning models to process their personal data [1, 2,3, 4]. One predominant example of this is speech recogni-tion, where audio recorded with client-side mobile devicesare uploaded to centralized servers for server-side process-ing [5]. While these technologies have enabled real-timespeech recognition across ubiquitous mobile devices [6],sending their raw speech data to the cloud leaves users vul-nerable to giving away personal biometric information suchas gender, race, and other social constructs [7].Recent work has begun to realize the importance ofprivacy-preserving technologies for speech recognition andprocessing with several proposed approaches to preserve pri-vacy in the context of speaker identiﬁcation [8, 9]. However,current solutions primarily focus on server-side techniques,either relying on complex server-side decryption algorithmsor performing encryption through server communication pro-tocols and other methods [10, 11]. While improving server-side privacy reduces certain security concerns, users still donot retain control over whether privacy is ensured on the client-side . In this paper, we deﬁne, evaluate, and explore techniquesfor client-side privacy for speech recognition. We deﬁneclient-side privacy as privacy that is ensured before infor-mation even leaves the user device. Furthermore, client-sideprivacy requires that the encrypted data is compatible withdownstream server tasks without any modiﬁcation on theserver side. Thus, in the case of speech tasks, we requirethat the encrypted data is still understandable for downstreamtasks. In other words, ensuring client-side privacy is muchmore challenging than server-side methods since the formerrequires directly manipulating the user’s audio [12, 13]. De-spite its challenges, ensuring client-side privacy is highlydesirable since it gives the user full control of their privacy.As a step towards achieving client-side privacy for speechrecognition, we ﬁrst formalize several tradeoffs in ensur-ing client-side privacy between performance, compute re-quirements, and privacy. Compute is particularly importantgiven the space and memory requirements of client-side de-vices. Using our tradeoff analysis, we ﬁnd that existingapproaches tend to either be deﬁcient with respect to memoryrequirements, downstream performance, or privacy guaran-tees. Speciﬁcally, signal processing techniques perform wellon the ﬁrst two requirements but are lacking in the third,and neural models perform well in the third category but arelacking in either the ﬁrst or second. Using these insights,we propose several extensions for future work and call formore research in ensuring client-side privacy as a step to-wards safer deployment of speech recognition systems atscale across real-world mobile devices.

2. RELATED WORKSExisting privacy approaches have primarily been on theserver side. For example, Ma et al. propose an approach thatrelies on modifying server communication, and Ahmed et al.propose a framework that relies on the server to rearrangeaudio segments [11, 10]. Several approaches also attemptclient-side privacy but they either do a too simple down-stream speech task or don’t analyze privacy with respect tobiometrics. For example, Ericsson et al. explore maskinggender for a downstream 10-label classiﬁcation task, andAlouﬁ et al. address hiding speaker identity while performingdownstream ASR [14, 9]. We note that if the downstreamtask is too simple, then differential privacy or on-device ap- a r X i v : . [ ee ss . A S ] J a n roaches may be preferable. Additionally, hiding speakeridentity is much easier than hiding biometrics like gender, aselaborated in the next section. Thus, we analyze performanceon a non-trivial downstream speech task, namely ASR, andalso analyze the ability for an adversary to classify gender.Existing voice conversion approaches anonymize speakeridentity are a promising direction for our client-side approach[12, 13]. Current approaches are predominantly neural, utiliz-ing adversarial, disentanglement, or other encoder-decoder-related architectures. Anonymizing speaker identity is mucheasier than anonymizing biometric information like gen-der since the former requires a slight user data distributionshift whereas the latter requires a signiﬁcant one. Thus, inthis work, we extend voice conversion approaches to theanonymizing biometric information setting. Speech recognition:

ASR models are getting larger andmore powerful, making the case for putting them on the serverside stronger [15]. Since the best ASR models are currentlyon the server side and this will likely also be the case in thefuture, measures must be taken to ensure user data sent to theserver remains secure.

3. PROPOSED APPROACH

We ﬁrst propose a new framework for ensuring privacy of userspeech data entirely from the client-side. We further performan explore of three different potential implementations andanalyze their pros and cons.

Given a user with audio data x , gender g , and transcription y ,we want to generate x (cid:48) such that P ( G = g | x (cid:48) ) is as close to0.5 as possible and P ( y (cid:48) = y | x (cid:48) ) is high as possible, where G is the gender random variable and y (cid:48) is the transcriptionof x (cid:48) . We refer to the approach transforming x to x (cid:48) as theencryption approach, and we evaluate its performance using aneural gender classiﬁer and an ASR model. In order to avoidconfounding factors, both evaluation models are trained ondata completely separate from any used to train encryptionapproaches, as detailed in section 4.1. We use the ASR model itself as a baseline. Namely, in thisapproach, the user would perform ASR on-device. We as-sume that the gender classiﬁcation accuracy is 0.5 here sincethe user does not need to send any data to the server.

Signal processing:

One way to encrypt the audio data isusing a purely signal processing approach. Since we focus onprivatizing gender in this work, we use an approach cateredtowards hiding gender information. Speciﬁcally, we shift theaverage pitch of each utterance to a predeﬁned value whilepreserving formants. Details are described in section 4.4.1.

Adversarial approach:

We also investigate an adver-sarial approach to encrypt audio data [16]. Speciﬁcally, we use model trained in two stages, ﬁrst to train an autoencoderto learn speaker-independent representations, and second totrain a GAN to generate realistic audio. The ﬁrst stage al-ternates between the following two steps: 1. the encodertransforms the input into a latent representation that the de-coder uses along with a speaker label to reconstruct the input,and 2. a classiﬁer attempts to identify the speaker given thelatent representation. The loss function for such adversarialtraining procedure is given by (cid:88) x i ,y i || d ( e ( x i ) , y i ) − x i || − λ (cid:88) x j ,y j − log P c ( y j | e ( x j )) , where d is the decoder, e is the encoder, P c is the probabilityoutputted by the classiﬁer, ( x i , y i ) is an audio-speaker pair inthe ﬁrst step, and ( x j , y j ) is an audio-speaker pair in the sec-ond step. The second stage trains a generator, conditioned onthe encoder output and a speaker label, alongside a discrimi-nator with two output layers, one that predicts whether or notthe generator ( c ) created its input and another that classiﬁesthe speaker ( c (cid:48) ). Thus, the loss function for the generator is (cid:88) x k ,y k log c ( x k )+log(1 − c ( g ( x k , y k )) − (cid:88) x l ,y l log P c (cid:48) ( y l | g ( x l )) , where g is the generator, and the loss function for the discrim-inator is (cid:88) x k ,y k − log c ( x k ) − log(1 − c ( g ( x k , y k )) − (cid:88) x l ,y l log P c (cid:48) ( y l | x l ) . This model is suitable for our work because it explicitly sep-arates speaker identity from content twice. While this archi-tecture was previously only studied for speaker identity re-moval, we study its efﬁcacy for gender masking and down-stream speech recognition performance. Further details aredescribed in section 4.4.2.

Disentanglement approach:

We also investigate a dis-entanglement approach using variational autoencoders (VAE)[17]. Namely, we train a speaker encoder e s , content encoder e c , and decoder d using the following loss function: m (cid:88) x i λ r || d ( e s ( x i ) , e c ( x i )) − x i || + (cid:88) x j λ k || e c ( x ) ||  , where m is the batch size, x i and x j are input acoustic fea-tures, and λ r and λ k are hyperparameters. An instance nor-malization is added to the content encoder in order to re-move speaker information and an adaptive instance normal-ization layer is added to the decoder in order to add the de-sired speaker information [18, 19]. This model is suitable forour work because it learns to reconstruct content while trans-forming speaker information. While this architecture was pre-viously only studied for speaker identity removal, we study itsefﬁcacy for gender masking and downstream speech recogni-tion performance (see section 4.4.3 for details). . EXPERIMENTS4.1. Dataset We train all of our encryption models on the VCTK corpus,and the ASR model and the gender classiﬁer on LibriSpeech[20, 21]. The ASR model and the gender classiﬁer are bothevaluated on the test-clean subset and trained using indepen-dent speakers, as detailed in sections 4.2 and 4.3 below.

We use a pretrained Wav2Vec 2.0 Large model to evaluatedownstream ASR performance [15]. Speciﬁcally, this modelwas pretrained on 960 hours of unlabeled LibriSpeech train-ing data and ﬁnetuned on a labeled version of the same data.During evaluation, we perform Viterbi decoding and calculatethe word error rate on the LibriSpeech test-clean subset.

We use VGGVox model, a modiﬁed version of the VGG-16 CNN, as our adversary gender classiﬁer [22, 23], slightlymodifying the network by adding a ReLU activation followedby a fully-connected layer with size-2 output. We approx-imate the data available to an adversary by training in twostages: 1. we pre-train the classiﬁer on 100 hours of labeled,unmodiﬁed speech from the LibriSpeech train data, and 2. weﬁne-tune the classiﬁer on the encrypted speech of a handfulof speakers from the same 100-hour subset. We measure thegender masking ability of each encryption approach by cal-culating the VGGVox model’s gender classiﬁcation accuracyon an encrypted versions of the LibriSpeech test-clean subsetafter being ﬁnetuned on data encrypted using the respectiveapproach.

In our signal processing approach, for each utterance, we cal-culate its F value and then perform a pitch shift from thatvalue to a reference F value. We calculate the F sequenceof an utterance using REAPER , and deﬁne the utterance F value as the average of the non-negative F values. Table 1summarizes the F value statistics of the raw data used in theencryption approaches.To calculate the reference F value, we embed each ut-terance in the LibriSpeech train-clean-100 subset using thesecond-to-last layer output of the VGGVox gender classiﬁeronly trained on raw train-clean-100 data. Then, we averagethe embeddings for each speaker to create speaker embed-dings, which we use to identify the speaker whose embeddinghas the smallest mean Euclidean distance to the other speakerembeddings. Among that speaker’s utterances, we choose theone with smallest embedding distance to the other speaker https://github.com/google/REAPER Table 1 . Mean and standard deviation of the utterance F values in a 20-speaker subset of LibriSpeech train-clean-100as well as LibriSpeech test-clean. S TAGE S UBSET µ ± σ train-clean Male . ± . Female . ± . Both . ± . test-clean Male . ± . Female . ± . Both . ± . embeddings, and set our reference equal to the corresponding F value, which we ﬁnd is . .We use the Rubber Band Library to perform pitch shift-ing with formant preservation using a phase vocoder. Forutterance u with F value of f u , we shift its pitch by

12 log ( f r /f u ) semitones, where f r is the reference F value.Tables 2 and 3 summarize the semitone value statistics of thedata used in the encryption approaches. Table 2 . Mean and standard deviation of the calculated semi-tone values in a 20-speaker subset of LibriSpeech train-clean-100 as well as LibriSpeech test-clean. Since males have lower F values on average than females as shown in Table 1, semi-tone values are greater than zero for the former and less thanzero for the latter. S TAGE S UBSET µ ± σ train-clean Male . ± . Female − . ± . Both . ± . test-clean Male . ± . Female − . ± . Both − . ± . Table 3 . Mean and standard deviation of semitone values ina 20-speaker subset of LibriSpeech train-clean-100 as wellas LibriSpeech test-clean, where all values are capped at amagnitude of 2 to avoid distortion effects from large pitchshifts. S TAGE S UBSET µ ± σ train-clean Male . ± . Female − . ± . Both . ± . test-clean Male . ± . Female − . ± . Both − . ± . We train a convolutional autoencoder using the hyperparam-eters described in Chou et al. [16]. The model has log-magnitude spectrograms as its input and output acoustic fea- github.com/breakfastquay/rubberband ures, and we use the Grifﬁn-Lim algorithm to convert themodel output into waveforms [24]. We train a convolutional VAE using the hyperparameters de-scribed in Chou et al. [17]. The model has log-magnitudespectrograms as its input and output acoustic features, andwe use the Grifﬁn-Lim algorithm to convert the model outputinto waveforms [24]. During evaluation we use the approachdescribed in Section 4.4.1 to select the reference target utter-ance.

We explore a hybrid approach wherein we ﬁrst apply a sig-nal processing pitch shift and then pass the resulting wave-form into the disentanglement model. Such approach is mo-tivated by how neural approaches like those aforementionedhave greater output quality degradation the more different thesource speaker is from the target speaker. In order to avoiddistortion effects from large pitch shifts, we cap the magni-tude of the semitone value at 2.

5. RESULTS AND DISCUSSION5.1. Gender Classiﬁcation Accuracy

Table 4 lists the adversary gender classiﬁcation accuracy foreach encryption approach. Excluding the ASR baselines, weobserve that the adversary has the lowest gender classiﬁcationaccuracy when the hybrid encryption approach is used.

Table 4 . Classiﬁcation accuracy on LibriSpeech test-cleanwhen ﬁnetuned on data from n speakers in LibriSpeech train-clean-100. Since the non-neural approaches here are muchmore memory efﬁcient than neural ones, we write the numberof parameters for the former as zero. A PPROACH P ARAMS n = 2 ↓ n = 4 ↓ n = 20 ↓

0. Wav2Vec 2.0 Base . ∗ .

500 0 .

1. Wav2Vec 2.0 Large . ∗ .

500 0 .

2. No Encryption . .

969 0 .

3. Pitch Shift . .

888 0 .

926 0 . | Pitch Shift | ≤ . .

951 0 .

5. GAN . ∗ .

780 0 .

816 0 .

6. VAE . ∗ .

575 0 . .

7. Hybrid . ∗ .

539 0 . . Table 5 lists the downstream ASR performance for each ap-proach. We observe that the disentanglement approach (VAE)performs much better than adversarial learning (GAN).

We visualize the compute-privacy and utility-privacy trade-offs on a three-dimensional plot, where the axes are genderclassiﬁcation accuracy, number of parameters (memory), and

Table 5 . ASR performance on LibriSpeech test-clean usingencrypted audio. Since the non-neural approaches here aremuch more memory efﬁcient than neural ones, we write thenumber of parameters for the former as zero. A PPROACH P ARAMS

WER ↓ Wav2Vec 2.0 Base . ∗ . Wav2Vec 2.0 Large . ∗ . No Encryption . . Pitch Shift . . | Pitch Shift | ≤ . . GAN . ∗ . VAE . ∗ . Hybrid . ∗ . Fig. 1 . Trade-off Analysis: Points closer to the bottom-leftcorner are the most optimal. Darker points have lower mem-ory values. The indices correspond to the approaches listed inTable 4.word error rate in Figure 1. We observe that the VAE, Hy-brid, and Wav2Vec 2.0 Base models have the best tradeoff be-tween performance, which combines both downstream WERand gender masking ability, and memory.

6. CONCLUSION AND FUTURE DIRECTIONS

In this work, we setup the problem of ensuring client-side pri-vacy for speech representation learning that does not rely onany server-side privacy guarantees. We formalized several de-sirable properties regarding performance, privacy, and com-putation and performed a large-scale empirical study of ex-isting approaches. We ﬁnd that while disentanglement-basedapproaches currently have the best tradeoff between gendermasking, downstream performance, and memory usage, allexisting approaches still fall short of ideal performance. Ourinitial empirical analysis opens the door towards more reliableevaluations of the tradeoffs underlying privacy-preserving ap-proaches on the client-side, a property crucial for safe real-world deployment of speech systems on mobile devices. . REFERENCES [1] Yiqiang Chen, Xin Qin, Jindong Wang, Chaohui Yu, andWen Gao, “Fedhealth: A federated transfer learningframework for wearable healthcare,”

IEEE IntelligentSystems , 2020.[2] Robin C Geyer, Tassilo Klein, and Moin Nabi, “Dif-ferentially private federated learning: A client level per-spective,” arXiv preprint arXiv:1712.07557 , 2017.[3] Paul Pu Liang, Terrance Liu, Liu Ziyin, Nicholas B.Allen, Randy P. Auerbach, David Brent, RuslanSalakhutdinov, and Louis-Philippe Morency, “Think lo-cally, act globally: Federated learning with local andglobal representations,” 2020.[4] Jie Xu and Fei Wang, “Federated learning for healthcareinformatics,” arXiv preprint arXiv:1911.06270 , 2019.[5] David Leroy, Alice Coucke, Thibaut Lavril, ThibaultGisselbrecht, and Joseph Dureau, “Federated learningfor keyword spotting,” 2019.[6] Ian McGraw, Rohit Prabhavalkar, Raziel Alvarez,Montse Gonzalez Arenas, Kanishka Rao, David Ry-bach, Ouais Alsharif, Hasim Sak, Alexander Gruen-stein, Francoise Beaufays, and Carolina Parada, “Per-sonalized speech recognition on mobile devices,” 2016.[7] Rita Singh,

Proﬁling humans from their voice , Springer,2019.[8] Natalia Tomashenko, Brij Mohan Lal Srivastava, XinWang, Emmanuel Vincent, Andreas Nautsch, Ju-nichi Yamagishi, Nicholas Evans, Jose Patino, Jean-Franc¸ois Bonastre, Paul-Gauthier No´e, et al., “In-troducing the voiceprivacy initiative,” arXiv preprintarXiv:2005.01387 , 2020.[9] Ranya Alouﬁ, Hamed Haddadi, and David Boyle,“Privacy-preserving voice analysis via disentangled rep-resentations,” arXiv preprint arXiv:2007.15064 , 2020.[10] Shimaa Ahmed, Amrita Roy Chowdhury, KassemFawaz, and Parmesh Ramanathan, “Preech: A systemfor privacy-preserving speech transcription,” in .Aug. 2020, pp. 2703–2720, USENIX Association.[11] Z. Ma, Y. Liu, X. Liu, J. Ma, and F. Li, “Privacy-preserving outsourced speech recognition for smart iotdevices,”

IEEE Internet of Things Journal , vol. 6, no. 5,pp. 8406–8420, 2019.[12] Brij Mohan Lal Srivastava, Natalia Tomashenko, XinWang, Emmanuel Vincent, Junichi Yamagishi, Mo-hamed Maouche, Aur´elien Bellet, and Marc Tommasi, “Design choices for x-vector based speaker anonymiza-tion,” arXiv preprint arXiv:2005.08601 , 2020.[13] Wen-Chin Huang, Tomoki Hayashi, Shinji Watanabe,and Tomoki Toda, “The sequence-to-sequence baselinefor the voice conversion challenge 2020: Cascading asrand tts,” arXiv preprint arXiv:2010.02434 , 2020.[14] David Ericsson, Adam ¨Ostberg, Edvin Listo Zec, JohnMartinsson, and Olof Mogren, “Adversarial represen-tation learning for private speech generation,” arXivpreprint arXiv:2006.09114 , 2020.[15] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed,and Michael Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” arXivpreprint arXiv:2006.11477 , 2020.[16] Ju-chieh Chou, Cheng-chieh Yeh, Hung-yi Lee, andLin-Shan Lee, “Multi-target voice conversion withoutparallel data by adversarially learning disentangled au-dio representations,” 09 2018, pp. 501–505.[17] Ju-chieh Chou and Hung-yi Lee, “One-shot voice con-version by separating speaker and content representa-tions with instance normalization,” 09 2019, pp. 664–668.[18] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky,“Instance normalization: The missing ingredient for faststylization,” arXiv preprint arXiv:1607.08022 , 2016.[19] X. Huang and S. Belongie, “Arbitrary style transferin real-time with adaptive instance normalization,” in , 2017, pp. 1510–1519.[20] Christophe Veaux, Junichi Yamagishi, Kirsten MacDon-ald, et al., “Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” 2016.[21] Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-jeev Khudanpur, “Librispeech: an asr corpus based onpublic domain audio books,” in . IEEE, 2015, pp. 5206–5210.[22] Arsha Nagrani, Joon Son Chung, and Andrew Zisser-man, “Voxceleb: A large-scale speaker identiﬁcationdataset,” 06 2017.[23] Karen Simonyan and Andrew Zisserman, “Very deepconvolutional networks for large-scale image recogni-tion,” in

International Conference on Learning Repre-sentations (ICLR) , 2015.[24] D. Grifﬁn and Jae Lim, “Signal estimation from mod-iﬁed short-time fourier transform,”