Understanding the Tradeoffs in Client-Side Privacy for Speech Recognition
Peter Wu, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency
UUNDERSTANDING THE TRADEOFFS IN CLIENT-SIDE PRIVACYFOR SPEECH RECOGNITION
Peter Wu, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency
Carnegie Mellon University, PA, USA
ABSTRACT
Existing approaches to ensuring privacy of user speech dataprimarily focus on server-side approaches. While improvingserver-side privacy reduces certain security concerns, usersstill do not retain control over whether privacy is ensuredon the client-side. In this paper, we define, evaluate, andexplore techniques for client-side privacy in speech recog-nition, where the goal is to preserve privacy on raw speechdata before leaving the client’s device. We first formalizeseveral tradeoffs in ensuring client-side privacy between per-formance, compute requirements, and privacy. Using ourtradeoff analysis, we perform a large-scale empirical studyon existing approaches and find that they fall short on at leastone metric. Our results call for more research in this crucialarea as a step towards safer real-world deployment of speechrecognition systems at scale across mobile devices.
Index Terms — privacy, voice conversion, speech recog-nition, voice biometrics
1. INTRODUCTION
Modern users are increasingly relying on cloud-based ma-chine learning models to process their personal data [1, 2,3, 4]. One predominant example of this is speech recogni-tion, where audio recorded with client-side mobile devicesare uploaded to centralized servers for server-side process-ing [5]. While these technologies have enabled real-timespeech recognition across ubiquitous mobile devices [6],sending their raw speech data to the cloud leaves users vul-nerable to giving away personal biometric information suchas gender, race, and other social constructs [7].Recent work has begun to realize the importance ofprivacy-preserving technologies for speech recognition andprocessing with several proposed approaches to preserve pri-vacy in the context of speaker identification [8, 9]. However,current solutions primarily focus on server-side techniques,either relying on complex server-side decryption algorithmsor performing encryption through server communication pro-tocols and other methods [10, 11]. While improving server-side privacy reduces certain security concerns, users still donot retain control over whether privacy is ensured on the client-side . In this paper, we define, evaluate, and explore techniquesfor client-side privacy for speech recognition. We defineclient-side privacy as privacy that is ensured before infor-mation even leaves the user device. Furthermore, client-sideprivacy requires that the encrypted data is compatible withdownstream server tasks without any modification on theserver side. Thus, in the case of speech tasks, we requirethat the encrypted data is still understandable for downstreamtasks. In other words, ensuring client-side privacy is muchmore challenging than server-side methods since the formerrequires directly manipulating the user’s audio [12, 13]. De-spite its challenges, ensuring client-side privacy is highlydesirable since it gives the user full control of their privacy.As a step towards achieving client-side privacy for speechrecognition, we first formalize several tradeoffs in ensur-ing client-side privacy between performance, compute re-quirements, and privacy. Compute is particularly importantgiven the space and memory requirements of client-side de-vices. Using our tradeoff analysis, we find that existingapproaches tend to either be deficient with respect to memoryrequirements, downstream performance, or privacy guaran-tees. Specifically, signal processing techniques perform wellon the first two requirements but are lacking in the third,and neural models perform well in the third category but arelacking in either the first or second. Using these insights,we propose several extensions for future work and call formore research in ensuring client-side privacy as a step to-wards safer deployment of speech recognition systems atscale across real-world mobile devices.
2. RELATED WORKSExisting privacy approaches have primarily been on theserver side. For example, Ma et al. propose an approach thatrelies on modifying server communication, and Ahmed et al.propose a framework that relies on the server to rearrangeaudio segments [11, 10]. Several approaches also attemptclient-side privacy but they either do a too simple down-stream speech task or don’t analyze privacy with respect tobiometrics. For example, Ericsson et al. explore maskinggender for a downstream 10-label classification task, andAloufi et al. address hiding speaker identity while performingdownstream ASR [14, 9]. We note that if the downstreamtask is too simple, then differential privacy or on-device ap- a r X i v : . [ ee ss . A S ] J a n roaches may be preferable. Additionally, hiding speakeridentity is much easier than hiding biometrics like gender, aselaborated in the next section. Thus, we analyze performanceon a non-trivial downstream speech task, namely ASR, andalso analyze the ability for an adversary to classify gender.Existing voice conversion approaches anonymize speakeridentity are a promising direction for our client-side approach[12, 13]. Current approaches are predominantly neural, utiliz-ing adversarial, disentanglement, or other encoder-decoder-related architectures. Anonymizing speaker identity is mucheasier than anonymizing biometric information like gen-der since the former requires a slight user data distributionshift whereas the latter requires a significant one. Thus, inthis work, we extend voice conversion approaches to theanonymizing biometric information setting. Speech recognition:
ASR models are getting larger andmore powerful, making the case for putting them on the serverside stronger [15]. Since the best ASR models are currentlyon the server side and this will likely also be the case in thefuture, measures must be taken to ensure user data sent to theserver remains secure.
3. PROPOSED APPROACH
We first propose a new framework for ensuring privacy of userspeech data entirely from the client-side. We further performan explore of three different potential implementations andanalyze their pros and cons.
Given a user with audio data x , gender g , and transcription y ,we want to generate x (cid:48) such that P ( G = g | x (cid:48) ) is as close to0.5 as possible and P ( y (cid:48) = y | x (cid:48) ) is high as possible, where G is the gender random variable and y (cid:48) is the transcriptionof x (cid:48) . We refer to the approach transforming x to x (cid:48) as theencryption approach, and we evaluate its performance using aneural gender classifier and an ASR model. In order to avoidconfounding factors, both evaluation models are trained ondata completely separate from any used to train encryptionapproaches, as detailed in section 4.1. We use the ASR model itself as a baseline. Namely, in thisapproach, the user would perform ASR on-device. We as-sume that the gender classification accuracy is 0.5 here sincethe user does not need to send any data to the server.
Signal processing:
One way to encrypt the audio data isusing a purely signal processing approach. Since we focus onprivatizing gender in this work, we use an approach cateredtowards hiding gender information. Specifically, we shift theaverage pitch of each utterance to a predefined value whilepreserving formants. Details are described in section 4.4.1.
Adversarial approach:
We also investigate an adver-sarial approach to encrypt audio data [16]. Specifically, we use model trained in two stages, first to train an autoencoderto learn speaker-independent representations, and second totrain a GAN to generate realistic audio. The first stage al-ternates between the following two steps: 1. the encodertransforms the input into a latent representation that the de-coder uses along with a speaker label to reconstruct the input,and 2. a classifier attempts to identify the speaker given thelatent representation. The loss function for such adversarialtraining procedure is given by (cid:88) x i ,y i || d ( e ( x i ) , y i ) − x i || − λ (cid:88) x j ,y j − log P c ( y j | e ( x j )) , where d is the decoder, e is the encoder, P c is the probabilityoutputted by the classifier, ( x i , y i ) is an audio-speaker pair inthe first step, and ( x j , y j ) is an audio-speaker pair in the sec-ond step. The second stage trains a generator, conditioned onthe encoder output and a speaker label, alongside a discrimi-nator with two output layers, one that predicts whether or notthe generator ( c ) created its input and another that classifiesthe speaker ( c (cid:48) ). Thus, the loss function for the generator is (cid:88) x k ,y k log c ( x k )+log(1 − c ( g ( x k , y k )) − (cid:88) x l ,y l log P c (cid:48) ( y l | g ( x l )) , where g is the generator, and the loss function for the discrim-inator is (cid:88) x k ,y k − log c ( x k ) − log(1 − c ( g ( x k , y k )) − (cid:88) x l ,y l log P c (cid:48) ( y l | x l ) . This model is suitable for our work because it explicitly sep-arates speaker identity from content twice. While this archi-tecture was previously only studied for speaker identity re-moval, we study its efficacy for gender masking and down-stream speech recognition performance. Further details aredescribed in section 4.4.2.
Disentanglement approach:
We also investigate a dis-entanglement approach using variational autoencoders (VAE)[17]. Namely, we train a speaker encoder e s , content encoder e c , and decoder d using the following loss function: m (cid:88) x i λ r || d ( e s ( x i ) , e c ( x i )) − x i || + (cid:88) x j λ k || e c ( x ) || , where m is the batch size, x i and x j are input acoustic fea-tures, and λ r and λ k are hyperparameters. An instance nor-malization is added to the content encoder in order to re-move speaker information and an adaptive instance normal-ization layer is added to the decoder in order to add the de-sired speaker information [18, 19]. This model is suitable forour work because it learns to reconstruct content while trans-forming speaker information. While this architecture was pre-viously only studied for speaker identity removal, we study itsefficacy for gender masking and downstream speech recogni-tion performance (see section 4.4.3 for details). . EXPERIMENTS4.1. Dataset We train all of our encryption models on the VCTK corpus,and the ASR model and the gender classifier on LibriSpeech[20, 21]. The ASR model and the gender classifier are bothevaluated on the test-clean subset and trained using indepen-dent speakers, as detailed in sections 4.2 and 4.3 below.
We use a pretrained Wav2Vec 2.0 Large model to evaluatedownstream ASR performance [15]. Specifically, this modelwas pretrained on 960 hours of unlabeled LibriSpeech train-ing data and finetuned on a labeled version of the same data.During evaluation, we perform Viterbi decoding and calculatethe word error rate on the LibriSpeech test-clean subset.
We use VGGVox model, a modified version of the VGG-16 CNN, as our adversary gender classifier [22, 23], slightlymodifying the network by adding a ReLU activation followedby a fully-connected layer with size-2 output. We approx-imate the data available to an adversary by training in twostages: 1. we pre-train the classifier on 100 hours of labeled,unmodified speech from the LibriSpeech train data, and 2. wefine-tune the classifier on the encrypted speech of a handfulof speakers from the same 100-hour subset. We measure thegender masking ability of each encryption approach by cal-culating the VGGVox model’s gender classification accuracyon an encrypted versions of the LibriSpeech test-clean subsetafter being finetuned on data encrypted using the respectiveapproach.
In our signal processing approach, for each utterance, we cal-culate its F value and then perform a pitch shift from thatvalue to a reference F value. We calculate the F sequenceof an utterance using REAPER , and define the utterance F value as the average of the non-negative F values. Table 1summarizes the F value statistics of the raw data used in theencryption approaches.To calculate the reference F value, we embed each ut-terance in the LibriSpeech train-clean-100 subset using thesecond-to-last layer output of the VGGVox gender classifieronly trained on raw train-clean-100 data. Then, we averagethe embeddings for each speaker to create speaker embed-dings, which we use to identify the speaker whose embeddinghas the smallest mean Euclidean distance to the other speakerembeddings. Among that speaker’s utterances, we choose theone with smallest embedding distance to the other speaker https://github.com/google/REAPER Table 1 . Mean and standard deviation of the utterance F values in a 20-speaker subset of LibriSpeech train-clean-100as well as LibriSpeech test-clean. S TAGE S UBSET µ ± σ train-clean Male . ± . Female . ± . Both . ± . test-clean Male . ± . Female . ± . Both . ± . embeddings, and set our reference equal to the corresponding F value, which we find is . .We use the Rubber Band Library to perform pitch shift-ing with formant preservation using a phase vocoder. Forutterance u with F value of f u , we shift its pitch by
12 log ( f r /f u ) semitones, where f r is the reference F value.Tables 2 and 3 summarize the semitone value statistics of thedata used in the encryption approaches. Table 2 . Mean and standard deviation of the calculated semi-tone values in a 20-speaker subset of LibriSpeech train-clean-100 as well as LibriSpeech test-clean. Since males have lower F values on average than females as shown in Table 1, semi-tone values are greater than zero for the former and less thanzero for the latter. S TAGE S UBSET µ ± σ train-clean Male . ± . Female − . ± . Both . ± . test-clean Male . ± . Female − . ± . Both − . ± . Table 3 . Mean and standard deviation of semitone values ina 20-speaker subset of LibriSpeech train-clean-100 as wellas LibriSpeech test-clean, where all values are capped at amagnitude of 2 to avoid distortion effects from large pitchshifts. S TAGE S UBSET µ ± σ train-clean Male . ± . Female − . ± . Both . ± . test-clean Male . ± . Female − . ± . Both − . ± . We train a convolutional autoencoder using the hyperparam-eters described in Chou et al. [16]. The model has log-magnitude spectrograms as its input and output acoustic fea- github.com/breakfastquay/rubberband ures, and we use the Griffin-Lim algorithm to convert themodel output into waveforms [24]. We train a convolutional VAE using the hyperparameters de-scribed in Chou et al. [17]. The model has log-magnitudespectrograms as its input and output acoustic features, andwe use the Griffin-Lim algorithm to convert the model outputinto waveforms [24]. During evaluation we use the approachdescribed in Section 4.4.1 to select the reference target utter-ance.
We explore a hybrid approach wherein we first apply a sig-nal processing pitch shift and then pass the resulting wave-form into the disentanglement model. Such approach is mo-tivated by how neural approaches like those aforementionedhave greater output quality degradation the more different thesource speaker is from the target speaker. In order to avoiddistortion effects from large pitch shifts, we cap the magni-tude of the semitone value at 2.
5. RESULTS AND DISCUSSION5.1. Gender Classification Accuracy
Table 4 lists the adversary gender classification accuracy foreach encryption approach. Excluding the ASR baselines, weobserve that the adversary has the lowest gender classificationaccuracy when the hybrid encryption approach is used.
Table 4 . Classification accuracy on LibriSpeech test-cleanwhen finetuned on data from n speakers in LibriSpeech train-clean-100. Since the non-neural approaches here are muchmore memory efficient than neural ones, we write the numberof parameters for the former as zero. A PPROACH P ARAMS n = 2 ↓ n = 4 ↓ n = 20 ↓
0. Wav2Vec 2.0 Base . ∗ .
500 0 .
500 0 .
1. Wav2Vec 2.0 Large . ∗ .
500 0 .
500 0 .
2. No Encryption . .
969 0 .
969 0 .
3. Pitch Shift . .
888 0 .
926 0 . | Pitch Shift | ≤ . .
951 0 .
951 0 .
5. GAN . ∗ .
780 0 .
816 0 .
6. VAE . ∗ .
575 0 . .
7. Hybrid . ∗ .
539 0 . . Table 5 lists the downstream ASR performance for each ap-proach. We observe that the disentanglement approach (VAE)performs much better than adversarial learning (GAN).
We visualize the compute-privacy and utility-privacy trade-offs on a three-dimensional plot, where the axes are genderclassification accuracy, number of parameters (memory), and
Table 5 . ASR performance on LibriSpeech test-clean usingencrypted audio. Since the non-neural approaches here aremuch more memory efficient than neural ones, we write thenumber of parameters for the former as zero. A PPROACH P ARAMS
WER ↓ Wav2Vec 2.0 Base . ∗ . Wav2Vec 2.0 Large . ∗ . No Encryption . . Pitch Shift . . | Pitch Shift | ≤ . . GAN . ∗ . VAE . ∗ . Hybrid . ∗ . Fig. 1 . Trade-off Analysis: Points closer to the bottom-leftcorner are the most optimal. Darker points have lower mem-ory values. The indices correspond to the approaches listed inTable 4.word error rate in Figure 1. We observe that the VAE, Hy-brid, and Wav2Vec 2.0 Base models have the best tradeoff be-tween performance, which combines both downstream WERand gender masking ability, and memory.
6. CONCLUSION AND FUTURE DIRECTIONS
In this work, we setup the problem of ensuring client-side pri-vacy for speech representation learning that does not rely onany server-side privacy guarantees. We formalized several de-sirable properties regarding performance, privacy, and com-putation and performed a large-scale empirical study of ex-isting approaches. We find that while disentanglement-basedapproaches currently have the best tradeoff between gendermasking, downstream performance, and memory usage, allexisting approaches still fall short of ideal performance. Ourinitial empirical analysis opens the door towards more reliableevaluations of the tradeoffs underlying privacy-preserving ap-proaches on the client-side, a property crucial for safe real-world deployment of speech systems on mobile devices. . REFERENCES [1] Yiqiang Chen, Xin Qin, Jindong Wang, Chaohui Yu, andWen Gao, “Fedhealth: A federated transfer learningframework for wearable healthcare,”
IEEE IntelligentSystems , 2020.[2] Robin C Geyer, Tassilo Klein, and Moin Nabi, “Dif-ferentially private federated learning: A client level per-spective,” arXiv preprint arXiv:1712.07557 , 2017.[3] Paul Pu Liang, Terrance Liu, Liu Ziyin, Nicholas B.Allen, Randy P. Auerbach, David Brent, RuslanSalakhutdinov, and Louis-Philippe Morency, “Think lo-cally, act globally: Federated learning with local andglobal representations,” 2020.[4] Jie Xu and Fei Wang, “Federated learning for healthcareinformatics,” arXiv preprint arXiv:1911.06270 , 2019.[5] David Leroy, Alice Coucke, Thibaut Lavril, ThibaultGisselbrecht, and Joseph Dureau, “Federated learningfor keyword spotting,” 2019.[6] Ian McGraw, Rohit Prabhavalkar, Raziel Alvarez,Montse Gonzalez Arenas, Kanishka Rao, David Ry-bach, Ouais Alsharif, Hasim Sak, Alexander Gruen-stein, Francoise Beaufays, and Carolina Parada, “Per-sonalized speech recognition on mobile devices,” 2016.[7] Rita Singh,
Profiling humans from their voice , Springer,2019.[8] Natalia Tomashenko, Brij Mohan Lal Srivastava, XinWang, Emmanuel Vincent, Andreas Nautsch, Ju-nichi Yamagishi, Nicholas Evans, Jose Patino, Jean-Franc¸ois Bonastre, Paul-Gauthier No´e, et al., “In-troducing the voiceprivacy initiative,” arXiv preprintarXiv:2005.01387 , 2020.[9] Ranya Aloufi, Hamed Haddadi, and David Boyle,“Privacy-preserving voice analysis via disentangled rep-resentations,” arXiv preprint arXiv:2007.15064 , 2020.[10] Shimaa Ahmed, Amrita Roy Chowdhury, KassemFawaz, and Parmesh Ramanathan, “Preech: A systemfor privacy-preserving speech transcription,” in .Aug. 2020, pp. 2703–2720, USENIX Association.[11] Z. Ma, Y. Liu, X. Liu, J. Ma, and F. Li, “Privacy-preserving outsourced speech recognition for smart iotdevices,”
IEEE Internet of Things Journal , vol. 6, no. 5,pp. 8406–8420, 2019.[12] Brij Mohan Lal Srivastava, Natalia Tomashenko, XinWang, Emmanuel Vincent, Junichi Yamagishi, Mo-hamed Maouche, Aur´elien Bellet, and Marc Tommasi, “Design choices for x-vector based speaker anonymiza-tion,” arXiv preprint arXiv:2005.08601 , 2020.[13] Wen-Chin Huang, Tomoki Hayashi, Shinji Watanabe,and Tomoki Toda, “The sequence-to-sequence baselinefor the voice conversion challenge 2020: Cascading asrand tts,” arXiv preprint arXiv:2010.02434 , 2020.[14] David Ericsson, Adam ¨Ostberg, Edvin Listo Zec, JohnMartinsson, and Olof Mogren, “Adversarial represen-tation learning for private speech generation,” arXivpreprint arXiv:2006.09114 , 2020.[15] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed,and Michael Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” arXivpreprint arXiv:2006.11477 , 2020.[16] Ju-chieh Chou, Cheng-chieh Yeh, Hung-yi Lee, andLin-Shan Lee, “Multi-target voice conversion withoutparallel data by adversarially learning disentangled au-dio representations,” 09 2018, pp. 501–505.[17] Ju-chieh Chou and Hung-yi Lee, “One-shot voice con-version by separating speaker and content representa-tions with instance normalization,” 09 2019, pp. 664–668.[18] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky,“Instance normalization: The missing ingredient for faststylization,” arXiv preprint arXiv:1607.08022 , 2016.[19] X. Huang and S. Belongie, “Arbitrary style transferin real-time with adaptive instance normalization,” in , 2017, pp. 1510–1519.[20] Christophe Veaux, Junichi Yamagishi, Kirsten MacDon-ald, et al., “Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” 2016.[21] Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-jeev Khudanpur, “Librispeech: an asr corpus based onpublic domain audio books,” in . IEEE, 2015, pp. 5206–5210.[22] Arsha Nagrani, Joon Son Chung, and Andrew Zisser-man, “Voxceleb: A large-scale speaker identificationdataset,” 06 2017.[23] Karen Simonyan and Andrew Zisserman, “Very deepconvolutional networks for large-scale image recogni-tion,” in
International Conference on Learning Repre-sentations (ICLR) , 2015.[24] D. Griffin and Jae Lim, “Signal estimation from mod-ified short-time fourier transform,”