[PDF] Improving on-device speaker verification using federated learning with privacy

Abstract

Information on speaker characteristics can be useful as side information in improving speaker recognition accuracy. However, such information is often private. This paper investigates how privacy-preserving learning can improve a speaker verification system, by enabling the use of privacy-sensitive speaker data to train an auxiliary classification model that predicts vocal characteristics of speakers. In particular, this paper explores the utility achieved by approaches which combine different federated learning and differential privacy mechanisms. These approaches make it possible to train a central model while protecting user privacy, with users' data remaining on their devices. Furthermore, they make learning on a large population of speakers possible, ensuring good coverage of speaker characteristics when training a model. The auxiliary model described here uses features extracted from phrases which trigger a speaker verification system. From these features, the model predicts speaker characteristic labels considered useful as side information. The knowledge of the auxiliary model is distilled into a speaker verification system using multi-task learning, with the side information labels predicted by this auxiliary model being the additional task. This approach results in a 6% relative improvement in equal error rate over a baseline system.

Full PDF

IImproving on-device speaker veriﬁcation using federated learning with privacy

Filip Granqvist, Matt Seigel, Rogier van Dalen, ´Aine Cahill, Stephen Shum, Matthias Paulik

Apple { fgranqvist,mseigel,rogier vandalen,aine cahill,stephen shum,mpaulik } @apple.com Abstract

Information on speaker characteristics can be useful as side in-formation in improving speaker recognition accuracy. However,such information is often private. This paper investigates howprivacy-preserving learning can improve a speaker veriﬁcationsystem, by enabling the use of privacy-sensitive speaker data totrain an auxiliary classiﬁcation model that predicts vocal char-acteristics of speakers. In particular, this paper explores theutility achieved by approaches which combine different feder-ated learning and differential privacy mechanisms. These ap-proaches make it possible to train a central model while pro-tecting user privacy, with users’ data remaining on their devices.Furthermore, they make learning on a large population of speak-ers possible, ensuring good coverage of speaker characteristicswhen training a model. The auxiliary model described here usesfeatures extracted from phrases which trigger a speaker veriﬁ-cation system. From these features, the model predicts speakercharacteristic labels considered useful as side information. Theknowledge of the auxiliary model is distilled into a speaker ver-iﬁcation system using multi-task learning, with the side infor-mation labels predicted by this auxiliary model being the addi-tional task. This approach results in a 6 % relative improvementin equal error rate over a baseline system.

Index Terms : Speaker Veriﬁcation, Multi-task Learning, Fed-erated Learning, Differential Privacy

1. Introduction

Speaker veriﬁcation is the problem of determining whether theperson speaking is a speciﬁc individual or someone else. Itis a vital feature for devices that use a “wake-up phrase” toprovide access to information, as actions should only be trig-gered when this phrase is uttered by the device owner and notan impostor. Speaker veriﬁcation systems usually consist oftwo components: a speaker embedding network; and a dis-criminative method for comparing pairs of embeddings to deter-mine whether or not those embeddings originate from the samespeaker [1, 2, 3, 4]. Additional side information can be usefulfor speaker veriﬁcation [5, 6, 7]. This side information couldbe obtained through manual labelling. The setting that this pa-per considers instead is one where side information is availableon many users’ devices, but it is privacy-sensitive and shouldtherefore not be uploaded to a central server. At a high level,this paper tests three hypotheses:1. It is possible to train a classiﬁer on the audio of triggerphrases to predict personal attributes of the speaker con-sidered to be useful as side information.2. Such a classiﬁer can be improved with federated learningwhile preserving users’ privacy.3. The predictions of this classiﬁer can be used to improvethe performance of speaker veriﬁcation. Previous work has shown that neural networks can learn topredict speaker-dependent labels, such as gender [8, 9, 10], andemotion [11, 12, 8], from utterances. The desired outcome fromtesting the ﬁrst hypothesis is a classiﬁer that can predict similarspeaker-dependent labels from the same input as the baselinespeaker veriﬁcation system used in this paper.The second hypothesis is that it is possible to train a use-ful classiﬁer on distributed user data while preserving user pri-vacy. This is achieved through the combination of federatedlearning with differential privacy, which has been proposed andput into practice successfully in a large body of prior work[13, 14, 15, 16]. In federated learning, a batch of clients com-pute statistics on their local data using the latest version of acentral model. The resulting statistics are combined on a serverto improve the central model. This process is repeated witha different subset of users. Federated averaging [14] is com-monly used for federated learning. In this algorithm, modelsare trained locally on devices and the changes in model param-eter values are averaged on a central server and used to updatethe central model. However, local model updates, which arederived from the data, might leak sensitive information. To pre-vent this, differential privacy (DP) [17] is used in this paper.Prior work has provided few examples of high-utility applica-tions on real-world models and datasets, and none on classify-ing speakers. This paper presents an analysis of different pri-vacy regimes on training accuracy and convergence in this do-main.The third hypothesis states that the encoded knowledge ofan auxiliary model trained on side information can be used toimprove a speaker veriﬁcation system. Manually labelled sideinformation has been shown effective for improving speakerveriﬁcation systems [5, 6, 7]. The baseline system [18] employsthe common approach of using a speaker embedding networkand scoring pairs of embeddings using cosine similarity. Thispaper shows that it can be improved by enriching the speakerembedding network with knowledge distilled from the auxiliarymodel.The structure of this paper is as follows. Section 2 pro-vides a high-level overview of how federated learning can bemade private using differential privacy. Section 3 introduces thebaseline speaker veriﬁcation system used in production. Sec-tion 4 introduces the classiﬁer trained on user data in a privacy-preserving manner to predict side information from triggerphrases, named the “vocal classiﬁcation model”. Section 5 ex-plains the approach for including the vocal classiﬁcation modelin the speaker veriﬁcation training setup to ultimately improveperformance. Section 6 presents experimental results.

2. Federated learning with privacy

Data that can be used to improve machine-learned models of-ten belongs to individuals or users and is therefore distributedover their devices. Federated learning is an approach that makes a r X i v : . [ ee ss . A S ] A ug earning in this scenario possible. In federated learning, a cen-tral model is trained over a distributed dataset, where a largenumber of nodes (e.g. user devices) hold variable-sized sub-sets of the data. A model update or gradient [19, 20, 14] iscomputed at the node on the local data, and communicated to acentral server. A large number of these updates or gradients arecombined at the central server during each iteration of training.A global update to the central model is computed as the averageof local updates. This is called “federated averaging” [14].Many organisations and policy makers are committed to up-holding user privacy. This makes federated learning an impor-tant approach to consider when dealing with data that is pri-vate, as it goes some way to protecting privacy. However, eventhough raw user data is not communicated with the server, it hasbeen shown that model updates can leak information about theraw data [21, 22]. As mentioned in Section 1, there is a largebody of work investigating and putting approaches into practicewhich combine federated learning with some privacy protec-tion. One common way to mitigate these threats to privacy isto apply differential privacy (DP) [17, 13, 15, 16]. Differen-tial privacy makes it possible to add noise to the model updatesto give a guaranteed upper bound on the amount of informationthat can be leaked. DP can be used to protect an individual’s up-date by applying noise at the distributed node. DP can also beapplied centrally to protect the privacy of individuals’ updatesafter aggregation [23].In this paper, a number of privacy regimes are explored insimulation. One of these regimes is to use a weaker form oflocal DP [24], combined with central DP. This form of local DPis applied to individual updates which are sent for aggregationon a secure server. This algorithm is an optimal method for pro-viding updates (i.e. high-dimensional vectors) with the highestpossible signal-to-noise ratio (SNR). The algorithm is tuned toachieve an SNR that permits high accuracy, while still allowingstrong DP guarantees in deployment scenarios where it is ap-plicable, e.g., because of shufﬂing [25] and subsampling [26].In addition, while doing federated averaging, the server addsenough additional noise to ensure strong central DP guarantees(as per the moment’s accountant [13].

3. Speaker veriﬁcation

The baseline speaker veriﬁcation system improved in this workis the system described in [18], which builds on an underlyingvoice trigger model that recognizes the trigger phrase. The inputof the speaker veriﬁcation system is a ﬁxed-length supervectorwith features generated from forward propagation of the triggerphrase audio through the voice trigger model. The voice triggersystem uses 26 Mel Frequency Cepstral Coefﬁcients to param-eterize a hidden Markov model (HMM) which models the trig-ger phrase. The means of the 31 HMM states (other than thosemodelling silence) are concatenated to form the supervector, re-sulting in ×

20 = 520 dimensions.The supervector consists of features about the particu-lar trigger phrase and the baseline speaker veriﬁcation sys-tem contains a neural network that transforms the supervectorinto “speaker space”, only focusing on retaining characteristicsabout the speaker itself. The neural network is a fully connectedneural network with ﬁve layers. The ﬁrst four layers consist of dimensions, with batch normalization [27] and sigmoid ac-tivations. The ﬁfth layer is designed to be the embedding layer,and is therefore only a linear transform of dimension 100 withbatch normalization. In the training phase, a sixth layer of size K with softmax activation is added as a head to the architec- ture, where K is the number of speakers in the training dataset.Training is deﬁned as a speaker identiﬁcation task, where thelabels are one-hot representations of the K unique speakers andtraining is performed by minimizing the cross-entropy with theoutput softmax distribution. The trained embedding is used formeasuring the similarity between two utterances by measuringthe distance in “speaker space”.The speaker veriﬁcation system stores multiple speakervectors generated from a set of enrolment utterances. At testtime, the acoustic instance of a trigger phrase is transformedinto a ﬁxed-length supervector from the HMM states of thevoice trigger model, transformed again into a speaker vectorby the speaker embedding network, and compared to the en-rolment speaker vectors using cosine similarity. The averagecosine similarity between the test speaker vector and the enrol-ment vectors can be interpreted as a speaker veriﬁcation score.Following the notation of [28], the score is deﬁned as SV score = 1 N N (cid:88) i =1 f nn ( u a ) (cid:62) f nn ( u spki ) (cid:107) f nn ( u a ) (cid:107)(cid:107) f nn ( u spki ) (cid:107) (1)where u spki is the i th out of N supervectors from the enrolmentphase, u a is the test supervector, and f nn is the function ex-pressed by the speaker embedding network. The speaker veriﬁ-cation score SV score is then compared to the operating thresh-old λ to reject or accept the request following the trigger phrase.The work presented in this paper mainly targets theJapanese language ( ja JP ), where the training dataset origi-nates from a speaker population of size K = 18700 . Thespeaker embedding network is trained with a batch size of ,a weight decay of · − , initial learning rate of − andmomentum of factor . . The performance of this setup is pre-sented as Baseline in Section 6. Improvements on the baselinemodel presented in the paper are also trained with the same hy-perparameters unless otherwise stated.

4. Vocal classiﬁcation

The ﬁrst hypothesis proposes that classiﬁcation of side infor-mation can be learned from the voice trigger phrase only, if theside information is correlated with vocal characteristics. To testthe hypothesis, a fully connected deep neural network was de-ﬁned which uses the same input features as the baseline speakerveriﬁcation system, i.e. the 520-dimensional supervector, andpredicts the side information of the speaker. This model is re-ferred to as the “vocal classiﬁcation model”. The target labelsare sensitive information and are stored privately on devices,hence larger-scale experiments of training the vocal classiﬁca-tion model were only possible using the framework explainedin Section 2.Experiments were carried out with limited central data toevaluate the effect of applying differential privacy on federatedtraining of the vocal classiﬁcation model. In addition to accu-racy, the signal-to-noise ratio is measured throughout experi-ments to quantify the signal quality of model updates where DPhas been applied. Here, SNR is computed as the ratio of theL2 norm of the un-noised model update to that of the addedDP noise, i.e.

SNR = || unnoised update || || DP noise || . Figure 1 shows theaccuracy of the following experiments in simulation on an eval-uation set: No DP

No DP mechanism is applied. The resulting accuracyof . acts as an upper bound for the remaining ex-periments because introducing any DP mechanism to im-prove privacy guarantees is expected to have a negativempact on accuracy. Note that even in the “No DP” sce-nario, there is some privacy protection, as anonymity isassumed. The accuracy of this model proves the ﬁrst hy-pothesis, namely that a predictor of vocal characteristicscan be learned from only the voice trigger phrase. Local DP

The strongest form of differential privacy, local DP,is applied using the Gaussian mechanism [29]. The pri-vacy parameters used were (cid:15) = 2 and δ = 10 − . This re-sults in a signiﬁcant negative impact on the performanceof the model, yielding an accuracy of . , with a lowSNR of . observed for the ﬁrst central model up-date. The result is still much better than random, whichshows that useful knowledge can be learned even withsuch strict privacy guarantees and low SNR. Central DP

The Gaussian moments accountant is applied onthe aggregate model update to provide central privacyguarantees. The privacy parameters used were (cid:15) = 2 and δ = 10 − . For the moments accountant, the populationsize is assumed to be M, the cohort size is andthe maximum number of central iterations is . The ac-curacy of the trained vocal classiﬁcation model is . ,which is close to the “No DP” case. An SNR of . isobserved for the ﬁrst central model update, which is sig-niﬁcantly higher than “Local DP”. However, this doesnot have any local privacy guarantees. Central DP with weaker local DP

Falling between the “localDP” and “Central DP” experiments, a weaker form oflocal DP [24] is used in combination with the Gaussianmoments accountant for the central DP mechanism. Thiscombination results in an accuracy of . , which is agood privacy-utility trade-off. The privacy parametersused in the weaker form of local DP translate to (cid:15) =25 . , which is in the high-epsilon regime. However, withthe assumed privacy ampliﬁcation through shufﬂing andsampling for anonymity, and the application of centralDP with (cid:15) = 2 and δ = 10 − , this may be considereda reasonable operating point. The SNR observed here is . , which is signiﬁcantly higher than “local DP” andexpected considering the high (cid:15) value.Figure 1: Accuracy of the vocal classiﬁcation model on an eval-uation dataset, trained with different DP mechanisms.

Hyperparameter tuning was performed in simulations withlimited central data. The best performing model was usedas an initialization when training with real devices. Switch-ing to training the vocal classiﬁcation model distributed on-device yielded multiple beneﬁts. Firstly, additional categories of side information not previously available in the limited cen-tral dataset were used. Secondly, magnitudes more data is avail-able distributed on devices. The cohort of users for each cen-tral model update was increased to , resulting in higherSNR for the same local DP parameters and smaller noise vari-ance from the moments accountant used centrally. This largereffective corpus means that there is more speaker coverage.Thirdly, while labels of the central dataset may be erroneous dueto errors in manual human annotation, on-device training usesground truth labels, thereby increasing accuracy, while protect-ing privacy.

5. Multi-task learning of thespeaker veriﬁcation system

The third hypothesis proposed in this paper is that the encodedknowledge of a vocal classiﬁcation model can act as comple-mentary information for training a more accurate speaker ver-iﬁcation system. Multiple approaches for utilizing the knowl-edge of the vocal classiﬁcation model were experimented with:static rules for rejecting a request based on the model output,using the output of the model as input to intermediate layerswhen training the speaker embedding network, and multi-tasklearning with pseudo-labels. The latter approach was the mostsuccessful and is the focus of the rest of the paper.Speciﬁcally, the network was trained to predict the speaker,like the baseline system, and, additionally, the side informa-tion. The loss was the sum of the original loss and a new term.Minimizing this loss distilled the knowledge [30] of the vocalclassiﬁcation model, encoded in the pseudo-labels it generates.Just as with the baseline system, the ﬁnal classiﬁcation layerwas removed during inference to expose the embedding layer.Not only did distilling the knowledge of the vocal classiﬁca-tion model outperform the two other approaches mentioned, butthe ﬁnal architecture of the speaker veriﬁcation system remainsunchanged. Since the vocal classiﬁcation model is a modeltrained with differential privacy, any knowledge that is distilledby the speaker embedding network is also protected by the post-processing theorem of differential privacy [17].The vocal classiﬁcation model is generally conﬁdent in itspredictions, mostly generating an output probability of . for the highest predicted class, even where it is incorrect. Tobetter distill the knowledge for cases like this, the concept oftemperature is used [30]. A temperature higher than softensthe output distribution, making the probabilities that were pre-viously minuscule more representative.Given a mini-batch of data X = { x ( t ) } and correspondinglabels Y = { y ( t ) } , the objective function to minimize for themulti-task setup is L mtl ( X , Y ) = (cid:88) t (cid:0) L spk ( x ( t ) , y ( t ) ) + γ L vc ( x ( t ) ) (cid:1) , (2)where L spk is the cross-entropy loss function for speaker iden-tiﬁcation as in the baseline setup, γ is a weight for the vocalclassiﬁcation loss relative to speaker identiﬁcation, and the vo-cal classiﬁcation loss L vc is deﬁned as L vc ( x ) = T N N (cid:88) i =0 (cid:32) e V i ( x ) /T (cid:80) j e V j ( x ) /T − e z i /T (cid:80) j e z j /T (cid:33) . (3)Here, z i is the i th logit of the predicted side information fromthe vocal classiﬁcation layer in the speaker embedding train-ing setup, V i ( x ) is the i th logit from forward propagating theable 1: Performance of the two tasks in the multi-task learningsetup on an evaluation set.

Model Speaker accuracy (%) Side informationaccuracy (%)

Baseline 82.56% -VC ofﬂine 83.37% 98.22%VC FL 83.59% 98.15%supervector x through the vocal classiﬁcation model and T isthe temperature. The above can be interpreted as the meansquared error between the “softened” softmax distributions.The mean squared error is multiplied by T because otherwise ∂ L vc ∂z i scales with a factor of T for large T (see section 2.1 of[30]).

6. Results

In this section, results of three models are presented, all withthe same ﬁnal network architecture. The difference is only inhow the speaker embedding is trained. The ﬁrst model,

Base-line , is the baseline production system described in Section 3.The second model,

VC ofﬂine , follows the setup described inSection 5 where the knowledge to distil is from a vocal clas-siﬁcation model trained on the limited ofﬂine data (blue linein Figure 1). The third model,

VC FL , also follows the setupfrom Section 5, but with a vocal classiﬁcation model trainedwith federated learning. The mechanism in [24] was used forlocal privacy guarantees and the Gaussian mechanism with mo-ments accountant was used for central privacy guarantees.Multi-task learning of the speaker embedding was con-ducted on . million utterances of the trigger phrase from speakers, all preprocessed by the voice trigger system toextract 520-dimensional supervectors. The softmax layer clas-sifying side information has six outputs and the softmax layerclassifying speakers has one output per speaker. The tempera-ture T was set to for all experiments, and the weight γ of theside information classiﬁcation loss was roughly tuned to bal-ance the losses of the two tasks. Evaluation accuracy was mea-sured on another set of utterances from the same popu-lation of speakers as the training dataset.The accuracies of speaker identiﬁcation and side informa-tion classiﬁcation on the evaluation dataset are shown in Table1. The accuracy on the speaker identiﬁcation task in the multi-task setting increases relative to the baseline. It is possible thatthe side information has both a regularizing effect on speakeridentiﬁcation as well as helping propagate signals through thenetwork. The accuracy on the classiﬁcation of the side infor-mation is expected to be close to because the labels aregenerated from the vocal classiﬁcation model, and this largerDNN should be able to capture the encoded knowledge of thevocal classiﬁcation model.As mentioned in Section 3, a speaker proﬁle is deﬁned by aset of supervectors from enrolment utterances. When evaluatingthe speaker veriﬁcation system with the newly trained speakerembedding network, each of speaker proﬁles available werecompared to supervectors of test utterances using Equation 1.A subset of the test utterances were unique utterances from the speakers which had proﬁles, and the rest originated fromimposter speakers that did not match any proﬁle. A total of pairs of speaker proﬁles and test utterances were com- Table 2: Performance of speaker veriﬁcation on a test set.

Model EER (%)

Baseline 10.10VC ofﬂine 9.95VC FL 9.50pared and a test utterance was accepted or rejected by applyinga threshold θ on the speaker vector score SV score . Performanceat the equal error rate (EER) is shown in Table 2 for the threeexperiments. The multi-task setup with knowledge distillationof a vocal classiﬁcation model trained on limited central datayields a . absolute improvement over the baseline. Thesame setup with a vocal classiﬁcation model trained on-devicewith privacy-preserving federated learning yields a rela-tive improvement in EER. These results prove both our secondand third hypotheses, namely that privacy-preserving federatedlearning can be used to improve the vocal classiﬁcation-basedsystem (due to the gain over the “VC ofﬂine” experiment), andthat the vocal classiﬁcation model can be used through multi-task learning to improve the speaker veriﬁcation system (due tothe gain over the “Baseline” system).

7. Conclusions

This paper demonstrates how a centrally trained speaker veriﬁ-cation system can be improved by distilling the knowledge ofan auxiliary model that was trained with side information on amuch broader population using federated learning, while pro-tecting user privacy. Firstly, the auxiliary model, which clas-siﬁes side information, was trained using data distributed overmillions of real devices. Additional experiments simulated dif-ferent combinations of federated learning and differential pri-vacy when training this model, to highlight the utility/privacytrade-off expected when using such approaches. The accuracy,time to convergence and signal-to-noise ratio clearly show therelative ordering of these approaches in terms of utility. Sec-ondly, the encoded knowledge of the auxiliary model was dis-tilled into the speaker embedding network of an existing speakerveriﬁcation baseline system using multi-task learning. Finally,a relative improvement of in equal error rate for speakerveriﬁcation was achieved using this technique while maintain-ing the same network architecture as the baseline. This resultshows that the speaker characteristic knowledge distilled intothe speaker veriﬁcation network resulted in speaker embeddingswhich are more discriminative.

8. Acknowledgements

This work was a collaborative effort between multiple teams.The authors would like to thank Chandra Dhir and Sachin Ka-jarekar for their help and involvement in relation to the speakerveriﬁcation system. The authors would also like to thank ev-eryone involved in the private federated learning effort, whichmade the experiments in this work possible including: Ab-hishek Bhowmick, Simon Beaumont, Andrew Byde, Luke Carl-son, Andrew Cherkashyn, Mansi Deshpande, Fei Dong, JulienFreudiger, Stanley Hung, Omid Javidbakht, Gaurav Kapoor,Joris Kluivers, Henry Mason, Tom Naughton, Deepa NemmiliVeeravalli, Rehan Rishi and Dominic Telaar. . References [1] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker veriﬁcation,” in .IEEE, 2014, pp. 4052–4056.[2] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,“Deep neural network embeddings for text-independent speakerveriﬁcation.” in

Interspeech , 2017, pp. 999–1003.[3] Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentivespeaker embeddings for text-independent speaker veriﬁcation.” in

Interspeech , 2018, pp. 3573–3577.[4] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu-danpur, “X-vectors: Robust DNN embeddings for speaker recog-nition,” in . IEEE, 2018, pp. 5329–5333.[5] L. Ferrer, M. Graciarena, A. Zymnis, and E. Shriberg, “Systemcombination using auxiliary information for speaker veriﬁcation,”in . IEEE, 2008, pp. 4853–4856.[6] O. Plchot, S. Matsoukas, P. Matˇejka, N. Dehak, J. Ma, S. Cumani,O. Glembek, H. Hermansky, S. Mallidi, N. Mesgarani et al. , “De-veloping a speaker identiﬁcation system for the DARPA RATSproject,” in . IEEE, 2013, pp. 6768–6772.[7] F. Kelly and J. H. Hansen, “Evaluation and calibration of short-term aging effects in speaker veriﬁcation,” in

Sixteenth AnnualConference of the International Speech Communication Associa-tion , 2015.[8] Z.-Q. Wang and I. Tashev, “Learning utterance-level representa-tions for speech emotion and age/gender recognition using deepneural networks,” in . IEEE, 2017,pp. 5150–5154.[9] H. Meinedo and I. Trancoso, “Age and gender classiﬁcation us-ing fusion of acoustic and prosodic features,” in

Eleventh AnnualConference of the International Speech Communication Associa-tion , 2010.[10] S. H. Kabil, H. Muckenhirn, and M. Magimai-Doss, “On learn-ing to identify genders from raw speech signal using CNNs,” in

Interspeech , 2018, pp. 287–291.[11] K. Han, D. Yu, and I. Tashev, “Speech emotion recognition usingdeep neural network and extreme learning machine,” in

Fifteenthannual conference of the international speech communication as-sociation , 2014.[12] S. Ghosh, E. Laksana, L.-P. Morency, and S. Scherer, “Represen-tation learning for speech emotion recognition,” in

Interspeech ,2016, pp. 3603–3607.[13] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov,K. Talwar, and L. Zhang, “Deep learning with differentialprivacy,”

Proceedings of the 2016 ACM SIGSAC Conferenceon Computer and Communications Security - CCS’16 , 2016.[Online]. Available: http://dx.doi.org/10.1145/2976749.2978318[14] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, andB. Ag¨uera y Arcas, “Communication-efﬁcient learning of deepnetworks from decentralized data,” 2016.[15] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang, “Learningdifferentially private language models without losing accuracy,”

ICLR , 2018.[16] T. Ryffel, A. Trask, M. Dahl, B. Wagner, J. Mancuso, D. Rueck-ert, and J. Passerat-Palmbach, “A generic framework for privacypreserving deep learning,”

CoRR , 2018.[17] C. Dwork and A. Roth, “The algorithmic foundations of differ-ential privacy.”

Foundations and Trends in Theoretical ComputerScience , vol. 9, no. 3-4, pp. 211–407, 2014. [18] Siri Team, “Personalized Hey Siri,”

Apple Ma-chine Learning Journal , vol. 1, no. 9, 2018. [On-line]. Available: https://machinelearning.apple.com/2018/04/16/personalized-hey-siri.html[19] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Privacy awarelearning,” 2012.[20] J. Koneˇcn´y, H. B. McMahan, D. Ramage, and P. Richt´arik, “Fed-erated optimization: Distributed machine learning for on-deviceintelligence,” 2016.[21] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacksthat exploit conﬁdence information and basic countermeasures,”in

Proceedings of the ACM SIGSAC Conference on Computer andCommunications Security , 10 2015, pp. 1322–1333.[22] L. Melis, C. Song, E. D. Cristofaro, and V. Shmatikov, “Exploitingunintended feature leakage in collaborative learning,” 2018.[23] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membershipinference attacks against machine learning models,” in . IEEE, 2017, pp. 3–18.[24] A. Bhowmick, J. C. Duchi, J. Freudiger, G. Kapoor, andR. Rogers, “Protection against reconstruction and its applicationsin private federated learning.”

CoRR , vol. abs/1812.00984, 2018.[25] ´U. Erlingsson, V. Feldman, I. Mironov, A. Raghunathan, K. Tal-war, and A. Thakurta, “Ampliﬁcation by shufﬂing: From local tocentral differential privacy via anonymity,” in

Proceedings of theThirtieth Annual ACM-SIAM Symposium on Discrete Algorithms .SIAM, 2019, pp. 2468–2479.[26] B. Balle, G. Barthe, and M. Gaboardi, “Privacy ampliﬁcation bysubsampling: Tight analyses via couplings and divergences,” in

Advances in Neural Information Processing Systems , 2018, pp.6277–6287.[27] S. Ioffe and C. Szegedy, “Batch normalization: Acceleratingdeep network training by reducing internal covariate shift,” arXivpreprint arXiv:1502.03167 , 2015.[28] E. Marchi, S. Shum, K. Hwang, S. Kajarekar, S. Sigtia,H. Richards, R. Haynes, Y. Kim, and J. Bridle, “Generalised dis-criminative transform via curriculum learning for speaker recog-nition,” in , April 2018, pp. 5324–5328.[29] B. Balle and Y.-X. Wang, “Improving the Gaussian mechanismfor differential privacy: Analytical calibration and optimal denois-ing,” arXiv preprint arXiv:1805.06530 , 2018.[30] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge ina neural network,” arXiv preprint arXiv:1503.02531arXiv preprint arXiv:1503.02531