Improving on-device speaker verification using federated learning with privacy
Filip Granqvist, Matt Seigel, Rogier van Dalen, Áine Cahill, Stephen Shum, Matthias Paulik
IImproving on-device speaker verification using federated learning with privacy
Filip Granqvist, Matt Seigel, Rogier van Dalen, ´Aine Cahill, Stephen Shum, Matthias Paulik
Apple { fgranqvist,mseigel,rogier vandalen,aine cahill,stephen shum,mpaulik } @apple.com Abstract
Information on speaker characteristics can be useful as side in-formation in improving speaker recognition accuracy. However,such information is often private. This paper investigates howprivacy-preserving learning can improve a speaker verificationsystem, by enabling the use of privacy-sensitive speaker data totrain an auxiliary classification model that predicts vocal char-acteristics of speakers. In particular, this paper explores theutility achieved by approaches which combine different feder-ated learning and differential privacy mechanisms. These ap-proaches make it possible to train a central model while pro-tecting user privacy, with users’ data remaining on their devices.Furthermore, they make learning on a large population of speak-ers possible, ensuring good coverage of speaker characteristicswhen training a model. The auxiliary model described here usesfeatures extracted from phrases which trigger a speaker verifi-cation system. From these features, the model predicts speakercharacteristic labels considered useful as side information. Theknowledge of the auxiliary model is distilled into a speaker ver-ification system using multi-task learning, with the side infor-mation labels predicted by this auxiliary model being the addi-tional task. This approach results in a 6 % relative improvementin equal error rate over a baseline system.
Index Terms : Speaker Verification, Multi-task Learning, Fed-erated Learning, Differential Privacy
1. Introduction
Speaker verification is the problem of determining whether theperson speaking is a specific individual or someone else. Itis a vital feature for devices that use a “wake-up phrase” toprovide access to information, as actions should only be trig-gered when this phrase is uttered by the device owner and notan impostor. Speaker verification systems usually consist oftwo components: a speaker embedding network; and a dis-criminative method for comparing pairs of embeddings to deter-mine whether or not those embeddings originate from the samespeaker [1, 2, 3, 4]. Additional side information can be usefulfor speaker verification [5, 6, 7]. This side information couldbe obtained through manual labelling. The setting that this pa-per considers instead is one where side information is availableon many users’ devices, but it is privacy-sensitive and shouldtherefore not be uploaded to a central server. At a high level,this paper tests three hypotheses:1. It is possible to train a classifier on the audio of triggerphrases to predict personal attributes of the speaker con-sidered to be useful as side information.2. Such a classifier can be improved with federated learningwhile preserving users’ privacy.3. The predictions of this classifier can be used to improvethe performance of speaker verification. Previous work has shown that neural networks can learn topredict speaker-dependent labels, such as gender [8, 9, 10], andemotion [11, 12, 8], from utterances. The desired outcome fromtesting the first hypothesis is a classifier that can predict similarspeaker-dependent labels from the same input as the baselinespeaker verification system used in this paper.The second hypothesis is that it is possible to train a use-ful classifier on distributed user data while preserving user pri-vacy. This is achieved through the combination of federatedlearning with differential privacy, which has been proposed andput into practice successfully in a large body of prior work[13, 14, 15, 16]. In federated learning, a batch of clients com-pute statistics on their local data using the latest version of acentral model. The resulting statistics are combined on a serverto improve the central model. This process is repeated witha different subset of users. Federated averaging [14] is com-monly used for federated learning. In this algorithm, modelsare trained locally on devices and the changes in model param-eter values are averaged on a central server and used to updatethe central model. However, local model updates, which arederived from the data, might leak sensitive information. To pre-vent this, differential privacy (DP) [17] is used in this paper.Prior work has provided few examples of high-utility applica-tions on real-world models and datasets, and none on classify-ing speakers. This paper presents an analysis of different pri-vacy regimes on training accuracy and convergence in this do-main.The third hypothesis states that the encoded knowledge ofan auxiliary model trained on side information can be used toimprove a speaker verification system. Manually labelled sideinformation has been shown effective for improving speakerverification systems [5, 6, 7]. The baseline system [18] employsthe common approach of using a speaker embedding networkand scoring pairs of embeddings using cosine similarity. Thispaper shows that it can be improved by enriching the speakerembedding network with knowledge distilled from the auxiliarymodel.The structure of this paper is as follows. Section 2 pro-vides a high-level overview of how federated learning can bemade private using differential privacy. Section 3 introduces thebaseline speaker verification system used in production. Sec-tion 4 introduces the classifier trained on user data in a privacy-preserving manner to predict side information from triggerphrases, named the “vocal classification model”. Section 5 ex-plains the approach for including the vocal classification modelin the speaker verification training setup to ultimately improveperformance. Section 6 presents experimental results.
2. Federated learning with privacy
Data that can be used to improve machine-learned models of-ten belongs to individuals or users and is therefore distributedover their devices. Federated learning is an approach that makes a r X i v : . [ ee ss . A S ] A ug earning in this scenario possible. In federated learning, a cen-tral model is trained over a distributed dataset, where a largenumber of nodes (e.g. user devices) hold variable-sized sub-sets of the data. A model update or gradient [19, 20, 14] iscomputed at the node on the local data, and communicated to acentral server. A large number of these updates or gradients arecombined at the central server during each iteration of training.A global update to the central model is computed as the averageof local updates. This is called “federated averaging” [14].Many organisations and policy makers are committed to up-holding user privacy. This makes federated learning an impor-tant approach to consider when dealing with data that is pri-vate, as it goes some way to protecting privacy. However, eventhough raw user data is not communicated with the server, it hasbeen shown that model updates can leak information about theraw data [21, 22]. As mentioned in Section 1, there is a largebody of work investigating and putting approaches into practicewhich combine federated learning with some privacy protec-tion. One common way to mitigate these threats to privacy isto apply differential privacy (DP) [17, 13, 15, 16]. Differen-tial privacy makes it possible to add noise to the model updatesto give a guaranteed upper bound on the amount of informationthat can be leaked. DP can be used to protect an individual’s up-date by applying noise at the distributed node. DP can also beapplied centrally to protect the privacy of individuals’ updatesafter aggregation [23].In this paper, a number of privacy regimes are explored insimulation. One of these regimes is to use a weaker form oflocal DP [24], combined with central DP. This form of local DPis applied to individual updates which are sent for aggregationon a secure server. This algorithm is an optimal method for pro-viding updates (i.e. high-dimensional vectors) with the highestpossible signal-to-noise ratio (SNR). The algorithm is tuned toachieve an SNR that permits high accuracy, while still allowingstrong DP guarantees in deployment scenarios where it is ap-plicable, e.g., because of shuffling [25] and subsampling [26].In addition, while doing federated averaging, the server addsenough additional noise to ensure strong central DP guarantees(as per the moment’s accountant [13].
3. Speaker verification
The baseline speaker verification system improved in this workis the system described in [18], which builds on an underlyingvoice trigger model that recognizes the trigger phrase. The inputof the speaker verification system is a fixed-length supervectorwith features generated from forward propagation of the triggerphrase audio through the voice trigger model. The voice triggersystem uses 26 Mel Frequency Cepstral Coefficients to param-eterize a hidden Markov model (HMM) which models the trig-ger phrase. The means of the 31 HMM states (other than thosemodelling silence) are concatenated to form the supervector, re-sulting in ×
20 = 520 dimensions.The supervector consists of features about the particu-lar trigger phrase and the baseline speaker verification sys-tem contains a neural network that transforms the supervectorinto “speaker space”, only focusing on retaining characteristicsabout the speaker itself. The neural network is a fully connectedneural network with five layers. The first four layers consist of dimensions, with batch normalization [27] and sigmoid ac-tivations. The fifth layer is designed to be the embedding layer,and is therefore only a linear transform of dimension 100 withbatch normalization. In the training phase, a sixth layer of size K with softmax activation is added as a head to the architec- ture, where K is the number of speakers in the training dataset.Training is defined as a speaker identification task, where thelabels are one-hot representations of the K unique speakers andtraining is performed by minimizing the cross-entropy with theoutput softmax distribution. The trained embedding is used formeasuring the similarity between two utterances by measuringthe distance in “speaker space”.The speaker verification system stores multiple speakervectors generated from a set of enrolment utterances. At testtime, the acoustic instance of a trigger phrase is transformedinto a fixed-length supervector from the HMM states of thevoice trigger model, transformed again into a speaker vectorby the speaker embedding network, and compared to the en-rolment speaker vectors using cosine similarity. The averagecosine similarity between the test speaker vector and the enrol-ment vectors can be interpreted as a speaker verification score.Following the notation of [28], the score is defined as SV score = 1 N N (cid:88) i =1 f nn ( u a ) (cid:62) f nn ( u spki ) (cid:107) f nn ( u a ) (cid:107)(cid:107) f nn ( u spki ) (cid:107) (1)where u spki is the i th out of N supervectors from the enrolmentphase, u a is the test supervector, and f nn is the function ex-pressed by the speaker embedding network. The speaker verifi-cation score SV score is then compared to the operating thresh-old λ to reject or accept the request following the trigger phrase.The work presented in this paper mainly targets theJapanese language ( ja JP ), where the training dataset origi-nates from a speaker population of size K = 18700 . Thespeaker embedding network is trained with a batch size of ,a weight decay of · − , initial learning rate of − andmomentum of factor . . The performance of this setup is pre-sented as Baseline in Section 6. Improvements on the baselinemodel presented in the paper are also trained with the same hy-perparameters unless otherwise stated.
4. Vocal classification
The first hypothesis proposes that classification of side infor-mation can be learned from the voice trigger phrase only, if theside information is correlated with vocal characteristics. To testthe hypothesis, a fully connected deep neural network was de-fined which uses the same input features as the baseline speakerverification system, i.e. the 520-dimensional supervector, andpredicts the side information of the speaker. This model is re-ferred to as the “vocal classification model”. The target labelsare sensitive information and are stored privately on devices,hence larger-scale experiments of training the vocal classifica-tion model were only possible using the framework explainedin Section 2.Experiments were carried out with limited central data toevaluate the effect of applying differential privacy on federatedtraining of the vocal classification model. In addition to accu-racy, the signal-to-noise ratio is measured throughout experi-ments to quantify the signal quality of model updates where DPhas been applied. Here, SNR is computed as the ratio of theL2 norm of the un-noised model update to that of the addedDP noise, i.e.
SNR = || unnoised update || || DP noise || . Figure 1 shows theaccuracy of the following experiments in simulation on an eval-uation set: No DP
No DP mechanism is applied. The resulting accuracyof . acts as an upper bound for the remaining ex-periments because introducing any DP mechanism to im-prove privacy guarantees is expected to have a negativempact on accuracy. Note that even in the “No DP” sce-nario, there is some privacy protection, as anonymity isassumed. The accuracy of this model proves the first hy-pothesis, namely that a predictor of vocal characteristicscan be learned from only the voice trigger phrase. Local DP
The strongest form of differential privacy, local DP,is applied using the Gaussian mechanism [29]. The pri-vacy parameters used were (cid:15) = 2 and δ = 10 − . This re-sults in a significant negative impact on the performanceof the model, yielding an accuracy of . , with a lowSNR of . observed for the first central model up-date. The result is still much better than random, whichshows that useful knowledge can be learned even withsuch strict privacy guarantees and low SNR. Central DP
The Gaussian moments accountant is applied onthe aggregate model update to provide central privacyguarantees. The privacy parameters used were (cid:15) = 2 and δ = 10 − . For the moments accountant, the populationsize is assumed to be M, the cohort size is andthe maximum number of central iterations is . The ac-curacy of the trained vocal classification model is . ,which is close to the “No DP” case. An SNR of . isobserved for the first central model update, which is sig-nificantly higher than “Local DP”. However, this doesnot have any local privacy guarantees. Central DP with weaker local DP
Falling between the “localDP” and “Central DP” experiments, a weaker form oflocal DP [24] is used in combination with the Gaussianmoments accountant for the central DP mechanism. Thiscombination results in an accuracy of . , which is agood privacy-utility trade-off. The privacy parametersused in the weaker form of local DP translate to (cid:15) =25 . , which is in the high-epsilon regime. However, withthe assumed privacy amplification through shuffling andsampling for anonymity, and the application of centralDP with (cid:15) = 2 and δ = 10 − , this may be considereda reasonable operating point. The SNR observed here is . , which is significantly higher than “local DP” andexpected considering the high (cid:15) value.Figure 1: Accuracy of the vocal classification model on an eval-uation dataset, trained with different DP mechanisms.
Hyperparameter tuning was performed in simulations withlimited central data. The best performing model was usedas an initialization when training with real devices. Switch-ing to training the vocal classification model distributed on-device yielded multiple benefits. Firstly, additional categories of side information not previously available in the limited cen-tral dataset were used. Secondly, magnitudes more data is avail-able distributed on devices. The cohort of users for each cen-tral model update was increased to , resulting in higherSNR for the same local DP parameters and smaller noise vari-ance from the moments accountant used centrally. This largereffective corpus means that there is more speaker coverage.Thirdly, while labels of the central dataset may be erroneous dueto errors in manual human annotation, on-device training usesground truth labels, thereby increasing accuracy, while protect-ing privacy.
5. Multi-task learning of thespeaker verification system
The third hypothesis proposed in this paper is that the encodedknowledge of a vocal classification model can act as comple-mentary information for training a more accurate speaker ver-ification system. Multiple approaches for utilizing the knowl-edge of the vocal classification model were experimented with:static rules for rejecting a request based on the model output,using the output of the model as input to intermediate layerswhen training the speaker embedding network, and multi-tasklearning with pseudo-labels. The latter approach was the mostsuccessful and is the focus of the rest of the paper.Specifically, the network was trained to predict the speaker,like the baseline system, and, additionally, the side informa-tion. The loss was the sum of the original loss and a new term.Minimizing this loss distilled the knowledge [30] of the vocalclassification model, encoded in the pseudo-labels it generates.Just as with the baseline system, the final classification layerwas removed during inference to expose the embedding layer.Not only did distilling the knowledge of the vocal classifica-tion model outperform the two other approaches mentioned, butthe final architecture of the speaker verification system remainsunchanged. Since the vocal classification model is a modeltrained with differential privacy, any knowledge that is distilledby the speaker embedding network is also protected by the post-processing theorem of differential privacy [17].The vocal classification model is generally confident in itspredictions, mostly generating an output probability of . for the highest predicted class, even where it is incorrect. Tobetter distill the knowledge for cases like this, the concept oftemperature is used [30]. A temperature higher than softensthe output distribution, making the probabilities that were pre-viously minuscule more representative.Given a mini-batch of data X = { x ( t ) } and correspondinglabels Y = { y ( t ) } , the objective function to minimize for themulti-task setup is L mtl ( X , Y ) = (cid:88) t (cid:0) L spk ( x ( t ) , y ( t ) ) + γ L vc ( x ( t ) ) (cid:1) , (2)where L spk is the cross-entropy loss function for speaker iden-tification as in the baseline setup, γ is a weight for the vocalclassification loss relative to speaker identification, and the vo-cal classification loss L vc is defined as L vc ( x ) = T N N (cid:88) i =0 (cid:32) e V i ( x ) /T (cid:80) j e V j ( x ) /T − e z i /T (cid:80) j e z j /T (cid:33) . (3)Here, z i is the i th logit of the predicted side information fromthe vocal classification layer in the speaker embedding train-ing setup, V i ( x ) is the i th logit from forward propagating theable 1: Performance of the two tasks in the multi-task learningsetup on an evaluation set.
Model Speaker accuracy (%) Side informationaccuracy (%)
Baseline 82.56% -VC offline 83.37% 98.22%VC FL 83.59% 98.15%supervector x through the vocal classification model and T isthe temperature. The above can be interpreted as the meansquared error between the “softened” softmax distributions.The mean squared error is multiplied by T because otherwise ∂ L vc ∂z i scales with a factor of T for large T (see section 2.1 of[30]).
6. Results
In this section, results of three models are presented, all withthe same final network architecture. The difference is only inhow the speaker embedding is trained. The first model,
Base-line , is the baseline production system described in Section 3.The second model,
VC offline , follows the setup described inSection 5 where the knowledge to distil is from a vocal clas-sification model trained on the limited offline data (blue linein Figure 1). The third model,
VC FL , also follows the setupfrom Section 5, but with a vocal classification model trainedwith federated learning. The mechanism in [24] was used forlocal privacy guarantees and the Gaussian mechanism with mo-ments accountant was used for central privacy guarantees.Multi-task learning of the speaker embedding was con-ducted on . million utterances of the trigger phrase from speakers, all preprocessed by the voice trigger system toextract 520-dimensional supervectors. The softmax layer clas-sifying side information has six outputs and the softmax layerclassifying speakers has one output per speaker. The tempera-ture T was set to for all experiments, and the weight γ of theside information classification loss was roughly tuned to bal-ance the losses of the two tasks. Evaluation accuracy was mea-sured on another set of utterances from the same popu-lation of speakers as the training dataset.The accuracies of speaker identification and side informa-tion classification on the evaluation dataset are shown in Table1. The accuracy on the speaker identification task in the multi-task setting increases relative to the baseline. It is possible thatthe side information has both a regularizing effect on speakeridentification as well as helping propagate signals through thenetwork. The accuracy on the classification of the side infor-mation is expected to be close to because the labels aregenerated from the vocal classification model, and this largerDNN should be able to capture the encoded knowledge of thevocal classification model.As mentioned in Section 3, a speaker profile is defined by aset of supervectors from enrolment utterances. When evaluatingthe speaker verification system with the newly trained speakerembedding network, each of speaker profiles available werecompared to supervectors of test utterances using Equation 1.A subset of the test utterances were unique utterances from the speakers which had profiles, and the rest originated fromimposter speakers that did not match any profile. A total of pairs of speaker profiles and test utterances were com- Table 2: Performance of speaker verification on a test set.
Model EER (%)
Baseline 10.10VC offline 9.95VC FL 9.50pared and a test utterance was accepted or rejected by applyinga threshold θ on the speaker vector score SV score . Performanceat the equal error rate (EER) is shown in Table 2 for the threeexperiments. The multi-task setup with knowledge distillationof a vocal classification model trained on limited central datayields a . absolute improvement over the baseline. Thesame setup with a vocal classification model trained on-devicewith privacy-preserving federated learning yields a rela-tive improvement in EER. These results prove both our secondand third hypotheses, namely that privacy-preserving federatedlearning can be used to improve the vocal classification-basedsystem (due to the gain over the “VC offline” experiment), andthat the vocal classification model can be used through multi-task learning to improve the speaker verification system (due tothe gain over the “Baseline” system).
7. Conclusions
This paper demonstrates how a centrally trained speaker verifi-cation system can be improved by distilling the knowledge ofan auxiliary model that was trained with side information on amuch broader population using federated learning, while pro-tecting user privacy. Firstly, the auxiliary model, which clas-sifies side information, was trained using data distributed overmillions of real devices. Additional experiments simulated dif-ferent combinations of federated learning and differential pri-vacy when training this model, to highlight the utility/privacytrade-off expected when using such approaches. The accuracy,time to convergence and signal-to-noise ratio clearly show therelative ordering of these approaches in terms of utility. Sec-ondly, the encoded knowledge of the auxiliary model was dis-tilled into the speaker embedding network of an existing speakerverification baseline system using multi-task learning. Finally,a relative improvement of in equal error rate for speakerverification was achieved using this technique while maintain-ing the same network architecture as the baseline. This resultshows that the speaker characteristic knowledge distilled intothe speaker verification network resulted in speaker embeddingswhich are more discriminative.
8. Acknowledgements
This work was a collaborative effort between multiple teams.The authors would like to thank Chandra Dhir and Sachin Ka-jarekar for their help and involvement in relation to the speakerverification system. The authors would also like to thank ev-eryone involved in the private federated learning effort, whichmade the experiments in this work possible including: Ab-hishek Bhowmick, Simon Beaumont, Andrew Byde, Luke Carl-son, Andrew Cherkashyn, Mansi Deshpande, Fei Dong, JulienFreudiger, Stanley Hung, Omid Javidbakht, Gaurav Kapoor,Joris Kluivers, Henry Mason, Tom Naughton, Deepa NemmiliVeeravalli, Rehan Rishi and Dominic Telaar. . References [1] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in .IEEE, 2014, pp. 4052–4056.[2] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur,“Deep neural network embeddings for text-independent speakerverification.” in
Interspeech , 2017, pp. 999–1003.[3] Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentivespeaker embeddings for text-independent speaker verification.” in
Interspeech , 2018, pp. 3573–3577.[4] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu-danpur, “X-vectors: Robust DNN embeddings for speaker recog-nition,” in . IEEE, 2018, pp. 5329–5333.[5] L. Ferrer, M. Graciarena, A. Zymnis, and E. Shriberg, “Systemcombination using auxiliary information for speaker verification,”in . IEEE, 2008, pp. 4853–4856.[6] O. Plchot, S. Matsoukas, P. Matˇejka, N. Dehak, J. Ma, S. Cumani,O. Glembek, H. Hermansky, S. Mallidi, N. Mesgarani et al. , “De-veloping a speaker identification system for the DARPA RATSproject,” in . IEEE, 2013, pp. 6768–6772.[7] F. Kelly and J. H. Hansen, “Evaluation and calibration of short-term aging effects in speaker verification,” in
Sixteenth AnnualConference of the International Speech Communication Associa-tion , 2015.[8] Z.-Q. Wang and I. Tashev, “Learning utterance-level representa-tions for speech emotion and age/gender recognition using deepneural networks,” in . IEEE, 2017,pp. 5150–5154.[9] H. Meinedo and I. Trancoso, “Age and gender classification us-ing fusion of acoustic and prosodic features,” in
Eleventh AnnualConference of the International Speech Communication Associa-tion , 2010.[10] S. H. Kabil, H. Muckenhirn, and M. Magimai-Doss, “On learn-ing to identify genders from raw speech signal using CNNs,” in
Interspeech , 2018, pp. 287–291.[11] K. Han, D. Yu, and I. Tashev, “Speech emotion recognition usingdeep neural network and extreme learning machine,” in
Fifteenthannual conference of the international speech communication as-sociation , 2014.[12] S. Ghosh, E. Laksana, L.-P. Morency, and S. Scherer, “Represen-tation learning for speech emotion recognition,” in
Interspeech ,2016, pp. 3603–3607.[13] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov,K. Talwar, and L. Zhang, “Deep learning with differentialprivacy,”
Proceedings of the 2016 ACM SIGSAC Conferenceon Computer and Communications Security - CCS’16 , 2016.[Online]. Available: http://dx.doi.org/10.1145/2976749.2978318[14] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, andB. Ag¨uera y Arcas, “Communication-efficient learning of deepnetworks from decentralized data,” 2016.[15] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang, “Learningdifferentially private language models without losing accuracy,”
ICLR , 2018.[16] T. Ryffel, A. Trask, M. Dahl, B. Wagner, J. Mancuso, D. Rueck-ert, and J. Passerat-Palmbach, “A generic framework for privacypreserving deep learning,”
CoRR , 2018.[17] C. Dwork and A. Roth, “The algorithmic foundations of differ-ential privacy.”
Foundations and Trends in Theoretical ComputerScience , vol. 9, no. 3-4, pp. 211–407, 2014. [18] Siri Team, “Personalized Hey Siri,”
Apple Ma-chine Learning Journal , vol. 1, no. 9, 2018. [On-line]. Available: https://machinelearning.apple.com/2018/04/16/personalized-hey-siri.html[19] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Privacy awarelearning,” 2012.[20] J. Koneˇcn´y, H. B. McMahan, D. Ramage, and P. Richt´arik, “Fed-erated optimization: Distributed machine learning for on-deviceintelligence,” 2016.[21] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacksthat exploit confidence information and basic countermeasures,”in
Proceedings of the ACM SIGSAC Conference on Computer andCommunications Security , 10 2015, pp. 1322–1333.[22] L. Melis, C. Song, E. D. Cristofaro, and V. Shmatikov, “Exploitingunintended feature leakage in collaborative learning,” 2018.[23] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membershipinference attacks against machine learning models,” in . IEEE, 2017, pp. 3–18.[24] A. Bhowmick, J. C. Duchi, J. Freudiger, G. Kapoor, andR. Rogers, “Protection against reconstruction and its applicationsin private federated learning.”
CoRR , vol. abs/1812.00984, 2018.[25] ´U. Erlingsson, V. Feldman, I. Mironov, A. Raghunathan, K. Tal-war, and A. Thakurta, “Amplification by shuffling: From local tocentral differential privacy via anonymity,” in
Proceedings of theThirtieth Annual ACM-SIAM Symposium on Discrete Algorithms .SIAM, 2019, pp. 2468–2479.[26] B. Balle, G. Barthe, and M. Gaboardi, “Privacy amplification bysubsampling: Tight analyses via couplings and divergences,” in
Advances in Neural Information Processing Systems , 2018, pp.6277–6287.[27] S. Ioffe and C. Szegedy, “Batch normalization: Acceleratingdeep network training by reducing internal covariate shift,” arXivpreprint arXiv:1502.03167 , 2015.[28] E. Marchi, S. Shum, K. Hwang, S. Kajarekar, S. Sigtia,H. Richards, R. Haynes, Y. Kim, and J. Bridle, “Generalised dis-criminative transform via curriculum learning for speaker recog-nition,” in , April 2018, pp. 5324–5328.[29] B. Balle and Y.-X. Wang, “Improving the Gaussian mechanismfor differential privacy: Analytical calibration and optimal denois-ing,” arXiv preprint arXiv:1805.06530 , 2018.[30] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge ina neural network,” arXiv preprint arXiv:1503.02531arXiv preprint arXiv:1503.02531