[PDF] Federated Acoustic Modeling For Automatic Speech Recognition

Abstract

Data privacy and protection is a crucial issue for any automatic speech recognition (ASR) service provider when dealing with clients. In this paper, we investigate federated acoustic modeling using data from multiple clients. A client's data is stored on a local data server and the clients communicate only model parameters with a central server, and not their data. The communication happens infrequently to reduce the communication cost. To mitigate the non-iid issue, client adaptive federated training (CAFT) is proposed to canonicalize data across clients. The experiments are carried out on 1,150 hours of speech data from multiple domains. Hybrid LSTM acoustic models are trained via federated learning and their performance is compared to traditional centralized acoustic model training. The experimental results demonstrate the effectiveness of the proposed federated acoustic modeling strategy. We also show that CAFT can further improve the performance of the federated acoustic model.

Full PDF

FFEDERATED ACOUSTIC MODELING FOR AUTOMATIC SPEECH RECOGNITION

Xiaodong Cui, Songtao Lu and Brian Kingsbury

IBM Research AIIBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USA

ABSTRACT

Data privacy and protection is a crucial issue for any automaticspeech recognition (ASR) service provider when dealing withclients. In this paper, we investigate federated acoustic model-ing using data from multiple clients. A client’s data is stored on alocal data server and the clients communicate only model parameterswith a central server, and not their data. The communication hap-pens infrequently to reduce the communication cost. To mitigate thenon-iid issue, client adaptive federated training (CAFT) is proposedto canonicalize data across clients. The experiments are carriedout on 1,150 hours of speech data from multiple domains. HybridLSTM acoustic models are trained via federated learning and theirperformance is compared to traditional centralized acoustic modeltraining. The experimental results demonstrate the effectiveness ofthe proposed federated acoustic modeling strategy. We also showthat CAFT can further improve the performance of the federatedacoustic model.

Index Terms — federated learning, speech recognition, adaptivetraining, LVCSR, GDPR

1. INTRODUCTION

Large-scale acoustic modeling for automatic speech recognition(ASR) usually relies on speech data from multiple sources. Forindustrial ASR products, oftentimes the acoustic models are cus-tomized for one or more target clients. The traditional approachto training acoustic models aggregates the speech data from theclients, along with other public and/or internal speech data if nec-essary. The training is typically carried out in a centralized fashionon the servers of the ASR service provider. With data privacy andprotection becoming a crucial issue in information technology [1],most clients will require that their data stay on-premises, precludingits release to the provider for training. A new approach to trainingthe models under this circumstance is needed. In this paper, we in-vestigate a federated acoustic modeling strategy where the acousticmodel is trained collaboratively among the clients with each clienthaving their data locally stored. The clients only exchange theirlocal model updates with the central server at the service provider.This model exchange takes place at a minimal frequency to reducethe communication cost. Federated learning (FL) [2, 3, 4] has been widely used in appli-cations such as healthcare and ﬁnance where data privacy is a con-straint. In the setting of FL, multiple entities collaborate with each ©2021 IEEE. Personal use of this material is permitted. Permission fromIEEE must be obtained for all other uses, in any current or future media,including reprinting/republishing this material for advertising or promotionalpurposes, creating new collective works, for resale or redistribution to serversor lists, or reuse of any copyrighted component of this work in other works. other to optimize a machine learning problem under the orchestra-tion of a central server. A global model is learned with each clientkeeping its data private, in local storage. Even though FL was pro-posed initially in scenarios with a huge number of mobile or edgedevices [5], it was later generalized to a much broader spectrum ofapplications. In [3], the initial FL setting with an emphasis on a largenumber of devices, each with a relatively small amount of data, isreferred to as “cross-device” FL, while the setting that we are aboutto use in this work for federated acoustic modeling is referred toas “cross-silo” FL. Cross-silo FL deals with a federation of a fewdata providers each with a relatively large amount of siloed data. Inthe speech community, FL related efforts have also been reportedrecently. In [6], federated learning was used to train an embeddedwake word detector on crowdsourced speech. In [7], various feder-ated averaging schemes and data augmentation techniques have beenstudied to improve keyword spotting models with data not indepen-dent and identically distributed (iid) at the edge. An interactive sys-tem was built in [8] to demonstrate how FL can help transfer learningon acoustic models using edge device data. In [9], a federated trans-fer learning platform is introduced with improved performance usingenhanced federated averaging via hierarchical optimization and gra-dient selection.In this paper, we introduce a cross-silo FL framework for jointacoustic modeling with heterogeneous client data from multipleclients. Its conﬁguration is shown in Fig.1. Each client has a localserver for data storage and computation. The clients only com-municate with the central server of the service provider. Modelparameters, not raw data, are exchanged between the clients andthe central server in each round of communication. Local modelparameters are uploaded to the central server and aggregated into aglobal model that is then transmitted back to each client. The com-munication is synchronous and takes place at a minimal frequencyto reduce the communication cost. Since client data may come fromdistinct domains with unbalanced amounts, a fundamental issue withfederated acoustic modeling in the real world is dealing with non-iiddata. In this work, we propose a client adaptive federated training(CAFT) strategy to mitigate data heterogeneity. Experiments areconducted on 1,150 hours of speech data from multiple domainsincluding public data, internal data, and real-world client data. Wecompare the performance of the federated strategy under varioussettings and also compare it with the traditional centralized training.The remainder of the paper is organized as follows. Section 2gives the mathematical formulation the federated acoustic modeling.Section 3 and Section 4 present the algorithms and implementationof federated training and client adaptive federated training of acous-tic models. Experimental results are reported in Section 5 followedby a discussion in Section 6. Finally, we conclude the paper with asummary in Section 7. a r X i v : . [ c s . S D ] F e b ig. 1 . Federated acoustic modeling with multiple clients.

2. PROBLEM FORMULATION

Suppose there are L clients. The data of client i , i = 1 , · · · , L ,consists of n i samples which can only be accessed locally. The datafollows a client-dependent distribution D i . Each client is associatedwith a local risk function L i ( w ) (cid:44) E x ∼ D i [ f ( w, x )] (1)where w is the parameters, x ∼ D i the data samples from distribu-tion D i and f ( w, x ) the loss function.Federated modeling optimizes the following risk function over L clients min w L ( w ) = L (cid:88) i =1 p i L i ( w ) = L (cid:88) i =1 p i E x ∼ D i [ f ( w, x )] (2)where p i > and (cid:80) i p i = 1 are weights on the local risk functions. Itis typical to choose p i = L to make the clients equally contributed.In conventional distributed training [10, 11, 12, 13], data frommultiple sources are ﬁrst mixed and then distributed to learners.Each learner has equal access to the mixed data and therefore localdistribution across learners is iid. In federated learning, however,the data from different sources can not be mixed and hence thelocal data distribution is non-iid, which is different from the conven-tional distributed training. This is a fundamental issue in federatedlearning. In addition, the amounts of data from the clients could beunbalanced. As a result, the weights in the global loss function inEq.2 are sometimes set to p i = n i n where n = (cid:80) Li =1 n i to make theloss function of each client proportional to their amounts of data.

3. FEDERATED ACOUSTIC MODELING

The recipe for cross-silo federated training of acoustic models isgiven in Algorithm 1.The central server of the service provider coordinates the dis-tributed training among clients. It starts the training by sending aglobally initialized model to the clients and aggregates the locallyupdated models from all clients for a total of T rounds of commu-nication. In each communication round, clients receive the globalmodel from the central server and update it locally using their ownlocal data before sending it back to the central server. This process iscarried out in parallel with synchronization. The central server willwait until models from all clients are in place to update the globalmodel. The local model update can be realized by any optimizer.In this work, stochastic gradient descent (SGD) with momentum isused. The local client data is evenly divided into T chunks and one chunk of data after randomization is used for multi-step mini-batchSGD with a batch size B in each communication round: w k +1 = w k − α B (cid:88) j =1 ∇ f ( w k , x j ) . (3)The global model update is conducted by federated averaging(FedAvg) [2]: w t +1 = w t − η L (cid:88) i =1 p i (cid:16) w t − w ( i ) t +1 (cid:17) (4)Note that when the global learning rate η = 1 , FedAvg is equivalentto a simple model averaging strategy w t +1 = L (cid:88) i =1 p i w ( i ) t +1 (5)which is analogous to the K -step averaging SGD [14] when p i = L .However, in this case the steps employed in each client are differentand the data distribution is non-iid. Algorithm 1

Federated acoustic modeling L ← total number of clients; M ← total number of epochs; T ← total rounds of communication; B ← batch size; α ← local learning rate; η ← global learning rate; function S ERVER

Initialize model w ; for m ← , · · · , M do copy model from last epoch for t ← , · · · , T do Send model w t to clients for i ← , · · · , L do (cid:46) running in parallel w ( i ) t +1 ← Client ( i , t ) w t +1 ← w t − η (cid:80) Li =1 p i (cid:16) w t − w ( i ) t +1 (cid:17) function C LIENT ( i , t )Receive model w t from server;Get the samples S ( i ) t for the communication round where | S ( i ) t | = n i T ;Get the number of SGD steps K i = | S ( i ) t | B ; for k ← , · · · , K i do w ( i ) t,k +1 ← w ( i ) t,k − α (cid:80) Bj =1 ∇ f ( w ( i ) t,k , x j ) , x j ∼ S ( i ) t ;Return model w ( i ) t +1 = w ( i ) t,K i to server;

4. CLIENT ADAPTIVE FEDERATED TRAINING

The data provided by clients may come from different domains withdifferent distributions. Since the clients update their models locallyand do not directly communicate with each other, this non-iid issuehas posed a major challenge in FL. In ASR, canonicalization hasbeen widely used in dealing with data heterogeneity, for example thatcaused by different speakers and environments [15, 16]. In this work,e introduce CAFT to mitigate data heterogeneity across clients,extending speaker or cluster adaptive training to federated learning.We introduce a transform F i of the data x to each client i . Thetransform is estimated to minimize the local risk function given theglobal model w ˆ F i = argmin F i L ( w, F i ) (cid:44) argmin F i E x ∼ D i [ f ( w, F i ( x ))] (6)After the estimation, the transform F i is ﬁxed and a local model w ( i ) is updated on the transformed data F i ( x ) . In each round, F i and w ( i ) are alternately optimized. Since F i is estimated against theglobal model w , the transformed data F i ( x ) is expected to be morehomogeneous. In this work, the transform F i is chosen to be anafﬁne transform F i ( x ) = A i x + b i , x ∼ D i (7)The recipe for CAFT is given in Algorithm 2. Algorithm 2

Client adaptive federated training L ← total number of clients; M ← total number of epochs; T ← total rounds of communication; B ← batch size; α ← local learning rate; η ← global learning rate; function S ERVER

Initialize model w ; for m ← , · · · , M do copy model from last epoch for t ← , · · · , T do Send model w t to clients for i ← , · · · , L do (cid:46) running in parallel w ( i ) t +1 ← CATClient ( i , t ) w t +1 ← w t − η (cid:80) Li =1 p i (cid:16) w t − w ( i ) t +1 (cid:17) function CATC LIENT ( i , t )Receive model w t from server;Get the samples S ( i ) t for the communication round where | S ( i ) t | = n i T ;Get the number of SGD steps K i = | S ( i ) t | B ;Estimate client-dependent transform F i given w t ; for k ← , · · · , K i do w ( i ) t,k +1 ← w ( i ) t,k − α (cid:80) Bj =1 ∇ f ( w ( i ) t,k , F ( x j )) , x j ∼ S ( i ) t ;Return model w ( i ) t +1 = w ( i ) t,K i to server;

5. EXPERIMENTAL RESULTS

Experiments are conducted on 1,150 hours of speech data from ﬁvesources. It consists of 420 hours of Broadcast News (BN) data, 450hours of internal dictation data, 100 hours of internal meeting data,140 hours of hospitality (travel and hotel reservation) data and 40hours of accented data, respectively. It represents a good coverageof public data (BN), internal data (dictation and meeting) and real- Test Set Description HoursS1 dev04f test set from Broadcast News 2.21S2 hospitality speech 0.34S3 Asian-accented speech 2.41S4 Latin-accented speech 3.12

Table 1 . Test sets used for evaluation.world client data (hospitality and accented). Each data source istreated as a client, giving rise to ﬁve clients in total. The data iswideband speech with a 16KHz sampling rate. The acoustic modelsare evaluated on four test sets. To make the evaluation extensive yetcontrolled, the four selected test sets are taken from public data andreal-world client data, as described in Table 1. The decoding vocab-ulary comprises 260K words and the language model (LM) is a 4-gram LM with 200M n-grams and modiﬁed Kneser-Ney smoothing.The LM training data is selected from a broad variety of sources.The acoustic model is a hybrid Long Short-Term Memory(LSTM) network with 5 bi-directional layers. Each layer has 512cells with 256 cells in each direction. A linear projection layerwith 256 hidden units is inserted between the topmost LSTM layerand the softmax layer consisting of 9,300 output units. These9,300 units correspond to context-dependent hidden Markov model(HMM) states. The LSTM is unrolled over 21 frames and trainedwith non-overlapping feature subsequences of that length. The fea-ture input is 40 dimensional log-Mel features with delta and doubledelta features. The total input dimensionality is 120.Table 2 shows the word error rates (WERs) of acoustic mod-els trained by the traditional centralized training and federated train-ing under various hyper-parameters such as global learning rates andcommunication rounds. The baseline is considered an oracle modelas it is trained by mixing all the training data such that training iscarried out on iid data. It is trained using SGD with a learning rateof 0.1, a momentum of 0.9, and a batch size of 256. The learningrate anneals by √ after the 10th epoch and the training ﬁnishes in20 epochs. In federated training, the optimization of the local mod-els for each client follows a recipe similar to the baseline except thetraining uses local data. The SGD optimizer employs a learning rateof 0.2, a momentum of 0.9, and a batch size 128. All clients areequally weighted (i.e. p i = 0 . ). When η = 1 . , FedAvg amountsto a simple model average. Setting the number of communicationrounds involves a tradeoff: to be communication efﬁcient, one wantsto minimize the number of communication rounds, but this may de-grade the performance of the ﬁnal model. From the table, 20 roundsof communication gives the best WERs, while more rounds of modelaveraging may slightly hurt the performance. The best WERs areachieved after 20 rounds of communication when the global learn-ing rate is 0.95.The experimental results of CAFT are given in Table 3. Thebaseline is trained by ﬁrst estimating the client-dependent transformand then pooling the canonicalized client data together for model op-timization, which is analogous to the cluster adaptive training [15]when dealing with cluster heterogeneity. Compared to the baselinein Table 2, the client adaptive training baseline improves the WERsfor all the test sets. On average it reduces the WER from 15.83%to 14.50%. It is also treated as an oracle model for CAFT. Thesecond block of Table 3 shows the WERs under various rounds ofcommunication with a global learning rate of 0.95. The afﬁne trans-form is × in dimension and is optimized using SGD witha learning rate of 0.02, a momentum of 0.9, and a batch size of1,024. We ﬁnd that a large batch size is helpful when estimating T S1 S2 S3 S4 avg baseline 13.2 23.0 13.8 13.3 15.83

Table 2 . WERs of baseline and federated training under variousglobal learning rates and communication rounds. The baseline istrained following the traditional centralized recipe, which serves asan iid data oracle for FL with non-iid data. η = 1 . indicates a simplemodel averaging.the client-dependent transform. The model is optimized using SGDwith a learning rate of 0.05, a momentum of 0.9, and a batch sizeof 128. For both transform and model optimization, the learningrates anneal by √ after the 10th epoch and the training ﬁnishesin 20 epochs. From the table, it can be seen that using 10 roundsof communication gives an averaged WER 15.92%. CAFT can befurther improved by pre-training. Instead of starting from a random-ized global model across the clients to estimate the client-dependenttransform, the global model is initialized with the well-trained FLmodel without canonicalization. In this case, we use the model fromthe 8th row of Table 2 (under η = 0 . and T = 20 with an averagedWER 16.33%). The learning rate anneals by √ after the 3rd epochand the training ﬁnishes in 10 epochs. The results are given in thethird block of Table 3 under CAFT-PT. It can be observed that thepre-training consistently improves the performance on all test sets.With 15 rounds of communication, the averaged WER is 15.13%,which is about 0.8% better than without pre-training. η T S1 S2 S3 S4 avg

CAT baseline 13.0 22.4 11.5 11.1 14.50CAFT0.95 10 13.7 23.9 13.2 12.9 15.92

CAFT-PT

Table 3 . WERs of CAT baseline and CAFT under various settings.CATF-PT presents results of CAFT starting with a global pre-trainedfederated model. The CAT baseline serves as an oracle for CAFT.Table 4 shows the WERs of the four test sets under variousweighting strategies using CAFT-PT with η = 0 . and T = 15 . Theﬁrst row uses equal weights on all ﬁve clients. In the second row,weights are proportional to the amount of data of each client. Rows3-5 represent strategies with a preference for a particular client. Itcan be observed that the ASR performance of a particular client willbe improved in most cases if its local model update is favored inweighting in training. weight S1 S2 S3 S4 p i = 1 /L p i = n i /n p accented = 0 . , p i = 0 . rest 13.5 22.6 11.8 11.4 p brodcastnews = 0 . , p i = 0 . rest 13.4 23.2 12.8 12.5 p hospitality = 0 . , p i = 0 . rest 13.5 22.0 12.4 11.9 Table 4 . WERs of CAFT with pre-training under various weightingstrategies.

6. DISCUSSION

Cross-silo federated training of acoustic models is distributed bydesign under fundamental requirements of local data storage andcommunication only permissible between clients and the serviceprovider. Other than those requirements, the local clients havethe ﬂexibility of choosing the training strategy to update the localmodel. For instance, the local K -step optimization can be real-ized in a distributed fashion by itself across multiple nodes whethersynchronously or asynchronously. In addition, in this work we as-sume the communication between clients and the service provideris secure. Otherwise channel encryption techniques or differentialprivacy [17, 18] may be applied to ensure data safety in the training.The investigated federated training strategy includes a specialcase where there is only one client and the service provider wantsto collaboratively train the model with some public data or its owninternal data. This is a common customized acoustic modeling sce-nario in the real world. In this case, L = 2 and the second client isthe service provider itself while in the mean time its local server isthe central server.Despite the use of an afﬁne transform in this work, in general theclient-dependent transform F can be any nonlinear function. At testtime, we use the transform estimated from the training data as weassume the training and test data is matched for a given client. Thetransform can be further adapted on the test sets, but this estimationis typically unsupervised and needs a two-pass decoding, which willincrease inference complexity and introduce latency.Lastly, it is fairly common that the data from multiple clients isunbalanced. Strategic weighting in training is one way to deal withunbalanced data from clients or if the service provider has certainperformance priority in mind.

7. SUMMARY

In this paper we investigated cross-silo federated acoustic modelingto protect data privacy where the ASR service provider collabora-tively trains a global acoustic model across multiple clients. Eachclient has its local data storage which is not shared with either theservice provider or the other clients. Models are updated locallyfrom each client before being sent back to the service provider’scentral server for aggregation. Various federated averaging schemeshave been compared and the impact of the communication frequencyis also studied. To deal with the non-iid issue, client adaptive fed-erated training is introduced where a client-dependent transform isestimated to canonicalize the heterogeneous data among the clients.Experiments show that client adaptive federated training can effec-tively mitigate the data heterogeneity to consistently give improvedperformance on all the test sets. . REFERENCES [1] General Data Protection Regulation (GDPR), https://gdpr-info.eu/.[2] J. Konecny, H. B. McMahan, F. X. Yu, P. Richtarik, A. T.Suresh, and D. Bacon, “Federated learning: strategies for im-proving communication efﬁciency,” in

NIPS Workshop on Pri-vate Multi-Party Machine Learning , 2016.[3] P. Kairouz et. al., “Advances and open problems in federatedlearning,” arXiv preprint arXiv:1912.04977 , 2019.[4] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A.y Arcas, “Communication-efﬁcient learning of deep networksfrom decentralized data,” in

International Conference on Arti-ﬁcial Intelligence and Statistics (AISTATS) , 2017.[5] A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays,S. Augenstein, H. Eichner, C. Kiddon, and D. Ramage, “Fed-erated learning for mobile keyboard prediction,” arXiv preprintarXiv:1811.03604 , 2018.[6] D. Leroy, A. Coucke, T. Lavril, T. Gisselbrecht, and J. Dureau,“Federated learning for keyword spotting,” in

Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2019, pp. 6341–6345.[7] A. Hard, K. Partridge, C. Nguyen, N. Subrahmanya, A. Shah,P. Zhu, I. L. Moreno, and R. Mathews, “Training keywordspotting models on non-IID data with federated learning,” in

Interspeech , 2020.[8] C. Tan, D. Jiang, H. Mo, J. Peng, Y. Tong, W. Zhao, C. Chen,R. Lian, Y. Song, and Q. Xu, “Federated acoustic model op-timization for automatic speech recognition,” in

InternationalConference on Database Systems for Advanced Applications ,2020, pp. 771–774.[9] D. Dimitriadis, K. Kumatani, R. Gmyr, Y. Gaur, and S. E. Es-kimez, “A federated approach in training acoustic models,” in

Interspeech , 2020.[10] X. Cui, W. Zhang, U. Finkler, G. Saon, M. Picheny, andD. Kung, “Distributed training of deep neural network acous-tic models for automatic speech recognition: a comparison ofcurrent training strategies,”

IEEE Signal Processing Maganize ,pp. 39–49, May 2020.[11] W. Zhang, X. Cui, U. Finkler, B. Kingsbury, G. Saon, D. Kung,and M. Picheny, “Distributed deep learning strategies for au-tomatic speech recognition,” in

International Conference onAcoustics, Speech and Signal Processing (ICASSP) , 2019.[12] K. Chen and Q. Huo, “Scalable training of deep learning ma-chines by incremental block training with intro-block paralleloptimization and blockwise model-update ﬁltering,” in

Inter-national Conference on Acoustics, Speech and Signal Process-ing (ICASSP) , 2016, pp. 5880–5884.[13] Y. Huang, J. Tian, L. Han, G. Wang, X. Song, D. Su, and D. Yu,“A random gossip BMUF process for neural language model-ing,” in

International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2020, pp. 7959–7963.[14] F. Zhou and G. Cong, “On the convergence properties of a K-step averaging stochastic gradient descent algorithm for non-convex optimization,” in

Proceedings of the Twenty-SeventhInternational Joint Conference on Artiﬁcial Intelligence (IJ-CAI) , 2018. [15] M. J. F. Gales, “Cluster adaptive training of hidden Markovmodels,”

IEEE Transactions on Speech and Audio Processing ,vol. 8, no. 4, pp. 417–428, 2000.[16] M. J. F. Gales, “Maximum likelihood linear transformationsfor HMM-based speech recognition,”

Computer Speech andLanguage , vol. 12, pp. 75–98, 1998.[17] C. Dwork and A. Roth, “The algorithmic foundations of differ-ential privacy,”

Foundations and Trends in Theoretical Com-puter Science , vol. 9, no. 3-4, 2014.[18] M. Naseri, J. Hayes, and E. D. Cristofaro, “Toward ro-bustness and privacy in federated learning: experimentingwith local and central differential privacy,” arXiv preprintarXiv:2009.03561arXiv preprintarXiv:2009.03561