[PDF] Effects of Layer Freezing when Transferring DeepSpeech to New Languages

Abstract

In this paper, we train Mozilla's DeepSpeech architecture on German and Swiss German speech datasets and compare the results of different training methods. We first train the models from scratch on both languages and then improve upon the results by using an English pretrained version of DeepSpeech for weight initialization and experiment with the effects of freezing different layers during training. We see that even freezing only one layer already improves the results dramatically.

Full PDF

EEﬀects of Layer Freezing when TransferringDeepSpeech to New Languages

Onno Eberhard and Torsten Zesch

Language Technology LabUniversity of Duisburg-Essen [email protected] Abstract

In this paper, we train Mozilla’s DeepSpeech architecture on German and SwissGerman speech datasets and compare the results of diﬀerent training methods. Weﬁrst train the models from scratch on both languages and then improve upon theresults by using an English pretrained version of DeepSpeech for weight initializationand experiment with the eﬀects of freezing diﬀerent layers during training. We seethat even freezing only one layer already improves the results dramatically.

The ﬁeld of automatic speech recognition (ASR) is dominated by research speciﬁc tothe English language. There exist plenty available text-to-speech models pretrained on(and optimized for) English data. When it comes to a low-resource language like SwissGerman, or even standard German, the range of available pretrained models becomesvery sparse. In this paper, we train Mozilla’s implementation of Baidu’s DeepSpeechASR architecture (Hannun et al. 2014) on these two languages. We use transfer learningto leverage the availability of a pretrained English version of DeepSpeech and observethe diﬀerence made by freezing diﬀerent numbers of layers during training.For previous work on using DeepSpeech for the two languages German and Swiss German,see (Agarwal and Zesch 2019) and (Agarwal and Zesch 2020) respectively. Note however,that our datasets and training methods are not identical to those used there. Our focushere lies on isolating the eﬀect of layer freezing in the given context. Deep neural networks can excel at many diﬀerent tasks, but they often require verylarge amounts of training data and computational resources. To remedy this, it is oftenadvantageous to employ transfer learning: Instead of initializing the parameters of thenetwork randomly, the optimized parameters of a network trained on a similar task arereused. Those parameters can then be ﬁne-tuned to the speciﬁc task on hand, using lessdata and fewer computational resources. In the ﬁne-tuning process many parameters ofthe original model may be “frozen”, i.e. held constant during training. This can speedup training, as well as decrease the computational resources used during training (Kunzeet al. 2017). The idea of taking deep neural networks trained on large datasets andﬁne-tuning them on tasks with less available training data has been popular in computervision for years (Huh, Agrawal, and Efros 2016). More recently, with the emergence of https://github.com/mozilla/DeepSpeech a r X i v : . [ c s . C L ] F e b nd-to-end deep neural networks for automatic speech recognition (like DeepSpeech), ithas also been used in this area (Kunze et al. 2017; Li, Wang, and Beigi 2019).The reason why the freezing of parameters for ﬁne-tuning deep neural networks is sosuccessful, is that the networks learn representations of the input data in a hierarchicalmanner. The input is transformed into simplistic features in the ﬁrst layers of a neuralnetwork and into more complex features in the layers closer to the output. With networksfor image classiﬁcation this can be nicely visualized (Zeiler and Fergus 2014). As forautomatic speech recognition, the representations learned by the layers of a similar systemto the one we used, one that is also based on Baidu’s DeepSpeech architecture, havebeen analyzed by Belinkov and Glass (2017). The ﬁndings show that the hierarchicalstructure of features learned by DeepSpeech is not as clear as it is with networks for imageprocessing. Nonetheless, some ﬁndings, for example that aﬀricates are better representedat later layers in the network, seem to aﬃrm the hypothesis that the later layers learnmore abstract features and earlier layers learn more primitive features. This is importantfor ﬁne-tuning, because it only makes sense to freeze parameters if they don’t need tobe adjusted for the new task. If it is known that the ﬁrst layers of a network learnto identify “lower-level”-features, i.e. simple shapes in the context of image processingor simple sounds in the context of ASR, these layers can be frozen completely duringﬁne-tuning.The DeepSpeech network takes features extracted from raw audio data as input andoutputs character probabilities (the architecture is described in more detail in the nextsection). With the reasoning from above, the ﬁrst few layers should mostly obtain simplefeatures, such as phonemes, from the input, while the later layers should mostly infer thecharacter corresponding to these lower level features. The rationale for using transferlearning to transfer from one language to another, is the assumption that these lower-levelfeatures are shared across diﬀerent languages. Thus, only the parameters later layersneed to be adjusted for successfully training the network on a new language. Whetherthis assumption works in practice, and how much use freezing the layers actually is, willbe the focus of this paper. We train the English pretrained version of DeepSpeech onGerman and on Swiss German data and observe the impact of freezing fewer or morelayers during training. We use Mozilla’s DeepSpeech version 0.7 for our experiments. The implementation diﬀersin many ways from the original model presented by Hannun et al. (2014). The architectureis described in detail in the oﬃcial documentation and is depicted in Figure 1. Fromthe raw speech data, Mel-Frequency Cepstral Coeﬃcients (Imai 1983) are extractedand passed to a 6-layer deep recurrent neural network. The ﬁrst three layers are fullyconnected with a ReLU activation function. The fourth layer is a Long Short-TermMemory unit (Hochreiter and Schmidhuber 1997); the ﬁfth layer is again fully connectedand ReLU activated. The last layer outputs probabilities for each character in thelanguage’s alphabet. It is fully connected and uses a softmax activation for normalization.The character-probabilities are used to calculate a Connectionist Temporal Classiﬁcation(CTC) loss function (Graves et al. 2006). The weights of the model are optimized usingthe Adam method (Kingma and Ba 2014) with respect to the CTC loss. https://deepspeech.readthedocs.io/en/latest/DeepSpeech.html ) To assess the eﬀects of layer freezing, we train the network multiple times for each ofthe two languages. For weight initialization we use an English pretrained model, whichis provided by Mozilla . We then freeze between 0 and 4 layers during training. Forboth languages we also train one model from scratch, where the weights are initializedrandomly. In total, we train 6 diﬀerent models for each language: Reference

The whole model from scratch (random weight initialization)

The model with weights initialized to those of the English pretrainedmodel, all weights are optimized during training

The English-initialized model with the ﬁrst layer frozen

The English-initialized model with the ﬁrst two layers frozen

The English-initialized model with the ﬁrst three layers frozen

The English-initialized model with the ﬁrst three and the ﬁfth layerfrozenThe complete training script, as well as the modiﬁed versions of DeepSpeech that utilizelayer freezing are available online . The weights were frozen by adding trainable=False at the appropriate places in the TensorFlow code. For all models, we had to reinitialize https://github.com/mozilla/DeepSpeech/releases https://github.com/onnoeberhard/deepspeech-paper In training each model, we used a batch size of 24, a learning rate of 0.0005 and a dropoutrate of 0.4. We did not perform any hyperparameter optimization. The training wasdone on a Linux machine with 96 Intel Xeon Platinum 8160 CPUs @ 2.10GHz, 256GB ofmemory and an NVIDIA GeForce GTX 1080 Ti GPU with 11GB of memory. Training theGerman language models for 30 epochs took approximately one hour per model. Trainingthe Swiss German models took about 4 hours for 30 epochs on each model. We did notobserve a correlation between training time and the number of frozen layers.

We trained the German models on the German-language Mozilla Common Voice speechdataset (Ardila et al. 2020). The utterances are typically between 3 and 5 seconds long andare collected from and reviewed by volunteers. Because of this, the dataset comprises alarge amount of diﬀerent speakers which makes it rather noisy. The Swiss German modelswere trained on the data provided by Plüss, Neukom, and Vogel (2020). This speech datawas collected from speeches at the Bernese parliament. The English pretrained model wastrained by Mozilla on a combination of English speech datasets, including LibriSpeechand Common Voice English. The datasets for all three languages are described in Table1. For inference and testing we used the language model KenLM (Heaﬁeld 2011), trainedon the corpus described by Radeck-Arneth et al. (2015, Section 3.2). This corpus consistsof a mixture of text from the sources Wikipedia and Europarl as well as crawled sentences.The whole corpus was preprocessed with MaryTTS (Schröder and Trouvain 2003).Dataset Hours of data Number of speakersEnglish > The test results for both languages from the six diﬀerent models described in Section3.2 are compiled in Tables 2 and 3. For testing, the epoch with the best validation lossduring training was taken for each model. Figures 2 to 5 show the learning curves forall training procedures (Fig. 2 and 3 for German, Fig. 4 and 5 for Swiss German). Thecurve of the best model (3 frozen layers for German, 2 frozen layers for Swiss German) isshown in both plots for each language. The epochs used for testing are also marked inthe ﬁgures.For both languages, the best results were achieved by the models with the ﬁrst two tothree layers frozen during training. It is notable however, that the other models thatutilize layer freezing are not far oﬀ. The training curves look remarkably similar (seeFigures 3 and 5). For both languages, all four models achieve much better results thanthe two models without layer freezing (“Reference” and “0 Frozen Layers”). The resultsseem to indicate that freezing the ﬁrst layer brings the largest advantage in training, with See here for more detail: https://github.com/mozilla/DeepSpeech/releases/tag/v0.7.0 CTC L o ss Reference0 Frozen Layers3 Frozen Layers TrainingTestingBest epochs

Figure 2: Learning curves (German dataset): With and withouttransfer learning and layer freezing

CTC L o ss Figure 3: Learning curves (German dataset): Comparison of freezinga diﬀerent number of layers 5

CTC L o ss Reference0 Frozen Layers2 Frozen LayersTrainingTestingBest epochs

Figure 4: Learning curves (Swiss German dataset): With and withouttransfer learning and layer freezing

CTC L o ss Figure 5: Learning curves (Swiss German dataset): Comparison offreezing a diﬀerent number of layers6ethod WER CERReference .70 .420 Frozen Layers .63 .371 Frozen Frozen Layer .48 .26

A next step might be to train these models with more training data and see if layerfreezing is still beneﬁcial. The chosen German speech dataset is not very large; Agarwaland Zesch (2019) achieved a best result of 0.151 WER when training the model on alarge dataset, in contrast to a result of 0.797 WER when training the same model on avery similar dataset to the one we used.An interesting idea for further research is to use a diﬀerent pretrained model than theEnglish one. English seems to work alright for transferring to German, but it is possiblethat the lower level language features extracted by a model only trained for recognizingEnglish speech are not suﬃcient for transferring to certain other languages. For example,when just transcribing speech there is no need for such a model to learn intonationfeatures. This might be a problem when trying to transfer such a pretrained model to atonal language like Mandarin or Thai. There might also be phonemes that don’t exist orare very rare in English but abundant in other languages.

Transfer learning seems to be a powerful approach to train an automatic speech recognitionsystem on a small dataset. The eﬀects we saw when transferring DeepSpeech from Englishto German and from English to Swiss German were very similar: The results were notnecessarily better than plain training when just initializing the parameters, but freezingonly the ﬁrst layer already improved the results dramatically. Freezing more layers7mproved the outcome even more, but with larger training datasets this might haveadverse eﬀects.

We want to thank Aashish Agarwal for valuable help in setting up DeepSpeech and forproviding preprocessing scripts as well as the hyperparameters we used for training.

References

Agarwal, Aashish and Torsten Zesch (2019). “German End-to-end Speech Recognitionbased on DeepSpeech”. In:

Proceedings of the 15th Conference on Natural LanguageProcessing (KONVENS 2019): Long Papers . Erlangen, Germany: German Society forComputational Linguistics & Language Technology, pp. 111–119.— (2020).

LTL-UDE at Low-Resource Speech-to-Text Shared Task: Investigating MozillaDeepSpeech in a low-resource setting .Ardila, Rosana et al. (2020). “Common Voice: A Massively-Multilingual Speech Corpus”.In:

Proceedings of The 12th Language Resources and Evaluation Conference, LREC2020, Marseille, France, May 11-16, 2020 . European Language Resources Association,pp. 4218–4222.Belinkov, Yonatan and James Glass (2017). “Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems”. In:

Advances in Neural InformationProcessing Systems . Vol. 30, pp. 2441–2451.Graves, Alex et al. (2006). “Connectionist temporal classiﬁcation: labelling unsegmentedsequence data with recurrent neural networks”. In:

Proceedings of the 23rd interna-tional conference on Machine learning , pp. 369–376.Hannun, Awni et al. (2014).

Deep Speech: Scaling up end-to-end speech recognition . arXiv: .Heaﬁeld, Kenneth (2011). “KenLM: Faster and Smaller Language Model Queries”. In:

Proceedings of the Sixth Workshop on Statistical Machine Translation . Association forComputational Linguistics, pp. 187–197.Hochreiter, Sepp and Jürgen Schmidhuber (1997). “Long short-term memory”. In:

Neuralcomputation

What makes ImageNet goodfor transfer learning? arXiv: .Imai, Satoshi (1983). “Cepstral analysis synthesis on the mel frequency scale”. In:

ICASSP’83. IEEE International Conference on Acoustics, Speech, and Signal Pro-cessing . Vol. 8. IEEE, pp. 93–96.Kingma, Diederik P. and Jimmy Ba (2014).

Adam: A Method for Stochastic Optimization .arXiv: .Kunze, Julius et al. (2017). “Transfer Learning for Speech Recognition on a Budget”. In:

Proceedings of the 2nd Workshop on Representation Learning for NLP, Rep4NLP@ACL2017, Vancouver, Canada, August 3, 2017 . Association for Computational Linguistics,pp. 168–177.Li, Bryan, Xinyue Wang, and Homayoon S. M. Beigi (2019). “Cantonese AutomaticSpeech Recognition Using Transfer Learning from Mandarin”. In:

CoRR .Plüss, Michel, Lukas Neukom, and Manfred Vogel (2020).

Germeval 2020 task 4: Low-resource speech-to-text .Radeck-Arneth, Stephan et al. (2015). “Open Source German Distant Speech Recognition:Corpus and Acoustic Model”. In:

Proceedings Text, Speech and Dialogue (TSD) . Pilsen,Czech Republic, pp. 480–488. 8chröder, Marc and Jürgen Trouvain (2003). “The German text-to-speech synthesissystem MARY: A tool for research, development and teaching”. In:

InternationalJournal of Speech Technology