Improving Speaker-Independent Lipreading with Domain-Adversarial Training
IImproving Speaker-Independent Lipreading with Domain-AdversarialTraining
Michael Wand, J¨urgen Schmidhuber
The Swiss AI Lab IDSIA, USI & SUPSI, Manno-Lugano, Switzerland [email protected], [email protected]
Abstract
We present a
Lipreading system, i.e. a speech recognition sys-tem using only visual features, which uses domain-adversarialtraining for speaker independence. Domain-adversarial train-ing is integrated into the optimization of a lipreader based ona stack of feedforward and LSTM (Long Short-Term Memory)recurrent neural networks, yielding an end-to-end trainable sys-tem which only requires a very small number of frames of un-transcribed target data to substantially improve the recognitionaccuracy on the target speaker. On pairs of different source andtarget speakers, we achieve a relative accuracy improvement ofaround 40% with only 15 to 20 seconds of untranscribed tar-get speech data. On multi-speaker training setups, the accuracyimprovements are smaller but still substantial.
Index Terms : Lipreading, Deep Neural Networks, Long Short-Term Memory, Domain Adaptation
1. Introduction
Lipreading is the process of understanding speech by usingsolely visual features, i.e. images of the lips of a speaker. Incommunication between humans, lipreading has a twofold rel-evance [1]: First, visual cues play a role in spoken conversation[2]; second, hearing-impaired persons may use lipreading as ameans to follow verbal speech.With the success of computer-based speech recognitionover the past decades, automatic lipreading has become an ac-tive field of research as well, with pioneering work by Peta-jan [3], who used lipreading to augment conventional acousticspeech recognition, and Chiou and Hwang [4], who were thefirst to perform lipreading without resorting to any acoustic sig-nal at all. Since 2014, lipreading systems have systematicallybegun to use neural networks at part of the processing pipeline[5, 6] or for end-to-end-training [7, 8, 9]. In our previous work[7], we proposed a fully neural network based system, usinga stack of fully connected and recurrent (LSTM, Long Short-Term Memory) [10, 11] neural network layers.The scope of this paper is the introduction of state-of-the-art methods for speaker-independent lipreading with neural net-works. We evaluate our established system [7] in a cross-speaker setting, observing a drastic performance drop on un-known speakers. In order to alleviate the discrepancy betweentraining speakers and unknown test speaker, we use domain-adversarial training as proposed by Ganin and Lempitsky [12]:
Untranscribed data from the target speaker is used as additionaltraining input to the neural network, with the aim of pushingthe network to learn an intermediate data representation whichis domain-agnostic, i.e. which does not depend on whether theinput data comes from a source speaker or a target speaker. Weevaluate our system on a subset of the GRID corpus [13], whichcontains extensive data from 34 speakers and is therefore idealfor a systematic evaluation of the proposed method.
2. Related work
Lipreading can be used to complement or augment speechrecognition, particularly in the presence of noise [3, 14], andfor purely visual speech recognition [4, 15, 5]. In the latter case,ambiguities due to incomplete information (e.g. about voicing)can be mitigated by augmenting the video stream with ultra-sound images of the vocal tract [16]. Visual speech process-ing is an instance of a
Silent Speech interface [17]; furtherpromising approaches include capturing the movement of thearticulators by electric or permanent magnetic articulography[18, 19], and capturing of muscle activity using electromyogra-phy [20, 21, 22, 23].Versatile lipreading features have been proposed, such asActive Appearance Models [24], Local Binary Patterns [25],and PCA-based
Eigenlips [26] and
Eigentongues [27]. Fortackling speaker dependency, diverse scaling and normaliza-tion techniques have been employed [28, 29]. Classifica-tion is often done with Hidden Markov Models (HMMs), e.g.[30, 15, 31, 32]. Mouth tracking is done as a preprocessing step[32, 15, 5]. For a comprehensive review see [33].Neural networks have early been applied to the Lipread-ing task [34], however, they have become widespread only inrecent years, with the advent of state-of-the-art learning tech-niques (and the necessary hardware). The first deep neural net-work for lipreading was a seven-layer convolutional net as apreprocessing stage for an HMM-based word recognizer [5].Since then, several end-to-end trainable systems were presented[7, 8, 9]. The current state-of-the-art accuracy on the GRID cor-pus is 3.3% error [9] using a very large set of additional trainingdata; so their result is not directly comparable to ours.In domain adaptation , it is assumed that a learning task ex-hibits a domain shift between the training (or source ) and test(or target ) data. This can be mitigated in several ways [35]; weapply domain-adversarial training [12], where an intermediatelayer in a multi-layer network is driven to learn a representationof the input data which is optimized to be domain-agnostic, i.e.to make it difficult to detect whether an input sample is from thesource or the target domain. A great advantage of this approachis the end-to-end trainability of the entire system. For a sum-mary of further approaches to domain adaptation with neuralnetworks, we refer to the excellent overview in [12].
3. Data and preprocessing
We follow the data preprocessing protocol from [7]. We use theGRID corpus [13], which consists of video and audio record-ings of speakers (which we name s1 to s34 ) saying sentences each. All sentences have a fixed structure: com-mand(4) + color(4) + preposition(4) + letter(25) + digit(10)+ adverb(4) , for example “Place red at J 2, please”, where thenumber of alternative words is given in parentheses. There are distinct words; alternatives are randomly distributed so that a r X i v : . [ c s . C V ] A ug igure 1: Two randomly chosen example frames from the GRIDcorpus with highlighted mouth area. context cannot be used for classification. Each sentence has alength of 3 seconds at 25 frames per second, so the total dataper speaker is seconds ( minutes). Using the annota-tions contained in the corpus, we segmented all videos at wordlevel, yielding word samples per speaker.We experiment on speakers s1–s19: speakers 1-9 form the development speakers, used to determine optimal parameters;speakers 10–19 are the evaluation speakers, held back until thefinal evaluation of the systems. The data from each speakerwas randomly subdivided into training, validation, and test sets,where the latter two contain five samples of each word, i.e. atotal of · samples each. The training data is con-sequently highly unbalanced: For example, each letter from “a”to “z” appears 30 times, whereas each color appears 240 times.We converted the “normal” quality videos ( × pix-els) to greyscale and extracted × pixel windows containingthe mouth area, as described in [7]. The frames were contrast-normalized and z-normalized over the training set, indepen-dently for each speaker. Unreadable videos were discarded.All experiments have one dedicated target speaker onwhich this experiment is evaluated, and one, four, or eight source speakers on which supervised training is performed.Speakers are chosen consecutively, for example, the experi-ments on four training speakers on the development data are (s1 . . . s4) → s5, (s2 . . . s5) → s6, · · · , (s9, s1, s2, s3) → s4, where → separates source and target speakers. We also compute base-line results on single speakers. The data sets of each speakerare used as follows: Training data is used for supervised train-ing (on the source speakers) and unsupervised adaptation (onthe target speaker).
Validation data is used for early stopping,the network is evaluated on the test data.
4. Methods and System Setup
The system is based on the lipreading setup from [7], reimple-mented in Tensorflow [36]. Raw × lip images are usedas input data, without any further preprocessing except normal-ization. We stack several fully connected feedforward layers,optionally followed by Dropout [37], and one LSTM recurrentlayer to form a network which is capable of recognizing sequen-tial video data. The final layer is a softmax with 51 word targets.All inner layers use a tanh nonlinearity. During testing, classi-fication is performed on the last frame of an input word, thesoftmax output on all previous frames is discarded. Similarly,during training, an error signal is backpropagated (through timeand through the stack of layers) only from the last frame of eachtraining word sample.Optimization is performed by minimizing the multi-classcross-entropy using stochastic gradient descent applying Ten-sorflow’s MomentumOptimizer with a momentum of 0.5, a Figure 2:
Optimal network topology for adversarial training,with a common part (top), word classifier (bottom left, only onsource speakers), and speaker classifier (bottom right). Notethat the gradient of the speaker classifier is inverted, and thatthe contribution of the adversarial network is configurable. learning rate of 0.001, and a batch size of 8 sequences. Thenetwork weights are initialized following a truncated normaldistribution with a standard deviation of 0.1. In order to com-pensate for the unbalanced training set, each training sample isweighted with a factor inversely proportional to its frequency.Early stopping (with a patience of 30 epochs) is performed onthe validation data of the source speakers.
Adversarial training [12] is integrated as follows. At thesecond feedforward layer, we attach a further network whichperforms framewise speaker classification on source and targetspeakers. For this purpose, each training batch of 8 word se-quences is augmented by eight additional word sequences fromthe target speaker, for which no word label is used , and no gra-dient is backpropagated from the word classifier. On the ex-tended batch of 16 sequences, the “adversarial” network per-forms framewise speaker classification. This network followsa standard pattern (two feedforward layers with 100 neuronseach plus a softmax layer with 2, 5, or 9 speaker outputs) andis trained jointly with the word classifier, with a configurableweight. If there are more word sequences from the sourcespeaker(s) than from the target speaker, target sequences are re-peated.So far, this describes a joint classifier for two different tasks(speaker and word classification), resembling Caruana’s Multi-task training [38]. The power of the adversarial network comesfrom a simple twist: The backpropagated gradient from the ad-versarial network is inverted where it is fed into the main branchof the network, causing the lower branch to perform gradient ascent instead of descent. Since the speaker classification partof the system learns to classify speakers, the inverted gradientfed into the “branching” layer causes the joint part of the net-work to learn to confuse speakers instead of separating them.The speaker classifier and the joint network work for oppo-site objectives (hence, “adversarial”); an idea first presented inthe context of factorial codes [39]. Figure 2 shows a graphi-cal overview of the system: The joint part is at the top, at thebottom are word classifier (left) and speaker classifier (right).able 1:
Baseline word accuracies on single speakers, aver-aged over the development set, with standard deviation. Layertypes are FC (fully connected feedforward), DP (Dropout), and LSTM , followed by the number of neurons/cells. ∗ marks the(reimplemented and recomputed) best system from [7]. Network Training acc. Test acc.
FC128-LSTM128-LSTM128 ∗ ± ± ± ± ± ± ± ± ± ±
5. Experiments and Results
The first experiment deals with establishing a baseline for ourexperiments, building on prior work [7]. We run the lipreaderas a single-speaker system with different topologies, optionallyusing Dropout (always with 50% dropout ratio) to avoid over-fitting the training set. Adversarial training is not used (i.e. theweight in figure 2 is set to zero). Table 1 shows the resultingtest set accuracies averaged over the development speakers.Without using Dropout, the accuracy on the test set is ∼ p = 2 . × − ).The accuracies in a cross-speaker setting, again on the de-velopment speakers, are given in table 2. The accuracy de-creases drastically, in particular when only one source speakeris used for training: On an unknown target speaker, the sys-tem achieves only an average 13.5% accuracy. The situation isclearly better when training data from multiple speakers is used,but even for eight training speakers, the average accuracy on anunknown speaker is only 37.8%. We also note that the test accu-racy on the source speakers does not rise when data from mul-tiple speakers is used, even though there is more training data.It appears that the additional data does not “help” the systemto improve its performance. On an unknown speaker, however,training data from multiple speakers does improve performance,very probably the system learns to be more speaker-agnostic. Asimilar observation with a very different input signal was re-ported in [41].Clearly, lipreading across different speakers is a challeng-ing problem. In the remainder of this paper, we show howdomain-adversarial training helps to tackle this challenge. Table 2: Baseline word accuracies on training across speak-ers without adaptation by domain-adversarial training, aver-aged over the development set, with standard deviation. Thebest network from table 1 (FC256-DP-FC256-DP-FC256-DP-LSTM256) was used.
Number of Source spk Target spktraining spk Train acc. Test acc. Test acc. ± ± ± ± ± ± ± ± ± We now augment the baseline word classification network withadversarial training as described in section 4, thus making fulluse of the system shown in figure 2. For now, we use all se-quences from the training set of the target speaker. As suggestedin [12], we found it beneficial to gradually activate adversarialtraining: the weight of the adversarial part is set to zero at thebeginning, every 10 epochs, it is raised by 0.2 until the maxi-mum value of 1.0 is reached at epoch 50. The results of this ex-periment are shown in the upper two blocks of table 3, where itcan be seen that adversarial training causes substantial accuracyimprovement, particularly with only one source speaker: In thiscase, the accuracy rises by more than 40% relative, from 13.5%to 19.2%. In the case of four or eight source speakers, the ac-curacy improves by 13.1% resp. 12.2% relative. We tuned thissystem using various topologies for the adversarial part, as wellas different weight schedules for adversarial training, findingrather consistent behavior. The only setting which is emphati-cally discouraged is starting with an adversarial weight greaterthan zero. See section 6 for further analysis.
While the presented system does not require supervised trainingdata from the target speaker, we still use the entire training set ofthe target speaker. In practical applications, even unsupervisedtraining data may only be sparsely available, so this setup issomewhat undesired.Since the content of the target training sequences is irrel-evant for the adversarial training, we may hypothesize that wecould also do with a much smaller set of target training data.So as a final experiment, we reduce the number of training se-quences for the target speaker. The training protocol remains asbefore; in particular, training is always performed on the full setof source sequences, target sequences are repeated as necessary.Table 3:
Word accuracies and standard deviations for systemswith adversarial training on all target sequences or a subset of50 target sequences, on the development speakers.
Adversarial Number of Target RelativeTraining on training spk Test acc. Improvement
None 1 13.5% ± ± ± ± ± ± ± ± ± %10%20%30%40% s - > s s - > s s - > s s - > s s - > s s - > s s - > s s - > s s - > s W o r d A cc u r a cy Speakers (source -> target)No adversarial trainingAdversarial training on all target wordsAdversarial training on 50 target words
Figure 3:
Accuracies with and without adversarial training onpairs of one source and one target speaker, on the developmentset.
The original number of 5490 target training sequencescan be reduced to 50 sequences without a substantial loss ofaccuracy—this amounts to only 15-20 seconds of untranscribedtarget data. Results are shown in the lower block of table 3:For example, in the case of a single source speaker, the targetaccuracy drops to 18.9% instead of 19.2%. The improvementis lower when more source speakers are used. We hypothesizethat this stems from the growing ratio between the number ofsource sequences and the number of target sequences.Finally, figure 3 shows an accuracy breakdown for speakerpairs, i.e. for single-speaker supervised training. In eight out ofnine cases, domain-adversarial training clearly outperforms thebaseline system, often by a substantial margin. We also observethat the accuracy gain depends very much on the speaker pair.
We evaluate our result on the evaluation speakers, i.e. speakers10–19 from the GRID corpus. The hypothesis to be tested statesthat adversarial training improves the accuracy of the cross-speaker lipreader trained on one, four, or eight source speakers,using either all target sequences or 50 target sequences. We usethe one-tailed t-test with paired samples for evaluation.Table 4 shows the resulting accuracies, relative improve-ments, and p-values. Improvements are significant in all casesin which the entire target speaker data is used. For 50 targetsequences, significance can be ascertained only in the case of asingle source speaker, but we always get some improvement.We finally note that when applying such a system in prac-tice, untranscribed data is accrued continuously : so the qualityof the system on the target speaker could be improved continu-ously as well, without requiring any extra data collection.Table 4:
Word accuracies, relative improvements, and p-valuesfor systems with adversarial training, on the evaluation speak-ers. Significant results are marked with ∗ . Adversarial Number of Target Relative p-valueTraining on training spk Test acc. Improvement
None 1 18.7% - -4 39.4% - -8 46.5% - -All TargetSequences 1 25.4% 35.8% 0.0030 ∗ ∗ ∗
50 TargetSequences 1 24.1% 28.9% 0.0045 ∗ W o r d A cc u r a c y Source speaker, validation data0 20 40 60 80Epoch0%50% Target speaker, validation data
Figure 4:
Accuracy vs. epoch on different data sets with adver-sarial training, for speaker pair s5 → s6. Note that the targetaccuracy shows a substantial rise at epoch 10, where adversar-ial training sets in.
6. Analysis
In this section we attempt to shed light on the effect of domain-adversarial training. Figure 4 shows the progress of training forspeakers s5 → s6 versus the training epoch, with adversarialtraining activated. The source speaker accuracies on validationand test set are ∼ target speaker accuracies are 39.1% on the validationset and 39.5% on the test set, our greatest single increase withadversarial training: without adversarial training, the target ac-curacy is less than 22%.From the steady rise of the first curve, we see that the train-ing progresses smoothly. This is the expected behavior for awell-tuned system. On the validation sets, the accuracy variesmuch less smoothly, with jumps of several percent points be-tween epochs. We observed that this behavior is quite consistentfor all systems, with or without adversarial training, and also forvarying numbers of training speakers. Clearly the “error land-scape” between training and validation data is very different,both within the same speaker and between different speakers.The effect of adversarial training is clearly observable: Atepoch 10, where adversarial training becomes active (with 0.2weight), the target accuracy jumps visibly, even though the cri-terion for which the adversarial network is optimized is verydifferent from the word accuracy which is plotted in the graph .This is a remarkable success, even though it should be noted(compare figure 3) that on other speaker pairs, we obtain a muchlower improvement by adversarial training.
7. Conclusion
In this study we have described how to apply domain-adversarial training to a state-of-the-art lipreading system forimproved speaker independency. When training and test areperformed on pairs of different speakers, the average improve-ment is around 40%, which is highly significant; this improve-ment even persists when the amount of untranscribed target datais drastically reduced to about 15-20 seconds. When super-vised training data from several speakers is available, there isstill some improvement, from a much higher baseline
8. Acknowledgements
The first author was supported by the H2020 project INPUT(grant . References [1] L. Woodhouse, L. Hickson, and B. Dodd, “Review of VisualSpeech Perception by Hearing and Hearing-impaired People:Clinical Implications,”
International Journal of Language andCommunication Disorders , vol. 44, no. 3, pp. 253 – 270, 2009.[2] H. McGurk and J. MacDonald, “Hearing Lips and Seeing Voices,”
Nature , vol. 264, no. 5588, pp. 746 – 748, 1976.[3] E. D. Petajan, “Automatic Lipreading to Enhance Speech Recog-nition (Speech Reading) ,” Ph.D. dissertation, University of Illi-nois at Urbana-Champaign, 1984.[4] G. I. Chiou and J.-N. Hwang, “Lipreading from Color Video,”
IEEE Transactions on Image Processing , vol. 6, no. 8, pp. 1192 –1195, 1997.[5] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata,“Lipreading using Convolutional Neural Network,” in
Proc. Inter-speech , 2014, pp. 1149 – 1153.[6] S. Petridis and M. Pantic, “Deep Complementary Bottleneck Fea-tures for Visual Speech Recognition,” in
Proc. ICASSP , 2016, pp.2304 – 2308.[7] M. Wand, J. Koutn´ık, and J. Schmidhuber, “Lipreading with LongShort-Term Memory,” in
Proc. ICASSP , 2016, pp. 6115 – 6119.[8] Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Fre-itas, “LipNet: End-to-End Sentence-level Lipreading,”arXiv:1611.01599, 2016.[9] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip Read-ing Sentences in the Wild,” arXiv:1611.05358, 2016.[10] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,”
Neural Computation , vol. 9, pp. 1735 – 1780, 1997.[11] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to Forget:Continual Prediction with LSTM,”
Neural Computation , vol. 12,no. 10, pp. 2451–2471, 2000.[12] Y. Ganin and V. Lempitsky, “Unsupervised Domain Adaptationby Backpropagation,” in
Proc. ICML , 2015, pp. 1180 – 1189.[13] M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An Audio-Visual Corpus for Speech Perception and Automatic SpeechRecognition,”
Journal of the Acoustical Society of America , vol.120, no. 5, pp. 2421 – 2424, 2006.[14] A. H. Abdelaziz, S. Zeiler, and D. Kolossa, “Learning DynamicStream Weights For Coupled-HMM-Based Audio-Visual SpeechRecognition,”
IEEE/ACM Transactions on Audio, Speech, andLanguage Processing , vol. 23, no. 5, pp. 863 – 876, 2015.[15] R. Bowden, S. Cox, R. Harvey, Y. Lan, E.-J. Ong, G. Owen,and B.-J. Theobald, “Recent Developments in Automated Lip-reading,” in
Proc. SPIE , 2013.[16] T. Hueber, E.-L. Benaroya, G. Chollet, B. Denby, G. Dreyfus, andM. Stone, “Development of a Silent Speech Interface Driven byUltrasound and Optical Images of the Tongue and Lips,”
SpeechCommunication , vol. 52, pp. 288 – 300, 2010.[17] B. Denby, T. Schultz, K. Honda, T. Hueber, and J. Gilbert, “SilentSpeech Interfaces,”
Speech Communication , vol. 52, no. 4, pp.270 – 287, 2010.[18] J. Wang and S. Hahn, “Speaker-Independent Silent SpeechRecognition with Across-Speaker Articulatory Normalization andSpeaker Adaptive Training,” in
Proc. Interspeech , 2015, pp. 2415– 2419.[19] J. A. Gonzalez, L. A. Cheah, J. M. Gilbert, J. Bai, S. R. Ell,P. D. Green, and R. K. Moore, “A Silent Speech System basedon Permanent Magnet Articulography and Direct Synthesis,”
Computer Speech and Language
IEEE Transaction onBiomedical Engineering , vol. 61, no. 10, pp. 2515 – 2526, 2014. [21] M. Wand and T. Schultz, “Towards Real-life Application of EMG-based Speech Recognition by using Unsupervised Adaptation,” in
Proc. Interspeech , 2014, pp. 1189 – 1193.[22] Y. Deng, J. T. Heaton, and G. S. Meltzner, “Towards a PracticalSilent Speech Recognition System,” in
Proc. Interspeech , 2014,pp. 1164 – 1168.[23] M. Wand and J. Schmidhuber, “Deep Neural Network Frontendfor Continuous EMG-Based Speech Recognition,” in
Proc. Inter-speech , 2016, pp. 3032 – 3036.[24] I. Matthews, T.Cootes, J. Bangham, S. Cox, and R. Harvey, “Ex-traction of Visual Features for Lipreading,”
IEEE Trans. on Pat-tern Analysis and Machine Vision , vol. 24, no. 2, pp. 198 – 213,2002.[25] G. Zhao, M. Barnard, and M. Pietikinen, “Lipreading With LocalSpatiotemporal Descriptors,”
IEEE Transactions on Multimedia ,vol. 11, no. 7, pp. 1254 – 1265, 2009.[26] C. Bregler and Y. Konig, “‘Eigenlips’ for Robust Speech Recog-nition,” in
Proc. ICASSP , 1994, pp. 669 – 672.[27] T. Hueber, G. Aversano, G. Chollet, B. Denby, G. Dreyfus,Y. Oussar, P. Roussel, and M. Stone, “Eigentongue Feature Ex-traction for an Ultrasound-based Silent Speech Interface,” in
Proc.ICASSP , 2007, pp. I–1245 – I–1248.[28] S. Cox, R. Harvey, Y. Lan, J. Newman, and B. Theobald, “TheChallenge of Multispeaker Lip-reading,” in
Proc. AVSP , 2008, pp.179 – 184.[29] Y. Lan, B.-J. Theobald, R. Harvey, E.-J. Ong, and R. Bow-den, “Improving Visual Features for Lip-reading,” in
Proc. AVSP ,2010.[30] T. Hueber, G. Chollet, B. Denby, G. Dreyfus, and M. Stone,“Continuous-Speech Phone Recognition from Ultrasound andOptical Images of the Tongue and Lips,” in
Proc. Interspeech ,2007, pp. 658–661.[31] F. Tao and C. Busso, “Lipreading Approach for Isolated DigitsRecognition Under Whisper and Neutral Speech,” in
Proc. Inter-speech , 2014, pp. 1154 – 1158.[32] Y. Lan, R. Harvey, B.-J. Theobald, E.-J. Ong, and R. Bowden,“Comparing Visual Features for Lipreading,” in
Proc. of the Inter-national Conference on Auditory-Visual Speech Processing , 2009,pp. 102 – 106.[33] Z. Zhou, G. Zhao, X. Hong, and M. Pietik¨ainen, “A Review ofRecent Advances in Visual Speech Decoding,”
Image and VisionComputing , vol. 32, pp. 590 – 605, 2014.[34] G. J. Wolff, K. V. Prasad, D. G. Stork, and M. E. Hennecke,“Lipreading by Neural Networks: Visual Preprocessing, Learningand Sensory Integration,” in
Proc. NIPS , 1993, pp. 1027 – 1034.[35] S. J. Pan and Q. Yang, “A Survey on Transfer Learning,”
IEEETransactions on Knowledge And Data Engineering , vol. 22,no. 10, pp. 1345 – 1359, 2010.[36] M. A. et al., “TensorFlow: Large-Scale Machine Learning onHeterogeneous Systems,” 2015, software available from tensor-flow.org.[37] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R.Salakhutdinov, “Improving Neural Networks by Preventing Co-adaptation of Feature Detectors,” Arxiv: 1207.0580v1, 2012.[38] R. Caruana, “Multitask Learning,” Ph.D. dissertation, School ofComputer Science, Carnegie Mellon University, 1997.[39] J. Schmidhuber, “Learning Factorial Codes by Predictability Min-imization,”
Neural Computation , vol. 4, no. 6, pp. 863 – 879,1992.[40] S. Gergen, S. Zeiler, A. H. Abdelaziz, R. Nickel, and D. Kolossa,“Dynamic Stream Weighting for Turbo-Decoding-Based Audio-visual ASR,” in
Proc. Interspeech , 2016, pp. 2135 – 2139.[41] M. Wand and T. Schultz, “Session-independent EMG-basedSpeech Recognition,” in