Speech-Driven Facial Reenactment Using Conditional Generative Adversarial Networks
SSpeech-Driven Facial Reenactment UsingConditional Generative Adversarial Networks
Seyed Ali Jalalifar , Hosein Hasani , and Hamid Aghajan , Department of Electrical Engineering,Sharif University of Technology, Tehran, Iran { seyedali.jalalifar,hasani.hosein } @ee.sharif.ir imec, Ghent University, Ghent, Belguim [email protected] Abstract.
We present a novel approach to generating photo-realisticimages of a face with accurate lip sync, given an audio input. By using arecurrent neural network, we achieved mouth landmarks based on audiofeatures. We exploited the power of conditional generative adversarialnetworks to produce highly-realistic face conditioned on a set of land-marks. These two networks together are capable of producing sequenceof natural faces in sync with an input audio track.
Keywords:
Speech to video mapping · Conditional generative adver-sarial networks · LSTM
Creating talking heads from audio input is interesting from both scientific andpractical viewpoints, e.g. constructing virtual computer generated characters,aiding hearing-impaired people, live dubbing of videos with translated audio,etc. Due to its wide variety of applications, audio to video has been the focus ofintensive research in recent years [1,2,3,4]. Mapping audio to facial images withaccurate lip-sync is an extremely difficult task because it is a mapping form1-Dimensional to 3-Dimensional space and also because humans are expert atdetecting any out-of-sync lip movements with respect to an audio.Facial reenactment has seen considerable progress recently [5,6,7,8]. Ap-proaches to photo-realistic facial reenactment usually involve utilizing computergraphic methods to produce high-quality results. Suwajanakorn et al.[1] gener-ates photo-realistic mouth texture directly from audio using compositing tech-niques. In [5], animating the facial expressions of the target video by sourceactor is achieved by deformation transfer between source and target. Althoughthese methods usually produce highly-realistic reenactment, they suffer fromoccasional failures. A big challenge for these methods is synthesizing realisticteeth because of the subtle details in the mouth region. Unlike these approaches,we propose using pure machine learning techniques for the task of facial reen-actment which we believe is more flexible and simpler to implement. By using a r X i v : . [ c s . C V ] M a r Jalalifar et al. generative adversarial networks, our model learns the manifold of human faceand lip movements which is a great help for avoiding uncanny valley .Generative adversarial networks (GANs), first introduced by Goodfellow etal. [9], are great tools for learning image manifold. They have shown huge po-tential in mimicking the underlying distribution of data, and produced visuallyimpressive results by sampling random images drawn from the image manifold[10,11,12,13]. Despite their power, GANs are notorious for their uncontrollableoutput because of the entangled space of the input data and no control on themodes of the data being generated. That was the impetus behind proposing con-ditional generative adversarial networks [14] which offer some control over theoutput. Other approaches are also proposed to learn disentangled, interpretablerepresentations in an unsupervised [15,16] or supervised [17] manner.We evaluated some extensions of generative adversarial networks and foundout that conditional GAN suits best to our problem. We exploited the powerof conditional GANs to generate natural faces conditioned on a set of mouthlandmarks . Another network is trained to produce mouth landmarks out of anaudio input using LSTM structure. By combining these two networks, our modelis capable of generating natural face with accurate lip sync. To the best of ourknowledge, this is the first time that C-GANs are applied to the problem ofaudio to video mapping. The closest work to ours is [1] but unlike our method,they composited mouth texture with proper 3D pose matching for accurate lipsyncing. We do a case study on a specific person, President Barak Obama,because of the huge volume of data available from his weekly address and alsobecause the videos are online and public domain. Generating talking heads ofother people is easily achievable using the same pipeline, given that enough datais available. The related work can be divided into two categories: Creating accurate lip syncgiven an audio input, and manipulating face using generative adversarial net-works.
Approaches to automatically generating natural looking speech animation usu-ally involve manipulating 3D computer generated faces[18,19,20]. It was not un-til recently that highly-realistic facial reenactment was achievable [1,5]. Typicalprocedure to generating lip sync from audio usually consists of extracting somefeatures from raw audio [1] or phoneme extraction [2]. A mapping between au-dio features to 3D face model for avatars is then achieved. For the case of facial Objects which closely look like humans but are different in small details elicit a senseof unfamiliarity while less similar objects look more familiar.peech-Driven Facial Reenactment 3
Fig. 1.
Artificial faces of Obama, created entirly from audio input. reenactment, appropriate facial texture is created from audio features. Taylor etal. [2] proposed using a sliding window predictor that learns arbitrary non-linearmapping from phoneme label input sequence to mouth movements. Andersonet al. in [21], proposed a pipeline for generating text-driven 3D talking headsfrom limited number of 3D scans using Active Appearance Model(AAM) to con-struct 2D talking heads first and then create 3D models from them. One of thefirst highly-realistic facial reenactment approaches, Face2Face, was introducedby Thies et al. [5]. They proposed a new approach for real-time facial reen-actment using monocular video sequence form source and target actor. Basedon their work, Suwajanakorn et al. [1] introduced a new method for creatingtalking heads given an audio input by compositing techniques. While Face2Facetransfers the mouth from another video sequence, they synthesize mouth shapedirectly from audio.Although our work is similar to [1] in application, there are fundamental dif-ferences between utilized methods. Conventional approaches for facial reenact-ment heavily involve computer graphic methods which are prone to generatinguncanny faces due to the lack of understanding of the human face manifold.These methods also need to overcome some challenges related to synthesizingrealistic teeth. Unlike these approaches, we propose a new pipeline for generatinghighly-realistic videos with accurate lip sync from audio by learning the humanface manifold. This greatly reduces the complications that typical methods usu-ally have to deal with and also prevents occasional failures.
Generative adversarial nets has recently received an increasing amount of atten-tion and produced promising results, especially in the tasks of image generation[10,11,12,13,22] and video generation [23]. The power of these networks is thatthey produce visually impressive outputs, because they learn the underlyingdistribution of data. They’ve opened a new door to the field of image editing.Efforts for editing faces in latent space usually consist of supervised and unsu-pervised methods to disentangle the latent space. Chen et al. [15] proposed aninformation-theoretic extension of GANs, InfoGAN, which is able to learn dis-
Jalalifar et al.
Fig. 2.
An overview of the proposed system. First an LSTM network is trained withaudio features as input and lip landmark positions as labels. A C-GAN is trained toproduce highly-realistic faces with respect to a given set of landmarks. Finally, thesetwo networks together are able to produce convincing faces from an audio track. entangled representation of latent space in a completely unsupervised manner.This is done by maximizing mutual information between some latent variablesand the observation. They tested their approach on the CelebA dataset andmanaged to control pose, presence or absence of glasses, hair style and emotionof generated face images. Semi-Latent GAN, proposed by Yin et al. [17], learnsto generate and modify images from attributes by decomposing noise of GANinto two parts, user defined attributes and latent attributes, which are obtainedfrom the data.GAN-based conditional image generation has also been the focus of researchin recent years. In conditional GANs, both generator and discriminator are pro-vided with class information. Ma et al. in [12] proposed a method for pose guidedperson image generation conditioned on a specific pose. Kaneko et al. [24] pre-sented a generative attribute controller by utilizing conditional filtered genera-tive adversarial networks. In this paper, we use conditional GANs to generatefacial images, given a set of landmarks. Our model is capable of generating faceswith accurate alignment with given landmarks. Another LSTM network learnsto predict facial landmark positions from audio features. These two networkstogether are able to generate realistic facial images with accurate lip sync, givenan audio input.
An overview of the system is shown at Fig. 2. At the heart of our system is aconditional GAN which is trained to produce highly realistic facial images con-ditioned on a given set of lip landmarks. An LSTM network is utilized to createlip landmarks out of audio input. Here we briefly introduce the implementednetworks in our system.
Long Short-Term Memory networks, first introduced by Hochreiter et al. [25],are a special type of recurrent neural networks. Unlike typical networks, RNNs peech-Driven Facial Reenactment 5 have the ability to connect previous information to the present task. While RNNsare not capable of handling long-term dependencies [26], LSTMs are explicitlydesigned to handle such situations. The computation within an LSTM cell canbe described as: f t +1 = σ ( W f . [ h t , Φ t ] + b f ) , (1) i t +1 = σ ( W i . [ h t , Φ t ] + b i ) , (2) o t +1 = σ ( W o . [ h t , Φ t ] + b o ) , (3)˜ C t +1 = tanh ( W c . [ h t , Φ t ] + b c ) . (4)where C t , h t and Φ t are the inputs to the LSTM. W f , W i , W o , b i , b o , b f , and b c are trainable parameters. σ is the sigmoid activation function. f, i, o are theforgetting, input and output gates of an standard LSTM unit which controlthe contribution of historical information to current decision. The outputs of anLSTM cell are C t +1 = f t +1 C t + i t +1 ˜ C t +1 , (5) h t +1 = o t +1 tanh ( C t +1 ) . (6) Typically, A GAN consists of two networks, a generator and a discriminator.The generator network G tries to fool the discriminator D by creating samplesas if they come from the real distribution of data. It is discriminator’s task todistinguish between fake and real samples while the generator tries to learn thetrue distribution of data in order to fool the discriminator. As training goes on,the discriminator becomes better and better at dividing real and fake samples sothe generator has to produce more realistic samples to deceive the discriminator.This leads to a two player min-max game with the value function V ( G, D ):min G max D V ( G, D ) = E x ∼ p data ( x ) [ log ( D ( x )] + E z ∼ p z ( z ) [ log (1 − D ( G ( z )))] . (7)GANs find a mapping between prior noise distribution to data space. Sincevalues of latent code are picked randomly from a distribution, there is no controlover the output of generator. Conditioning both discriminator and generator onsome extra information y offers some control over the output [14]. y could beany kind of auxiliary information, such as class label or as in our case, landmarkpositions. In the case of conditional GANs, the objective function of a two-playermin-max game would be:min G max D V ( G, D ) = E x ∼ p data ( x ) [ log ( D ( x | y )] + E z ∼ p z ( z ) [ log (1 − D ( G ( z | y )))] . (8)After training GAN network, the discriminator network is discarded and onlythe generator is used for creating realistic facial images. Jalalifar et al.
Mapping a sequence of audio to a sequence of images is inherently a difficult taskdue to the ambiguities of mapping from low-dimensional to high-dimensionalspace. Our ultimate goal is to estimate the distribution p model ( x i | V i ) where x i is an image at i th frame, and V i = [ v i − n , v i − n +1 , ..., v i + n ] is audio feature vec-tor with 2 n + 1 as sequence size. Instead of directly computing p model , we tryto estimate distributions p θ ( l i | V i ) and p φ ( x i | l i ) where l i consists of 8 landmarkpositions. Now the problem is finding θ and φ which represent model parame-ters of LSTM and Generator networks respectively. First an LSTM network istrained to output facial landmark positions based on the mel-frequency cepstralcoefficients extracted from audio input. Another generative model is trained tocreate high-quality realistic faces conditioned on a set of landmarks. These twonetworks are trained independently and result in a mapping from MFCC audiofeatures to a sequence of facial images in sync with a given audio. For the training part, we used President Obama’s weekly address videos becauseof their availability, high quality and controlled environment. These videos are14 hours in total but we used a subset of the dataset since we achieved thedesired quality with about two hours of videos. For each frame, we extracted theface region, in addition to important lip landmarks with the method proposedin [27].We also extracted mel-frequency cepstral coefficients from audio input.
The shape of the mouth during speech depends not only on the current phonemebut also on the phonemes before and after. This is called co-articulation and itcan affect up to 10 neighboring phonemes. Inspired by [3] and [1], we used LSTMfor preserving these long-term dependencies. Extensive work has been done onthe problem of audio feature extraction [28,29,30]. We used the typical mel-frequency cepstral coefficients as the audio features. We took discrete Fouriertransform on every 33 milliseconds sliding window and applied 40 triangular mel-scale filters to the Fourier power spectrum. In addition to these 13 MFCCs, wealso used their first temporal deviation and log mean energy as extra features toobtain a 28-D feature vector. From the 68 landmark points detected by Dlib [31],we selected the most correlated ones with speech which are 8 points around thelip. These 8 points make up a 16-D vector. We used a single layer LSTM structurefollowed by two hidden layers for mapping from audio to the lip landmarks. Moredetails about the implementation can be found on Section 5.
We propose using conditional generative adversarial networks to create imagefrom landmarks. We used the position of distinctive lip landmarks as an extra peech-Driven Facial Reenactment 7
Fig. 3.
Our conditional GAN network overview. Deconvolutional network of Generator(Top), Convolutional network of Discriminator (Bottom). condition on the generator network. The input of generator consists of a 50-Dnoise vector and a 16-D vector of landmark positions. The 66-D input vectorultimately becomes a 128x128x3 image through deconvolution networks. Weconcatenate the resultant image with 16-D landmark positions in a way thatthe input shape of discriminator finally becomes 128x128x19. We followed thetypical network structure proposed for DC-GANs except that we concatenatedboth generator and discriminator network inputs with landmark positions. Thestructure of generator and discriminator network is shown at Fig. 3.
In this part we discuss implementation details and results.
Fig. 4.
Bidirectional LSTM structure. Frames after and before the current frame effectthe output. Jalalifar et al.
We tested some architecture of LSTM and find out that bidirectional LSTMsuits best to our problem. As mentioned in Section 4.2, the phenomenon calledco-articulation causes mouth shape to be dependent on phonemes before andafter the current phoneme. The choice of bidirectional LSTM is rational since ittakes into account previous and next frames. We used a single layer bidirectionalLSTM since it produces the desired quality and there is no need to introducecomplexity to the network by adding extra layers. We used Adam optimizer[32] for training using Tensorflow framework [33]. Fig. 4 shows the bidirectionalLSTM structure. In table 1, we compare performance of different LSTM struc-tures and parameters.
Table 1.
Validation loss of different LSTM network structuresValidation Loss(Epochs)Network Structure 100 epochs 200 epochs 300 epochsSingle-layer bidirectional LSTM 0.91 0.88 0.85Single-layer unidirectional LSTM 0.93 0.91 0.93Two-layer bidirectional LSTM 0.92 0.88 0.84
Table 2.
Validation loss of different LSTM networks versus dropout probabilityDropout rate: 0 0.3 0.5Single-layer bidirectional LSTM 0.91 0.88 0.93Single-layer unidirectional LSTM 0.94 0.92 0.95Two-layer bidirectional LSTM 0.91 0.89 0.92
Our conditional network is able to create real facial images out of landmarks.While generating image sequence from audio input, we need to keep facial tex-ture and background constant. In order to achieve so, we limit C-GAN trainingdataset in the last epochs to the target video. This keeps the facial texture con-stant during face generation while preserves the details of the reconstructed face.Some tricks proposed in [34] improved the quality of the output and reduced vi-sual artifacts. Fig. 5 shows some of the results we achieved from a given set oflandmarks.The novelty of our approach is that the two modules that we used, LSTM andC-GAN, are almost independent from each other. This means that our model is peech-Driven Facial Reenactment 9
Fig. 5.
Images directly generated from landmarks. Original sequence (Top), Generatedface using C-GAN (Buttom).
Fig. 6.
Creating Artificial faces of President Obama, given an audio track from HillaryClinton. From top to bottom: 1) Original video, 2) Audio features, 3) Predicted land-marks, 4) Generated Images created from landmarks. able to transfer lip movements of other people, given their audio. Only a simpleaffine transformation should be applied to the source facial landmarks in order tobe aligned with the target landmarks. Fig. 6 shows transfer from Hillary Clintonsaudio speech to President Barack Obamas lip movements.
We propose using conditional generative adversarial networks for creating high-quality faces given their mouth landmarks. The mouth landmarks are also ob-tained from audio using an LSTM network. This gives us an end-to-end systemwith much flexibility, e.g. the ability to manipulate faces without losing their nat-uralness. This is a huge advantage over computer graphic methods since there isno need to get involved with details of face, e.g. synthesizing realistic teeth. TheLSTM network and C-GAN network are almost independent from each otherso we can reanimate target face with audios from other sources rather than the target person himself. This opens the door for many interesting new applicationssuch as face transformation, Dubsmash like apps, etc.We used Dlib landmark detector for extracting facial landmarks. There arenew approaches with more accurate results available for facial landmark detec-tion, especially in the mouth region [35]. Using these improved methods increasethe quality of the LSTM network to predict mouth shape from audio features.Sometimes our model fails to create natural faces. This is mainly becausethe fact that the provided lip landmarks are significantly different from whatthe C-GAN saw during training phase (Fig. 7). To address this problem, a morecomprehensive dataset can be used to cover more head poses and lip landmarkpositions.Finally, typical DC-GAN structure and training procedure are utilized duringtraining phase. New architectures and algorithms such as [11] has been proposedto improve the quality of output image. Using these new structures, images withhigher quality and finer details are achievable.
Fig. 7.
Some cases of failiure mainly caused by irrelevent lip landmarks.
References
1. Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing obama:Learning lip sync from audio. ACM Trans. Graph. (4) (July 2017) 95:1–95:132. Taylor, S., Kim, T., Yue, Y., Mahler, M., Krahe, J., Rodriguez, A.G., Hodgins, J.,Matthews, I.: A deep learning approach for generalized speech animation. ACMTrans. Graph. (4) (July 2017) 93:1–93:113. Shimba, T., Sakurai, R., Yamazoe, H., Lee, J.H.: Talking heads synthesis fromaudio with deep neural networks. In: 2015 IEEE/SICE International Symposiumon System Integration (SII). (Dec 2015) 100–1054. Llorach, G., Evans, A., Blat, J., Grimm, G., Hohmann, V.: Web-based live speech-driven lip-sync. In: 2016 8th International Conference on Games and VirtualWorlds for Serious Applications (VS-GAMES). (Sept 2016) 1–45. Thies, J., Zollhfer, M., Stamminger, M., Theobalt, C., Niessner, M.: Demo offace2face: Real-time face capture and reenactment of rgb videos. In: ACM SIG-GRAPH 2016 Emerging Technologies. SIGGRAPH ’16, New York, NY, USA, ACM(2016) 5:1–5:2peech-Driven Facial Reenactment 116. Thies, J., Zollhfer, M., Niessner, M., Valgaerts, L., Stamminger, M., Theobalt, C.:Real-time expression transfer for facial reenactment. ACM Trans. Graph. (6)(October 2015) 183:1–183:147. Garrido, P., Valgaerts, L., Sarmadi, H., Steiner, I., Varanasi, K., Perez, P.,Theobalt, C.: Vdub: Modifying face video of actors for plausible visual alignmentto a dubbed audio track. (2) (2015) 193–2048. Shi, F., Wu, H.T., Tong, X., Chai, J.: Automatic acquisition of high-fidelity facialperformances using monocular videos. ACM Trans. Graph. (6) (November 2014)222:1–222:139. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,Courville, A., Bengio, Y.: Generative adversarial nets. In Ghahramani, Z., Welling,M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., eds.: Advances in Neural Infor-mation Processing Systems 27. Curran Associates, Inc. (2014) 2672–268010. Zhu, J.Y., Kr¨ahenb¨uhl, P., Shechtman, E., Efros, A.A.: Generative visual manipu-lation on the natural image manifold. In: Proceedings of European Conference onComputer Vision (ECCV). (2016)11. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for im-proved quality, stability, and variation. CoRR abs/1710.10196 (2017)12. Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., Gool, L.V.: Pose guided personimage generation. CoRR abs/1705.09368 (2017)13. Im, D.J., Kim, C.D., Jiang, H., Memisevic, R.: Generating images with recurrentadversarial networks. CoRR abs/1602.05110 (2016)14. Mirza, M., Osindero, S.: Conditional generative adversarial nets. CoRR abs/1411.1784 (2014)15. Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: Info-gan: Interpretable representation learning by information maximizing generativeadversarial nets. CoRR abs/1606.03657 (2016)16. Larsen, A.B.L., Sønderby, S.K., Winther, O.: Autoencoding beyond pixels using alearned similarity metric. CoRR abs/1512.09300 (2015)17. Yin, W., Fu, Y., Sigal, L., Xue, X.: Semi-latent GAN: learning to generate andmodify facial images from attributes. CoRR abs/1704.02166 (2017)18. Liu, Y., Xu, F., Chai, J., Tong, X., Wang, L., Huo, Q.: Video-audio driven real-timefacial animation. ACM Trans. Graph. (6) (October 2015) 182:1–182:1019. Le, B.H., Ma, X., Deng, Z.: Live speech driven head-and-eye motion generators.IEEE Transactions on Visualization and Computer Graphics (11) (Nov 2012)1902–191420. Cao, Y., Tien, W.C., Faloutsos, P., Pighin, F.: Expressive speech-driven facialanimation. ACM Trans. Graph. (4) (October 2005) 1283–130221. Anderson, R., Stenger, B., Wan, V., Cipolla, R.: An expressive text-driven 3dtalking head. In: ACM SIGGRAPH 2013 Posters. SIGGRAPH ’13, New York,NY, USA, ACM (2013) 80:1–80:122. Zhang, H., Xu, T., Li, H., Zhang, S., Huang, X., Wang, X., Metaxas, D.N.: Stack-gan: Text to photo-realistic image synthesis with stacked generative adversarialnetworks. CoRR abs/1612.03242 (2016)23. Tulyakov, S., Liu, M., Yang, X., Kautz, J.: Mocogan: Decomposing motion andcontent for video generation. CoRR abs/1707.04993 (2017)24. Kaneko, T., Hiramatsu, K., Kashino, K.: Generative attribute controller withconditional filtered generative adversarial networks. In: 2017 IEEE Conference onComputer Vision and Pattern Recognition (CVPR). (July 2017) 7006–701525. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. (8)(November 1997) 1735–17802 Jalalifar et al.26. Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradi-ent descent is difficult. Trans. Neur. Netw. (2) (March 1994) 157–16627. Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble ofregression trees. In: Proceedings of the 2014 IEEE Conference on Computer Vi-sion and Pattern Recognition. CVPR ’14, Washington, DC, USA, IEEE ComputerSociety (2014) 1867–187428. Lee, H., Largman, Y., Pham, P., Ng, A.Y.: Unsupervised feature learning foraudio classification using convolutional deep belief networks. In: Proceedings ofthe 22Nd International Conference on Neural Information Processing Systems.NIPS’09, USA, Curran Associates Inc. (2009) 1096–110429. Paleˇcek, K.: Extraction of features for lip-reading using autoencoders. In Ronzhin,A., Potapova, R., Delic, V., eds.: Speech and Computer, Cham, Springer Interna-tional Publishing (2014) 209–21630. Takahashi, N., Gygli, M., Gool, L.V.: Aenet: Learning deep audio features forvideo analysis. CoRR abs/1701.00599 (2017)31. King, D.E.: Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res. (De-cember 2009) 1755–175832. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014)33. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado,G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A.,Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg,J., Man´e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J.,Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V.,Vi´egas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng,X.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015)Software available from tensorflow.org.34. Odena, A., Dumoulin, V., Olah, C.: Deconvolution and checkerboard artifacts.Distill (2016)35. Xiong, X., la Torre Frade, F.D.: Supervised descent method and its applicationsto face alignment. In: IEEE International Conference on Computer Vision andPattern Recognition (CVPR). (May 2013)36. Umapathy, K., Krishnan, S., Rao, R.K.: Audio signal feature extraction and classi-fication using local discriminant bases. IEEE Transactions on Audio, Speech, andLanguage Processing15