Lahiru Samarakoon
National University of Singapore
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Lahiru Samarakoon.
IEEE Transactions on Audio, Speech, and Language Processing | 2016
Lahiru Samarakoon; Khe Chai Sim
In this paper, we propose the factorized hidden layer (FHL) approach to adapt the deep neural network (DNN) acoustic models for automatic speech recognition (ASR). FHL aims at modeling speaker dependent (SD) hidden layers by representing an SD affine transformation as a linear combination of bases. The combination weights are low-dimensional speaker parameters that can be initialized using speaker representations like i-vectors and then reliably refined in an unsupervised adaptation fashion. Therefore, our method provides an efficient way to perform both adaptive training and (test-time) adaptation. Experimental results have shown that the FHL adaptation improves the ASR performance significantly, compared to the standard DNN models, as well as other state-of-the-art DNN adaptation approaches, such as training with the speaker-normalized CMLLR features, speaker-aware training using i-vector and learning hidden unit contributions (LHUC). For Aurora 4, FHL achieves 3.8% and 2.3% absolute improvements over the standard DNNs trained on the LDA + STC and CMLLR features, respectively. It also achieves 1.7% absolute performance improvement over a system that combines the i-vector adaptive training with LHUC adaptation. For the AMI dataset, FHL achieved 1.4% and 1.9% absolute improvements over the sequence-trained CMLLR baseline systems, for the IHM and SDM tasks, respectively.
international conference on acoustics, speech, and signal processing | 2016
Lahiru Samarakoon; Khe Chai Sim
In automatic speech recognition (ASR), adaptation and adaptive training techniques are used to perform speaker normalization. Previous methods mainly focus on using these techniques in isolation. In contrast, this paper investigates two approaches to improve the ASR performance by combining i-vector based speaker adaptive training in deep neural network (DNN) acoustic models with discriminative adaptation techniques. First, we combine these techniques by interpolating the decoding lattices of i-vector based systems with the decoding lattices of a discriminatively adapted model. Then, we combine these methods by discriminatively adapting the i-vector based system in unsupervised fashion. Our experiments on TED-LIUM dataset show that compared with a strong speaker independent baseline, lattice interpolation and adaptation of the i-vector systems achieve 12.0% and 15.6% relative improvements, respectively. Moreover, in comparison to the i-vector based systems, lattice interpolation reported a 4.5% relative improvement while discriminatively adapting the i-vector system reported a 8.3% relative improvement.
ieee automatic speech recognition and understanding workshop | 2015
Lahiru Samarakoon; Khe Chai Sim
This paper proposes an approach to improve automatic speech recognition (ASR) by normalizing the speaker variability of a well trained Deep Neural Network (DNN) acoustic model using i-vectors. Our approach learns a speaker dependent transformation of the acoustic features combined with the standard speaker dependent bias, to minimize the mismatch due to the inter-speaker variability. Speaker normalization experiments on the Aurora 4 task show 10.9% relative improvement over the baseline. Moreover, the proposed approach reported 4.5% relative improvement over the standard i-vector based method where only a speaker dependent bias is used. Furthermore, we report an analysis to compare our approach with the Constrained Maximum Likelihood Linear Regression (CMLLR) method.
international conference on acoustics, speech, and signal processing | 2017
Lahiru Samarakoon; Khe Chai Sim; Brian Mak
Subspace methods are used for deep neural network (DNN)-based acoustic model adaptation. These methods first construct a subspace and then perform the speaker adaptation as a point in the subspace. This paper aims to investigate the effectiveness of subspace methods for robust unsupervised adaptation. For the analysis, we compare two state-of-the-art subspace methods, namely, the singular value decomposition (SVD)-based bottleneck adaptation and the factorized hidden layer (FHL) adaptation. Both of these methods perform speaker adaptation as a linear combination of rank-1 bases. The main difference between the subspace construction is that FHL adaptation constructs a speaker subspace separate from the phoneme classification space while SVD-based bottleneck adaptation shares the same subspace for both the phoneme classification and the speaker adaptation. So far, no direct comparisons between these two methods are reported. In this work, we compare these two methods for their robustness to unsupervised adaptation on Aurora 4, AMI IHM and AMI SDM tasks. Our findings show that the FHL adaptation outperforms the SVD-based bottleneck adaptation especially in challenging conditions where the adaptation data is limited, or the quality of the adaptation alignments are low.
conference of the international speech communication association | 2016
Lahiru Samarakoon; Khe Chai Sim
Recently, the Factorized Hidden Layer (FHL) adaptation is proposed for speaker adaptation of deep neural network (DNN) based acoustic models. In addition to the standard affine transformation, an FHL contains a speaker-dependent (SD) transformation matrix using a linear combination of rank-1 matrices and an SD bias using a linear combination of vectors. In this work, we extend the FHL based adaptation to multiple variabilities of the speech signal. Experimental results on Aurora4 task show 26.0% relative improvement over the baseline when standard FHL adaptation is used for speaker adaptation. The Multiattribute FHL adaptation shows gains over the standard FHL adaptation where improvements reach up to 29.0% relative to the baseline.
New Era for Robust Speech Recognition, Exploiting Deep Learning | 2017
Khe Chai Sim; Yanmin Qian; Gautam Mantena; Lahiru Samarakoon; Souvik Kundu; Tian Tan
Deep neural networks (DNNs) have been successfully applied to many pattern classification problems, including acoustic modelling for automatic speech recognition (ASR). However, DNN adaptation remains a challenging task. Many approaches have been proposed in recent years to improve the adaptability of DNNs to achieve robust ASR. This chapter will review the recent adaptation methods for DNNs, broadly categorising them into constrained adaptation, feature normalisation, feature augmentation and structured DNN parameterisation. Specifically, we will describe various methods of estimating reliable representations for feature augmentation, focusing primarily on comparing i-vectors and other bottleneck features. Moreover, we will also present an adaptable DNN layer parameterisation scheme based on a linear interpolation structure. The interpolation weights can be reliably adjusted to adapt the DNN to different conditions. This generic scheme subsumes many existing DNN adaptation methods, including speaker-code adaptation, learning hidden unit contribution factorised hidden layer and cluster adaptive training for DNNs.
spoken language technology workshop | 2016
Lahiru Samarakoon; Khe Chai Sim
Recently, the factorized hidden layer (FHL) adaptation method is proposed for speaker adaptation of deep neural network (DNN) acoustic models. An FHL contains a speaker-dependent (SD) transformation matrix using a linear combination of rank-1 matrices and an SD bias using a linear combination of vectors, in addition to the standard affine transformation. On the other hand, full-rank bases are used with a similar DNN adaptation method which is based on cluster adaptive training (CAT). Therefore, it is interesting to investigate the effect of the rank of the bases used for adaptation. The increase of the rank of the bases improves the speaker subspace representation, without increasing the number of learnable speaker parameters. In this work, we investigate the effect of using various ranks for the bases of the SD transformation of FHLs on Aurora 4, AMI IHM and AMI SDM tasks. Experimental results have shown that when one FHL layer is used, it is optimal to use low-ranked bases of rank-50, instead of full-rank bases. Furthermore, when multiple FHLs are used, rank-1 bases are sufficient.
conference of the international speech communication association | 2016
Lahiru Samarakoon; Khe Chai Sim
Recently, the learning hidden unit contributions (LHUC) method is proposed for the adaptation of deep neural network (DNN) based acoustic models for automatic speech recognition (ASR). In LHUC, a set of speaker dependent (SD) parameters is estimated to linearly recombine the hidden units in an unsupervised fashion. Although LHUC performs considerably well, the gains diminish when the availability of the adaptation data amount decreases. Moreover, the per-speaker footprint of LHUC adaptation is in thousands and it is not desirable. Therefore, in this work, we propose the subspace LHUC, where the SD parameters are estimated in a subspace and connected to various layers through a new set of adaptively trained weights. We evaluate the subspace LHUC in the Aurora4 and AMI IHM tasks. Experimental results show that the subspace LHUC outperforms standard LHUC adaptation. With utterance-level fast adaptation, the subspace LHUC achieved 11.3% and 4.5% relative improvements over the standard LHUC for the Aurora4 and AMI IHM tasks respectively. Furthermore, the subspace LHUC reduces the per-speaker footprint by 94% over the standard LHUC adaptation.
international conference on acoustics, speech, and signal processing | 2018
Lahiru Samarakoon; Brian Mak; Khe Chai Sim
conference of the international speech communication association | 2017
Lahiru Samarakoon; Brian Mak; Khe Chai Sim