Neethu Mariam Joy
Indian Institute of Technology Madras
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Neethu Mariam Joy.
spoken language technology workshop | 2014
Basil Abraham; Neethu Mariam Joy; Navneeth K. S. Umesh
One of the major problems in acoustic modeling for a low-resource language is data sparsity. In recent years, cross-lingual acoustic modeling techniques have been employed to overcome this problem. In this paper we propose multiple cross-lingual techniques to address the problem of data insufficiency. The first method, which we call as the cross-lingual phone-CAT, uses the principles of phone-cluster adaptive training (phone-CAT), where the parameters of context-dependent states are obtained by linear interpolation of monophone cluster models. The second method uses the interpolation vectors of phone-CAT, which is known to capture the phonetic context information, to map phonemes between two languages. Finally, the data-driven phoneme-mapping technique is incorporated into the cross-lingual phone-CAT, to obtain what we call as the phoneme-mapped cross-lingual phone-CAT. The proposed techniques are employed in acoustic modeling of three Indian languages namely Bengali, Hindi and Tamil. The phoneme-mapped cross-lingual phone-CAT gave relative improvements of 15.14% for Bengali, 16.4% for Hindi and 11.3% for Tamil over the conventional cross-lingual subspace Gaussian mixture model (SGMM) in low-resource scenario.
national conference on communications | 2014
Neethu Mariam Joy; Basil Abraham; K Navneeth; S. Umesh
Cross-lingual acoustic modeling using Subspace Gaussian Mixture Model for low-resource languages of Indian origin is investigated. Building acoustic model for a low-resource language with limited vocabulary by leveraging resources from another language with comparatively larger resources was focused upon. Experiments were done on Bengali and Tamil corpus from MANDI database, with Tamil having greater resources than Bengali. We observed that the word accuracy of cross-lingual acoustic model of Bengali was approximately 2.5% above its CDHMM model and gave equivalent performance as its monolingual SGMM model.
conference of the international speech communication association | 2016
Basil Abraham; S. Umesh; Neethu Mariam Joy
In this paper, we propose two techniques to improve the acoustic model of a low-resource language by: (i) Pooling data from closely related languages using a phoneme mapping algorithm to build acoustic models like subspace Gaussian mixture model (SGMM), phone cluster adaptive training (Phone-CAT), deep neural network (DNN) and convolutional neural network (CNN). Using the low-resource language data, we then adapt the afore mentioned models towards that language. (ii) Using models built from high-resource languages, we first borrow subspace model parameters from SGMM/Phone-CAT; or hidden layers from DNN/CNN. The language specific parameters are then estimated using the lowresource language data. The experiments were performed on four Indian languages namely Assamese, Bengali, Hindi and Tamil. Relative improvements of 10 to 30% were obtained over corresponding monolingual models in each case.
conference of the international speech communication association | 2016
Neethu Mariam Joy; Murali Karthick Baskar; S. Umesh; Basil Abraham
In this paper, we propose the use of deep neural networks (DNN) as a regression model to estimate feature-space maximum likelihood linear regression (FMLLR) features from unnormalized features. During training, the pair of unnormalized features as input and corresponding FMLLR features as target are provided and the network is optimized to reduce the mean-square error between output and target FMLLR features. During test, the unnormalized features are passed through this DNN feature extractor to obtain FMLLR-like features without any supervision or first pass decode. Further, the FMLLR-like features are generated frame-by-frame, requiring no explicit adaptation data to extract the features unlike in FMLLR or ivector. Our proposed approach is therefore suitable for scenarios where there is little adaptation data. The proposed approach provides sizable improvements over basis-FMLLR and conventional FMLLR when normalization is done at utterance level on TIMIT and Switchboard-33hour data sets.
Speech Communication | 2017
Neethu Mariam Joy; Murali Karthick Baskar; S. Umesh
Abstract In this paper, we propose using deep neural networks (DNN) as a regression model to estimate speaker-normalized features from un-normalized features. We consider three types of speaker-specific feature normalization techniques, viz., feature-space maximum likelihood linear regression (FMLLR), vocal tract length normalization (VTLN) and a combination of both. The various un-normalized features considered were log filterbank features, Mel frequency cepstral coefficients (MFCC) and linear discriminant analysis (LDA) features. The DNN is trained using pairs of un-normalized features as input and corresponding speaker-normalized features as target. The network is optimized to reduce the mean square error between output and target speaker-normalized features. During test, un-normalized features are passed through this well trained DNN network to obtain pseudo speaker-normalized features without any supervision or adaptation data or first pass decode. As the pseudo speaker-normalized features are generated frame-by-frame, the proposed method requires no explicit adaptation data unlike in FMLLR or VTLN or i -vector. Our proposed approach is hence suitable for those scenarios where there is very little adaptation data. The proposed approach provides significant improvements over conventional speaker-normalization techniques when normalization is done at utterance level. The experiments done on TIMIT and 33-h subset and entire 300-h of Switchboard corpus supports our claim. With large amount of train data, the proposed pseudo speaker-normalized features outperforms conventional speaker-normalized features in the utterance-wise normalization scenario and gives consistent marginal improvements over un-normalized features.
international conference on signal processing | 2016
Neethu Mariam Joy; S. Umesh; Basil Abraham; K Navneeth
Phone-cluster adaptive training (Phone-CAT) is a subspace based acoustic modeling technique inspired from cluster adaptive training (CAT) and subspace Gaussian mixture model (SGMM). This paper explores three extensions, viz., increasing phonetic subspace dimension, including sub-states and speaker subspace, to the basic Phone-CAT model to improve its recognition performance. The latter two extensions are similar in implementation as that of SGMM as both acoustic models share a similar subspace framework. But, since the phonetic subspace dimension of Phone-CAT is constrained to be equal to the number of monophones, the first extension is not straightforward to implement. We propose a Two-stage Phone-CAT model where we increase the phonetic subspace dimension to that of the number of monophone states. This model will still be able to retain the center phone capturing property of the state-specific vectors in basic Phone-CAT. Experiments done on 33-hour train subset of Switchboard database shows improvements in recognition performance of basic Phone-CAT model with the inclusion of the proposed extensions.
conference of the international speech communication association | 2016
Basil Abraham; Srinivasan Umesh; Neethu Mariam Joy
Articulatory features provide robustness to speaker and environment variability by incorporating speech production knowledge. Pseudo articulatory features are a way of extracting articulatory features using articulatory classifiers trained from speech data. One of the major problems faced in building articulatory classifiers is the requirement of speech data aligned in terms of articulatory feature values at frame level. Manually aligning data at frame level is a tedious task and alignments obtained from the phone alignments using phone-to-articulatory feature mapping are prone to errors. In this paper, a technique using connectionist temporal classification (CTC) criterion to train an articulatory classifier using bidirectional long short-term memory (BLSTM) recurrent neural network (RNN) is proposed. The CTC criterion eliminates the need for forced frame level alignments. Articulatory classifiers were also built using different neural network architectures like deep neural networks (DNN), convolutional neural network (CNN) and BLSTM with frame level alignments and were compared to the proposed approach of using CTC. Among the different architectures, articulatory features extracted using articulatory classifiers built with BLSTM gave better recognition performance. Further, the proposed approach of BLSTM with CTC gave the best overall performance on both SVitchboard (6 hours) and Switchboard 33 hours data set.
2016 Twenty Second National Conference on Communication (NCC) | 2016
Neethu Mariam Joy; Basil Abraham; K Navneeth; S. Umesh
In this paper, we investigate methods to improve the recognition performance of low-resource languages with limited training data by borrowing subspace parameters from a high-resource language in subspace Gaussian mixture model (SGMM) framework. As a first step, only the state-specific vectors are updated using low-resource language, while retaining all the globally shared parameters from the high-resource language. This approach gave improvements only in some cases. However, when both state-specific and weight projection vectors are re-estimated with low-resource language, we get consistent improvement in performance over conventional monolingual SGMM of the low-resource language. Further, we conducted experiments to investigate the effect of different shared parameters on the acoustic model built using the proposed method. Experiments were done on the Tamil, Hindi and Bengali corpus of MANDI database. Relative improvement of 16.17% for Tamil, 13.74% for Hindi and 12.5% for Bengali, over respective monolingual SGMM were obtained.
IEEE Transactions on Neural Systems and Rehabilitation Engineering | 2018
Neethu Mariam Joy; S. Umesh
IEEE Transactions on Audio, Speech, and Language Processing | 2018
Neethu Mariam Joy; Sandeep Reddy Kothinti; Srinivasan Umesh