Ulpu Remes
Aalto University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ulpu Remes.
international conference on acoustics, speech, and signal processing | 2011
Hannu Pulakka; Ulpu Remes; Kalle J. Palomäki; Mikko Kurimo; Paavo Alku
The quality and intelligibility of narrowband telephone speech can be enhanced by artifical bandwidth extension. This study combines Gaussian mixture model-based (GMM) mel spectrum extension with a filter bank implementation for generating the missing spectral content in the highband at 4–8 kHz. The narrowband mel spectrum is calculated from input speech and the GMM is used to estimate the mel spectrum in the highband. An excitation signal for the highband is generated as a combination of upsampled linear prediction residual and modulated noise. The excitation is divided into sub-bands that are weighted and summed to realize the estimated mel spectrum. The bandwidth-extended output is obtained as the sum of the artificial highband signal and narrowband speech. Listening tests indicate that this method is preferred over narrowband speech and over a previously presented artificial bandwidth extension method which is implemented in some mobile phone models.
IEEE Transactions on Audio, Speech, and Language Processing | 2012
Hannu Pulakka; Ulpu Remes; Santeri Yrttiaho; Kalle J. Palomäki; Mikko Kurimo; Paavo Alku
The quality of narrowband telephone speech is degraded by the limited audio bandwidth. This paper describes a method that extends the bandwidth of telephone speech to the frequency range 0-300 Hz. The method generates the lowest harmonics of voiced speech using sinusoidal synthesis. The energy in the extension band is estimated from spectral features using a Gaussian mixture model. The amplitudes and phases of the synthesized sinusoidal components are adjusted based on the amplitudes and phases of the narrowband input speech, which provides adaptivity to varying input bandwidth characteristics. The proposed method was evaluated with listening tests in combination with another bandwidth extension method for the frequency range 4-8 kHz. While the low-frequency bandwidth extension was not found to improve perceived quality, the method reduced dissimilarity with wideband speech.
international conference on acoustics, speech, and signal processing | 2013
Reima Karhila; Ulpu Remes; Mikko Kurimo
This paper investigates the role of noise in speaker-adaptation of HMM-based text-to-speech (TTS) synthesis and presents a new evaluation procedure. Both a new listening test based on ITU-T recommendation 835 and a perceptually motivated objective measure, frequency-weighted segmental SNR, improve the evaluation of synthetic speech when noise is present. The evaluation of voices adapted with noisy data show that the noise plays a relatively small but noticeable role in the quality of synthetic speech: Naturalness and speaker similarity are not affected in a significant way by the noise, but listeners prefer the voices trained from cleaner data. Noise removal, even when it degrades natural speech quality, improves the synthetic voice.
IEEE Signal Processing Letters | 2011
Ulpu Remes; Kalle J. Palomäki; Tapani Raiko; Antti Honkela; Mikko Kurimo
Missing-feature reconstruction can improve speech recognition performance in unknown noisy environments. In this work, we examine using a nonlinear state-space model (NSSM) for missing-feature reconstruction and propose estimation with observed bounds to improve the NSSM performance. Evaluated in large-vocabulary continuous speech recognition task with babble and impulsive noise, using observed bounds in NSSM state estimation significantly improved the method performance.
IEEE Transactions on Audio, Speech, and Language Processing | 2015
Ulpu Remes; Ana Ramírez López; Kalle J. Palomäki; Mikko Kurimo
Automatic speech recognition systems use noise compensation and acoustic model adaptation to increase robustness towards speaker and environmental variation. The current work focuses on noise compensation with bounded conditional mean imputation (BCMI). BCMI approaches are missing-data methods which operate on the assumption that noise-corrupted observations can be divided into reliable and unreliable components. BCMI methods substitute the unreliable components with a clean speech posterior distribution. The posterior means can be used as clean speech estimates and the posterior variances can be introduced in acoustic model likelihood calculation as observation uncertainties. In addition, we propose in the current work that similar uncertainties are introduced in acoustic model adaptation. Evaluation with speech data recorded in diverse public and car environments indicates that the proposed uncertainties improve adaptation performance. When uncertainties were used in acoustic model likelihood calculation and adaptation, the proposed imputation and adaptation system introduced 15%-84% relative error reductions to an uncompensated baseline system performance.
international workshop on acoustic signal enhancement | 2014
Emma Jokinen; Ulpu Remes; Marko Takanen; Kalle J. Palomäki; Mikko Kurimo; Paavo Alku
Post-processing methods are used in mobile communications to improve the intelligibility of speech in adverse background noise conditions. In this study, post-processing based on the modification of the spectral tilt with Gaussian mixture models according to the Lombard effect is investigated. A spectral envelope estimation method is studied and optimized for this purpose. Furthermore, the extrapolation of the statistical mapping in a post-processing context is investigated. The proposed post-processing methods are compared to unprocessed speech and a reference method in subjective intelligibility and quality tests in different near-end noise conditions. The results indicate that one of the extrapolated methods achieved the same intelligibility as fixed high-pass filtering without degrading the quality of speech.
IEEE Journal of Selected Topics in Signal Processing | 2014
Reima Karhila; Ulpu Remes; Mikko Kurimo
This work describes experiments on using noisy adaptation data to create personalized voices with HMM-based speech synthesis. We investigate how environmental noise affects feature extraction and CSMAPLR and EMLLR adaptation. We investigate effects of regression trees and data quantity and test noise-robust feature streams for alignment and NMF-based source separation as preprocessing. The adaptation performance is evaluated using a listening test developed for noisy synthesized speech. The evaluation shows that speaker-adaptive HMM-TTS system is robust to moderate environmental noise.
conference of the international speech communication association | 2016
Emma Jokinen; Ulpu Remes; Paavo Alku
Intelligibility of speech in adverse near-end noise conditions can be enhanced with post-processing. Recently, a postprocessing method based on statistical mapping of the spectral tilt of normal speech to that of Lombard speech was proposed. However, previous intelligibility improvement studies utilizing Lombard speech have mainly gathered data from read sentences which might result in a less pronounced Lombard effect. Having a mild Lombard effect in the training data weakens the statistical normal-to-Lombard mapping of the spectral tilt which in turn deteriorates performance of intelligibility enhancement. Therefore, a database containing both conversational and read Lombard speech was recorded in several background noise conditions in this study. Statistical models for normal-to-Lombard mapping of the spectral tilt were then trained using the obtained conversational and read speech data and evaluated using an objective intelligibility metric. The results suggest that the conversational data contains a more pronounced Lombard effect and could be used to obtain better statistical models for intelligibility enhancement.
international conference on acoustics, speech, and signal processing | 2015
A. Ramírez López; Nobutaka Ono; Ulpu Remes; Kalle J. Palomäki; Mikko Kurimo
In this paper, an extension of independent vector analysis (IVA), model-based IVA, is proposed for multichannel source separation. For obtaining better source models, we introduce a single-channel source separation method, and utilize the outputs as source variances in time-frequency-variant Gaussian source model. The demixing matrices are estimated in the same way as a state-of-the-art IVA method, auxiliary-function-based IVA (AuxIVA). Experimental evaluations show that the proposed approach is effective and improves the source separation performance of IVA. In addition, several post-filters aiming to realize multichannel Wiener filter (MWF) are investigated. This setup proves to further increase the performance of IVA. The presented method shows a potential to provide a general way to improve the separation performance from single-channel source separation to multichannel source separation.
international conference on acoustics, speech, and signal processing | 2017
Shreyas Seshadri; Ulpu Remes; Okko Räsänen
Non-parametric Bayesian methods have recently gained popularity in several research areas dealing with unsupervised learning. These models are capable of simultaneously learning the cluster models as well as their number based on properties of a dataset. The most commonly applied models are using Dirichlet process priors and Gaussian models, called as Dirichlet process Gaussian mixture models (DPGMMs). Recently, von Mises-Fisher mixture models (VMMs) have also been gaining popularity in modelling high-dimensional unit-normalized features such as text documents and gene expression data. VMMs are potentially more efficient in modeling certain speech representations such as i-vector data when compared to the GMM-based models, as they work with unit-normalized features based on cosine distance. The current work investigates the applicability of Dirichlet process VMMs (DPVMMs) for i-vector-based speaker clustering and verification, showing that they indeed show superior performance in comparison to DPGMMs in the tasks. In addition, we introduce an implementation of the DPVMMs with variational inference that is publicly available for use.