Ulpu Remes | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ulpu Remes is active.

Explore More

Publication

Featured researches published by Ulpu Remes.

international conference on acoustics, speech, and signal processing | 2011

Speech bandwidth extension using Gaussian mixture model-based estimation of the highband mel spectrum

Hannu Pulakka; Ulpu Remes; Kalle J. Palomäki; Mikko Kurimo; Paavo Alku

The quality and intelligibility of narrowband telephone speech can be enhanced by artifical bandwidth extension. This study combines Gaussian mixture model-based (GMM) mel spectrum extension with a filter bank implementation for generating the missing spectral content in the highband at 4–8 kHz. The narrowband mel spectrum is calculated from input speech and the GMM is used to estimate the mel spectrum in the highband. An excitation signal for the highband is generated as a combination of upsampled linear prediction residual and modulated noise. The excitation is divided into sub-bands that are weighted and summed to realize the estimated mel spectrum. The bandwidth-extended output is obtained as the sum of the artificial highband signal and narrowband speech. Listening tests indicate that this method is preferred over narrowband speech and over a previously presented artificial bandwidth extension method which is implemented in some mobile phone models.

IEEE Transactions on Audio, Speech, and Language Processing | 2012

Bandwidth Extension of Telephone Speech to Low Frequencies Using Sinusoidal Synthesis and a Gaussian Mixture Model

Hannu Pulakka; Ulpu Remes; Santeri Yrttiaho; Kalle J. Palomäki; Mikko Kurimo; Paavo Alku

The quality of narrowband telephone speech is degraded by the limited audio bandwidth. This paper describes a method that extends the bandwidth of telephone speech to the frequency range 0-300 Hz. The method generates the lowest harmonics of voiced speech using sinusoidal synthesis. The energy in the extension band is estimated from spectral features using a Gaussian mixture model. The amplitudes and phases of the synthesized sinusoidal components are adjusted based on the amplitudes and phases of the narrowband input speech, which provides adaptivity to varying input bandwidth characteristics. The proposed method was evaluated with listening tests in combination with another bandwidth extension method for the frequency range 4-8 kHz. While the low-frequency bandwidth extension was not found to improve perceived quality, the method reduced dissimilarity with wideband speech.

international conference on acoustics, speech, and signal processing | 2013

HMM-based speech synthesis adaptation using noisy data: Analysis and evaluation methods

Reima Karhila; Ulpu Remes; Mikko Kurimo

This paper investigates the role of noise in speaker-adaptation of HMM-based text-to-speech (TTS) synthesis and presents a new evaluation procedure. Both a new listening test based on ITU-T recommendation 835 and a perceptually motivated objective measure, frequency-weighted segmental SNR, improve the evaluation of synthetic speech when noise is present. The evaluation of voices adapted with noisy data show that the noise plays a relatively small but noticeable role in the quality of synthetic speech: Naturalness and speaker similarity are not affected in a significant way by the noise, but listeners prefer the voices trained from cleaner data. Noise removal, even when it degrades natural speech quality, improves the synthetic voice.

IEEE Signal Processing Letters | 2011

Missing-Feature Reconstruction With a Bounded Nonlinear State-Space Model

Ulpu Remes; Kalle J. Palomäki; Tapani Raiko; Antti Honkela; Mikko Kurimo

Missing-feature reconstruction can improve speech recognition performance in unknown noisy environments. In this work, we examine using a nonlinear state-space model (NSSM) for missing-feature reconstruction and propose estimation with observed bounds to improve the NSSM performance. Evaluated in large-vocabulary continuous speech recognition task with babble and impulsive noise, using observed bounds in NSSM state estimation significantly improved the method performance.

IEEE Transactions on Audio, Speech, and Language Processing | 2015

Bounded conditional mean imputation with observation uncertainties and acoustic model adaptation

Ulpu Remes; Ana Ramírez López; Kalle J. Palomäki; Mikko Kurimo

Automatic speech recognition systems use noise compensation and acoustic model adaptation to increase robustness towards speaker and environmental variation. The current work focuses on noise compensation with bounded conditional mean imputation (BCMI). BCMI approaches are missing-data methods which operate on the assumption that noise-corrupted observations can be divided into reliable and unreliable components. BCMI methods substitute the unreliable components with a clean speech posterior distribution. The posterior means can be used as clean speech estimates and the posterior variances can be introduced in acoustic model likelihood calculation as observation uncertainties. In addition, we propose in the current work that similar uncertainties are introduced in acoustic model adaptation. Evaluation with speech data recorded in diverse public and car environments indicates that the proposed uncertainties improve adaptation performance. When uncertainties were used in acoustic model likelihood calculation and adaptation, the proposed imputation and adaptation system introduced 15%-84% relative error reductions to an uncompensated baseline system performance.

international workshop on acoustic signal enhancement | 2014

Spectral tilt modelling with extrapolated GMMs for intelligibility enhancement of narrowband telephone speech

Emma Jokinen; Ulpu Remes; Marko Takanen; Kalle J. Palomäki; Mikko Kurimo; Paavo Alku

Post-processing methods are used in mobile communications to improve the intelligibility of speech in adverse background noise conditions. In this study, post-processing based on the modification of the spectral tilt with Gaussian mixture models according to the Lombard effect is investigated. A spectral envelope estimation method is studied and optimized for this purpose. Furthermore, the extrapolation of the statistical mapping in a post-processing context is investigated. The proposed post-processing methods are compared to unprocessed speech and a reference method in subjective intelligibility and quality tests in different near-end noise conditions. The results indicate that one of the extrapolated methods achieved the same intelligibility as fixed high-pass filtering without degrading the quality of speech.

IEEE Journal of Selected Topics in Signal Processing | 2014

Noise in HMM-Based Speech Synthesis Adaptation: Analysis, Evaluation Methods and Experiments

Reima Karhila; Ulpu Remes; Mikko Kurimo

This work describes experiments on using noisy adaptation data to create personalized voices with HMM-based speech synthesis. We investigate how environmental noise affects feature extraction and CSMAPLR and EMLLR adaptation. We investigate effects of regression trees and data quantity and test noise-robust feature streams for alignment and NMF-based source separation as preprocessing. The adaptation performance is evaluated using a listening test developed for noisy synthesized speech. The evaluation shows that speaker-adaptive HMM-TTS system is robust to moderate environmental noise.

conference of the international speech communication association | 2016

The use of read versus conversational Lombard speech in spectral tilt modeling for intelligibility enhancement in near-end noise conditions

Emma Jokinen; Ulpu Remes; Paavo Alku

Intelligibility of speech in adverse near-end noise conditions can be enhanced with post-processing. Recently, a postprocessing method based on statistical mapping of the spectral tilt of normal speech to that of Lombard speech was proposed. However, previous intelligibility improvement studies utilizing Lombard speech have mainly gathered data from read sentences which might result in a less pronounced Lombard effect. Having a mild Lombard effect in the training data weakens the statistical normal-to-Lombard mapping of the spectral tilt which in turn deteriorates performance of intelligibility enhancement. Therefore, a database containing both conversational and read Lombard speech was recorded in several background noise conditions in this study. Statistical models for normal-to-Lombard mapping of the spectral tilt were then trained using the obtained conversational and read speech data and evaluated using an objective intelligibility metric. The results suggest that the conversational data contains a more pronounced Lombard effect and could be used to obtain better statistical models for intelligibility enhancement.

international conference on acoustics, speech, and signal processing | 2015

Designing multichannel source separation based on single-channel source separation

A. Ramírez López; Nobutaka Ono; Ulpu Remes; Kalle J. Palomäki; Mikko Kurimo

In this paper, an extension of independent vector analysis (IVA), model-based IVA, is proposed for multichannel source separation. For obtaining better source models, we introduce a single-channel source separation method, and utilize the outputs as source variances in time-frequency-variant Gaussian source model. The demixing matrices are estimated in the same way as a state-of-the-art IVA method, auxiliary-function-based IVA (AuxIVA). Experimental evaluations show that the proposed approach is effective and improves the source separation performance of IVA. In addition, several post-filters aiming to realize multichannel Wiener filter (MWF) are investigated. This setup proves to further increase the performance of IVA. The presented method shows a potential to provide a general way to improve the separation performance from single-channel source separation to multichannel source separation.

international conference on acoustics, speech, and signal processing | 2017

Dirichlet process mixture models for clustering i-vector data

Shreyas Seshadri; Ulpu Remes; Okko Räsänen

Non-parametric Bayesian methods have recently gained popularity in several research areas dealing with unsupervised learning. These models are capable of simultaneously learning the cluster models as well as their number based on properties of a dataset. The most commonly applied models are using Dirichlet process priors and Gaussian models, called as Dirichlet process Gaussian mixture models (DPGMMs). Recently, von Mises-Fisher mixture models (VMMs) have also been gaining popularity in modelling high-dimensional unit-normalized features such as text documents and gene expression data. VMMs are potentially more efficient in modeling certain speech representations such as i-vector data when compared to the GMM-based models, as they work with unit-normalized features based on cosine distance. The current work investigates the applicability of Dirichlet process VMMs (DPVMMs) for i-vector-based speaker clustering and verification, showing that they indeed show superior performance in comparison to DPGMMs in the tasks. In addition, we introduce an implementation of the DPVMMs with variational inference that is publicly available for use.

Explore More