Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jani Nurminen is active.

Publication


Featured researches published by Jani Nurminen.


IEEE Transactions on Audio, Speech, and Language Processing | 2011

HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering

Tuomo Raitio; Antti Suni; Junichi Yamagishi; Hannu Pulakka; Jani Nurminen; Martti Vainio; Paavo Alku

This paper describes an hidden Markov model (HMM)-based speech synthesizer that utilizes glottal inverse filtering for generating natural sounding synthetic speech. In the proposed method, speech is first decomposed into the glottal source signal and the model of the vocal tract filter through glottal inverse filtering, and thus parametrized into excitation and spectral features. The source and filter features are modeled individually in the framework of HMM and generated in the synthesis stage according to the text input. The glottal excitation is synthesized through interpolating and concatenating natural glottal flow pulses, and the excitation signal is further modified according to the spectrum of the desired voice source characteristics. Speech is synthesized by filtering the reconstructed source signal with the vocal tract filter. Experiments show that the proposed system is capable of generating natural sounding speech, and the quality is clearly better compared to two HMM-based speech synthesis systems based on widely used vocoder techniques.


IEEE Transactions on Audio, Speech, and Language Processing | 2010

Voice Conversion Using Partial Least Squares Regression

Elina Helander; Tuomas Virtanen; Jani Nurminen; Moncef Gabbouj

Voice conversion can be formulated as finding a mapping function which transforms the features of the source speaker to those of the target speaker. Gaussian mixture model (GMM)-based conversion is commonly used, but it is subject to overfitting. In this paper, we propose to use partial least squares (PLS)-based transforms in voice conversion. To prevent overfitting, the degrees of freedom in the mapping can be controlled by choosing a suitable number of components. We propose a technique to combine PLS with GMMs, enabling the use of multiple local linear mappings. To further improve the perceptual quality of the mapping where rapid transitions between GMM components produce audible artefacts, we propose to low-pass filter the component posterior probabilities. The conducted experiments show that the proposed technique results in better subjective and objective quality than the baseline joint density GMM approach. In speech quality conversion preference tests, the proposed method achieved 67% preference score against the smoothed joint density GMM method and 84% preference score against the unsmoothed joint density GMM method. In objective tests the proposed method produced a lower Mel-cepstral distortion than the reference methods.


international conference on acoustics, speech, and signal processing | 2007

A Novel Method for Prosody Prediction in Voice Conversion

Elina Helander; Jani Nurminen

Most of the published voice conversion schemes do not consider detailed prosody modeling but only control the F0 level and range. However, the detailed prosody can also carry a significant amount of speaker identity related information. This paper introduces a new method for converting the prosody in voice conversion. A syllable-based prosodic codebook is used to predict the converted F0 using not only the source contour but also linguistic information and segmental durations. The selection of the most suitable target contour is carried out using a trained classification and regression tree. The F0 contours in the codebook are represented in a transformed domain which allows compression and fast comparison. The performance of the prosodic conversion is evaluated in a real voice conversion system. The results indicate a significant improvement in speaker identity and naturalness when compared to GMM (Gaussian mixture model) based pitch prediction approach.


international conference on acoustics, speech, and signal processing | 2008

LSF mapping for voice conversion with very small training sets

Elina Helander; Jani Nurminen; Moncef Gabbouj

To make voice conversion usable in practical applications, the number of training sentences should be minimized. With traditional Gaussian mixture model (GMM) based techniques small training sets lead to over-fitting and estimation problems. We propose a new approach for mapping line spectral frequencies (LSFs) representing the vocal tract. The idea is based on inherent intra-frame correlations of LSFs. For each target LSF, a separate GMM is used and only the source and target LSF elements best correlating with the current LSF are used in training. The proposed method is evaluated both objectively and in listening tests, and it is shown that the method outperforms the conventional GMM approach especially with very small training sets.


international conference on acoustics, speech, and signal processing | 2012

Local linear transformation for voice conversion

Victor Popa; Hanna Silén; Jani Nurminen; Moncef Gabbouj

Many popular approaches to spectral conversion involve linear transformations determined for particular acoustic classes and compute the converted result as a linear combination between different local transformations in an attempt to ensure a continuous conversion. These methods often produce over-smoothed spectra and parameter tracks. The proposed method computes an individual linear transformation for every feature vector based on a small neighborhood in the acoustic space thus preserving local details. The method effectively reduces the over-smoothing by eliminating undesired contributions from acoustically remote regions. The method is evaluated in listening tests against the well-known Gaussian Mixture Model based conversion, representative of the class of methods involving linear transformations. Perceptual results indicate a clear preference for the proposed scheme.


IEEE Transactions on Audio, Speech, and Language Processing | 2010

Supervisory Data Alignment for Text-Independent Voice Conversion

Jianhua Tao; Meng Zhang; Jani Nurminen; Jilei Tian; Xia Wang

We propose new supervisory data alignment methods for text-independent voice conversion which do not need parallel training corpora. Phonetic information is used as a restriction during alignment for mapping the data from the source speaker onto the parameter space of a target speaker. Both linear and nonlinear methods are derived by considering alignment accuracy and topology preservation. For the linear alignment, we consider common phoneme clusters of the source and target space as benchmarks and adapt the source data vector to the target space while maintaining the relative phonetic positions among neighborhood clusters. In order to preserve the topological structure of the source parameter space and improve the stability of conversion and the accuracy of the phonetic mapping, a supervised self-organizing learning algorithm considering phonetic restriction is proposed for iteratively improving the alignment outcome of the previous step. Both the linear and nonlinear methods can also be applied in the cross-lingual case. Evaluation results show that the proposed methods improve the performance of alignment in terms of both alignment accuracy and stability for text-independent voice conversion in intra-lingual and cross-lingual cases.


international symposium on chinese spoken language processing | 2004

On analysis of eigenpitch in Mandarin Chinese

Jilei Tian; Jani Nurminen

Prosody is an inherent supra-segmental feature of human speech that is being employed to express, e.g., attitude, emotion, intent and attention. Pitch is the most important feature among the prosodic information. For Mandarin Chinese speech, the pitch information is even more crucial because Mandarin is a tonal language in which the tone of each syllable is described by its pitch contour. In this paper, the concept of syllable-based eigenpitch is introduced and investigated using principal component analysis (PCA). The eigenpitch and the related eigenfeatures are analyzed, and it is shown that the tonal patterns are preserved in the eigenpitch representation. Furthermore, we show that the dimension of pitch in the eigenspace can be reduced while minimizing the energy loss of the original pitch contour. Finally, we briefly discuss the quantization properties of the eigenpitch representation. We also present experimental results obtained using a Mandarin speech database. They are in line with the theoretical reasoning and further prove the usefulness of the proposed pitch modeling technique.


international conference on acoustics, speech, and signal processing | 2005

Optimal subset selection from text databases

Jilei Tian; Jani Nurminen; Imre Kiss

Speech and language processing techniques, such as automatic speech recognition (ASR), text-to-speech (TTS) synthesis, language understanding and translation, will play a key role in tomorrows user interfaces. Many of these techniques employ models that must be trained using text data. We introduce a novel method for training set selection from text databases. The quality of the training subset is ensured using an objective function that effectively describes the coverage achieved with the strings in the subset. The validity of the subset selection technique is verified in an automatic syllabification task. The results clearly indicate that the proposed systematic selection approach maximizes the quality of the training set, which in turn improves the quality of the trained model. The presented idea can be used in a wide variety of language processing applications that require training with text databases.


Journal of Signal and Information Processing | 2011

A Study of Bilinear Models in Voice Conversion

Victor Popa; Jani Nurminen; Moncef Gabbouj

This paper presents a voice conversion technique based on bilinear models and introduces the concept of contextual modeling. The bilinear approach reformulates the spectral envelope representation from line spectral frequencies feature to a two-factor parameterization corresponding to speaker identity and phonetic information, the so-called style and content factors. This decomposition offers a flexible representation suitable for voice conversion and facilitates the use of efficient training algorithms based on singular value decomposition. In a contextual approach (bilinear) models are trained on subsets of the training data selected on the fly at conversion time depending on the characteristics of the feature vector to be converted. The performance of bilinear models and context modeling is evaluated in objective and perceptual tests by comparison with the popular GMM-based voice conversion method for several sizes and different types of training data.


international conference on acoustics, speech, and signal processing | 2009

Phoneme cluster based state mapping for text-independent voice conversion

Meng Zhang; Jiaohua Tao; Jani Nurminen; Jilei Tian; Xia Wang

This paper takes phonetic information into account for data alignment in text-independent voice conversion. Hidden Markov Models are used for representing the phonetic structure of training speech. States belonging to same phoneme are grouped together to form a phoneme cluster. A state mapped codebook based transformation is established using information on the corresponding phoneme clusters from source and targets speech and weighted linear transform. For each source vector, several nearest clusters are considered simultaneously while mapping in order to generate a continuous and stable transform. Experimental results indicate that the proposed use of phonetic information increases the similarity between converted speech and target speech. The proposed technique is applicable to both intra-lingual and cross-lingual voice conversion.

Collaboration


Dive into the Jani Nurminen's collaboration.

Top Co-Authors

Avatar

Elina Helander

Tampere University of Technology

View shared research outputs
Top Co-Authors

Avatar

Moncef Gabbouj

Tampere University of Technology

View shared research outputs
Top Co-Authors

Avatar

Hanna Silén

Tampere University of Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge