Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jouni Pohjalainen is active.

Publication


Featured researches published by Jouni Pohjalainen.


Computer Speech & Language | 2015

Feature Selection Methods and Their Combinations in High-Dimensional Classification of Speaker Likability, Intelligibility and Personality Traits

Jouni Pohjalainen; Okko Räsänen; Serdar Kadioglu

This study focuses on feature selection in paralinguistic analysis and presents recently developed supervised and unsupervised methods for feature subset selection and feature ranking. Using the standard k-nearest-neighbors (kNN) rule as the classification algorithm, the feature selection methods are evaluated individually and in different combinations in seven paralinguistic speaker trait classification tasks. In each analyzed data set, the overall number of features highly exceeds the number of data points available for training and evaluation, making a well-generalizing feature selection process extremely difficult. The performance of feature sets on t feature selection data is observed to be a poor indicator of their performance on unseen data. The studied feature selection methods clearly outperform a standard greedy hill-climbing selection algorithm by being more robust against overfitting. When the selection methods are suitably combined with each other, the performance in the classification task can be further improved. In general, it is shown that the use of automatic feature selection in paralinguistic analysis can be used to reduce the overall number of features to a fraction of the original feature set size while still achieving a comparable or even better performance than baseline support vector machine or random forest classifiers using the full feature set. The most typically selected features for recognition of speaker likability, intelligibility and five personality traits are also reported.


IEEE Signal Processing Letters | 2010

Temporally Weighted Linear Prediction Features for Tackling Additive Noise in Speaker Verification

Rahim Saeidi; Jouni Pohjalainen; Tomi Kinnunen; Paavo Alku

Text-independent speaker verification under additive noise corruption is considered. In the popular mel-frequency cepstral coefficient (MFCC) front-end, the conventional Fourier-based spectrum estimation is substituted with weighted linear predictive methods, which have earlier shown success in noise-robust speech recognition. Two temporally weighted variants of linear predictive modeling are introduced to speaker verification and they are compared to FFT, which is normally used in computing MFCCs, and to conventional linear prediction. The effect of speech enhancement (spectral subtraction) on the system performance with each of the four feature representations is also investigated. Experiments by the authors on the NIST 2002 SRE corpus indicate that the accuracy of the conventional and proposed features are close to each other on clean data. For factory noise at 0 dB SNR level, baseline FFT and the better of the proposed features give EERs of 17.4% and 15.6%, respectively. These accuracies improve to 11.6% and 11.2%, respectively, when spectral subtraction is included as a preprocessing method. The new features hold a promise for noise-robust speaker verification.


Speech Communication | 2009

Stabilised weighted linear prediction

Carlo Magi; Jouni Pohjalainen; Tomas Bäckström; Paavo Alku

Weighted linear prediction (WLP) is a method to compute all-pole models of speech by applying temporal weighting of the square of the residual signal. By using short-time energy (STE) as a weighting function, this algorithm was originally proposed as an improved linear predictive (LP) method based on emphasising those samples that fit the underlying speech production model well. The original formulation of WLP, however, did not guarantee stability of all-pole models. Therefore, the current work revisits the concept of WLP by introducing a modified short-time energy function leading always to stable all-pole models. This new method, stabilised weighted linear prediction (SWLP), is shown to yield all-pole models whose general performance can be adjusted by properly choosing the length of the STE window, a parameter denoted by M. The study compares the performances of SWLP, minimum variance distortionless response (MVDR), and conventional LP in spectral modelling of speech corrupted by additive noise. The comparisons were performed by computing, for each method, the logarithmic spectral differences between the all-pole spectra extracted from clean and noisy speech in different segmental signal-to-noise ratio (SNR) categories. The results showed that the proposed SWLP algorithm was the most robust method against zero-mean Gaussian noise and the robustness was largest for SWLP with a small M-value. These findings were corroborated by a small listening test in which the majority of the listeners assessed the quality of impulse-train-excited SWLP filters, extracted from noisy speech, to be perceptually closer to original clean speech than the corresponding all-pole responses computed by MVDR. Finally, SWLP was compared to other short-time spectral estimation methods (FFT, LP, MVDR) in isolated word recognition experiments. Recognition accuracy obtained by SWLP, in comparison to other short-time spectral estimation methods, improved already at moderate segmental SNR values for sounds corrupted by zero-mean Gaussian noise. For realistic factory noise of low pass characteristics, the SWLP method improved the recognition results at segmental SNR levels below 0dB.


IEEE Transactions on Audio, Speech, and Language Processing | 2008

Evaluation of an Artificial Speech Bandwidth Extension Method in Three Languages

Hannu Pulakka; Laura Laaksonen; Martti Vainio; Jouni Pohjalainen; Paavo Alku

Quality and intelligibility of narrowband telephone speech can be improved by artificial bandwidth extension (ABE), which extends the speech bandwidth using only information available in the narrowband speech signal. This paper reports a three-language evaluation of an ABE method that has recently been launched in several of Nokias mobile telephone models. The method extends the speech bandwidth to frequencies above the telephone band by first utilizing spectral folding and then modifying the magnitude spectrum of the extension band with spline curves. The performance of the method was evaluated by formal listening tests in American English, Russian, and Mandarin Chinese. The results of the listening tests indicate that ABE processing improved the subjective quality of coded narrowband speech in all these languages. Differences between bandwidth-extended American English test sentences and their original wideband counterparts were also evaluated using both an objective distance measure that simulates the characteristics of human hearing and a conventional spectral distortion measure. The average objective error was calculated for different categories of speech sounds. The error was found to be smallest in nasals and semivowels and largest in fricative sounds.


Journal of the Acoustical Society of America | 2013

Formant frequency estimation of high-pitched vowels using weighted linear prediction

Paavo Alku; Jouni Pohjalainen; Martti Vainio; Anne-Maria Laukkanen; Brad H. Story

All-pole modeling is a widely used formant estimation method, but its performance is known to deteriorate for high-pitched voices. In order to address this problem, several all-pole modeling methods robust to fundamental frequency have been proposed. This study compares five such previously known methods and introduces a technique, Weighted Linear Prediction with Attenuated Main Excitation (WLP-AME). WLP-AME utilizes temporally weighted linear prediction (LP) in which the square of the prediction error is multiplied by a given parametric weighting function. The weighting downgrades the contribution of the main excitation of the vocal tract in optimizing the filter coefficients. Consequently, the resulting all-pole model is affected more by the characteristics of the vocal tract leading to less biased formant estimates. By using synthetic vowels created with a physical modeling approach, the results showed that WLP-AME yields improved formant frequencies for high-pitched sounds in comparison to the previously known methods (e.g., relative error in the first formant of the vowel [a] decreased from 11% to 3% when conventional LP was replaced with WLP-AME). Experiments conducted on natural vowels indicate that the formants detected by WLP-AME changed in a more regular manner between repetitions of different pitch than those computed by conventional LP.


IEEE Signal Processing Letters | 2012

Regularized All-Pole Models for Speaker Verification Under Noisy Environments

Cemal Hanilçi; Tomi Kinnunen; Figen Ertaş; Rahim Saeidi; Jouni Pohjalainen; Paavo Alku

Regularization of linear prediction based mel-frequency cepstral coefficient (MFCC) extraction in speaker verification is considered. Commonly, MFCCs are extracted from the discrete Fourier transform (DFT) spectrum of speech frames. In this paper, DFT spectrum estimate is replaced with the recently proposed regularized linear prediction (RLP) method. Regularization of temporally weighted variants, weighted LP (WLP) and stabilized WLP (SWLP) which have earlier shown success in speech and speaker recognition, is also introduced. A novel type of double autocorrelation (DAC) lag windowing is also proposed to enhance robustness. Experiments on the NIST 2002 corpus indicate that regularized all-pole methods (RLP, RWLP and RSWLP) yield large improvement on recognition accuracy under additive factory and babble noise conditions in terms of both equal error rate (EER) and minimum detection cost function (MinDCF).


Journal of the Acoustical Society of America | 2013

Detection of shouted speech in noise: Human and machine

Jouni Pohjalainen; Tuomo Raitio; Santeri Yrttiaho; Paavo Alku

High vocal effort has characteristic acoustic effects on speech. This study focuses on the utilization of this information by human listeners and a machine-based detection system in the task of detecting shouted speech in the presence of noise. Both female and male speakers read Finnish sentences using normal and shouted voice in controlled conditions, with the sound pressure level recorded. The speech material was artificially corrupted by noise and supplemented with pure noise. The human performance level was statistically evaluated by a listening test, where the subjects labeled noisy samples according to whether shouting was heard or not. A Bayesian detection system was constructed and statistically evaluated. Its performance was compared against that of human listeners, substituting different spectrum analysis methods in the feature extraction stage. Using features capable of taking into account the spectral fine structure (i.e., the fundamental frequency and its harmonics), the machine reached the detection level of humans even in the noisiest conditions. In the listening test, male listeners detected shouted speech significantly better than female listeners, especially with speakers making a smaller vocal effort increase for shouting.


international conference on acoustics, speech, and signal processing | 2011

Shout detection in noise

Jouni Pohjalainen; Paavo Alku; Tomi Kinnunen

For the task of detecting shouted speech in a noisy environment, this paper introduces a system based on mel frequency cepstral coefficient (MFCC) feature extraction, unsupervised frame dropping and Gaussian mixture model (GMM) classification. The evaluation material consists of phonemically identical speech and shouting as well as environmental noise of varying levels. The performance of the shout detection system is analyzed by varying the MFCC feature extraction with respect to 1) the feature vector length and 2) the spectrum estimation method. As for feature vector length, the best performance is obtained using 30 MFCC coefficients, which is more than what is conventionally used. In spectrum estimation, a scheme that combines a linear prediction spectrum envelope with spectral fine structure outperforms the conventional FFT.


international conference on acoustics, speech, and signal processing | 2012

Comparing spectrum estimators in speaker verification under additive noise degradation

Cemal Hanilçi; Tomi Kinnunen; Rahim Saeidi; Jouni Pohjalainen; Paavo Alku; Figen Ertaş; Johan Sandberg; Maria Hansson-Sandsten

Different short-term spectrum estimators for speaker verification under additive noise are considered. Conventionally, mel-frequency cepstral coefficients (MFCCs) are computed from discrete Fourier transform (DFT) spectra of windowed speech frames. Recently, linear prediction (LP) and its temporally weighted variants have been substituted as the spectrum analysis method in speech and speaker recognition. In this paper, 12 different short-term spectrum estimation methods are compared for speaker verification under additive noise contamination. Experimental results conducted on NIST 2002 SRE show that the spectrum estimation method has a large effect on recognition performance and stabilized weighted LP (SWLP) and minimum variance distortionless response (MVDR) methods yield approximately 7 % and 8 % relative improvements over the standard DFT method at -10 dB SNR level of factory and babble noises, respectively in terms of equal error rate (EER).


international conference on acoustics, speech, and signal processing | 2013

Speaker identification from shouted speech: Analysis and compensation

Cemal Hanilçi; Tomi Kinnunen; Rahim Saeidi; Jouni Pohjalainen; Paavo Alku; Figen Ertaş

Text-independent speaker identification is studied using neutral and shouted speech in Finnish to analyze the effect of vocal mode mismatch between training and test utterances. Standard mel-frequency cepstral coefficient (MFCC) features with Gaussian mixture model (GMM) recognizer are used for speaker identification. The results indicate that speaker identification accuracy reduces from perfect (100 %) to 8.71 % under vocal mode mismatch. Because of this dramatic degradation in recognition accuracy, we propose to use a joint density GMM mapping technique for compensating the MFCC features. This mapping is trained on a disjoint emotional speech corpus to create a completely speaker- and speech mode independent emotion-neutralizing mapping. As a result of the compensation, the 8.71 % identification accuracy increases to 32.00 % without degrading the non-mismatched train-test conditions much.

Collaboration


Dive into the Jouni Pohjalainen's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Tomi Kinnunen

University of Eastern Finland

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Rahim Saeidi

Radboud University Nijmegen

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge