Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Antti Suni is active.

Publication


Featured researches published by Antti Suni.


IEEE Transactions on Audio, Speech, and Language Processing | 2011

HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering

Tuomo Raitio; Antti Suni; Junichi Yamagishi; Hannu Pulakka; Jani Nurminen; Martti Vainio; Paavo Alku

This paper describes an hidden Markov model (HMM)-based speech synthesizer that utilizes glottal inverse filtering for generating natural sounding synthetic speech. In the proposed method, speech is first decomposed into the glottal source signal and the model of the vocal tract filter through glottal inverse filtering, and thus parametrized into excitation and spectral features. The source and filter features are modeled individually in the framework of HMM and generated in the synthesis stage according to the text input. The glottal excitation is synthesized through interpolating and concatenating natural glottal flow pulses, and the excitation signal is further modified according to the spectrum of the desired voice source characteristics. Speech is synthesized by filtering the reconstructed source signal with the vocal tract filter. Experiments show that the proposed system is capable of generating natural sounding speech, and the quality is clearly better compared to two HMM-based speech synthesis systems based on widely used vocoder techniques.


international conference on acoustics, speech, and signal processing | 2011

Utilizing glottal source pulse library for generating improved excitation signal for HMM-based speech synthesis

Tuomo Raitio; Antti Suni; Hannu Pulakka; Martti Vainio; Paavo Alku

This paper describes a source modeling method for hidden Markov model (HMM) based speech synthesis for improved naturalness. A speech corpus is first decomposed into the glottal source signal and the model of the vocal tract filter using glottal inverse filtering, and parametrized into excitation and spectral features. Additionally, a library of glottal source pulses is extracted from the estimated voice source signal. In the synthesis stage, the excitation signal is generated by selecting appropriate pulses from the library according to the target cost of the excitation features and a concatenation cost between adjacent glottal source pulses. Finally, speech is synthesized by filtering the excitation signal by the vocal tract filter. Experiments show that the naturalness of the synthetic speech is better or equal, and speaker similarity is better, compared to a system using only single glottal source pulse.


Journal of the Acoustical Society of America | 2005

Developing a speech intelligibility test based on measuring speech reception thresholds in noise for English and Finnish

Martti Vainio; Antti Suni; Hanna Järveläinen; Juhani Järvikivi; Ville-Veikko Mattila

A subjective test was developed suitable for evaluating the effect of mobile communications devices on sentence intelligibility in background noise. Originally a total of 25 lists, each list including 16 sentences, were developed in British English and Finnish to serve as the test stimuli representative of adult language today. The sentences, produced by two male and two female speakers, were normalized for naturalness, length, and intelligibility in each language. The sentence sets were balanced with regard to the expected lexical and phonetic distributions in the given language. The sentence lists are intended for adaptive measurement of speech reception thresholds (SRTs) in noise. In the verification of the test stimuli, SRTs were measured for ten subjects in Finnish and nine subjects in English. Mean SRTs were -2.47 dB in Finnish and -1.12 dB in English, with standard deviations of 1.61 and 2.36 dB, respectively. The mean thresholds did not vary significantly between the lists or the talkers after two lists were removed from the Finnish set and one from the English set. Thus the numbers of lists were reduced from 25 to 23 and 24, respectively. The statistical power of the test increased when thresholds were averaged over several sentence lists. With three lists per condition, the test is able to detect a 1.5-dB difference in SRTs with the probability of about 90%.


Journal of the Acoustical Society of America | 2010

Phonetic tone signals phonological quantity and word structure

Martti Vainio; Juhani Järvikivi; Daniel Aalto; Antti Suni

Many languages exploit suprasegmental devices in signaling word meaning. Tone languages exploit fundamental frequency whereas quantity languages rely on segmental durations to distinguish otherwise similar words. Traditionally, duration and tone have been taken as mutually exclusive. However, some evidence suggests that, in addition to durational cues, phonological quantity is associated with and co-signaled by changes in fundamental frequency in quantity languages such as Finnish, Estonian, and Serbo-Croat. The results from the present experiment show that the structure of disyllabic word stems in Finnish are indeed signaled tonally and that the phonological length of the stressed syllable is further tonally distinguished within the disyllabic sequence. The results further indicate that the observed association of tone and duration in perception is systematically exploited in speech production in Finnish.


Computer Speech & Language | 2014

Synthesis and perception of breathy, normal, and Lombard speech in the presence of noise

Tuomo Raitio; Antti Suni; Martti Vainio; Paavo Alku

This papers studies the synthesis of speech over a wide vocal effort continuum and its perception in the presence of noise. Three types of speech are recorded and studied along the continuum: breathy, normal, and Lombard speech. Corresponding synthetic voices are created by training and adapting the statistical parametric speech synthesis system GlottHMM. Natural and synthetic speech along the continuum is assessed in listening tests that evaluate the intelligibility, quality, and suitability of speech in three different realistic multichannel noise conditions: silence, moderate street noise, and extreme street noise. The evaluation results show that the synthesized voices with varying vocal effort are rated similarly to their natural counterparts both in terms of intelligibility and suitability.


international conference on acoustics, speech, and signal processing | 2013

Comparing glottal-flow-excited statistical parametric speech synthesis methods

Tuomo Raitio; Antti Suni; Martti Vainio; Paavo Alku

This paper studies the performance of glottal flow signal based excitation methods in statistical parametric speech synthesis. The current state of the art in excitationmodeling is reviewed and three excitation methods are selected for experiments. Two of the methods are based on the principal component analysis (PCA) decomposition of estimated glottal flow pulses. While the first one uses only the mean of the pulses, the second method uses 12 principal components in addition to the mean signal for modeling the glottal flow waveform. The third method utilizes a glottal flow pulse library from which pulses are selected according to target and concatenation costs. Subjective listening tests are carried out to determine the quality and similarity of the synthetic speech of one male and one female speaker. The results show that the PCA-based methods are rated best both in quality and similarity, but adding more components does not yield any improvements.


Speech Communication | 2016

Phase perception of the glottal excitation and its relevance in statistical parametric speech synthesis

Tuomo Raitio; Lauri Juvela; Antti Suni; Martti Vainio; Paavo Alku

Phase perception of the glottal excitation is studied.Source-filter vocoder is used to modify pitch-synchronous excitation phase pattern.Natural-phase, zero-phase, and random-phase excitations are compared.Various speakers and speaking styles are utilized in subjective listening tests.Results show that using natural phase information results in improved speech quality. While the characteristics of the amplitude spectrum of the voiced excitation have been studied widely both in natural and synthetic speech, the role of the excitation phase has remained less explored. This contradicts findings observed in sound perception studies indicating that humans are not phase deaf. Especially in speech synthesis, phase information is often omitted for simplicity. This study investigates the impact of phase information of the excitation signal of voiced speech and its relevance in statistical parametric speech synthesis. The experiments in the study involve, firstly, converting the pitch-synchronously computed original phase spectra of the excitation waveforms (either glottal flow waveforms or residuals) to either zero phase, cyclostationary random phase, or random phase. Secondly, the quality of synthetic speech in each case is compared in subjective listening tests to the corresponding signal excited with the original, natural phase. Experiments are conducted with natural, vocoded, and synthetic speech using voice material from various speakers with varying speaking styles, such as breathy, normal, and Lombard speech. The results indicate that the phase spectrum of the voiced excitation has a perceptually relevant effect in natural, vocoded, and synthetic speech, and utilizing the phase information in speech synthesis leads to improved speech quality.


Computer Speech & Language | 2017

Hierarchical representation and estimation of prosody using continuous wavelet transform

Antti Suni; Juraj imko; Daniel Aalto; Martti Vainio

We introduce a wavelet based representation system for speech prosody.Emergent hierarchy from f0, intensity and duration.Prominences and boundaries are represented in one framework.System allows for efficient analysis and annotation of prosodic events.The unsupervised prosodic labelling scheme is comparable with supervised methods. Prominences and boundaries are the essential constituents of prosodic structure in speech. They provide for means to chunk the speech stream into linguistically relevant units by providing them with relative saliences and demarcating them within utterance structures. Prominences and boundaries have both been widely used in both basic research on prosody as well as in text-to-speech synthesis. However, there are no representation schemes that would provide for both estimating and modelling them in a unified fashion. Here we present an unsupervised unified account for estimating and representing prosodic prominences and boundaries using a scale-space analysis based on continuous wavelet transform. The methods are evaluated and compared to earlier work using the Boston University Radio News corpus. The results show that the proposed method is comparable with the best published supervised annotation methods.


Archive | 2015

Emphasis, Word Prominence, and Continuous Wavelet Transform in the Control of HMM-Based Synthesis

Martti Vainio; Antti Suni; Daniel Aalto

Speech prosody, especially intonation, is hierarchical in nature. That is, the temporal changes in, e.g., fundamental frequency are caused by different factors in the production of an utterance. The small changes due to segmental articulation—consonants and vowels—are different both in their temporal scope and magnitude when compared to word, phrase, and utterance level changes. Words represent perhaps the most important prosodic level in terms of signaling the utterance internal information structure as well as the information structure that relates an utterance to the discourse background and other utterances in the discourse. In this chapter, we present a modeling scheme for hidden Markov model (HMM)-based parametric speech synthesis using word prominence and continuous wavelet transform (CWT). In this scheme emphasis is treated as one of the extrema in the word prominence scale, which is modeled separately from other temporal scales (segmental, syllabic, phrasal, etc.) using a hierarchical decomposition and superpositional modeling based on CWT. In this chapter, we present results on both automatic labeling of word prominences and pitch contour modeling with an HMM-based synthesis system.


international conference on acoustics, speech, and signal processing | 2012

On measuring the intelligibility of synthetic speech in noise — Do we need a realistic noise environment?

Tuomo Raitio; Marko Takanen; Olli Santala; Antti Suni; Martti Vainio; Paavo Alku

Assessing the intelligibility of synthetic speech is important in creating synthetic voices to be used in real life applications, especially for the ones involving interfering noise. This raises the question how to measure the intelligibility of synthetic speech to correctly simulate such conditions. Conventionally, this has been done using a simple listening test setup where diotic speech and noise are played to both ears with headphones. This is indeed very different from the real noise environment where speech and noise are spatially distributed. This paper addresses the question whether a realistic noise environment should be used to test the intelligibility of synthetic speech. Three different test conditions, one with multichannel reproduction of noise and speech, and two headphone setups are evaluated. Tests are performed with natural and synthetic speech, including speech especially intended for noisy conditions. The results indicate a general trend in all setups but also some interesting differences.

Collaboration


Dive into the Antti Suni's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Oliver Watts

University of Edinburgh

View shared research outputs
Researchain Logo
Decentralizing Knowledge