Oytun Türk
Boğaziçi University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Oytun Türk.
Computer Speech & Language | 2006
Oytun Türk; Levent M. Arslan
Abstract Differences in speaker characteristics, recording conditions, and signal processing algorithms affect output quality in voice conversion systems. This study focuses on formulating robust techniques for a codebook mapping based voice conversion algorithm. Three different methods are used to improve voice conversion performance: confidence measures, pre-emphasis, and spectral equalization. Analysis is performed for each method and the implementation details are discussed. The first method employs confidence measures in the training stage to eliminate problematic pairs of source and target speech units that might result from possible misalignments, speaking style differences or pronunciation variations. Four confidence measures are developed based on the spectral distance, fundamental frequency (f0) distance, energy distance, and duration distance between the source and target speech units. The second method focuses on the importance of pre-emphasis in line-spectral frequency (LSF) based vocal tract modeling and transformation. The last method, spectral equalization, is aimed at reducing the differences in the source and target long-term spectra when the source and target recording conditions are significantly different. The voice conversion algorithm that employs the proposed techniques is compared with the baseline voice conversion algorithm with objective tests as well as three subjective listening tests. First, similarity to the target voice is evaluated in a subjective listening test and it is shown that the proposed algorithm improves similarity to the target voice by 23.0%. An ABX test is performed and the proposed algorithm is preferred over the baseline algorithm by 76.4%. In the third test, the two algorithms are compared in terms of the subjective quality of the voice conversion output. The proposed algorithm improves the subjective output quality by 46.8% in terms of mean opinion score (MOS).
intelligent virtual agents | 2008
Patrick Gebhard; Marc Schröder; Marcela Charfuelan; Christoph Endres; Michael Kipp; Sathish Pammi; Martin Rumpler; Oytun Türk
In this paper we present two virtual characters in an interactive poker game using RFID-tagged poker cards for the interaction. To support the game creation process, we have combined models, methods, and technology that are currently investigated in the ECA research field in a unique way. A powerful and easy-to-use multimodal dialog authoring tool is used for the modeling of game content and interaction. The poker characters rely on a sophisticated model of affect and a state-of-the art speech synthesizer. During the game, the characters show a consistent expressive behavior that reflects the individually simulated affect in speech and animations. As a result, users are provided with an engaging interactive poker experience.
IEEE Transactions on Audio, Speech, and Language Processing | 2010
Oytun Türk; Marc Schröder
Generating expressive synthetic voices requires carefully designed databases that contain sufficient amount of expressive speech material. This paper investigates voice conversion and modification techniques to reduce database collection and processing efforts while maintaining acceptable quality and naturalness. In a factorial design, we study the relative contributions of voice quality and prosody as well as the amount of distortions introduced by the respective signal manipulation steps. The unit selection engine in our open source and modular text-to-speech (TTS) framework MARY is extended with voice quality transformation using either GMM-based prediction or vocal tract copy resynthesis. These algorithms are then cross-combined with various prosody copy resynthesis methods. The overall expressive speech generation process functions as a postprocessing step on TTS outputs to transform neutral synthetic speech into aggressive, cheerful, or depressed speech. Cross-combinations of voice quality and prosody transformation algorithms are compared in listening tests for perceived expressive style and quality. The results show that there is a tradeoff between identification and naturalness. Combined modeling of both voice quality and prosody leads to the best identification scores at the expense of lowest naturalness ratings. The fine detail of both voice quality and prosody, as preserved by the copy synthesis, did contribute to a better identification as compared to the approximate models.
KI '08 Proceedings of the 31st annual German conference on Advances in Artificial Intelligence | 2008
Marc Schröder; Patrick Gebhard; Marcela Charfuelan; Christoph Endres; Michael Kipp; Sathish Pammi; Martin Rumpler; Oytun Türk
In this paper we present an interactive poker game in which one human user plays against two animated agents using RFID-tagged poker cards. The game is used as a showcase to illustrate how current AI technologies can be used for providing new features to computer games. A powerful and easy-to-use multimodal dialog authoring tool is used for modeling game content and interaction. The poker characters rely on a sophisticated model of affect and a state-of-the art speech synthesizer. Through the combination of these methods, the characters show a consistent expressive behavior that enhances the naturalness of interaction in the game.
signal processing and communications applications conference | 2004
Oytun Türk; Levent M. Arslan
Speech therapy focuses on methods for the treatment of speech and language disorders. Speech recognition methods are investigated for computer assisted speech therapy in Turkish. Continuous-mixture hidden Markov models are employed for isolated phoneme and isolated word recognition tasks. Special care is taken for the recognition of confusable words. A Turkish database is designed and collected from native speakers for the evaluations. Initial experiments indicate 84.9% correct recognition rate for isolated phonemes and 94.2% for isolated words when the system is tested in speaker-independent mode. A correct recognition rate of 97.2% is achieved with speaker-dependent training for a list of Turkish words used in speech therapy. The recognition rate between word pairs that contain confusable Turkish phonemes is 88.0%. Speech recognition methods are employed in developing a software tool for speech therapy which can be trained to adapt to the patients voice.
signal processing and communications applications conference | 2004
Oytun Türk; Levent M. Arslan
Several problems in the training and transformation stages of voice conversion algorithms cause reduction in the output quality. This study focuses on the improvement of output quality in STASC based voice conversion and proposes five new methods. Robust end-point detection, pre-emphasis and spectral equalization are the three new methods that are employed in the training stage. The fourth method employs confidence measures for eliminating source and target HMM states that are significantly different in terms of duration, vocal tract spectrum, pitch, and energy. The last method focuses on the improvement of the pitch detection method. The optimal parameters of an autocorrelation based pitch detector are determined for male and female speakers separately with detailed analysis. The f/sub 0/ values obtained from electro-glottograph signals are used as the reference. The algorithm that employs the proposed methods is compared with STASC in subjective listening tests. The similarity to the target voice is increased by 23.0% and the subjective quality by 28.8% with the new methods.
Journal of the Acoustical Society of America | 2009
Oytun Türk; Levent M. Arslan
This paper focuses on the importance of source speaker selection for a weighted codebook mapping based voice conversion algorithm. First, the dependency on source speakers is evaluated in a subjective listening test using 180 different source-target pairs from a database of 20 speakers. Subjective scores for similarity to target speakers voice and quality are obtained. Statistical analysis of scores confirms the dependence of performance on source speakers for both male-to-male and female-to-female transformations. A source speaker selection algorithm is devised given a target speaker and a set of source speaker candidates. For this purpose, an artificial neural network (ANN) is trained that learns the regression between a set of acoustical distance measures and the subjective scores. The estimated scores are used in source speaker ranking. The average cross-correlation coefficient between rankings obtained from median subjective scores and rankings estimated by the algorithm is 0.84 for similarity and 0.78 for quality in male-to-male transformations. The results for female-to-female transformations were less reliable with a cross-correlation value of 0.58 for both similarity and quality.
conference of the international speech communication association | 2002
Oytun Türk; Levent M. Arslan
conference of the international speech communication association | 2005
Oytun Türk; Marc Schröder; Baris Bozkurt; Levent M. Arslan
Archive | 2003
Oytun Türk