Oliver Jokisch | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Oliver Jokisch is active.

Explore More

Publication

Featured researches published by Oliver Jokisch.

international conference on speech and computer | 2016

Quality Assessment of Two Fullband Audio Codecs Supporting Real-Time Communication

Michael Maruschke; Oliver Jokisch; Martin Meszaros; Franziska Trojahn; M. Hoffmann

Recent audio codecs enable high quality signals up to fullband (20 kHz) which is usually associated with the maximal audible bandwidth. Following previous studies on speech coding assessment, we survey in this novel study the music coding ability of two real-time codecs with fullband capability – the IETF standardized Opus codec as well as the 3 GPP specified EVS codec. We tested both codecs with vocal, instrumental and mixed music signals. For evaluation, we predicted human assessments using the instrumental POLQA method which has been primarily designed for speech assessment. Additionally, we performed two listening tests as a reference with a total of 21 young adults. Opus and EVS show a similar music coding performance. The quality assessment mainly depends on the specific music characteristics and on the tested bitrates from 16.4 to 64 kbit/s. The POLQA measure and the listening results are correlating, whereas the absolute ratings of the young listeners achieve much lower MOS values.

international conference on speech and computer | 2015

Review of the Opus Codec in a WebRTC Scenario for Audio and Speech Communication

Michael Maruschke; Oliver Jokisch; Martin Meszaros; Viktor Iaroshenko

The Internet Engineering Task Force (IETF) – the open Internet standards-development body – considers the Opus codec as a highly versatile audio codec for interactive voice and music transmission. In this review we survey the dynamic functioning of the Opus codec within a Web Real-Time Communication (WebRTC) framework based on the Google Chrome browser. The codec behavior and the effectively utilized features during the active communication process are tested and analyzed under various testing conditions. In the experiments, we verify the Opus performance and interactivity. Relevant codec parameters can easily be adapted in application development. In addition, WebRTC framework-coded speech achieves a similar MOS assessment compared to stand-alone Opus coding.

international conference on speech and computer | 2017

Acoustic Cues for the Perceptual Assessment of Surround Sound

Ingo Siegert; Oliver Jokisch; Alicia Flores Lotz; Franziska Trojahn; Martin Meszaros; Michael Maruschke

Speech and audio codecs are implemented in a variety of multimedia applications, and multichannel sound is offered by first streaming or cloud-based services. Beside the objective of perceptual quality, coding-related research is focused on low bitrate and minimal latency. The IETF-standardized Opus codec provides a high perceptual quality, low latency and the capability of coding multiple channels in various audio bandwidths up to Fullband (20 kHz). In a previous perceptual study on Opus-processed 5.1 surround sound, uncompressed and degraded stimuli were rated on a five-point degradation category scale (DMOS) for six channels at total bitrates between 96 and 192 kbit/s. This study revealed that the perceived quality depends on the music characteristics. In the current study we analyze spectral and music-feature differences between those five music stimuli at three coding bitrates and uncompressed sound to identify objective causes for perceptual differences. The results show that samples with annoying audible degradations involve higher spectral differences within the LFE channel as well as highly uncorrelated LSPs.

conference on computer as a tool | 2013

Runtime and speech quality survey of a voice conversion method

Oliver Jokisch; Yitagessu Birhanu; Rüdiger Hoffmann

Several methods for voice conversion have been established. The research aims at the characteristics of a target speaker and a near-to-natural speech quality. This contribution summarizes the listening experiments with four conversion methods including the assessment of speech quality, listening effort and similarity to the target voice. The subjective evaluation of similarity is checked by an instrumental distance measure based on logarithmic spectral distortion. Practical applications of voice conversion require an appropriate runtime performance and memory use. We select a conversion method based on VTLN to demonstrate the runtime and quality trade-off. In the case example, we survey the quality assessment depending on different training constellations with a varied data amount and training time. Furthermore, we discuss the runtime performance of the selected conversion method under typical operating conditions. The experiments cover the influence of system resources, setting of conversion parameters (warping factors) and different training constellations. The observed real-time factors of a non-optimized laboratory VC version are inappropriate for typical application scenarios.

international conference on speech and computer | 2018

QuARTCS: A Tool Enabling End-to-Any Speech Quality Assessment of WebRTC-Based Calls

Martin Meszaros; Franziska Trojahn; Michael Maruschke; Oliver Jokisch

Recently, the use of Web Real-Time Communication (Web-RTC) technology in communication applications has been increasing significantly. The users of IP-based telephony require excellent audio quality. However, in WebRTC-based audio calls the audio assessment is challenging due to the specific functioning principles of WebRTC, such as security requirements, diversity of the endpoints and varying client implementations.

SPECOM | 2018

Prosodic Plot of Dialogues: A Conceptual Framework to Trace Speakers' Role.

Vered Silber-Varod; Anat Lerner; Oliver Jokisch

In this paper we present a proof-of-concept study which aims to model a conceptual framework to analyze structures of dialogues. We demonstrate our approach on a specific research question – how speaker’s role is realized along the dialogue? To this end, we use a unified set of Map Task dialogues that are unique in the sense that each speaker participated twice – once as a follower and once as a leader, with the same interlocutor playing the other role. This pairwise setting enables to compare prosodic differences in three facets: Role, Speaker, and Session. For this POC, we analyze a basic set of prosodic features: Talk proportions, pitch, and intensity. To create comparable methodological framework for dialogues, we created three plots of the three prosodic features, in ten equal sized intervals along the session. We used a simple distance measure between the resulting ten-dimensional vectors of each facet for each feature. The prosodic plots of these dialogues reveal the interactions and common behaviour across each facet, on the one hand, and allow to trace potential locations of extreme prosodic values, suggesting pivot points of each facet, on the other.

international conference on speech and computer | 2017

Investigating Acoustic Correlates of Broad and Narrow Focus Perception by Japanese Learners of English

Gabor Pinter; Oliver Jokisch; Shinobu Mizuguchi

This work is an addition to the relatively short line of research concerning second language prosody perception. Using a prominence marking experiment, the study demonstrates that Japanese learners of English can perceptually discriminate between different focus scopes. Perceptual score profiles imply that narrowly focused words are identified and discriminated relatively easily, while differentiation of different scopes of broad focus presents a greater challenge. An analysis of a range of acoustic cues indicates that perceptual scores correlate most strongly with F0-based features. While this result is in contradiction with previous research results, it is shown that the divergence is attributable to the particular acoustic characteristics of the stimulus.

international conference on speech and computer | 2017

A Trainable Method for the Phonetic Similarity Search in German Proper Names

Oliver Jokisch; Horst-Udo Hain

Efficient methods for the similarity search in word databases play a significant role in various applications such as the robust search or indexing of names and addresses, spell-checking algorithms or the monitoring of trademark rights. The underlying distance measures are associated with similarity criteria of the users, and phonetic-based search algorithms are well-established since decades. Nonetheless, rule-based phonetic algorithms exhibit some weak points, e.g. their strong language dependency, the search overhead by tolerance or the risk of missing valid matches vice versa, which causes a pseudo-phonetic functionality in some cases. In contrast, we suggest a novel, adaptive method for similarity search in words, which is based on a trainable grapheme-to-phoneme (G2P) converter that generates most likely and widely correct pronunciations. Only as a second step, the similarity search in the phonemic reference data is performed by involving a conventional string metric such as the Levenshtein distance (LD). The G2P algorithm achieves a string accuracy of up to 99.5% in a German pronunciation lexicon and can be trained for different languages or specific domains such as proper names. The similarity tolerance can be easily adjusted by parameters like the admissible number or likability of pronunciation variants as well as by the phonemic or graphemic LD. As a proof of concept, we compare the G2P-based search method on a German surname database and a telephone book including first name, surname and street name to similarity matches by the conventional Cologne phonetic (Kolner Phonetik, KP) algorithm.

GLU 2017 International Workshop on Grounding Language Understanding | 2017

Automatic Speaker's Role Classification With a Bottom-up Acoustic Feature Selection

Vered Silber-Varod; Anat Lerner; Oliver Jokisch

The objective of the current study is to automatically identify the role played by the speaker in a dialogue. By using machine learning procedures over acoustic feature, we wish to automatically trace the footprints of this information through the speech signal. The acoustic feature set was selected from a large statistic-based feature sets including 1,583 dimension features. The analysis is carried out on interactive dialogues of a Map Task setting. The paper first describes the methodology of choosing the 100 most effective attributes among the 1,583 features that were extracted, and then presents the classification results test of the same speaker in two different roles, and a gender-based classification. Results show an average of a 71% classification rate of the role the same speaker played, 65% for all women together and 65% for all men together.

spoken language technology workshop | 2012

Syllable-based prosodic analysis of Amharic read speech

Oliver Jokisch; Yitagessu Birhanu; Rüdiger Hoffmann

Amharic is the official language of Ethiopia and belongs to the under-resourced languages. Analyzing a new corpus of Amharic read speech, this contribution surveys syllable-based prosodic variations in f0, duration and intensity to develop suitable prosody models for speech synthesis and recognition. The article starts with a brief description of the Amharic script, the prosodic analysis methods, an accentuation experiment using resynthesis and a perceptual test. The main part summarizes stress-related analysis results for f0, duration and intensity and their interrelations. The quantitative variations of Amharic are comparable with the range in well-examined languages. The observed modifications in syllable duration and mean f0 proved to be relevant for stress perception as demonstrated in the perceptual test with resynthesis stimuli.

Explore More