Matej Rojc | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Matej Rojc is active.

Explore More

Publication

Featured researches published by Matej Rojc.

Speech Communication | 2007

Time and space-efficient architecture for a corpus-based text-to-speech synthesis system

Matej Rojc; Zdravko Kacic

This paper proposes a time and space-efficient architecture for a text-to-speech synthesis system (TTS). The proposed architecture can be efficiently used in those applications with unlimited domain, requiring multilingual or polyglot functionality. The integration of a queuing mechanism, heterogeneous graphs and finite-state machines gives a powerful, reliable and easily maintainable architecture for the TTS system. Flexible and language-independent framework efficiently integrates all those algorithms used within the scope of the TTS system. Heterogeneous relation graphs are used for linguistic information representation and feature construction. Finite-state machines are used for time and space-efficient representation of language resources, for time and space-efficient lookup processes, and the separation of language-dependent resources from a language-independent TTS engine. Its queuing mechanism consists of several dequeue data structures and is responsible for the activation of all those TTS engine modules having to process the input text. In the proposed architecture, all modules use the same data structure for gathering linguistic information about input text. All input and output formats are compatible, the structure is modular and interchangeable, it is easily maintainable and object oriented. The proposed architecture was successfully used when implementing the Slovenian PLATTOS corpus-based TTS system, as presented in this paper.

Archive | 2011

Multilingual and Multimodal Corpus-Based Text-to-Speech System - PLATTOS -

Matej Rojc; Izidor Mlakar

Over the last decade a lot of TTS systems have been developed around the world that are more or less language-dependent and more or less time and space-efficient (Campbell & Black, 1996; Holzapfel, 2000; Raitio et al., 2011; Sproat, 1998; Taylor et al., 1998). However, speech technology-based applications demand time and space-efficient multilingual, polyglot, and multimodal TTS systems. Due to these facts and due to the need for a powerful, flexible, reliable and easily maintainable multimodal text-to-speech synthesis system, a design pattern is presented that serves as a flexible and language independent framework for efficient pipelining all text-to-speech processing steps. The presented design pattern is based on time and space-efficient architecture, where finite-state machines (FSM) and heterogeneous relation graphs (HRG) are integrated into a common TTS engine through the so-called ‘‘queuing mechanism’’. FSMs are a time-and-space efficient representation of language resources and are used for the separation of language-dependent parts from the language-independent TTS engine. On the other hand, the HRG structure is used for storing all linguistic and acoustic knowledge about the input sentence, for the representation of very heterogeneous data and for the flexible feature constructions needed by various machinelearned models that are used in general TTS systems. In this way, all the algorithms in the presented TTS system use the same data structure for gathering linguistic information about input text, all input and output formats between modules are compatible, the structure is modular and interchangeable, easily maintainable and object oriented (Rojc & Kačič, 2007). The general idea of corpus-based speech synthesis is the use of a large speech corpus for acoustic inventory and for creating realistic-sounding, machine-generated speech from raw waveform segments that are directly concatenated without any or only minimal signal processing. Since only a limited size speech corpus can be used, a compromise between the number of speech units in different prosodic contexts and the overall corpus size should normally be reached. On the other hand, the unit selection algorithm has to select the most suitable sequence of units from the acoustic inventory, where longer units should be favoured. Namely, when using longer units, the number of concatenation points can be reduced, resulting in more natural synthetic speech. The performance of the overall unit selection algorithm for corpus-based synthesis, regarding quality and speed, depends on the solving of several issues, e.g. preparation of text corpus, acoustic inventory construction

language resources and evaluation | 2007

Annotating discourse markers in spontaneous speech corpora on an example for the Slovenian language

Darinka Verdonik; Matej Rojc; Marko Stabej

Speech-to-speech translation technology has difficulties processing elements of spontaneity in conversation. We propose a discourse marker attribute in speech corpora to help overcome some of these problems. There have already been some attempts to annotate discourse markers in speech corpora. However, as there is no consistency on what expressions count as discourse markers, we have to reconsider how to set a framework for annotating, and, in order to better understand what we gain by introducing a discourse marker category, we have to analyse their characteristics and functions in discourse. This is especially important for languages such as Slovenian where no or little research on the topic of discourse markers has been carried out. The aims of this paper are to present a scheme for annotating discourse markers based on the analysis of a corpus of telephone conversations in the tourism domain in the Slovenian language, and to give some additional arguments based on the characteristics and functions of discourse markers that confirm their special status in conversation.

COST'10 Proceedings of the 2010 international conference on Analysis of Verbal and Nonverbal Communication and Enactment | 2010

Towards ECA's animation of expressive complex behaviour

Izidor Mlakar; Matej Rojc

Multimodal interfaces supporting ECAs enable the development of novel concepts regarding human-machine interaction interfaces and provide several communication channels such as: natural speech, facial expression, and different body gestures. This paper presents the synthesis of expressive behaviour within the realm of affective computing. By providing descriptions of different expressive parameters (e.g. temporal, spatial, power, and different degrees of fluidity) and the context of unplanned behaviour, it addresses the synthesis of expressive behaviour by enabling the ECA to visualize complex human-like body movements (e.g. expressions, emotional speech, hand and head gestures, gaze and complex emotions). Movements performed by our ECA EVA are reactive, not require extensive planning phases, and can be presented hieratically as a set of different events. The animation concepts prevent the synthesis of unnatural movements even when two or more behavioural events influence the same segments of the body (e.g. speech with different facial expressions).

Applied Artificial Intelligence | 2007

A UNIFIED APPROACH TO GRAPHEME-TO-PHONEME CONVERSION FOR THE PLATTOS SLOVENIAN TEXT-TO-SPEECH SYSTEM

Matej Rojc; Zdravko Kacic

This article presents a new unified approach to modeling grapheme-to-phoneme conversion for the PLATTOS Slovenian text-to-speech system. A cascaded structure consisting of several successive processing steps is proposed for the aim of grapheme-to-phoneme conversion. Processing foreign words and rules for the post-processing of phonetic transcriptions are also incorporated in the engine. The grapheme-to-phoneme conversion engine is flexible, efficient, and appropriate for multilingual text-to-speech systems. The grapheme-to-phoneme conversion process is described via finite-state machine formalism. The engine developed for Slovenian language can be integrated into various applications but can be even more efficiently integrated into architectures based on finite-state machine formalisms. Provided the necessary language resources are available, the presented approach can also be used for other languages.

Applied Artificial Intelligence | 2014

Describing and Animating Complex Communicative Verbal and Nonverbal Behavior Using Eva-Framework

Izidor Mlakar; Zdravko Kacic; Matej Rojc

Multimodal interfaces incorporating embodied conversational agents enable the development of novel concepts with regard to interaction management tactics in responsive human–machine interfaces. Such interfaces provide several additional nonverbal communication channels, such as natural visualized speech, facial expression, and different body motions. In order to simulate reactive humanlike communicative behavior and attitude, the realization of motion relies on different behavioral analyses and realization tactics and approaches. This article proposes a novel environment for “online” visual modeling of humanlike communicative behavior, named EVA-framework. In this study we focus on visual speech and nonverbal behavior synthesis by using hierarchical XML-based behavioral events and expressively adjustable motion templates. The main goal of the presented abstract motion notation scheme, named EVA-Script, is to enable the synthesis of unique and responsive behavior.

Applied Artificial Intelligence | 2011

GRADIENT-DESCENT BASED UNIT-SELECTION OPTIMIZATION ALGORITHM USED FOR CORPUS-BASED TEXT-TO-SPEECH SYNTHESIS

Matej Rojc; Zdravko Kacic

This paper proposes a gradient-descent based unit selection optimization algorithm for the optimization of unit-cost function weights and for improving the overall performance of the unit-selection algorithm, as used in a corpus-based text-to-speech synthesis system. Complex multidimensional and fuzzy-logic based unit-cost functions are used in the presented unit-selection algorithm. The weights used by these unit-cost functions are usually defined by heuristics or by listening tests. This can be very laborious and time consuming, and does not necessarily result in an optimal performance of the unit-selection algorithm because of multidimensional unit-cost function space, within which different database candidates’ features are evaluated. Using heuristics or listening tests is also rather rigid, especially when working with several different databases or voices. It is especially difficult, within this scope, to set up those weights used in unit-cost functions in order to achieve overall optimal performance of the unit-selection algorithm. The proposed unit-selection optimization process consists of several steps. It is fully automatic, flexible, and fast enough to enable the development of a corpus-based text-to-speech (TTS) system that uses many different voices, without any heuristics or listening tests. This optimization process can also be helpful when evaluating the performances of unit-selection cost functions, and the performance of the unit-selection algorithm itself. The obtained results “suggest” those values that the unit-selection cost-function weights should have in order to obtain smoother transitions between selected unit candidates, after the unit-selection process. The obtained results also hint at the performance level that can be achieved with a given set of unit-cost function weights, and suggest what improvements can be gained when using those additional or changed unit-cost functions included within the unit-selection algorithm.

Engineering Applications of Artificial Intelligence | 2017

The TTS-driven affective embodied conversational agent EVA, based on a novel conversational-behavior generation algorithm

Matej Rojc; Izidor Mlakar; Zdravko Kacic

As a result of the convergence of different services delivered over the internet protocol, internet protocol television (IPTV) may be regarded as the one of the most widespread user interfaces accepted by a highly diverse user domain. Every generation, from children to the elderly, can use IPTV for recreation, as well as for gaining social contact and stimulating the mind. However, technological advances in digital platforms go hand in hand with the complexity of their user interfaces, and thus induce technological disinterest and technological exclusion. Therefore, interactivity and affective content presentations are, from the perspective of advanced user interfaces, two key factors in any application incorporating human-computer interaction (HCI). Furthermore, the perception and understanding of the information (meaning) conveyed is closely interlinked with visual cues and non-verbal elements that speakers generate throughout human-human dialogues. In this regard, co-verbal behavior provides information to the communicative act. It supports the speakers communicative goal and allows for a variety of other information to be added to his/her messages, including (but not limited to) psychological states, attitudes, and personality. In the present paper, we address complexity and technological disinterest through the integration of natural, human-like multimodal output that incorporates a novel combined data- and rule-driven co-verbal behavior generator that is able to extract features from unannotated, general text. The core of the paper discusses the processes that model and synchronize non-verbal features with verbal features even when dealing with unknown context and/or limited contextual information. In addition, the proposed algorithm incorporates data-driven (speech prosody, repository of motor skills) and rule-based concepts (grammar, gesticon). The algorithm firstly classifies the communicative intent, then plans the co-verbal cues and their form within the gesture unit, generates temporally synchronized co-verbal cues, and finally realizes them in the form of human-like co-verbal movements. In this way, the information can be represented in the form of both meaningfully and temporally synchronized co-verbal cues with accompanying synthesized speech, using communication channels to which people are most accustomed. Automatic planning, designing, and recreation of co-verbal behavior for Smart IPTV system, named UMB-SmartTV.Procedures and algorithms for modeling the conversational dialog.TTS- and data-driven expressive model for generating co-verbal behavior.Semiotic classification of intent incorporating linguistic and prosodic cues.Visual prosody reflecting features of speech signal and context of input text.

Applied Artificial Intelligence | 2013

A NEW DISTRIBUTED PLATFORM FOR CLIENT-SIDE FUSION OF WEB APPLICATIONS AND NATURAL MODALITIES: A MULTIMODAL WEB PLATFORM

Izidor Mlakar; Matej Rojc

Web-based solutions and interfaces should be easy, more intuitive, and should also adapt to the natural and cognitive information processing and presentation capabilities of humans. Today, human-controlled multimodal systems with multimodal interfaces are possible. They allow for a more natural and more advanced exchange of information between man and machine. The fusion of web-based solutions with natural modalities is therefore an effective solution for users who would like to access services and web content in a more natural way. This article presents a novel multimodal web platform (MWP) that enables flexible migration from traditionally closed and purpose-oriented multimodal systems to the wider scope offered by web applications. The MWP helps to overcome problems of interoperability, compatibility, and integration that usually accompany migrations from standard (task-oriented) applications to web-based solutions and multiservice networks, thus enabling the enrichment of general web-based user interfaces with several advanced natural modalities in order to communicate and exchange information. The MWP is a system in which all modules are embedded within generic network-based architecture. When using it, the fusion of user front ends with new modalities requires as little intervention to the code of the web application as possible. The fusion is implemented within user front ends and retains the web-application code and its functionalities intact.

International Journal of Speech Technology | 2003

“LentInfo” Information—Providing System for the Festival Lent Programme

Andrej Žgank; Matej Rojc

This paper presents an application, “LentInfo”, which is a system used to provide information about programmes for the Festival Lent in Slovenia. The Festival Lent consists of different open-air theatre and music performances and raws more than 400,000 visitors per year. This application is based on a Hidden Markov Model (HMM) speech recogniser, and the dialogue construction and management is done using the CSDP (Common Spoken Dialogue Platform) dialogue management system. It is represented as a finite-state structure. The dialogue can be specified in a script using simple syntax description. The dialogue manager is multi-application oriented, so it can easily be upgraded for new applications. If some new concepts are needed, only new actions need be added to the existing ones. Currently, prompt messages are prerecorded, but it is also possible to include a speech synthesis system depending on the needs of the application. Error recovery during the dialogue is done with user confirmation of the recognised input speech. The results are presented for tests performed in the year 2001. The results are analyzed according to the phone type (fixed/mobile), signal to noise ratio, dialogue path, etc. Although some calls where carried out using mobile phones from noisy festival places, the performance of the system decreased only slightly under these conditions.

Explore More