Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Izidor Mlakar is active.

Publication


Featured researches published by Izidor Mlakar.


Archive | 2011

Multilingual and Multimodal Corpus-Based Text-to-Speech System - PLATTOS -

Matej Rojc; Izidor Mlakar

Over the last decade a lot of TTS systems have been developed around the world that are more or less language-dependent and more or less time and space-efficient (Campbell & Black, 1996; Holzapfel, 2000; Raitio et al., 2011; Sproat, 1998; Taylor et al., 1998). However, speech technology-based applications demand time and space-efficient multilingual, polyglot, and multimodal TTS systems. Due to these facts and due to the need for a powerful, flexible, reliable and easily maintainable multimodal text-to-speech synthesis system, a design pattern is presented that serves as a flexible and language independent framework for efficient pipelining all text-to-speech processing steps. The presented design pattern is based on time and space-efficient architecture, where finite-state machines (FSM) and heterogeneous relation graphs (HRG) are integrated into a common TTS engine through the so-called ‘‘queuing mechanism’’. FSMs are a time-and-space efficient representation of language resources and are used for the separation of language-dependent parts from the language-independent TTS engine. On the other hand, the HRG structure is used for storing all linguistic and acoustic knowledge about the input sentence, for the representation of very heterogeneous data and for the flexible feature constructions needed by various machinelearned models that are used in general TTS systems. In this way, all the algorithms in the presented TTS system use the same data structure for gathering linguistic information about input text, all input and output formats between modules are compatible, the structure is modular and interchangeable, easily maintainable and object oriented (Rojc & Kačič, 2007). The general idea of corpus-based speech synthesis is the use of a large speech corpus for acoustic inventory and for creating realistic-sounding, machine-generated speech from raw waveform segments that are directly concatenated without any or only minimal signal processing. Since only a limited size speech corpus can be used, a compromise between the number of speech units in different prosodic contexts and the overall corpus size should normally be reached. On the other hand, the unit selection algorithm has to select the most suitable sequence of units from the acoustic inventory, where longer units should be favoured. Namely, when using longer units, the number of concatenation points can be reduced, resulting in more natural synthetic speech. The performance of the overall unit selection algorithm for corpus-based synthesis, regarding quality and speed, depends on the solving of several issues, e.g. preparation of text corpus, acoustic inventory construction


COST'10 Proceedings of the 2010 international conference on Analysis of Verbal and Nonverbal Communication and Enactment | 2010

Towards ECA's animation of expressive complex behaviour

Izidor Mlakar; Matej Rojc

Multimodal interfaces supporting ECAs enable the development of novel concepts regarding human-machine interaction interfaces and provide several communication channels such as: natural speech, facial expression, and different body gestures. This paper presents the synthesis of expressive behaviour within the realm of affective computing. By providing descriptions of different expressive parameters (e.g. temporal, spatial, power, and different degrees of fluidity) and the context of unplanned behaviour, it addresses the synthesis of expressive behaviour by enabling the ECA to visualize complex human-like body movements (e.g. expressions, emotional speech, hand and head gestures, gaze and complex emotions). Movements performed by our ECA EVA are reactive, not require extensive planning phases, and can be presented hieratically as a set of different events. The animation concepts prevent the synthesis of unnatural movements even when two or more behavioural events influence the same segments of the body (e.g. speech with different facial expressions).


Applied Artificial Intelligence | 2014

Describing and Animating Complex Communicative Verbal and Nonverbal Behavior Using Eva-Framework

Izidor Mlakar; Zdravko Kacic; Matej Rojc

Multimodal interfaces incorporating embodied conversational agents enable the development of novel concepts with regard to interaction management tactics in responsive human–machine interfaces. Such interfaces provide several additional nonverbal communication channels, such as natural visualized speech, facial expression, and different body motions. In order to simulate reactive humanlike communicative behavior and attitude, the realization of motion relies on different behavioral analyses and realization tactics and approaches. This article proposes a novel environment for “online” visual modeling of humanlike communicative behavior, named EVA-framework. In this study we focus on visual speech and nonverbal behavior synthesis by using hierarchical XML-based behavioral events and expressively adjustable motion templates. The main goal of the presented abstract motion notation scheme, named EVA-Script, is to enable the synthesis of unique and responsive behavior.


Engineering Applications of Artificial Intelligence | 2017

The TTS-driven affective embodied conversational agent EVA, based on a novel conversational-behavior generation algorithm

Matej Rojc; Izidor Mlakar; Zdravko Kacic

As a result of the convergence of different services delivered over the internet protocol, internet protocol television (IPTV) may be regarded as the one of the most widespread user interfaces accepted by a highly diverse user domain. Every generation, from children to the elderly, can use IPTV for recreation, as well as for gaining social contact and stimulating the mind. However, technological advances in digital platforms go hand in hand with the complexity of their user interfaces, and thus induce technological disinterest and technological exclusion. Therefore, interactivity and affective content presentations are, from the perspective of advanced user interfaces, two key factors in any application incorporating human-computer interaction (HCI). Furthermore, the perception and understanding of the information (meaning) conveyed is closely interlinked with visual cues and non-verbal elements that speakers generate throughout human-human dialogues. In this regard, co-verbal behavior provides information to the communicative act. It supports the speakers communicative goal and allows for a variety of other information to be added to his/her messages, including (but not limited to) psychological states, attitudes, and personality. In the present paper, we address complexity and technological disinterest through the integration of natural, human-like multimodal output that incorporates a novel combined data- and rule-driven co-verbal behavior generator that is able to extract features from unannotated, general text. The core of the paper discusses the processes that model and synchronize non-verbal features with verbal features even when dealing with unknown context and/or limited contextual information. In addition, the proposed algorithm incorporates data-driven (speech prosody, repository of motor skills) and rule-based concepts (grammar, gesticon). The algorithm firstly classifies the communicative intent, then plans the co-verbal cues and their form within the gesture unit, generates temporally synchronized co-verbal cues, and finally realizes them in the form of human-like co-verbal movements. In this way, the information can be represented in the form of both meaningfully and temporally synchronized co-verbal cues with accompanying synthesized speech, using communication channels to which people are most accustomed. Automatic planning, designing, and recreation of co-verbal behavior for Smart IPTV system, named UMB-SmartTV.Procedures and algorithms for modeling the conversational dialog.TTS- and data-driven expressive model for generating co-verbal behavior.Semiotic classification of intent incorporating linguistic and prosodic cues.Visual prosody reflecting features of speech signal and context of input text.


Applied Artificial Intelligence | 2013

A NEW DISTRIBUTED PLATFORM FOR CLIENT-SIDE FUSION OF WEB APPLICATIONS AND NATURAL MODALITIES: A MULTIMODAL WEB PLATFORM

Izidor Mlakar; Matej Rojc

Web-based solutions and interfaces should be easy, more intuitive, and should also adapt to the natural and cognitive information processing and presentation capabilities of humans. Today, human-controlled multimodal systems with multimodal interfaces are possible. They allow for a more natural and more advanced exchange of information between man and machine. The fusion of web-based solutions with natural modalities is therefore an effective solution for users who would like to access services and web content in a more natural way. This article presents a novel multimodal web platform (MWP) that enables flexible migration from traditionally closed and purpose-oriented multimodal systems to the wider scope offered by web applications. The MWP helps to overcome problems of interoperability, compatibility, and integration that usually accompany migrations from standard (task-oriented) applications to web-based solutions and multiservice networks, thus enabling the enrichment of general web-based user interfaces with several advanced natural modalities in order to communicate and exchange information. The MWP is a system in which all modules are embedded within generic network-based architecture. When using it, the fusion of user front ends with new modalities requires as little intervention to the code of the web application as possible. The fusion is implemented within user front ends and retains the web-application code and its functionalities intact.


International Journal of Advanced Robotic Systems | 2013

TTS-driven Synthetic Behaviour-generation Model for Artificial Bodies

Izidor Mlakar; Zdravko Kacic; Matej Rojc

Visual perception, speech perception and the understanding of perceived information are linked through complex mental processes. Gestures, as part of visual perception and synchronized with verbal information, are a key concept of human social interaction. Even when there is no physical contact (e.g., a phone conversation), humans still tend to express meaning through movement. Embodied conversational agents (ECAs), as well as humanoid robots, are visual recreations of humans and are thus expected to be able to perform similar behaviour in communication. The behaviour generation system proposed in this paper is able to specify expressive behaviour strongly resembling natural movement performed within social interaction. The system is TTS-driven and fused with the time-and-space efficient TTS-engine, called ‘PLATTOS’. Visual content and content presentation is formulated based on several linguistic features that are extrapolated from arbitrary input text sequences and prosodic features (e.g., pitch, intonation, stress, emphasis, etc.), as predicted by several verbal modules in the system. According to the evaluation results, when using the proposed system the synchronized co-verbal behaviour can be recreated with a very high-degree of naturalness, either by ECAs or humanoid robots alike.


COST'10 Proceedings of the 2010 international conference on Analysis of Verbal and Nonverbal Communication and Enactment | 2010

Developing multimodal web interfaces by encapsulating their content and functionality within a multimodal shell

Izidor Mlakar; Matej Rojc

Web applications are a widely-spread and a widely-used concept for presenting information. Their underlying architecture and standards, in many cases, limit their presentation/control capabilities of showing pre-recorded audio/video sequences. Highly-dynamic text content, for instance, can only be displayed in its native from (as part of HTML content). This paper provides concepts and answers that enable the transformation of dynamic web-based content into multimodal sequences generated by different multimodal services. Based on the encapsulation of the content into a multimodal shell, any text-based data can dynamically and at interactive speeds be transformed into multimodal visually-synthesized speech. Techniques for the integration of multimodal input (e.g. visioning and speech recognition) are also included. The concept of multimodality relies on mashup approaches rather than traditional integration. It can, therefore, extended any type of web-based solution transparently with no major changes to either the multimodal services or the enhanced web-application.


International journal of mathematics and computers in simulation | 2011

EVA: expressive multipart virtual agent performing gestures and emotions

Izidor Mlakar; Matej Rojc


computational intelligence | 2009

Finite-state machine based distributed framework DATA for intelligent ambience systems

Matej Rojc; Izidor Mlakar


COST'11 Proceedings of the 2011 international conference on Cognitive Behavioural Systems | 2011

Form-Oriented annotation for building a functionally independent dictionary of synthetic movement

Izidor Mlakar; Zdravko Kacic; Matej Rojc

Collaboration


Dive into the Izidor Mlakar's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge