Giampiero Salvi
Royal Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Giampiero Salvi.
international conference on computers for handicapped persons | 2004
Jonas Beskow; Inger Karlsson; Jo Kewley; Giampiero Salvi
SYNFACE is a telephone aid for hearing-impaired people that shows the lip movements of the speaker at the other telephone synchronised with the speech. The SYNFACE system consists of a speech recogniser that recognises the incoming speech and a synthetic talking head. The output from the recogniser is used to control the articulatory movements of the synthetic head. SYNFACE prototype systems exist for three languages: Dutch, English and Swedish and the first user trials have just started.
systems man and cybernetics | 2012
Giampiero Salvi; Luis Montesano; Alexandre Bernardino; José Santos-Victor
We address the problem of bootstrapping language acquisition for an artificial system similarly to what is observed in experiments with human infants. Our method works by associating meanings to words in manipulation tasks, as a robot interacts with objects and listens to verbal descriptions of the interactions. The model is based on an affordance network, i.e., a mapping between robot actions, robot perceptions, and the perceived effects of these actions upon objects. We extend the affordance model to incorporate spoken words, which allows us to ground the verbal symbols to the execution of actions and the perception of the environment. The model takes verbal descriptions of a task as the input and uses temporal co-occurrence to create links between speech utterances and the involved objects, actions, and effects. We show that the robot is able form useful word-to-meaning associations, even without considering grammatical structure in the learning process and in the presence of recognition errors. These word-to-meaning associations are embedded in the robots own understanding of its actions. Thus, they can be directly used to instruct the robot to perform tasks and also allow to incorporate context in the speech recognition task. We believe that the encouraging results with our approach may afford robots with a capacity to acquire language descriptors in their operations environment as well as to shed some light as to how this challenging process develops with human infants.
international conference on robotics and automation | 2009
Verica Krunic; Giampiero Salvi; Alexandre Bernardino; Luis Montesano; José Santos-Victor
This paper presents a method to associate meanings to words in manipulation tasks. We base our model on an affordance network, i.e., a mapping between robot actions, robot perceptions and the perceived effects of these actions upon objects. We extend the affordance model to incorporate words. Using verbal descriptions of a task, the model uses temporal co-occurrence to create links between speech utterances and the involved objects, actions and effects. We show that the robot is able form useful word-to-meaning associations, even without considering grammatical structure in the learning process and in the presence of recognition errors. These word-to-meaning associations are embedded in the robots own understanding of its actions. Thus they can be directly used to instruct the robot to perform tasks and also allow to incorporate context in the speech recognition task.
international conference on multimodal interfaces | 2013
Catharine Oertel; Giampiero Salvi
This paper is concerned with modelling individual engagement and group involvement as well as their relationship in an eight-party, mutimodal corpus. We propose a number of features (presence, entropy, symmetry and maxgaze) that summarise different aspects of eye-gaze patterns and allow us to describe individual as well as group behaviour in time. We use these features to define similarities between the subjects and we compare this information with the engagement rankings the subjects expressed at the end of each interactions about themselves and the other participants. We analyse how these features relate to four classes of group involvement and we build a classifier that is able to distinguish between those classes with 71\% of accuracy.
Eurasip Journal on Audio, Speech, and Music Processing | 2009
Giampiero Salvi; Jonas Beskow; Samer Al Moubayed; Björn Granström
This paper describes SynFace, a supportive technology that aims at enhancing audio-based spoken communication in adverse acoustic conditions by providing the missing visual information in the form of an animated talking head. Firstly, we describe the system architecture, consisting of a 3D animated face model controlled from the speech input by a specifically optimised phonetic recogniser. Secondly, we report on speech intelligibility experiments with focus on multilinguality and robustness to audio quality. The system, already available for Swedish, English, and Flemish, was optimised for German and for Swedish wide-band speech quality available in TV, radio, and Internet communication. Lastly, the paper covers experiments with nonverbal motions driven from the speech signal. It is shown that turn-taking gestures can be used to affect the flow of human-human dialogues. We have focused specifically on two categories of cues that may be extracted from the acoustic signal: prominence/emphasis and interactional cues (turn-taking/back-channelling).
intelligent robots and systems | 2014
Alessandro Pieropan; Giampiero Salvi; Karl Pauwels; Hedvig Kjellström
Humans are able to merge information from multiple perceptional modalities and formulate a coherent representation of the world. Our thesis is that robots need to do the same in order to operate robustly and autonomously in an unstructured environment. It has also been shown in several fields that multiple sources of information can complement each other, overcoming the limitations of a single perceptual modality. Hence, in this paper we introduce a data set of actions that includes both visual data (RGB-D video and 6DOF object pose estimation) and acoustic data. We also propose a method for recognizing and segmenting actions from continuous audio-visual data. The proposed method is employed for extensive evaluation of the descriptive power of the two modalities, and we discuss how they can be used jointly to infer a coherent interpretation of the recorded action.
Speech Communication | 2006
Giampiero Salvi
This paper describes the use of connectionist techniques in phonetic speech recognition with strong latency constraints. The constraints are imposed by the task of deriving the lip movements of a synthetic face in real time from the speech signal, by feeding the phonetic string into an articulatory synthesiser. Particular attention has been paid to analysing the interaction between the time evolution model learnt by the multi-layer perceptrons and the transition model imposed by the Viterbi decoder, in different latency conditions. Two experiments were conducted in which the time dependencies in the language model (LM) were controlled by a parameter. The results show a strong interaction between the three factors involved, namely the neural network topology, the length of time dependencies in the LM and the decoder latency.
collaboration technologies and systems | 2013
Giovanni Saponaro; Giampiero Salvi; Alexandre Bernardino
In this paper, we propose a method to recognize human body movements and we combine it with the contextual knowledge of human-robot collaboration scenarios provided by an object affordances framework that associates actions with its effects and the objects involved in them. The aim is to equip humanoid robots with action prediction capabilities, allowing them to anticipate effects as soon as a human partner starts performing a physical action, thus enabling interactions between man and robot to be fast and natural. We consider simple actions that characterize a human-robot collaboration scenario with objects being manipulated on a table: inspired from automatic speech recognition techniques, we train a statistical gesture model in order to recognize those physical gestures in real time. Analogies and differences between the two domains are discussed, highlighting the requirements of an automatic gesture recognizer for robots in order to perform robustly and in real time.
non linear speech processing | 2006
Giampiero Salvi
This article investigates the possibility to use the class entropy of the output of a connectionist phoneme recogniser to predict time boundaries between phonetic classes. The rationale is that the value of the entropy should increase in proximity of a transition between two segments that are well modelled (known) by the recognition network since it is a measure of uncertainty. The advantage of this measure is its simplicity as the posterior probabilities of each class are available in connectionist phoneme recognition. The entropy and a number of measures based on differentiation of the entropy are used in isolation and in combination. The decision methods for predicting the boundaries range from simple thresholds to neural network based procedure. The different methods are compared with respect to their precision, measured in terms of the ratio between the number C of predicted boundaries within 10 or 20ms of the reference and the total number of predicted boundaries, and recall, measured as the ratio between C and the total number of reference boundaries.
non linear speech processing | 2005
Giampiero Salvi
The segment boundaries produced by the Synface low latency phoneme recogniser are analysed. The precision in placing the boundaries is an important factor in the Synface system as the aim is to drive the lip movements of a synthetic face for lip-reading support. The recogniser is based on a hybrid of recurrent neural networks and hidden Markov models. In this paper we analyse the look-ahead length in the Viterbi-like decoder affects the precision of boundary placement. The properties of the entropy of the posterior probabilities estimated by the neural network are also investigated in relation to the distance of the frame from a phonetic transition.