Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Gerasimos Potamianos is active.

Publication


Featured researches published by Gerasimos Potamianos.


european signal processing conference | 2015

Multi-room speech activity detection using a distributed microphone network in domestic environments

Panagiotis Giannoulis; Alessio Brutti; Marco Matassoni; Alberto Abad; Athanasios Katsamanis; Miguel Matos; Gerasimos Potamianos; Petros Maragos

Domestic environments are particularly challenging for distant speech recognition: reverberation, background noise and interfering sources, as well as the propagation of acoustic events across adjacent rooms, critically degrade the performance of standard speech processing algorithms. In this application scenario, a crucial task is the detection and localization of speech events generated by users within the various rooms. A specific challenge of multi-room environments is the inter-room interference that negatively affects speech activity detectors. In this paper, we present and compare different solutions for the multi-room speech activity detection task. The combination of a model-based room-independent speech activity detection module with a room-dependent inside/outside classification stage, based on specific features, provides satisfactory performance. The proposed methods are evaluated on a multi-room, multi-channel corpus, where spoken commands and other typical acoustic events occur in different rooms.


international conference on acoustics, speech, and signal processing | 2014

Robust far-field spoken command recognition for home automation combining adaptation and multichannel processing

Athanasios Katsamanis; Isidoros Rodomagoulakis; Gerasimos Potamianos; Petros Maragos; Antigoni Tsiami

The paper presents our approach to speech-controlled home automation. We are focusing on the detection and recognition of spoken commands preceded by a key-phrase as recorded in a voice-enabled apartment by a set of multiple microphones installed in the rooms. For both problems we investigate robust modeling, environmental adaptation and multichannel processing to cope with a) insufficient training data and b) the far-field effects and noise in the apartment. The proposed integrated scheme is evaluated in a challenging and highly realistic corpus of simulated audio recordings and achieves F-measure close to 0.70 for key-phrase spotting and word accuracy close to 98% for the command recognition task.


international conference on digital signal processing | 2013

Experiments on far-field multichannel speech processing in smart homes

Isidoros Rodomagoulakis; Panagiotis Giannoulis; Z.-I. Skordilis; Petros Maragos; Gerasimos Potamianos

In this paper, we examine three problems that rise in the modern, challenging area of far-field speech processing. The developed methods for each problem, namely (a) multi-channel speech enhancement, (b) voice activity detection, and (c) speech recognition, are potentially applicable to a distant speech recognition system for voice-enabled smart home environments. The obtained results on real and simulated data, regarding the smart home speech applications, are quite promising due to the accomplished improvements made in the employed signal processing methods.


pervasive technologies related to assistive environments | 2012

Audio-visual speech recognition using depth information from the Kinect in noisy video conditions

Georgios Galatas; Gerasimos Potamianos; Fillia Makedon

In this paper we build on our recent work, where we successfully incorporated facial depth data of a speaker captured by the Microsoft Kinect device, as a third data stream in an audio-visual automatic speech recognizer. In particular, we focus our interest on whether the depth stream provides sufficient speech information that can improve system robustness to noisy audio-visual conditions, thus studying system operation beyond the traditional scenarios, where noise is applied to the audio signal alone. For this purpose, we consider four realistic visual modality degradations at various noise levels, and we conduct small-vocabulary recognition experiments on an appropriate, previously collected, audiovisual database. Our results demonstrate improved system performance due to the depth modality, as well as considerable accuracy increase, when using both the visual and depth modalities over audio only speech recognition.


international conference on acoustics, speech, and signal processing | 2015

Multichannel speech enhancement using MEMS microphones

Z.-I. Skordilis; Antigoni Tsiami; Petros Maragos; Gerasimos Potamianos; Luca Spelgatti; Roberto Sannino

In this work, we investigate the efficacy of Micro Electro-Mechanical System (MEMS) microphones, a newly developed technology of very compact sensors, for multichannel speech enhancement. Experiments are conducted on real speech data collected using a MEMS microphone array. First, the effectiveness of the array geometry for noise suppression is explored, using a new corpus containing speech recorded in diffuse and localized noise fields with a MEMS microphone array configured in linear and hexagonal array geometries. Our results indicate superior performance of the hexagonal geometry. Then, MEMS microphones are compared to Electret Condenser Microphones (ECMs), using the ATHENA database, which contains speech recorded in realistic smart home noise conditions with hexagonal-type arrays of both microphone types. MEMS microphones exhibit performance similar to ECMs. Good performance, versatility in placement, small size, and low cost, make MEMS microphones attractive for multichannel speech processing.


Hands-free Speech Communication and Microphone Arrays (HSCMA), 2014 4th Joint Workshop on | 2014

The Athena-RC system for speech activity detection and speaker localization in the DIRHA smart home

Panagiotis Giannoulis; Antigoni Tsiami; Isidoros Rodomagoulakis; Athanasios Katsamanis; Gerasimos Potamianos; Petros Maragos

We present our system for speech activity detection and speaker localization inside a smart home with multiple rooms equipped with microphone arrays of known geometry and placement. The smart home is developed as part of the DIRHA European funded project, providing both simulated and real data for system development and evaluation, under extremely challenging conditions of noise, reverberation, and speech overlap. Our proposed approach performs speech activity detection first, by employing multi-microphone decision fusion on traditional statistical models and acoustic features, within a Viterbi decoding framework, further assisted by signal energy- and model log-likelihood threshold-based heuristics. Then it performs speaker localization using traditional time-difference of arrival estimation between properly selected microphone pairs, further assisted by a dereverberation component. The system achieves very low detection errors, namely less than 4% (5%) for speech activity detection in the simulated (real) DIRHA corpus, and less than 10% (12%) for joint speech detection and speaker localization.


pervasive technologies related to assistive environments | 2011

Audio visual speech recognition in noisy visual environments

Georgios Galatas; Gerasimos Potamianos; Alexandros Papangelis; Fillia Makedon

Speech recognition is a natural means of interaction for a human with a smart assistive environment. In order for this interaction to be effective, such a system should attain a high recognition rate even under adverse conditions. Audio-visual speech recognition (AVSR) can be of help in such environments, especially under the presence of audio noise. However the impact of visual noise to its performance has not been studied sufficiently in the literature. In this paper, we examine the effects of visual noise to AVSR, reporting experiments on the relatively simple task of connected digit recognition, under moderate acoustic noise and a variety of types of visual noise. The latter can be caused by either faulty sensors or video signal transmission problems that can be found in smart assistive environments. Our AVSR system exhibits higher accuracy in comparison to an audio-only recognizer and robust performance in most cases of noisy video signals considered.


Archive | 2017

The Handbook of Multimodal-Multisensor Interfaces: Foundations, User Modeling, and Common Modality Combinations - Volume 1

Sharon Oviatt; Björn W. Schuller; Philip R. Cohen; Daniel Sonntag; Gerasimos Potamianos; Antonio Krüger

The Handbook of Multimodal-Multisensor Interfaces provides the first authoritative resource on what has become the dominant paradigm for new computer interfaces-user input involving new media (speech, multi-touch, gestures, writing) embedded in multimodal-multisensor interfaces. These interfaces support smartphones, wearables, in-vehicle, robotic, and many other applications that are now highly competitive commercially. This edited collection is written by international experts and pioneers in the field. It provides a textbook for students, and a reference and technology roadmap for professionals working in this rapidly emerging area. Volume 1 of the handbook presents relevant theory and neuroscience foundations for guiding the development of high-performance systems. Additional chapters discuss approaches to user modeling, interface design that supports user choice, synergistic combination of modalities with sensors, and blending of multimodal input and output. They also highlight an in-depth look at the most common multimodal-multisensor combinations- for example, touch and pen input, haptic and non-speech audio output, and speech co-processed with visible lip movements, gaze, gestures, or pen input. A common theme throughout is support for mobility and individual differences among users-including the worlds rapidly growing population of seniors. These handbook chapters provide walk-through examples and video illustrations of different system designs and their interactive use. Common terms are defined, and information on practical resources is provided (e.g., software tools, data resources) for hands-on project work to develop and evaluate multimodal-multisensor systems. In the final chapter, experts exchange views on a timely and controversial challenge topic, and how they believe multimodal-multisensor interfaces should be designed in the future to most effectively advance human performance.


spoken language technology workshop | 2016

Audio-visual speech activity detection in a two-speaker scenario incorporating depth information from a profile or frontal view

Spyridon Thermos; Gerasimos Potamianos

Motivated by increasing popularity of depth visual sensors, such as the Kinect device, we investigate the utility of depth information in audio-visual speech activity detection. A two-subject scenario is assumed, allowing to also consider speech overlap. Two sensory setups are employed, where depth video captures either a frontal or profile view of the subjects, and is subsequently combined with the corresponding planar video and audio streams. Further, multi-view fusion is regarded, using audio and planar video from a sensor at the complementary view setup. Support vector machines provide temporal speech activity classification for each visually detected subject, fusing the available modality streams. Classification results are further combined to yield speaker diarization. Experiments are reported on a suitable audio-visual corpus recorded by two Kinects. Results demonstrate the benefits of depth information, particularly in the frontal depth view setup, reducing speech activity detection and speaker diarization errors over systems that ignore it.


computer vision and pattern recognition | 2017

Deep Affordance-Grounded Sensorimotor Object Recognition

Spyridon Thermos; Georgios Th. Papadopoulos; Petros Daras; Gerasimos Potamianos

It is well-established by cognitive neuroscience that human perception of objects constitutes a complex process, where object appearance information is combined with evidence about the so-called object affordances, namely the types of actions that humans typically perform when interacting with them. This fact has recently motivated the sensorimotor approach to the challenging task of automatic object recognition, where both information sources are fused to improve robustness. In this work, the aforementioned paradigm is adopted, surpassing current limitations of sensorimotor object recognition research. Specifically, the deep learning paradigm is introduced to the problem for the first time, developing a number of novel neuro-biologically and neuro-physiologically inspired architectures that utilize state-of-the-art neural networks for fusing the available information sources in multiple ways. The proposed methods are evaluated using a large RGB-D corpus, which is specifically collected for the task of sensorimotor object recognition and is made publicly available. Experimental results demonstrate the utility of affordance information to object recognition, achieving an up to 29% relative error reduction by its inclusion.

Collaboration


Dive into the Gerasimos Potamianos's collaboration.

Top Co-Authors

Avatar

Petros Maragos

National Technical University of Athens

View shared research outputs
Top Co-Authors

Avatar

Athanasios Katsamanis

National and Kapodistrian University of Athens

View shared research outputs
Top Co-Authors

Avatar

Panagiotis Giannoulis

National Technical University of Athens

View shared research outputs
Top Co-Authors

Avatar

Antigoni Tsiami

National and Kapodistrian University of Athens

View shared research outputs
Top Co-Authors

Avatar

Isidoros Rodomagoulakis

National Technical University of Athens

View shared research outputs
Top Co-Authors

Avatar

Fillia Makedon

University of Texas at Arlington

View shared research outputs
Top Co-Authors

Avatar

Georgios Galatas

University of Texas at Arlington

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Z.-I. Skordilis

National and Kapodistrian University of Athens

View shared research outputs
Top Co-Authors

Avatar

Youssef Mroueh

Massachusetts Institute of Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge