Alexey Karpov
Russian Academy of Sciences
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Alexey Karpov.
Speech Communication | 2014
Laurent Besacier; Etienne Barnard; Alexey Karpov; Tanja Schultz
Speech processing for under-resourced languages is an active field of research, which has experienced significant progress during the past decade. We propose, in this paper, a survey that focuses on automatic speech recognition (ASR) for these languages. The definition of under-resourced languages and the challenges associated to them are first defined. The main part of the paper is a literature review of the recent (last 8years) contributions made in ASR for under-resourced languages. Examples of past projects and future trends when dealing with under-resourced languages are also presented. We believe that this paper will be a good starting point for anyone interested to initiate research in (or operational development of) ASR for one or several under-resourced languages. It should be clear, however, that many of the issues and approaches presented here, apply to speech technology in general (text-to-speech synthesis for instance).
Speech Communication | 2014
Alexey Karpov; Konstantin Markov; Irina S. Kipyatkova; Daria Vazhenina; Andrey Ronzhin
Speech is the most natural way of human communication and in order to achieve convenient and efficient human-computer interaction implementation of state-of-the-art spoken language technology is necessary. Research in this area has been traditionally focused on several main languages, such as English, French, Spanish, Chinese or Japanese, but some other languages, particularly Eastern European languages, have received much less attention. However, recently, research activities on speech technologies for Czech, Polish, Serbo-Croatian, Russian languages have been steadily increasing. In this paper, we describe our efforts to build an automatic speech recognition (ASR) system for the Russian language with a large vocabulary. Russian is a synthetic and highly inflected language with lots of roots and affixes. This greatly reduces the performance of the ASR systems designed using traditional approaches. In our work, we have taken special attention to the specifics of the Russian language when developing the acoustic, lexical and language models. A special software tool for pronunciation lexicon creation was developed. For the acoustic model, we investigated a combination of knowledge-based and statistical approaches to create several different phoneme sets, the best of which was determined experimentally. For the language model (LM), we introduced a new method that combines syntactical and statistical analysis of the training text data in order to build better n-gram models. Evaluation experiments were performed using two different Russian speech databases and an internally collected text corpus. Among the several phoneme sets we created, the one which achieved the fewest word level recognition errors was the set with 47 phonemes and thus we used it in the following language modeling evaluations. Experiments with 204 thousand words vocabulary ASR were performed to compare the standard statistical n-gram LMs and the language models created using our syntactico-statistical method. The results demonstrated that the proposed language modeling approach is capable of reducing the word recognition errors.
Pattern Recognition and Image Analysis | 2009
Alexey Karpov; A. L. Ronzhin
A multimodal interactive dialogue automaton (kiosk) for self-service is presented in the paper. Multimodal user interface allow people to interact with the kiosk by natural speech, gestures additionally to the standard input and output devices. Architecture of the kiosk contains key modules of speech processing and computer vision. An array of four microphones is applied for far-field capturing and recording of user’s speech commands, it allows the kiosk to detect voice activity, to localize sources of desired speech signals, and to eliminate environmental acoustical noises. A noise robust speaker-independent recognition system is applied to automatic interpretation and understanding of continuous Russian speech. The distant speech recognizer uses grammar of voice queries as well as garbage and silence models to improve recognition accuracy. Pair of portable video-cameras are applied for vision-based detection and tracking of user’s head and body position inside of the working area. Russian-speaking talking head serves both for bimodal audio-visual speech synthesis and for improvement of communication intelligibility by turning the head to an approaching client. Dialogue manager controls the flow of dialogue and synchronizes sub-modules for input modalities fusion and output modalities fission. The experiments made with the multimodal kiosk were directed to cognitive and usability studies of human-computer interaction by different communication means
ruSMART/NEW2AN'10 Proceedings of the Third conference on Smart Spaces and next generation wired, and 10th international conference on Wireless networking | 2010
Andrey Ronzhin; Victor Budkov; Alexey Karpov
Web-based collaboration using the wireless devices that have multimedia playback capabilities is a viable alternative to traditional face-to-face meetings. E-meetings are popular in businesses because of their cost savings. To provide quick and effective engagement to the meeting activity, the remote user should be able to perceive whole events in the meeting room and have the same possibilities like participants inside. The technological framework of the developed intelligent meeting room implements multichannel audio-visual system for participant activity detection and automatically composes actual multimedia content for remote mobile user. The developed web-based application for remote user interaction with equipment of the intelligent meeting room and organization of E-meetings were tested with Nokia mobile phones.
Pattern Recognition and Image Analysis | 2007
A. L. Ronzhin; Alexey Karpov
In the paper, we describe a system SIRIUS for recognition of continuous Russian speech, which is developed in the group of speech informatics of SPIIRAS. The specific feature of this system is that the language and speech are represented on morphemic level. This allows one to significantly reduce the size of lexically recognizable dictionary and increase the processing rate. We describe the process of introduction of the Russian speech recognition system into the area of infotelecommunications for voice access to the Internet-version of the electronic catalogue “Yellow Pages of Saint Petersburg” with the purpose of creation of an automated call-center for answering subscriber’s calls. In the paper, we demonstrate the results of testing the system work with speech samples recorded both in offices and in conditions of phone conversations.
international conference on speech and computer | 2015
Elena E. Lyakso; Olga V. Frolova; Evgeniya Dmitrieva; Aleksei Grigorev; Heysem Kaya; Albert Ali Salah; Alexey Karpov
We present the first child emotional speech corpus in Russian, called “EmoChildRu”, which contains audio materials of 3–7 year old kids. The database includes over 20 K recordings (approx. 30 h), collected from 100 children. Recordings were carried out in three controlled settings by creating different emotional states for children: playing with a standard set of toys; repetition of words from a toy-parrot in a game store setting; watching a cartoon and retelling of the story, respectively. This corpus is designed to study the reflection of the emotional state in the characteristics of voice and speech and for studies of the formation of emotional states in ontogenesis. A portion of the corpus is annotated for three emotional states (discomfort, neutral, comfort). Additional data include brain activity measurements (original EEG, evoked potentials records), the results of the adult listeners analysis of child speech, questionnaires, and description of dialogues. The paper reports two child emotional speech analysis experiments on the corpus: by adult listeners (humans) and by an automatic classifier (machine), respectively. Automatic classification results are very similar to human perception, although the accuracy is below 55 % for both, showing the difficulty of child emotion recognition from speech under naturalistic conditions.
international conference on human computer interaction | 2011
Alexey Karpov; Andrey Ronzhin; Irina S. Kipyatkova
In this paper, we present a bi-modal user interface aimed both for assistance to persons without hands or with physical disabilities of hands/arms, and for contactless HCI with able-bodied users as well. Human being can manipulate a virtual mouse pointer moving his/her head and verbally communicate with a computer, giving speech commands instead of computer input devices. Speech is a very useful modality to reference objects and actions on objects, whereas head pointing gesture/motion is a powerful modality to indicate spatial locations. The bi-modal interface integrates a tri-lingual system for multi-channel audio signal processing and automatic recognition of voice commands in English, French and Russian as well as a vision-based head detection/tracking system. It processes natural speech and head pointing movements in parallel and fuses both informational streams in a united multimodal command, where each modality transmits own semantic information: head position indicates 2D head/pointer coordinates, while speech signal yields control commands. Testing of the bi-modal user interface and comparison with contact-based pointing interfaces was made by the methodology of ISO 9241-9.
international conference on speech and computer | 2013
Irina S. Kipyatkova; Alexey Karpov
In this paper, the comparison of 2,3,4-gram language models with various lexicon sizes is presented. The text data forming the training corpus has been collected from recent Internet news sites; total size of the corpus is about 350 million words 2.4 GB data. The language models were built using the recognition lexicons of 110K, 150K, 219K, and 303K words. For evaluation of these models such characteristics as perplexity, OOV words rate and n-gram hit rate were computed. Experimental results on continuous Russian speech recognition are also given in the paper.
international conference on universal access in human-computer interaction | 2014
Alexey Karpov; Andrey Ronzhin
In this paper, we present a universal assistive technology with multimodal input and multimedia output interfaces. The conceptual model and the software-hardware architecture with levels and components of the universal assistive technology are described. The architecture includes five main interconnected levels: computer hardware, system software, application software of digital signal processing, application software of human-computer interfaces, software of assistive information technologies. The universal assistive technology proposes several multimodal systems and interfaces to the people with disabilities: audio-visual Russian speech recognition system (AVSR), “Talking head” synthesis system (text-to-audiovisual speech), “Signing avatar” synthesis system (sign language visual synthesis), ICANDO multimodal system (hands-free PC control system), and the control system of an assistive smart space.
international conference on speech and computer | 2016
Vasilisa Verkhodanova; Alexander L. Ronzhin; Irina S. Kipyatkova; Denis Ivanko; Alexey Karpov; M. Železný
In this paper we present a software-hardware complex for collection of audio-visual speech databases with a high-speed camera and a dynamic microphone. We describe the architecture of the developed software as well as some details of the collected database of Russian audio-visual speech HAVRUS. The developed software provides synchronization and fusion of both audio and video channels and makes allowance for and processes the natural factor of human speech - the asynchrony of audio and visual speech modalities. The collected corpus comprises recordings of 20 native speakers of Russian and is meant for further research and experiments on audio-visual Russian speech recognition.