Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Milos Cernak is active.

Publication


Featured researches published by Milos Cernak.


IEEE Signal Processing Letters | 2013

A Simple Continuous Pitch Estimation Algorithm

Philip N. Garner; Milos Cernak; Petr Motlicek

Recent work in text to speech synthesis has pointed to the benefit of using a continuous pitch estimate; that is, one that records pitch even when voicing is not present. Such an approach typically requires interpolation. The purpose of this letter is to show that a continuous pitch estimation is available from a combination of otherwise well known techniques. Further, in the case of an autocorrelation based estimate, the continuous requirement negates the need for other heuristics to correct for common errors. An algorithm is suggested, illustrated, and demonstrated using a parametric vocoder.


international conference on acoustics, speech, and signal processing | 2013

On the (UN)importance of the contextual factors in HMM-based speech synthesis and coding

Milos Cernak; Petr Motlicek; Philip N. Garner

This paper presents an evaluation of the contextual factors of HMM-based speech synthesis and coding systems. Two experimental setups are proposed that are based on successive context addition from phonetic to full-context. The aim was to investigate the impact of the individual contextual factors on the speech quality. In that sense important and unimportant (i.e., not having significant impact on speech quality, also called weak) contextual factors were identified. The results imply that in speech coding the improvement in quality can be achieved just with reconstruction of syllable contexts. The sentence and utterance contexts are unimportant on the decoder side, and it is not necessary to deal with them. Although in speech coding the wider context was not necessary, in speech synthesis current syllable and utterance contexts are more important over others (previous and next word/phrase contexts).


international conference on acoustics, speech, and signal processing | 2015

Phonological vocoding using artificial neural networks

Milos Cernak; Blaise Potard; Philip N. Garner

We investigate a vocoder based on artificial neural networks using a phonological speech representation. Speech decomposition is based on the phonological encoders, realised as neural network classifiers, that are trained for a particular language. The speech reconstruction process involves using a Deep Neural Network (DNN) to map phonological features posteriors to speech parameters - line spectra and glottal signal parameters - followed by LPC resynthesis. This DNN is trained on a target voice without transcriptions, in a semi-supervised manner. Both encoder and decoder are based on neural networks and thus the vocoding is achieved using a simple fast forward pass. An experiment with French vocoding and a target male voice trained on 21 hour long audio book is presented. An application of the phonological vocoder to low bit rate speech coding is shown, where transmitted phonological posteriors are pruned and quantized. The vocoder with scalar quantization operates at 1 kbps, with potential for lower bit-rate.


Speech Communication | 2016

On structured sparsity of phonological posteriors for linguistic parsing

Milos Cernak; Afsaneh Asaei; Hervé Bourlard

Phonological posterior is a sparse vector consisting of phonological class probabilities.Phonological posterior is estimated from a short speech segment using deep neural network.Segmental phonological posterior conveys supra-segmental information on linguistic events.A linguistic class is characterized by a codebook of binary phonological structures.Linguistic parsing is achieved with high accuracy using binary pattern matching. The speech signal conveys information on different time scales from short (2040ms) time scale or segmental, associated to phonological and phonetic information to long (150250ms) time scale or supra segmental, associated to syllabic and prosodic information. Linguistic and neurocognitive studies recognize the phonological classes at segmental level as the essential and invariant representations used in speech temporal organization.In the context of speech processing, a deep neural network (DNN) is an effective computational method to infer the probability of individual phonological classes from a short segment of speech signal. A vector of all phonological class probabilities is referred to as phonological posterior. There are only very few classes comprising a short term speech signal; hence, the phonological posterior is a sparse vector. Although the phonological posteriors are estimated at segmental level, we claim that they convey supra-segmental information. Specifically, we demonstrate that phonological posteriors are indicative of syllabic and prosodic events.Building on findings from converging linguistic evidence on the gestural model of Articulatory Phonology as well as the neural basis of speech perception, we hypothesize that phonological posteriors convey properties of linguistic classes at multiple time scales, and this information is embedded in their support (index) of active coefficients. To verify this hypothesis, we obtain a binary representation of phonological posteriors at the segmental level which is referred to as first-order sparsity structure; the high-order structures are obtained by the concatenation of first-order binary vectors. It is then confirmed that the classification of supra-segmental linguistic events, the problem known as linguistic parsing, can be achieved with high accuracy using a simple binary pattern matching of first-order or high-order structures.


international conference on acoustics, speech, and signal processing | 2006

Unit Selection Speech Synthesis in Noise

Milos Cernak

The paper presents an approach to unit selection speech synthesis in noise. The approach is based on a modification of the speech synthesis method originally published in A.W. Black and P. Taylor (1997), where the distance of a candidate unit from its cluster center is used as the unit selection cost. We found out that using an additional measure evaluating intelligibility for the unit cost may improve the overall understandability of speech in noise. The measure we have chosen for prediction of speech intelligibility in noise is speech intelligibility index (SII). While the calculation of the SII value for each unit in the speech corpus was made off-line, a pink noise was used as a representative noise for the calculation. Listening tests imply that such a simple modification of the unit cost in unit selection synthesis can improve understandability of speech delivered under poor channel conditions


text speech and dialogue | 2004

Slovak Speech Database for Experiments and Application Building in Unit-Selection Speech Synthesis

Milan Rusko; Marián Trnka; Sachia Darzágín; Milos Cernak

After the years of hesitation the conservative Slovak telecommunication market seems to become conscious of the need of voice driven services. In the last year, all the three telecommunication operators have adopted our text to speech system Kempelen in their interactive voice response systems. The diphone concatenative synthesis has probably reached the frontier of its abilities and so the next step is to check for a synthesis method giving more intelligible and more natural synthesized speech with better prosody modelling. Therefore we have decided to build a one speaker speech database in Slovak for experiments and application building in unit-selection speech synthesis. To build such a database, we tried to exploit as much of the existing speech resources in Slovak as possible, to utilize the knowledge from previous projects and to use the existing routines developed at our department. The paper describes the structure, recording and annotation of this database as well as first experiments with unit-selection speech synthesizer.


Computer Speech & Language | 2017

Speech vocoding for laboratory phonology

Milos Cernak; Stefan Benus; Alexandros Lazaridis

Using phonological speech vocoding, we propose a platform for exploring relations between phonology and speech processing, and in broader terms, for exploring relations between the abstract and physical structures of a speech signal. Our goal is to make a step towards bridging phonology and speech processing and to contribute to the program of Laboratory Phonology. We show three application examples for laboratory phonology: compositional phonological speech modelling, a comparison of phonological systems and an experimental phonological parametric text-to-speech (TTS) system. The featural representations of the following three phonological systems are considered in this work: (i) Government Phonology (GP), (ii) the Sound Pattern of English (SPE), and (iii) the extended SPE (eSPE). Comparing GP- and eSPE-based vocoded speech, we conclude that the latter achieves slightly better results than the former. However, GP - the most compact phonological speech representation - performs comparably to the systems with a higher number of phonological features. The parametric TTS based on phonological speech representation, and trained from an unlabelled audiobook in an unsupervised manner, achieves intelligibility of 85% of the state-of-the-art parametric speech synthesis. We envision that the presented approach paves the way for researchers in both fields to form meaningful hypotheses that are explicitly testable using the concepts developed and exemplified in this paper. On the one hand, laboratory phonologists might test the applied concepts of their theoretical models, and on the other hand, the speech processing community may utilize the concepts developed for the theoretical phonological models for improvements of the current state-of-the-art applications.


conference of the international speech communication association | 2016

PhonVoc: A Phonetic and Phonological Vocoding Toolkit.

Milos Cernak; Philip N. Garner

Abstract We present the PhonVoc toolkit, a cascaded deep neural net-work (DNN) composed of speech analyser and synthesizer thatuse a shared phonetic and/or phonological speech representa-tion. The free toolkit is distributed as open-source software un-der a BSD 3-Clause License, available at https://github.com/idiap/phonvoc with the pre-trained US English anal-ysis and synthesis DNNs, and thus it is ready for immediateuse.In a broader context, the toolkit implements training andtesting of the analysis by synthesis heuristic model. It is thusdesigned for the wider speech community working in acousticphonetics, laboratory phonology, and parametric speech cod-ing. The toolkit interprets the phonetic posterior probabilities asa sequential scheme, whereas the phonological posterior-classprobabilities are considered as a parallel via K different phono-logical classes. A case study is presented on a LibriSpeechdatabase and a LibriVox US English native female speaker. Thephonetic and phonological vocoding yield comparable perfor-mance, improving speech quality by merging the phonetic andphonological speech representation.Index Terms: speech vocoding, deep neural networks


international conference on acoustics, speech, and signal processing | 2017

Multi-view representation learning via gcca for multimodal analysis of Parkinson's disease

Juan Camilo Vásquez-Correa; Juan Rafael Orozco-Arroyave; Raman Arora; Elmar Nöth; Najim Dehak; Heidi Christensen; Frank Rudzicz; Tobias Bocklet; Milos Cernak; Hamidreza Chinaei; Julius Hannink; Phani Sankar Nidadavolu; Maria Yancheva; Alyssa Vann; Nikolai Vogler

Information from different bio-signals such as speech, handwriting, and gait have been used to monitor the state of Parkinsons disease (PD) patients, however, all the multimodal bio-signals may not always be available. We propose a method based on multi-view representation learning via generalized canonical correlation analysis (GCCA) for learning a representation of features extracted from handwriting and gait that can be used as a complement to speech-based features. Three different problems are addressed: classification of PD patients vs. healthy controls, prediction of the neurological state of PD patients according to the UPDRS score, and the prediction of a modified version of the Frenchay dysarthria assessment (m-FDA). According to the results, the proposed approach is suitable to improve the results in the addressed problems, specially in the prediction of the UPDRS, and m-FDA scores.


SLSP 2015 Proceedings of the Third International Conference on Statistical Language and Speech Processing - Volume 9449 | 2015

Residual-Based Excitation with Continuous F0 Modeling in HMM-Based Speech Synthesis

Tamás Gábor Csapó; Géza Németh; Milos Cernak

In statistical parametric speech synthesis, creaky voice can cause disturbing artifacts. The reason is that standard pitch tracking algorithms tend to erroneously measure F0 in regions of creaky voice. This pattern is learned during training of hidden Markov-models HMMs. In the synthesis phase, false voiced/unvoiced decision caused by creaky voice results in audible quality degradation. In order to eliminate this phenomena, we use a simple continuous F0 tracker which does not apply a strict voiced/unvoiced decision. In the proposed residual-based vocoder, Maximum Voiced Frequency is used for mixed voiced and unvoiced excitation. As all parameters of the vocoder are continuous, Multi-Space Distribution is not necessary during training the HMMs, which has been shown to be advantageous. Artifacts caused by creaky voice are eliminated with this speech synthesis system. A subjective listening test of English utterances has shown improvement over the traditional excitation.

Collaboration


Dive into the Milos Cernak's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Afsaneh Asaei

Idiap Research Institute

View shared research outputs
Top Co-Authors

Avatar

Petr Motlicek

Idiap Research Institute

View shared research outputs
Top Co-Authors

Avatar

Elmar Nöth

University of Erlangen-Nuremberg

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge