Branislav M. Popovic
University of Novi Sad
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Branislav M. Popovic.
Applied Intelligence | 2012
Branislav M. Popovic; Marko Janev; Darko Pekar; Niksa Jakovljevic; Milan Gnjatović; Milan Sečujski; Vlado Delić
The paper presents a novel split-and-merge algorithm for hierarchical clustering of Gaussian mixture models, which tends to improve on the local optimal solution determined by the initial constellation. It is initialized by local optimal parameters obtained by using a baseline approach similar to k-means, and it tends to approach more closely to the global optimum of the target clustering function, by iteratively splitting and merging the clusters of Gaussian components obtained as the output of the baseline algorithm. The algorithm is further improved by introducing model selection in order to obtain the best possible trade-off between recognition accuracy and computational load in a Gaussian selection task applied within an actual recognition system. The proposed method is tested both on artificial data and in the framework of Gaussian selection performed within a real continuous speech recognition system, and in both cases an improvement over the baseline method has been observed.
international conference on speech and computer | 2015
Branislav M. Popovic; Stevan Ostrogonac; Edvin Pakoci; Niksa Jakovljevic; Vlado Delić
This paper presents a deep neural network (DNN) based large vocabulary continuous speech recognition (LVCSR) system for Serbian, developed using the open-source Kaldi speech recognition toolkit. The DNNs are initialized using stacked restricted Boltzmann machines (RBMs) and trained using cross-entropy as the objective function and the standard error backpropagation procedure in order to provide posterior probability estimates for the hidden Markov model (HMM) states. Emission densities of HMM states are represented as Gaussian mixture models (GMMs). The recipes were modified based on the particularities of the Serbian language in order to achieve the optimal results. A corpus of approximately 90 hours of speech (21000 utterances) is used for the training. The performances are compared for two different sets of utterances between the baseline GMM-HMM algorithm and various DNN settings.
international conference on speech and computer | 2017
Branislav M. Popovic; Edvin Pakoci; Darko Pekar
This paper presents the results of a large vocabulary speech recognition for the Serbian language, developed by using Eesen end-to-end framework. Eesen involves training a single deep recurrent neural network, containing a number of bidirectional long short-term memory layers, modeling the connection between the speech and a set of context-independent lexicon units. This approach reduces the amount of expert knowledge needed in order to develop other competitive speech recognition systems. The training is based on a connectionist temporal classification, while decoding allows the usage of weighted finite-state transducers. This provides much faster and more efficient decoding in comparison to other similar systems. A corpus of approximately 215 h of audio data (about 171 h of speech and 44 h of silence, or 243 male and 239 female speakers) was employed for the training (about 90%) and testing (about 10%) purposes. On a set of more than 120000 words, the word error rate of 14.68% and the character error rate of 3.68% is achieved.
international conference on speech and computer | 2017
Edvin Pakoci; Branislav M. Popovic; Darko Pekar
This paper presents the results obtained using several variants of trigram language models in a large vocabulary continuous speech recognition (LVCSR) system for the Serbian language, based on the deep neural network (DNN) framework implemented within the Kaldi speech recognition toolkit. This training approach allows parallelization using several threads on either multiple GPUs or multiple CPUs, and provides a natural-gradient modification to the stochastic gradient descent (SGD) optimization method. Acoustic models are trained over a fixed number of training epochs with parameter averaging in the end. This paper discusses recognition using different language models trained with Kneser-Ney or Good-Turing smoothing methods, as well as several pruning parameter values. The results on a test set containing more than 120000 words and different utterance types are explored and compared to the referent results with GMM-HMM speaker-adapted models for the same speech database. Online and offline recognition results are compared to each other as well. Finally, the effect of additional discriminative training using a language model prior to the DNN stage is explored.
international conference on speech and computer | 2016
Tijana Delic; Branislav Gerazov; Branislav M. Popovic; Milan Sečujski
One of the most recently proposed techniques for modeling the prosody of an utterance is the decomposition of its pitch, duration and/or energy contour into physiologically motivated units called atoms, based on matching pursuit. Since this model is based on the physiology of the production of sentence intonation, it is essentially language independent. However, the intonation of an utterance in a particular language is obviously under the influence of factors of a predominantly linguistic nature. In this research, restricted to the case of American English with prosody annotated using standard ToBI conventions, we have shown that, under certain mild constraints, the positive and negative atoms identified in the pitch contour coincide very well with high and low pitch accents and phrase accents of ToBI. By giving a linguistic interpretation of the atom decomposition model, this research enables its practical use in domains such as speech synthesis or cross-lingual prosody transfer.
international conference on speech and computer | 2016
Edvin Pakoci; Branislav M. Popovic; Niksa Jakovljevic; Darko Pekar; Fathy Yassa
In this paper, a novel variant of an automatic phonetic segmentation procedure is presented, especially useful if data is scarce. The procedure uses the Kaldi speech recognition toolkit as its basis, and combines and modifies several existing methods and Kaldi recipes. Both the specifics of model training and test data alignment are explained in detail. Effectiveness of artificial extension of the starting amount of manually labeled material during training is examined as well. Experimental results show the admirable overall correctness of the proposed procedure in the given test environment. Several variants of the procedure are compared, and the usage of speaker-adapted context-dependent triphone models trained without the expanded manually checked data is proven to produce the best results. A few ways to improve the procedure even more, as well as future work, are also discussed.
telecommunications forum | 2015
Branislav M. Popovic; Edvin Pakoci; Niksa Jakovljevic; Goran Kocis; Darko Pekar
This paper presents a Voice Assistant, an Android based personal assistant application for mobile phones, allowing voice control for the Serbian language. The native interface is provided for a large vocabulary continuous speech recognition system based on the open-source Kaldi speech recognition toolkit. Several acoustic models were trained using a database of about 70000 utterances, by adding different amount of noise. The results are provided for a test database of about 4500 utterances and a test vocabulary of more than 14000 words.
international conference on speech and computer | 2014
Igor Stankovic; Branislav M. Popovic; Florian Focone
This paper describes influence of different types of agent’s behaviour in a social human-virtual agent gesture interaction experiment. The interaction was described to participants as a game with the goal of imitating the agent’s slow upper-body movements, and where new subtle movements can be proposed by both the participant and the agent. As we are interested only in body movements, simple virtual agent was built and displayed at a local exhibition, and we asked visitors one by one to play the game. During the interaction, the agent’s behaviour varied from subject to subject, and we observed their responses. Interesting observations have been drawn from the experiment and it seems that only a small variation in the agent’s behaviour and synchronization can lead to a significantly different feel of the game in the participants.
international conference on speech and computer | 2013
Branislav M. Popovic; Milan Seăujski; Vlado Delić; Marko Janev; Igor Stankovic
The paper presents the module for automatic morphological annotation within a text synthesizer for Hebrew, based on an efficient combination of two approaches. The first approach includes the selection of lexemes from appropriate lexica, while the other approach involves automatic morphological analysis of text input using a complex expert algorithm relying on a set of transformational rules and using 6 types of scoring procedures. The module operates on a set of 30 part-of-speech tags with more than 3000 corresponding morphological categories. The paper discusses the advantages of the proposed method in the context of an extremely morphologically complex language such as Hebrew, with particular emphasis given to the relative importance of individual scoring procedures. When all 6 scoring procedures are applied, the accuracy of 99.6% is achieved on a corpus of 3093 sentences 55046 words.
international conference on speech and computer | 2018
Branislav M. Popovic; Edvin Pakoci; Darko Pekar
In this paper, a number of language model training techniques will be examined and utilized in a large vocabulary continuous speech recognition system for the Serbian language (more than 120000 words), namely Mikolov and Yandex RNNLM, TensorFlow based GPU approaches and CUED-RNNLM approach. The baseline acoustic model is a chain sub-sampled time delayed neural network, trained using cross-entropy training and a sequence-level objective function on a database of about 200 h of speech. The baseline language model is a 3-gram model trained on the training part of the database transcriptions and the Serbian journalistic corpus (about 600000 utterances), using the SRILM toolkit and the Kneser-Ney smoothing method, with a pruning value of 10−7 (previous best). The results are analyzed in terms of word and character error rates and the perplexity of a given language model on training and validation sets. Relative improvement of 22.4% (best word error rate of 7.25%) is obtained in comparison to the baseline language model.