Zdravko Kacic
University of Maribor
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Zdravko Kacic.
International Journal of Speech Technology | 2003
Vladimir Hozjan; Zdravko Kacic
This paper presents and discusses an analysis of multilingual emotion recognition from speech with database-specific emotional features. Recognition was performed on English, Slovenian, Spanish, and French InterFace emotional speech databases. The InterFace databases included several neutral speaking styles and six emotions: disgust, surprise, joy, fear, anger and sadness. Speech features for emotion recognition were determined in two steps. In the first step, low-level features were defined and in the second high-level features were calculated from low-level features. Low-level features are composed from pitch, derivative of pitch, energy, derivative of energy, and duration of speech segments. High-level features are statistical presentations of low-level features. Database-specific emotional features were selected from high-level features that contain the most information about emotions in speech. Speaker-dependent and monolingual emotion recognisers were defined, as well as multilingual recognisers. Emotion recognition was performed using artificial neural networks. The achieved recognition accuracy was highest for speaker-dependent emotion recognition, smaller for monolingual emotion recognition and smallest for multilingual recognition. The database-specific emotional features are most convenient for use in multilingual emotion recognition. Among speaker-dependent, monolingual, and multilingual emotion recognition, the difference between emotion recognition with all high-level features and emotion recognition with database-specific emotional features is smallest for multilingual emotion recognition—3.84%.
Speech Communication | 2007
Tomaž Rotovnik; Mirjam Sepesy Maucec; Zdravko Kacic
In this article, we focus on creating a large vocabulary speech recognition system for the Slovenian language. Currently, state-of-the-art recognition systems are able to use vocabularies with sizes of 20,000 to 100,000 words. These systems have mostly been developed for English, which belongs to a group of uninflectional languages. Slovenian, as a Slavic language, belongs to a group of inflectional languages. Its rich morphology presents a major problem in large vocabulary speech recognition. Compared to English, the Slovenian language requires a vocabulary approximately 10 times greater for the same degree of text coverage. Consequently, the difference in vocabulary size causes a high degree of OOV (out-of-vocabulary words). Therefore OOV words have a direct impact on recognizer efficiency. The characteristics of inflectional languages have been considered when developing a new search algorithm with a method for restricting the correct order of sub-word units, and to use separate language models based on sub-words. This search algorithm combines the properties of sub-word-based models (reduced OOV) and word-based models (the length of context). The algorithm also enables better search-space limitation for sub-word models. Using sub-word models, we increase recognizer accuracy and achieve a comparable search space to that of a standard word-based recognizer. Our methods were evaluated in experiments on a SNABI speech database.
International Journal of Speech Technology | 2000
Klára Vicsi; Peter Roach; Anne-Marie Öster; Zdravko Kacic; Peter Barczikay; Andras Tantos; Ferenc Csatári; Zsolt Bakcsi; Anna Sfakianaki
The development of an audiovisual pronunciation teaching and training method and software system is discussed in this article. The method is designed to help children with speech and hearing disorders gain better control over their speech production. The teaching method is drawn up for progression from individual sound preparation to practice of sounds in sentences for four languages: English, Swedish, Slovenian, and Hungarian. The system is a general language-independent measuring tool and database editor. This database editor makes it possible to construct modules for all participant languages and for different sound groups. Two modules are under development for the system in all languages: one for teaching and training vowels to hearing-impaired children and the other for correction of misarticulated fricative sounds. In the article we present the measuring methods, the used distance score calculations of the visualized speech spectra, and problems in the evaluation of the new multimedia tool.
Signal Processing | 2007
Bojan Kotnik; Zdravko Kacic
This paper presents a noise robust feature extraction algorithm NRFE using joint wavelet packet decomposition (WPD) and autoregressive (AR) modeling of a speech signal. In opposition to the short time Fourier transform (STFT)-based time-frequency signal representation, wavelet packet decomposition can lead to better representation of non-stationary parts of the speech signal (e.g. consonants). The vowels are well described with an AR model as in LPC analysis. The proposed Root-Log compression scheme is used to perform the computation of the wavelet packet parameters. The separately extracted WPD and AR-based parameters are combined together and then transformed with the usage of linear discriminant analysis (LDA) to finally produce a lower dimensional output feature vector. The noise robustness is improved with the application of proposed wavelet-based denoising algorithm with a modified soft thresholding procedure and time-frequency adaptive threshold. The proposed voice activity detector based on a skewness-to-kurtosis ratio of the LPC residual signal is used to effectively perform a frame-dropping principle. The speech recognition results achieved on Aurora 2 and Aurora 3 databases show overall performance improvement of 44.7% and 48.2% relative to the baseline MFCC front-end, respectively.
Journal of the Acoustical Society of America | 2006
Vladimir Hozjan; Zdravko Kacic
This paper presents a rule-based method to determine emotion-dependent features, which are defined from high-level features derived from the statistical measurements of prosodic parameters of speech. Emotion-dependent features are selected from high-level features using extraction rules. The ratio of emotional expression similarity between two speakers is defined by calculating the number and values of the emotion-dependent features that are present for the two speakers being compared. Emotional speech from Interface databases is used for evaluation of the proposed method, which was used to analyze emotional speech from five male and four female speakers in order to find any similarities and differences among individual speakers. The speakers are actors that have interpreted six emotions in four different languages. The results show that all the speakers share some universal signs regarding certain emotion-dependent features of emotional expression. Further analysis revealed that almost all speakers in the analysis used unique sets of emotion-dependent features and each speaker used unique values for the defined emotion-dependent features. The comparison among speakers shows that the expressed emotions can be analyzed according to two criteria. The first criterion is a defined set of emotion-dependent features and the second is an emotion-dependent feature value.
Speech Communication | 2002
Janez Kaiser; Bogomir Horvat; Zdravko Kacic
In this paper, we propose a novel discriminative objective function for the estimation of hidden Markov model (HMM) parameters, based on the calculation of overall risk. For continuous speech recognition, the algorithm minimises the risk of misclassification on the training database and thus maximises recognition accuracy. The calculation of the risk is based on the measurement of Levenshtein distance between the correct transcription and the N-best recognised transcriptions, which counts the number of recognition errors - deleted, substituted and inserted words. The minimisation of the proposed criterion is implemented by using the extended Baum-Welch algorithm for the estimation of the discrete HMM parameters and the Normandins extension of the algorithm for the estimation of the continuous densities. We tested the performance of the proposed algorithm on two tasks: phoneme recognition on the TIMIT database and continuous speech recognition on the Resource Management (RM) database. Results show consistent decrease of the recognition error rate when compared to the standard maximum likelihood estimation training. The highest achieved decrease of word error rate in our experiments is 20.8% on the TIMIT phoneme recognition task, 39.6% on the RM task with context-independent HMMs and 18.22% with context-dependent models.
EURASIP Journal on Advances in Signal Processing | 2005
Damjan Vlaj; Bojan Kotnik; Bogomir Horvat; Zdravko Kacic
This paper presents a novel computationally efficient voice activity detection (VAD) algorithm and emphasizes the importance of such algorithms in distributed speech recognition (DSR) systems. When using VAD algorithms in telecommunication systems, the required capacity of the speech transmission channel can be reduced if only the speech parts of the signal are transmitted. A similar objective can be adopted in DSR systems, where the nonspeech parameters are not sent over the transmission channel. A novel approach is proposed for VAD decisions based on mel-filter bank (MFB) outputs with the so-called Hangover criterion. Comparative tests are presented between the presented MFB VAD algorithm and three VAD algorithms used in the G.729, G.723.1, and DSR (advanced front-end) Standards. These tests were made on the Aurora 2 database, with different signal-to-noise (SNRs) ratios. In the speech recognition tests, the proposed MFB VAD outperformed all the three VAD algorithms used in the standards by relative (G.723.1 VAD), by relative (G.729 VAD), and by relative (DSR VAD) in all SNRs.
Speech Communication | 2003
Bojan Imperl; Zdravko Kacic; Bogomir Horvat; Andrej Zgank
This paper addresses the problem of multilingual acoustic modelling for the design of multilingual speech recognisers. An agglomerative clustering algorithm for the definition of multilingual set of triphones is proposed. This clustering algorithm is based on the definition of an indirect distance measure for triphones defined as a weighted sum of the explicit estimates of the context similarity on a monophone level. The monophone similarity estimation method is based on the algorithm of Houtgast. The new clustering algorithm was tested in a multilingual speech recognition experiment for three languages. The algorithm was applied on monolingual triphone sets of language specific recognisers for all languages. In order to evaluate the clustering algorithm, the performance of the multilingual set of triphones was compared to the performance of the reference system composed of all three language specific recognisers operating in parallel, and to the performance of the multilingual set of triphones produced by the tree-based clustering algorithm. All experiments were based on the 1000 FDB SpeechDat(II) databases (Slovenian, Spanish and German). Experiments have shown that the use of the clustering algorithm results in a significant reduction of the number of triphones with minor degradation of recognition rate.
international conference on acoustics, speech, and signal processing | 2000
Bojan Imperl; Zdravko Kacic; Bogomir Horvat; Andrej Zgank
The paper addresses the problem of multilingual acoustic modelling for the design of multilingual speech recognisers. Two different approaches for the definition of multilingual set of triphones (bottom-up and a top-down) are investigated. A new clustering algorithm for the definition of multilingual set of triphones is proposed. The agglomerative clustering algorithm (bottom-up) is based on a definition of a distance measure for triphones defined as a weighted sum of explicit estimates of the context similarity on a monophone level. The monophone similarity estimation method is based on the algorithm of Houtgast. The second type of system uses tree-based clustering (top-down) with a common decision tree. The experiments were based on the SpeechDat II databases (Slovenian, Spanish and German 1000 FDB SpeechDat II). Experiments have shown that the use of the agglomerative clustering algorithm results in a significant reduction of the number of triphones with minor degradation of word accuracy.
Speech Communication | 2007
Matej Rojc; Zdravko Kacic
This paper proposes a time and space-efficient architecture for a text-to-speech synthesis system (TTS). The proposed architecture can be efficiently used in those applications with unlimited domain, requiring multilingual or polyglot functionality. The integration of a queuing mechanism, heterogeneous graphs and finite-state machines gives a powerful, reliable and easily maintainable architecture for the TTS system. Flexible and language-independent framework efficiently integrates all those algorithms used within the scope of the TTS system. Heterogeneous relation graphs are used for linguistic information representation and feature construction. Finite-state machines are used for time and space-efficient representation of language resources, for time and space-efficient lookup processes, and the separation of language-dependent resources from a language-independent TTS engine. Its queuing mechanism consists of several dequeue data structures and is responsible for the activation of all those TTS engine modules having to process the input text. In the proposed architecture, all modules use the same data structure for gathering linguistic information about input text. All input and output formats are compatible, the structure is modular and interchangeable, it is easily maintainable and object oriented. The proposed architecture was successfully used when implementing the Slovenian PLATTOS corpus-based TTS system, as presented in this paper.