Damjan Vlaj | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Damjan Vlaj is active.

Explore More

Publication

Featured researches published by Damjan Vlaj.

EURASIP Journal on Advances in Signal Processing | 2005

A computationally efficient mel-filter bank VAD algorithm for distributed speech recognition systems

Damjan Vlaj; Bojan Kotnik; Bogomir Horvat; Zdravko Kacic

This paper presents a novel computationally efficient voice activity detection (VAD) algorithm and emphasizes the importance of such algorithms in distributed speech recognition (DSR) systems. When using VAD algorithms in telecommunication systems, the required capacity of the speech transmission channel can be reduced if only the speech parts of the signal are transmitted. A similar objective can be adopted in DSR systems, where the nonspeech parameters are not sent over the transmission channel. A novel approach is proposed for VAD decisions based on mel-filter bank (MFB) outputs with the so-called Hangover criterion. Comparative tests are presented between the presented MFB VAD algorithm and three VAD algorithms used in the G.729, G.723.1, and DSR (advanced front-end) Standards. These tests were made on the Aurora 2 database, with different signal-to-noise (SNRs) ratios. In the speech recognition tests, the proposed MFB VAD outperformed all the three VAD algorithms used in the standards by relative (G.723.1 VAD), by relative (G.729 VAD), and by relative (DSR VAD) in all SNRs.

International Journal of Speech Technology | 2003

Efficient Noise Robust Feature Extraction Algorithms for Distributed Speech Recognition (DSR) Systems

Bojan Kotnik; Damjan Vlaj; Bogomir Horvat

The evolution of robust speech recognition systems that maintain a high level of recognition accuracy in difficult and dynamically-varying acoustical environments is becoming increasingly important as speech recognition technology becomes a more integral part of mobile applications. In distributed speech recognition (DSR) architecture the recognisers front-end is located in the terminal and is connected over a data network to a remote back-end recognition server. The terminal performs the feature parameter extraction, or the front-end of the speech recognition system. These features are transmitted over a data channel to the remote back-end recogniser. DSR provides particular benefits for the applications of mobile devices such as improved recognition performance compared to using the voice channel and ubiquitous access from different networks with a guaranteed level of recognition performance. A feature extraction algorithm integrated into the DSR system is required to operate in real-time as well as with the lowest possible computational costs.In this paper, two innovative front-end processing techniques for noise robust speech recognition are presented and compared, time-domain based frame-attenuation (TD-FrAtt) and frequency-domain based frame-attenuation (FD-FrAtt). These techniques include different forms of frame-attenuation, improvement of spectral subtraction based on minimum statistics, as well as a mel-cepstrum feature extraction procedure. Tests are performed using the Slovenian SpeechDat II fixed telephone database and the Aurora 2 database together with the HTK speech recognition toolkit. The results obtained are especially encouraging for mobile DSR systems with limited sizes of available memory and processing power.

Digital Signal Processing | 2013

Acoustic classification and segmentation using modified spectral roll-off and variance-based features

Marko Kos; Zdravko Kacic; Damjan Vlaj

This paper presents novel features and an architecture for an automatic on-line acoustic classification and segmentation system. The system includes speech/non-speech segmentation (with the emphasis on accurate speech/music segmentation), gender segmentation, and speech bandwidth segmentation. This automatic segmentation system can be easily integrated into an automatic continuous speech recognition system, where information about individual acoustic segments can be used for acoustic model selection and adaptation, or as additional information for rich transcription output. Acoustic model adaptation can improve the speech recognition rate and additional information in rich transcription can be useful when searching for some certain events or circumstances (male speaker talking over the phone line, etc.). For speech/non-speech segmentation we propose a new set of features, which are based on an energy variance in a narrow frequency sub-band, called EVFB (Energy Variance of Filter Bank). The proposed features also prove to be an efficient discriminator between speech and music. Segmentation cross-test results show that EVFB features prove to be more robust than MFCC features. Two new features (modified spectral roll-off and high-frequency energy variance) are also proposed for speech bandwidth classification and segmentation. The results show a good and robust performance by the automatic on-line acoustic segmentation system. All experiments and tests were performed on a radio broadcast database and a Slovenian BNSI Broadcast News database. Integration of the automatic on-line acoustic segmentation system into a continuous speech recognition system based on MFCC (mel-frequency cepstral coefficients) features requires only a small additional computational cost because many of the proposed system@?s feature calculation procedures are common to the MFCC features calculation procedure.

international conference on systems, signals and image processing | 2009

On-Line Speech/Music Segmentation for Broadcast News Domain

Marko Kos; Matej Grasic; Damjan Vlaj; Zdravko Kacic

This paper presents novel feature-group for on-line speech/music segmentation for broadcast news domain. The features are based on mel-frequency cepstral coefficients variance (MFCCV). The idea behind the feature-group construction is the energy variation in a narrow frequency sub-band. The variation is bigger for speech than for music. For feature discrimination and segmentation ability evaluation the radio broadcast database was used. Results show that MFCCV features perform better than the classic MFCC features. The MFCCV features are very convenient speech/music discriminator for automatic speech recognition system where MFCC features are used, as they perform better than classic MFCC features and only one additional calculation step is needed.

Computers & Electrical Engineering | 2012

Voice activity detection algorithm using nonlinear spectral weights, hangover and hangbefore criteria

Damjan Vlaj; Zdravko Kacic; Marko Kos

This paper introduces a nonlinear function into the frequency spectrum that improves the detection of vowels, diphthongs, and semivowels within the speech signal. The lower efficiency of consonant detection was solved by implementing the hangover and hangbefore criteria. This paper presents a procedure for faster definition of those optimal constants used by hangover and hangbefore criteria. A nonlinearly changed frequency spectrum is used in the proposed GMM (Gaussian Mixture Model) based VAD (Voice Activity Detection) algorithm. Comparative tests between the proposed VAD algorithm and seven other VAD algorithms were made on the Aurora 2 database. The experiments were based on frame error detection and on speech recognition performance for two types of acoustic training modes (multi-condition and clean only). The lowest average percentage of frame errors was obtained by the proposed VAD algorithm, which also achieved positive improvement in the speech recognition performance for both types of acoustic training modes.

international conference on systems, signals and image processing | 2009

Influence of Hangover and Hangbefore Criteria on Automatic Speech Recognition

Damjan Vlaj; Marko Kos; Matej Grasic; Zdravko Kacic

In this paper the influence of hangover and hangbefore criteria on automatic speech recognition is presented. Voice activity detection (VAD) algorithm is nowadays almost always part of automatic speech recognition systems. Hangover and hangbefore criteria can be integrated into VAD algorithm after basic VAD decision. Hangover and hangbefore criteria can improve speech recognition results. However, there is a question, how many frames should be taken for hangover and hangbefore criteria. The duration of vowels, diphthongs and semivowels is important to define how many frames must be detected as speech, so that we can decide if hangover and hangbefore criteria will be used at all. The frames of consonants have low spectral energy. Especially energy of unvoiced fricatives, unvoiced stops and nasals is very low. First, these frames are detected as silence. However, with hangover and hangbefore criteria they are again considered as speech. Speech recognition experiments show that hangover and hangbefore criteria can improve speech recognition results. Speech recognition experiments also show that hangbefore criterion has a more important role in speech recognition than hangover criterion.

international conference on systems signals and image processing | 2016

Tennis stroke detection and classification using miniature wearable IMU device

Marko Kos; Jernej Zenko; Damjan Vlaj; Iztok Kramberger

This paper presents work related to tennis stroke detection and classification. For arm movement acquisition a miniature wearable IMU device, positioned on the players forearm (right above the wrist) is proposed and presented. The device uses a MEMS-based accelerometer and gyroscope with 6-DOF. For reliable and accurate tennis stroke detection the information obtained from the accelerometer data is used, and for tennis stroke classification, information from gyroscope data is extracted and processed. The proposed system is able to detect and classify three most common tennis strokes: forehand, backhand, and serve. Because of limited memory and lack of processing power, the proposed algorithms for stroke detection and classification are quite simple, but are nonetheless capable of achieving high classification rate. Overall 98.1% tennis stroke classification accuracy was achieved.

Computers & Electrical Engineering | 2017

A speech-based distributed architecture platform for an intelligent ambience☆

Marko Kos; Matej Rojc; Andrej Žgank; Zdravko Kacic; Damjan Vlaj

Abstract In the paper, a speech-based platform for intelligent ambience and/or supportive environment applications is presented. The platform has a distributed architecture, which enables extended connectivity and support for multiple intelligent ambience services. The mobile unit Genesis is an integral part of the distributed platform, enabling interaction between several users and the environment. Furthermore, the sophisticated client/server platforms architecture incorporates robust speech recognition and text-to-speech synthesis engines for more natural human-machine interaction between users and the mobile unit Genesis. Both engines are multilingual oriented. Although the whole system is developed for the Slovenian language, it can be quickly adapted for other languages when appropriate language resources are available. With high speaker independent speech recognition accuracy and low command-to-operation delay, Genesis proves to have good manoeuvrability and it is easy to operate even by a non-experienced operator.

international conference on systems signals and image processing | 2016

Quick and efficient definition of hangbefore and hangover criteria for voice activity detection

Damjan Vlaj; Marko Kos; Zdravko Kacic

In this paper the quick and efficient definition of hangbefore and hangover criteria for Voice Activity Detection (VAD) algorithm is presented. Hangbefore and hangover criteria are integrated into VAD algorithm after basic VAD decision step. Using the hangbefore criterion, the problem of incorrect detection of unvoiced speech that occurs at the beginning and in the middle of the speech segment, can be solved. Using the hangover criterion, the problem of incorrect detection of unvoiced speech that occurs at the end and in the middle of the speech segment, can be improved. There is always a dilemma of how many frames should be taken for hangover and hangbefore criteria. The main purpose of this study was to set up the procedure, which would quickly and reliably define frames for hangbefore and hangover criteria. Our test results showed that the new quick procedure has led us to very similar final results as obtained with the previously known procedures.

conference of the international speech communication association | 2001