Simon Dobrisek
University of Ljubljana
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Simon Dobrisek.
International Journal of Speech Technology | 2003
Simon Dobrisek; Jerneja Gros; Bostjan Vesnicer; Nikola Pavešić; x
Blind and visually-impaired people face many problems in interacting with information retrieval systems. State-of-the-art spoken language technology offers potential to overcome many of them. In the mid-nineties our research group decided to develop an information retrieval system suitable for Slovene-speaking blind and visually-impaired people. A voice-driven text-to-speech dialogue system was developed for reading Slovenian texts obtained from the Electronic Information System of the Association of Slovenian Blind and Visually Impaired Persons Societies. The evolution of the system is presented. The early version of the system was designed to deal explicitly with the Electronic Information System where the available text corpora are stored in a plain text file format without any, or with just some, basic non-standard tagging. Further improvements to the system became possible with the decision to transfer the available corpora to the new web portal, exclusively dedicated to blind and visually-impaired users. The text files were reformatted into common HTML/XML pages, which comply with the basic recommendations set by the Web Access Initiative. In the latest version of the system all the modules of the early version are being integrated into the user interface, which has some basic web-browsing functionalities and a text-to-speech screen-reader function controlled by the mouse as well.
International Journal of Speech Technology | 2003
Jerneja Gros; Simon Dobrisek; Janez Žibert; Nikola Pavešić; x
This paper presents the Slovene-language spoken resources that were acquired at the Laboratory of Artificial Perception, Systems and Cybernetics (LUKS) at the Faculty of Electrical Engineering, University of Ljubljana over the past ten years. The resources consist of:• isolated-spoken-word corpora designed for phonetic research of the Slovene spoken language;• read-speech corpora from dialogues relating to air flight information;• isolated-word corpora, designed for studying the Slovene spoken diphthongs;• Slovene diphone corpora used for text-to-speech synthesis systems;• a weather forecast speech database, as an attempt to capture radio and television broadcast news in the Slovene language; and• read- and spontaneous-speech corpora used to study the effects of the psycho physical conditions of the speakers on their speech characteristics.All the resources are accompanied by relevant text transcriptions, lexicons and various segmentation labels. The read-speech corpora relating to the air flight information domain also are annotated prosodically and semantically. The words in the orthographic transcription were automatically tagged for their lemma and morphosyntactic description. Many of the mentioned speech resources are freely available for basic research purposes in speech technology and linguistics. In this paper we describe all the resources in more detail and give a brief description of their use in the spoken language technology products developed at LUKS.
Computer Speech & Language | 2013
R. Gajšek; F. Mihelic; Simon Dobrisek
In this article we present an efficient approach to modeling the acoustic features for the tasks of recognizing various paralinguistic phenomena. Instead of the standard scheme of adapting the Universal Background Model (UBM), represented by the Gaussian Mixture Model (GMM), normally used to model the frame-level acoustic features, we propose to represent the UBM by building a monophone-based Hidden Markov Model (HMM). We present two approaches: transforming the monophone-based segmented HMM-UBM to a GMM-UBM and proceeding with the standard adaptation scheme, or to perform the adaptation directly on the HMM-UBM. Both approaches give superior results than the standard adaptation scheme (GMM-UBM) in both the emotion recognition task and the alcohol detection task. Furthermore, with the proposed method we were able to achieve better results than the current state-of-the-art systems in both tasks.
IEEE Transactions on Pattern Analysis and Machine Intelligence | 2009
Simon Dobrisek; Janez Zibert; Nikola Pavesic; F. Mihelic
An edit-distance model that can be used for the approximate matching of contiguous and noncontiguous timed strings is presented. The model extends the concept of the weighted string-edit distance by introducing timed edit operations and by making the edit costs time dependent. Special attention is paid to the timed null symbols that are associated with the timed insertions and deletions. The usefulness of the presented model is demonstrated on the classification of phone-recognition errors using the TIMIT speech database.
3rd International Workshop on Biometrics and Forensics (IWBF 2015) | 2015
Vitomir Struc; Janez Krizaj; Simon Dobrisek
The facial imagery usually at the disposal for forensics investigations is commonly of a poor quality due to the unconstrained settings in which it was acquired. The captured faces are typically non-frontal, partially occluded and of a low resolution, which makes the recognition task extremely difficult. In this paper we try to address this problem by presenting a novel framework for face recognition that combines diverse features sets (Gabor features, local binary patterns, local phase quantization features and pixel intensities), probabilistic linear discriminant analysis (PLDA) and data fusion based on linear logistic regression. With the proposed framework a matching score for the given pair of probe and target images is produced by applying PLDA on each of the four feature sets independently - producing a (partial) matching score for each of the PLDA-based feature vectors - and then combining the partial matching results at the score level to generate a single matching score for recognition. We make two main contributions in the paper: i) we introduce a novel framework for face recognition that relies on probabilistic MOdels of Diverse fEature SeTs (MODEST) to facilitate the recognition process and ii) benchmark it against the existing state-of-the-art. We demonstrate the feasibility of our MODEST framework on the FRGCv2 and PaSC databases and present comparative results with the state-of-the-art recognition techniques, which demonstrate the efficacy of our framework.
ieee international conference on automatic face gesture recognition | 2013
Janez Krizaj; Vitomir Struc; Simon Dobrisek
The paper introduces a novel framework for 3D face recognition that capitalizes on region covariance descriptors and Gaussian mixture models. The framework presents an elegant and coherent way of combining multiple facial representations, while simultaneously examining all computed representations at various levels of locality. The framework first computes a number of region covariance matrices/descriptors from different sized regions of several image representations and then adopts the unscented transform to derive low-dimensional feature vectors from the computed descriptors. By doing so, it enables computations in the Euclidean space, and makes Gaussian mixture modeling feasible. In the last step a support vector machine classification scheme is used to make a decision regarding the identity of the modeled input 3D face image. The proposed framework exhibits several desirable characteristics, such as an inherent mechanism for data fusion/integration (through the region covariance matrices), the ability to examine the facial images at different levels of locality, and the ability to integrate domain-specific prior knowledge into the modeling procedure. We assess the feasibility of the proposed framework on the Face Recognition Grand Challenge version 2 (FRGCv2) database with highly encouraging results.
Computer Communications | 2003
Nikola Pavesic; Jerneja Gros; Simon Dobrisek
HOMER II is a voice-driven text-to-speech system developed for blind or visually impaired persons for reading Slovenian texts. Users can obtain texts from the Internet site of the Association of Slovenian Blind and Visually Impaired Persons Societies from their Electronic Information System where they can find daily newspapers, some novels and other information. The system consists of four main modules. The first module enables Internet communication, retrieves text to a local disc and converts it to a standard form. The input interface manages the keyboard entry and/or speaker independent speech recognition. The output interface performs speech synthesis of a given text and in addition prints the same text magnified to the screen. The user dialog is responsible for the user friendly communication and controls other tasks of the system. Homer II was ported from Linux to the MS Windows 9x/ME/NT/2000 operating systems. For the best performance it uses multi-threading and other advantages of the 32-bit environment. Further versions of the HOMER system with even more advanced dialogue modules and some basic World Wide Web browsing functionality will represent an important tool in the distance learning and teaching process for the impaired persons using academic networks.
ieee international conference on automatic face gesture recognition | 2015
Tadej Justin; Vitomir Struc; Simon Dobrisek; Bostjan Vesnicer; Ivo Ipšić
The paper addresses the problem of speaker (or voice) de-identification by presenting a novel approach for concealing the identity of speakers in their speech. The proposed technique first recognizes the input speech with a diphone recognition system and then transforms the obtained phonetic transcription into the speech of another speaker with a speech synthesis system. Due to the fact that a Diphone RecOgnition step and a sPeech SYnthesis step are used during the de-identification, we refer to the developed technique as DROPSY. With this approach the acoustical models of the recognition and synthesis modules are completely independent from each other, which ensures the highest level of input speaker de-identification. The proposed DROPSY-based de-identification approach is language dependent, text independent and capable of running in real-time due to the relatively simple computing methods used. When designing speaker de-identification technology two requirements are typically imposed on the de-identification techniques: i) it should not be possible to establish the identity of the speakers based on the de-identified speech, and ii) the processed speech should still sound natural and be intelligible. This paper, therefore, implements the proposed DROPSY-based approach with two different speech synthesis techniques (i.e, with the HMM-based and the diphone TD-PSOLA-based technique). The obtained de-identified speech is evaluated for intelligibility and evaluated in speaker verification experiments with a state-of-the-art (i-vector/PLDA) speaker recognition system. The comparison of both speech synthesis modules integrated in the proposed method reveals that both can efficiently de-identify the input speakers while still producing intelligible speech.
International Journal of Advanced Robotic Systems | 2012
Janez Križaj; Vitomir Struc; Simon Dobrisek
This paper focuses on the use of Gaussian Mixture models (GMM) for 3D face verification. A special interest is taken in practical aspects of 3D face verification systems, where all steps of the verification procedure need to be automated and no meta-data, such as pre-annotated eye/nose/mouth positions, is available to the system. In such settings the performance of the verification system correlates heavily with the performance of the employed alignment (i.e., geometric normalization) procedure. We show that popular holistic as well as local recognition techniques, such as principal component analysis (PCA), or Scale-invariant feature transform (SIFT)-based methods considerably deteriorate in their performance when an “imperfect” geometric normalization procedure is used to align the 3D face scans and that in these situations GMMs should be preferred. Moreover, several possibilities to improve the performance and robustness of the classical GMM framework are presented and evaluated: i) explicit inclusion of spatial information, during the GMM construction procedure, ii) implicit inclusion of spatial information during the GMM construction procedure and iii) on-line evaluation and possible rejection of local feature vectors based on their likelihood. We successfully demonstrate the feasibility of the proposed modifications on the Face Recognition Grand Challenge data set.
international conference on pattern recognition | 2010
Vitomir truc; Simon Dobrisek; Nikola Pavesic
Subspace projection techniques are known to be susceptible to the presence of partial occlusions in the image data. To overcome this susceptibility, we present in this paper a confidence weighting scheme that assigns weights to pixels according to a measure, which quantifies the confidence that the pixel in question represents an outlier. With this procedure the impact of the occluded pixels on the subspace representation is reduced and robustness to partial occlusions is obtained. Next, the confidence weighting concept is improved by a local procedure for the estimation of the subspace representation. Both the global weighting approach and the local estimation procedure are assessed in face recognition experiments on the AR database, where encouraging results are obtained with partially occluded facial images.