Cemil Demir
Scientific and Technological Research Council of Turkey
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Cemil Demir.
IEEE Transactions on Audio, Speech, and Language Processing | 2013
Cemil Demir; Murat Saraclar; Ali Taylan Cemgil
In this study, we describe a mixture model based single-channel speech-music separation method. Given a catalog of background music material, we propose a generative model for the superposed speech and music spectrograms. The background music signal is assumed to be generated by a jingle in the catalog. The background music component is modeled by a scaled conditional mixture model representing the jingle. The speech signal is modeled by a probabilistic model, which is similar to the probabilistic interpretation of Non-negative Matrix Factorization (NMF) model. The parameters of the speech model is estimated in a semi-supervised manner from the mixed signal. The approach is tested with Poisson and complex Gaussian observation models that correspond respectively to Kullback-Leibler (KL) and Itakura-Saito (IS) divergence measures. Our experiments show that the proposed mixture model outperforms a standard NMF method both in speech-music separation and automatic speech recognition (ASR) tasks. These results are further improved using Markovian prior structures for temporal continuity between the jingle frames. Our test results with real data show that our method increases the speech recognition performance.
signal processing and communications applications conference | 2009
Cemil Demir; Mehmet Ugur Dogan
Using posterior probability based features to segment an audio signal as speech and music has been commonly used method In this study Hidden-Markov-Model (HMM) based acoustic models are used to calculate posterior probabilities. Acoustic Models includes states of context-independent phones as modeling unit. Entropy and Dynamism are found using via the posterior probabilities and these values are used as feature for speech-music discrimination. An HMM based classifier that uses Viterbi decoding is implemented and using discriminative features, audio signals are segmented as speech and music. As a result of the tests, it was found that applied speech-music segmentation method decreasesWord-Error-Rate and increases the speed of recognition.
ieee automatic speech recognition and understanding workshop | 2011
Cemil Demir; Ali Taylan Cemgil; Murat Saraclar
In this study, we analyze the gain estimation problem of the catalog-based single-channel speech-music separation method, which we proposed previously. In the proposed method, assuming that we know a catalog of the background music, we developed a generative model for the superposed speech and music spectrograms. We represent the speech spectrogram by a Non-Negative Matrix Factorization (NMF) model and the music spectrogram by a conditional Poisson Mixture Model (PMM). In this model, we assume that the background music is generated by repeating and changing the gain of the jingle in the music catalog. Although the separation performance of the proposed method is satisfactory with known gain values, the performance decreases when the gain value of the jingle is unknown and has to be estimated. In this paper, we address the gain estimation problem of the catalog-based method and propose three different approaches to overcome this problem. One of these approaches is to use Gamma Markov Chain (GMC) probabilistic structure to impose the correlation between the gain parameters across the time frames. By using GMC, the gain parameter is estimated more accurately. The other approaches are maximum a posteriori (MAP) and piece-wise constant estimation (PCE) of the gain values. Although all three methods improve the separation performance as compared to the original method itself, GMC approach achieved the best performance.
signal processing and communications applications conference | 2011
Cemil Demir; Mehmet Ugur Dogan; A. Taylan Cemgil; Murat Saraclar
In this study, single-channel speech source separation is carried out to separate the speech from the background music, which degrades the speech recognition performance especially in broadcast news transcription systems. Since the separation is done using single observation of the source signals, the sources have to be previously modeled using training data. Non-negative Matrix Factorization (NMF) methods are used to model the sources. In order to model the source signals, different training data sets, which contain different music and speech data, are created and the effect of the training data sets are analyzed in this study. The performances of the methods are measured not only using separation performance measure but also with speech recognition performance measures.
signal processing and communications applications conference | 2014
M. Said Aydemir; Burak Aydin; Hamza Kaya; Ibrahim Karliaga; Cemil Demir
In this study, two different Ottoman and Turkish handwritten recognition systems have been developed using Hidden Markov Model (HMM) and Recurrent Neural Network (RNN). The systems are tested in both public use datasets and Civil Registration and Nationality (CRN) dataset. As public use datasets, IFN/ENIT dataset which is created for Arabic language, is used because of the similarity between Ottoman and Arabic, IAM dataset is tested which consists of Latin characters. Because of the CRN dataset is not suitable for direct usage, contrast enhancement, line and background destruction, converting 24 bit image to binary format, image resize for normalized font value, skew detection and correction are applied as pre-processing steps. When the recognition results of both systems are compared, the system which employs the RNN gives %8 higher accuracy then system which employs HMM.
signal processing and communications applications conference | 2014
Cemil Demir; A. Taylan Cemgil; Murat Saraclar
In this study, single-channel speech source separation is carried out to separate the speech from the background music, which degrades the speech recognition performance especially in broadcast news transcription systems. Since the separation is done using single observation of the source signals, the sources have to be previously modeled using training data. Non-negative Matrix Factorization (NMF) methods are used to model the sources. In order to model the source signals, different training data sets, which contain different music and speech data, are created and the effect of the training data sets are analyzed in this study. The performances of the methods are measured not only using separation performance measure but also with speech recognition performance measures.
signal processing and communications applications conference | 2014
Ahmet Afsin Akin; Cemil Demir
In this paper we will present SmoothLm, a language model compression and random access library. Like some other previous work, this library uses Minimal Perfect Hash Functions (MPHF) to reach high compression rates. We improved a previous MPHF algorithm in terms of generation and query speed and named it Multi Level MPHF. We also present a mechanism that use this MPHF structure on very large data sets quickly with limited memory usage. SmoothLms generates lossy models and it provides a quantization mechanism for probability values for extra compression. We use SmoothLm in our in house speech recognition engine and our experiments showed that with correct parameters, being a lossy model or applying quantization does not hurt performance. Library is proper for applications developed in Java and source code is available with a free license.
signal processing and communications applications conference | 2012
Ahmet Afsin Akin; Cemil Demir; Mehmet Ugur Dogan
In this study, some solutions for out of vocabulary (OOV) word problem of automatic speech recognition (ASR) systems which are developed for agglutinative languages like Turkish, are examined and an improvement to this problem is proposed. It has been shown that using sub-word language models outperforms word based models by reducing the OOV word ratio in languages with complex morphology. In this work we propose improvements on both statistical and morphological sub-word language modelling techniques by applying language dependent pre-processing on words before applying sub-word segmentation. In our tests, using the largest Turkish broadcast news corpus to date, we had better results in our proposed models comparing baseline statistical and morphological sub-word language models.
signal processing and communications applications conference | 2011
Erdem Ünal; Cemil Demir; Mehmet Ugur Dogan
Representation of music with the purpose of matching queries is one of the popular sub fields of information retrieval. In this paper, studies for to the project named ‘Music Tracking System for Royalty Rights Management’ funded by the TÜBıTAK ARDEB 3501 grant program is presented. Related to technical representation of music for music matching, the tonal music space theory and its background is explained. The short time spectral analysis features are mapped on to the three dimensional musical space for meaningful representation. The symbolic representation is then integrated into a look-up table with N-gram blocks. This process is held for each database entry. The look up table becomes an N-gram block representation of the entire database. When a match query is available, the symbolic sequence of the query is searched in the look-up table and a match result is presented to the user. In a database of ten thousand music samples, full clip and partial clip query matching results and technical details about the performance of the system is shown.
signal processing and communications applications conference | 2009
Coskun Mermer; Cemil Demir; Hamza Kaya; Mehmet Ugur Dogan
In this paper, we introduce a Turkish-to-English speech translation system trained with a parallel corpus recently developed at TÜBİTAK-UEKAE and we investigate the performance improvement obtained by using speaker adaptation. We describe the two sub-modules of the system, namely, the speech recognition and machine translation modules, and we investigate the relationship between their individual performances and the overall system performance using various metrics. Furthermore, we present the improvements in recognition error rate and recognition time obtained by applying speaker adaptation.