Dogan Can
University of Southern California
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dogan Can.
IEEE Transactions on Audio, Speech, and Language Processing | 2011
Dogan Can; Murat Saraclar
This paper considers the problem of constructing an efficient inverted index for the spoken term detection (STD) task. More specifically, we construct a deterministic weighted finite-state transducer storing soft-hits in the form of (utterance ID, start time, end time, posterior score) quadruplets. We propose a generalized factor transducer structure which retains the time information necessary for performing STD. The required information is embedded into the path weights of the factor transducer without disrupting the inherent optimality. We also describe how to index all substrings seen in a collection of raw automatic speech recognition lattices using the proposed structure. Our STD indexing/search implementation is built upon the OpenFst Library and is designed to scale well to large problems. Experiments on Turkish and English data sets corroborate our claims.
IEEE Transactions on Audio, Speech, and Language Processing | 2009
Ebru Arisoy; Dogan Can; Siddika Parlak; Hasim Sak; Murat Saraclar
This paper summarizes our recent efforts for building a Turkish Broadcast News transcription and retrieval system. The agglutinative nature of Turkish leads to a high number of out-of-vocabulary (OOV) words which in turn lower automatic speech recognition (ASR) accuracy. This situation compromises the performance of speech retrieval systems based on ASR output. Therefore using a word-based ASR is not adequate for transcribing speech in Turkish. To alleviate this problem, various sub-word-based recognition units are utilized. These units solve the OOV problem with moderate size vocabularies and perform even better than a 500 K word vocabulary as far as recognition accuracy is concerned. As a novel approach, the interaction between recognition units, words and sub-words, and discriminative training is explored. Sub-word models benefit from discriminative training more than word models do, especially in the discriminative language modeling framework. For speech retrieval, a spoken term detection system based on automata indexation is utilized. As with transcription, retrieval performance is measured under various schemes incorporating words and sub-words. Best results are obtained using a cascade of word and sub-word indexes together with term-specific thresholding.
international conference on acoustics, speech, and signal processing | 2009
Dogan Can; Erica Cooper; Abhinav Sethy; Christopher M. White; Bhuvana Ramabhadran; Murat Saraclar
The spoken term detection (STD) task aims to return relevant segments from a spoken archive that contain the query terms whether or not they are in the system vocabulary. This paper focuses on pronunciation modeling for Out-of-Vocabulary (OOV) terms which frequently occur in STD queries. The STD system described in this paper indexes word-level and sub-word level lattices or confusion networks produced by an LVCSR system using Weighted Finite State Transducers (WFST).We investigate the inclusion of n-best pronunciation variants for OOV terms (obtained from letter-to-sound rules) into the search and present the results obtained by indexing confusion networks as well as lattices. The following observations are worth mentioning: phone indexes generated from sub-words represent OOVs well and too many variants for the OOV terms degrade performance if pronunciations are not weighted.
international acm sigir conference on research and development in information retrieval | 2009
Dogan Can; Erica Cooper; Arnab Ghoshal; Martin Jansche; Sanjeev Khudanpur; Bhuvana Ramabhadran; Michael Riley; Murat Saraclar; Abhinav Sethy; Morgan Ulinski; Christopher M. White
Indexing and retrieval of speech content in various forms such as broadcast news, customer care data and on-line media has gained a lot of interest for a wide range of applications, from customer analytics to on-line media search. For most retrieval applications, the speech content is typically first converted to a lexical or phonetic representation using automatic speech recognition (ASR). The first step in searching through indexes built on these representations is the generation of pronunciations for named entities and foreign language query terms. This paper summarizes the results of the work conducted during the 2008 JHU Summer Workshop by the Multilingual Spoken Term Detection team, on mining the web for pronunciations and analyzing their impact on spoken term detection. We will first present methods to use the vast amount of pronunciation information available on the Web, in the form of IPA and ad-hoc transcriptions. We describe techniques for extracting candidate pronunciations from Web pages and associating them with orthographic words, filtering out poorly extracted pronunciations, normalizing IPA pronunciations to better conform to a common transcription standard, and generating phonemic representations from ad-hoc transcriptions. We then present an analysis of the effectiveness of using these pronunciations to represent Out-Of-Vocabulary (OOV) query terms on the performance of a spoken term detection (STD) system. We will provide comparisons of Web pronunciations against automated techniques for pronunciation generation as well as pronunciations generated by human experts. Our results cover a range of speech indexes based on lattices, confusion networks and one-best transcriptions at both word and word fragments levels.
Journal of Counseling Psychology | 2016
Dogan Can; Rebeca A. Marín; Panayiotis G. Georgiou; Zac E. Imel; David C. Atkins; Shrikanth Narayanan
The dissemination and evaluation of evidence-based behavioral treatments for substance abuse problems rely on the evaluation of counselor interventions. In Motivational Interviewing (MI), a treatment that directs the therapist to utilize a particular linguistic style, proficiency is assessed via behavioral coding-a time consuming, nontechnological approach. Natural language processing techniques have the potential to scale up the evaluation of behavioral treatments such as MI. We present a novel computational approach to assessing components of MI, focusing on 1 specific counselor behavior-reflections, which are believed to be a critical MI ingredient. Using 57 sessions from 3 MI clinical trials, we automatically detected counselor reflections in a maximum entropy Markov modeling framework using the raw linguistic data derived from session transcripts. We achieved 93% recall, 90% specificity, and 73% precision. Results provide insight into the linguistic information used by coders to make ratings and demonstrate the feasibility of new computational approaches to scaling up the evaluation of behavioral treatments.
Journal of Substance Abuse Treatment | 2015
Sarah Peregrine Lord; Dogan Can; Michael Yi; Rebeca A. Marín; Christopher W. Dunn; Zac E. Imel; Panayiotis G. Georgiou; Shrikanth Narayanan; Mark Steyvers; David C. Atkins
The current paper presents novel methods for collecting MISC data and accurately assessing reliability of behavior codes at the level of the utterance. The MISC 2.1 was used to rate MI interviews from five randomized trials targeting alcohol and drug use. Sessions were coded at the utterance-level. Utterance-based coding reliability was estimated using three methods and compared to traditional reliability estimates of session tallies. Session-level reliability was generally higher compared to reliability using utterance-based codes, suggesting that typical methods for MISC reliability may be biased. These novel methods in MI fidelity data collection and reliability assessment provided rich data for therapist feedback and further analyses. Beyond implications for fidelity coding, utterance-level coding schemes may elucidate important elements in the counselor-client interaction that could inform theories of change and the practice of MI.
international symposium on computer and information sciences | 2009
Temucin Som; Dogan Can; Murat Saraclar
In this paper, we develop an HMM-based sliding video text recognizer and present our results on Turkish broadcast news for the hearing impaired. We use well known speech recognition techniques to model and recognize sliding video text characters using a minimal amount of labeled data. Baseline system without any language modeling gives a word error rate of 2.2% on 138 minutes of test data. We then provide an analysis of character errors and employ a character-based language model to correct most of them. Finally we decrease the amount of training data to a quarter, split the test data into halves and investigate semi-supervised training. Word error rates after semi-supervised training are significantly lower than to those after baseline training. We see 40% relative reduction in word error rate (1.5 → 0.9) over the test set.
conference of the international speech communication association | 2016
Bo Xiao; Dogan Can; James Gibson; Zac E. Imel; David C. Atkins; Panayiotis G. Georgiou; Shrikanth Narayanan
Manual annotation of human behaviors with domain specific codes is a primary method of research and treatment fidelity evaluation in psychotherapy. However, manual annotation has a prohibitively high cost and does not scale to coding large amounts of psychotherapy session data. In this paper, we present a case study of modeling therapist language in addiction counseling, and propose an automatic coding approach. The task objective is to code therapist utterances with domain specific codes. We employ Recurrent Neural Networks (RNNs) to predict these behavioral codes based on session transcripts. Experiments show that RNNs outperform the baseline method using Maximum Entropy models. The model with bi-directional Gated Recurrent Units and domain specific word embeddings achieved the highest overall accuracy. We also briefly discuss about client code prediction and comparison to previous work.
empirical methods in natural language processing | 2015
Dogan Can; Shrikanth Narayanan
Efficient computation of n-gram posterior probabilities from lattices has applications in lattice-based minimum Bayes-risk decoding in statistical machine translation and the estimation of expected document frequencies from spoken corpora. In this paper, we present an algorithm for computing the posterior probabilities of all ngrams in a lattice and constructing a minimal deterministic weighted finite-state automaton associating each n-gram with its posterior for efficient storage and retrieval. Our algorithm builds upon the best known algorithm in literature for computing ngram posteriors from lattices and leverages the following observations to significantly improve the time and space requirements: i) then-grams for which the posteriors will be computed typically comprises all n-grams in the lattice up to a certain length, ii) posterior is equivalent to expected count for ann-gram that do not repeat on any path, iii) there are efficient algorithms for computing n-gram expected counts from lattices. We present experimental results comparing our algorithm with the best known algorithm in literature as well as a baseline algorithm based on weighted finite-state automata operations.
signal processing and communications applications conference | 2009
Dogan Can; Murat Saraclar
In this paper, we present our Turkish Large Vocabulary Continuous Speech Recognition (LVCSR) system, which is based on open-source software (HTK, SRILM) and which utilizes 187 hours of Turkish broadcast news data as well as a 184 million-word text corpus collected from various Turkish news portals. Within this system, three different acoustic models optimizing ML, MMI and MPE criteria were developed and the contribution of discriminative acoustic modeling to Turkish LVCSR was investigated. Recognition experiments utilizing a tri-gram language model with 50 K vocabulary give word error rates of 25.8% with ML, 24.3% with MMI and finally 23.7% with MPE.