Harish Arsikere
University of California, Los Angeles
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Harish Arsikere.
Journal of the Acoustical Society of America | 2012
Steven M. Lulich; John R. Morton; Harish Arsikere; Mitchell S. Sommers; Gary K. F. Leung; Abeer Alwan
This paper presents a large-scale study of subglottal resonances (SGRs) (the resonant frequencies of the tracheo-bronchial tree) and their relations to various acoustical and physiological characteristics of speakers. The paper presents data from a corpus of simultaneous microphone and accelerometer recordings of consonant-vowel-consonant (CVC) words embedded in a carrier phrase spoken by 25 male and 25 female native speakers of American English ranging in age from 18 to 24 yr. The corpus contains 17,500 utterances of 14 American English monophthongs, diphthongs, and the rhotic approximant [[inverted r]] in various CVC contexts. Only monophthongs are analyzed in this paper. Speaker height and age were also recorded. Findings include (1) normative data on the frequency distribution of SGRs for young adults, (2) the dependence of SGRs on height, (3) the lack of a correlation between SGRs and formants or the fundamental frequency, (4) a poor correlation of the first SGR with the second and third SGRs but a strong correlation between the second and third SGRs, and (5) a significant effect of vowel category on SGR frequencies, although this effect is smaller than the measurement standard deviations and therefore negligible for practical purposes.
IEEE Signal Processing Letters | 2014
Harish Arsikere; Steven M. Lulich; Abeer Alwan
This letter investigates the use of MFCCs and GMMs for 1) improving the state of the art in speaker height estimation, and 2) rapid estimation of subglottal resonances (SGRs) without relying on formant and pitch tracking (unlike our previous algorithm in [1]). The proposed system comprises a set of height-dependent GMMs modeling static and dynamic MFCC features, where each GMM is associated with a height value. Furthermore, since SGRs and height are correlated, each GMM is also associated with a set of SGR values (known a priori). Given a speech sample, speaker height and SGRs are estimated as weighted combinations of the values corresponding to the N most-likely GMMs. We assess the importance of using dynamic MFCC features and the weighted decision rule, and demonstrate the efficacy of our approach via experiments on height estimation (using TIMIT) and SGR estimation (using the Tracheal Resonance database.
international conference on acoustics, speech, and signal processing | 2012
Harish Arsikere; Gary K. F. Leung; Steven M. Lulich; Abeer Alwan
This paper presents an algorithm for automatically estimating speaker height. It is based on: (1) a recently-proposed model of the subglottal system that explains the inverse relation observed between subglottal resonances and height, and (2) an improved version of our previous algorithm for automatically estimating the second subglottal resonance (Sg2). The improved Sg2 estimation algorithm was trained and evaluated on recently-collected data from 30 and 20 adult speakers, respectively. Sg2 estimation error was found to reduce by 29%, on average, as compared to the previous algorithm. The height estimation algorithm, employing the inverse relation between Sg2 and height, was trained on data from the above-mentioned 50 adults. It was evaluated on 563 adult speakers in the TIMIT corpus, and the mean absolute height estimation error was found to be less than 5.6cm.
Journal of the Acoustical Society of America | 2010
Steven M. Lulich; John R. Morton; Mitchell S. Sommers; Harish Arsikere; Yi‐Hui Lee; Abeer Alwan
Subglottal resonances have received increasing attention in recent studies of speech production, perception, and technology. They affect voice production, divide vowels and consonants into discrete categories, affect vowel perception, and are useful in automatic speech recognition. We present a new speech corpus of simultaneous microphone and (subglottal) accelerometer recordings of 25 adult male and 25 adult female speakers of American English (AE), between 22 and 25 years of age. Additional recordings of 50 gender‐balanced bilingual Spanish/AE speaking adults, as well as 100 child speakers of Spanish and AE, are under way. The AE adult corpus consists of 35 monosyllables (14 “hVd” and 21 “CVb” words, where C is [b, d, g], and V includes all AE monophthongs and diphthongs) in a phonetically neutral carrier phrase (“I said a ____ again”), with 10 repetitions of each word by each speaker, resulting in 17 500 individual microphone (and accelerometer) waveforms. Hand‐labeling of the target vowel in each utterance is currently under way. The corpus fills a gap in the literature on subglottal acoustics and will be useful for future studies in speech production, perception, and technology. It will be freely available to the speech research community. [Work supported in part by the NSF.]
Journal of the Acoustical Society of America | 2011
Harish Arsikere; Steven M. Lulich; Abeer Alwan
This letter focuses on the automatic estimation of the first subglottal resonance (Sg1). A database comprising speech and subglottal data of native American English speakers and bilingual Spanish/English speakers was used for the analysis. Data from 11 speakers (five males and six females) were used to derive an empirical relation among the first formant frequency, fundamental frequency, and Sg1. Using the derived relation, Sg1 was automatically estimated from voiced sounds in English and Spanish sentences spoken by 22 different speakers (11 males and 11 females). The error in estimating Sg1 was less than 50 Hz, on average.
international conference on acoustics, speech, and signal processing | 2013
Harish Arsikere; Steven M. Lulich; Abeer Alwan
This paper proposes a non-linear frequency warping scheme for VTLN. It is based on mapping the subglottal resonances (SGRs) and the third formant frequency (F3) of a given utterance to those of a reference speaker. SGRs are used because they relate to formants in specific ways while remaining phonetically invariant, and F3 is used because it is somewhat correlated to vocal-tract length. Given an utterance, the warping parameters (SGRs and F3) are determined by obtaining initial estimates from the signal, and refining the estimates with respect to a speaker-independent model. For children (TIDIGITS), the proposed method yields statistically-significant word error rate (WER) reductions (up to 15%) relative to conventional VTLN (linear warping) when: (1) speakers show poor baseline performance, and/or (2) training data are limited. For adults (Wall Street Journal), the WER reduction relative to conventional VTLN is 4-5%. Comparison with other non-linear warping techniques is also reported.
international conference on acoustics, speech, and signal processing | 2011
Harish Arsikere; Steven M. Lulich; Abeer Alwan
This paper deals with the automatic estimation of the second subglottal resonance (Sg2) from natural speech spoken by adults, since our previous work focused only on estimating Sg2 from isolated diphthongs. A new database comprising speech and subglottal data of native American English (AE) speakers and bilingual Spanish/English speakers was used for the analysis. Data from 11 speakers (6 females and 5 males) were used to derive an empirical relation among the second and third formant frequencies (F2 and F3) and Sg2. Using the derived relation, Sg2 was automatically estimated from voiced sounds in English and Spanish sentences spoken by 20 different speakers (10 males and 10 females). On average, the error in estimating Sg2 was less than 100 Hz in at least 9 isolated AE vowels and less than 40 Hz in continuous speech consisting of English or Spanish sentences.
Journal of the Acoustical Society of America | 2010
Harish Arsikere; Yi‐Hui Lee; Steven M. Lulich; John R. Morton; Mitchell S. Sommers; Abeer Alwan
Subglottal resonances (SGRs) have recently been used in automatic speaker normalization (SN), leading to improvements in children’s speech recognition [Wang et al. (2009)]. It is hypothesized that human listeners use SGRs for SN as well. However, the suitability of SGRs for SN has not been adequately investigated. SGRs and formants from adult speakers of American English and Mexican Spanish were measured using a new speech corpus with simultaneous (subglottal) accelerometer recordings [Lulich et al. (2010)]. The corpus has been analyzed at a broad level to understand relations among SGRs, speaker height, native language and gender, and formant frequencies as well as the variation of SGRs across vowels and speakers. It is shown that SGRs are roughly constant for a given speaker, regardless of their native spoken language, but differ from speaker to speaker. SGRs are therefore well suited for use in SN and perhaps in speaker identification. Preliminary analyzes also show that SGRs are correlated with each o...
conference of the international speech communication association | 2016
Jinxi Guo; Gary Yeung; Deepak Muralidharan; Harish Arsikere; Amber Afshan; Abeer Alwan
Speaker verification in real-world applications sometimes deals with limited duration of enrollment and/or test data. MFCC-based i-vector systems have defined the state-of-the-art for speaker verification, but it is well known that they are less effective with short utterances. To address this issue, we propose a method to leverage the speaker specificity and stationarity of subglottal acoustics. First, we present a deep neural network (DNN) based approach to estimate subglottal features from speech signals. The approach involves training a DNN-regression model that maps the log filter-bank coefficients of a given speech signal to those of its corresponding subglottal signal. Cross-validation experiments on the WashU-UCLA corpus (which contains parallel recordings of speech and subglottal acoustics) show the effectiveness of our DNN-based estimation algorithm. The average correlation coefficient between the actual and estimated subglottal filter-bank coefficients is 0.9. A scorelevel fusion of MFCC and subglottal-feature systems in the ivector PLDA framework yields statistically-significant improvements over the MFCC-only baseline. On the NIST SRE 08 truncated 10sec-10sec and 5sec-5sec core evaluation tasks, the relative reduction in equal error rate ranges between 6 and 14% for the conditions tested with both microphone and telephone speech.
Journal of the Acoustical Society of America | 2015
Steven M. Lulich; Harish Arsikere
This paper offers a re-evaluation of the mechanical properties of the tracheo-bronchial soft tissues and cartilage and uses a model to examine their effects on the subglottal acoustic input impedance. It is shown that the values for soft tissue elastance and cartilage viscosity typically used in models of subglottal acoustics during phonation are not accurate, and corrected values are proposed. The calculated subglottal acoustic input impedance using these corrected values reveals clusters of weak resonances due to soft tissues (SgT) and cartilage (SgC) lining the walls of the trachea and large bronchi, which can be observed empirically in subglottal acoustic spectra. The model predicts that individuals may exhibit SgT and SgC resonances to variable degrees, depending on a number of factors including tissue mechanical properties and the dimensions of the trachea and large bronchi. Potential implications for voice production and large pulmonary airway tissue diseases are also discussed.