Journal of the Acoustical Society of America | 2019

A deep neural network approach to investigate tone space in languages

 
 
 

Abstract


Phonological contrasts are usually signaled by multiple cues, and tonal languages typically involve multiple dimensions to distinguish between tones (e.g., duration, pitch contour, and voice quality, etc.). While the topic has been extensively studied, research has mostly used small datasets. This study employs a deep neural network (DNN) based speech recognizer trained on the AISHELL-1 (Bu et al., 2017) speech corpus (178 hours of read speech) to explore the tone space in Mandarin Chinese. A recent study shows that DNN models learn linguistically-interpretable information to distinguish between vowels (Weber et al., 2016). Specifically, from a low-dimensional Bottleneck layer, the model learns features comparable to F1 and F2. In the current study, we propose a more complicated Long Short-Term Memory (LSTM) model—with a Bottleneck layer implemented in the hidden layers—to account for variable duration, an important cue for tone discrimination. By interpreting the features learned in the Bottleneck layer, we explore what acoustic dimensions are involved in distinguishing tones. The large amount of data from the speech corpus also renders the results more convincing and provides additional insights not possible from studies with more limited data sets.Phonological contrasts are usually signaled by multiple cues, and tonal languages typically involve multiple dimensions to distinguish between tones (e.g., duration, pitch contour, and voice quality, etc.). While the topic has been extensively studied, research has mostly used small datasets. This study employs a deep neural network (DNN) based speech recognizer trained on the AISHELL-1 (Bu et al., 2017) speech corpus (178 hours of read speech) to explore the tone space in Mandarin Chinese. A recent study shows that DNN models learn linguistically-interpretable information to distinguish between vowels (Weber et al., 2016). Specifically, from a low-dimensional Bottleneck layer, the model learns features comparable to F1 and F2. In the current study, we propose a more complicated Long Short-Term Memory (LSTM) model—with a Bottleneck layer implemented in the hidden layers—to account for variable duration, an important cue for tone discrimination. By interpreting the features learned in the Bottleneck layer,...

Volume 145
Pages 1913-1913
DOI 10.1121/1.5101949
Language English
Journal Journal of the Acoustical Society of America

Full Text