ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 2021

Attention-Based Multi-Encoder Automatic Pronunciation Assessment

 
 

Abstract


Automatic pronunciation assessment plays an important role in Computer-Assisted Pronunciation Training (CAPT). Traditional methods for pronunciation assessment of reading aloud tasks utilize features derived from automatic speech recognition (ASR) and thus are sensitive to the accuracy of ASR and the effectiveness of features. Moreover, the representation capability of the features is also affected by the inconsistent optimization goals between the ASR and scoring tasks. In this paper we propose an end-to-end (E2E) pronunciation scoring network based on attention mechanism and multi-encoder consisting of audio and text encoders. The network optimized by a multi-task learning (MTL) framework can provide scoring at sentence-level as well as detailed scoring at word-level. Due to data scarcity for pronunciation scoring, we utilize ASR data and synthetic data to pre-train the network in two steps, and then fine-tune the network using the limited high-quality scoring data. Experimental results based on the dataset recorded by Chinese English-as-second-language (ESL) learners and labeled by three experts demonstrate that the proposed model outperforms the baseline in Pearson correlation coefficient (PCC).

Volume None
Pages 7743-7747
DOI 10.1109/ICASSP39728.2021.9414451
Language English
Journal ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Full Text