Runnan Li
Tsinghua University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Runnan Li.
autonomous infrastructure, management and security | 2018
Jingbei Li; Zhiyong Wu; Runnan Li; Mingxing Xu; Kehua Lei; Lianhong Cai
Computer assisted language learning (CALL) has attracted increasing interest in language teaching and learning. In the computer-supported learning environment, both pronunciation correction and expression modulation are certified to be essential for contemporary learners. However, while mispronunciation detection and diagnosis (MDD) technologies have achieved significant successes, speech expression evaluation is still relied on expensive and resources consuming manual assessment. In this paper, we proposed a novel multi-modal multi-scale neural network based approach for automatic speech expression evaluation in CALL. In particular, a multi-modal sparse auto encoder (MSAE) is firstly employed to make full use of both lexical and acoustic features, a recurrent auto encoder (RAE) is further employed to produce the features at different time scale and an attention-based multi-scale bi-directional long-short term memory (BLSTM) model is finally employed to score the speech expression. Experimental results using data collected from realistic airline broadcast evaluation demonstrate the effectiveness of the proposed approach, achieving a human-level predictive ability with acceptable rate 70.4%.
artificial intelligence methodology systems applications | 2018
Ziwei Zhu; Zhiyong Wu; Runnan Li; Yishuang Ning; Helen Meng
Recurrent neural networks (RNNs) with long short term memory (LSTM) acoustic model (AM) has achieved state-of-the-art performance in LVCSR. The strong ability in capturing context information makes the acoustic feature extracted from LSTM more discriminative. Feature extraction is also crucial to query-by-example spoken term detection (QbyE-STD), especially frame-level features. In this paper, we explore some frame-level recurrent neural networks representations for QbyE-STD, which is more robust than the original features. In addition, the designed model is a lightweight model that is suitable for the requirements for little footprint on mobile devices. Firstly, we use a traditional RAE to extract frame-level representations and use a correspondence RAE to depress non-semantic information. Then, we use the combination of the two models to extract more discriminative features. Some common tricks such as skipping frames have been used to make the model learn more context information. Experiment and evaluations show the performance of the proposed methods are superior to the conventional ones, in the same condition of computation requirements.
acm multimedia | 2018
Runnan Li; Zhiyong Wu; Jia Jia; Jingbei Li; Wei Chen; Helen Meng
Human-computer conversational interactions are increasingly pervasive in real-world applications, such as chatbots and virtual assistants. The user experience can be enhanced through affective design of such conversational dialogs, especially in enabling the computer to understand the emotive state in the users input, and to generate an appropriate system response within the dialog turn. Such a system response may further influence the users emotive state in the subsequent dialog turn. In this paper, we focus on the change in the users emotive states in adjacent dialog turns, to which we refer as user emotive state change. We propose a multi-modal, multi-task deep learning framework to infer the users emotive states and emotive state changes simultaneously. Multi-task learning convolution fusion auto-encoder is applied to fuse the acoustic and textual features to generate a robust representation of the users input. Long-short term memory recurrent auto-encoder is employed to extract features of system responses at the sentence-level to better capture factors affecting user emotive states. Multi-task learned structured output layer is adopted to model the dependency of user emotive state change, conditioned upon the user inputs emotive states and system response in current dialog turn. Experimental results demonstrate the effectiveness of the proposed method.
international conference on acoustics, speech, and signal processing | 2017
Runnan Li; Zhiyong Wu; Xunying Liu; Helen M. Meng; Lianhong Cai
Recurrent neural networks (RNNs) and their bidirectional long short term memory (BLSTM) variants are powerful sequence modelling approaches. Their inherently strong ability in capturing long range temporal dependencies allow BLSTM-RNN speech synthesis systems to produce higher quality and smoother speech trajectories than conventional deep neural networks (DNNs). In this paper, we improve the conventional BLSTM-RNN based approach by introducing a multi-task learned structured output layer where spectral parameter targets are conditioned upon pitch parameters prediction. Both objective and subjective experimental results demonstrated the effectiveness of the proposed technique.
international conference on acoustics, speech, and signal processing | 2017
Yishuang Ning; Zhiyong Wu; Runnan Li; Jia Jia; Mingxing Xu; Helen M. Meng; Lianhong Cai
Bidirectional long short-term memory (BLSTM) recurrent neural network (RNN) has achieved state-of-the-art performance in many sequence processing problems given its capability in capturing contextual information. However, for languages with limited amount of training data, it is still difficult to obtain a high quality BLSTM model for emphasis detection, the aim of which is to recognize the emphasized speech segments from natural speech. To address this problem, in this paper, we propose a multilingual BLSTM (MTL-BLSTM) model where the hidden layers are shared across different languages while the softmax output layer is language-dependent. The MTL-BLSTM can learn cross-lingual knowledge and transfer this knowledge to both languages to improve the emphasis detection performance. Experimental results demonstrate our method can outperform the comparison methods over 2–15.6% and 2.9–15.4% on the English corpus and Mandarin corpus in terms of relative F1-measure, respectively.
international symposium on chinese spoken language processing | 2016
Runnan Li; Zhiyong Wu; Helen M. Meng; Lianhong Cai
While both spectral and prosody transformation are important for voice conversion (VC), traditional methods have focused on the conversion of spectral features with less emphasis on prosody transformation. This paper presents a novel pitch transformation method for VC. As the correlation of spectral features and fundamental frequency in pitch perceptions has been proved, well-converted spectrum should benefit to pitch transformation. Motivated by this, a multi-task learning (MTL) framework based on deep bidirectional long short-term memory (DBLSTM) recurrent neural network (RNN) has been proposed for pitch transformation in VC. DBLSTM is used to model the long short-term dependencies across speech frames for spectral conversion; the converted spectrum and the source pitch contour are further simultaneously modeled to generate the converted target pitch contour and voiced/unvoiced flag; the above tasks are incorporated with the MTL framework to enhance the performances of each other. Experimental results indicate the proposed method outperforms the conventional approaches in pitch transformation.
international conference on multimedia and expo | 2018
Shaoguang Mao; Zhiyong Wu; Xu Li; Runnan Li; Xixin Wu; Helen M. Meng
international conference on acoustics, speech, and signal processing | 2018
Runnan Li; Zhiyong Wu; Yuchen Huang; Jia Jia; Helen Meng; Lianhong Cai
international conference on acoustics, speech, and signal processing | 2018
Shaoguang Mao; Zhiyong Wu; Runnan Li; Xu Li; Helen M. Meng; Lianhong Cai
conference of the international speech communication association | 2018
Ziwei Zhu; Zhiyong Wu; Runnan Li; Helen Meng; Lianhong Cai