Yishuang Ning
Tsinghua University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yishuang Ning.
international conference on acoustics, speech, and signal processing | 2016
Xinyu Lan; Xu Li; Yishuang Ning; Zhiyong Wu; Helen M. Meng; Jia Jia; Lianhong Cai
Speech is bimodal in nature. There are close correlations between the acoustic speech signals and the visual gestures such as lip movements, facial expressions and head motions. For speech driven talking avatar, how to derive more representative acoustic features from which to predict more accurate and realistic visual gestures still remains the research problem. Inspired by the promising performance of low level descriptors (LLD) in speech emotion recognition, in this work, we investigate the usage of LLD feature for the task of speech driven talking avatar. Furthermore, visual gestures also demonstrate correlations with not only context information of past or future acoustic features (e.g. anticipatory co-articulation phenomena) but also textual information (e.g. textual hints for lip movement). To incorporate such information, we also propose to use deep bidirectional long short-term memory (DBLSTM) as the bottleneck feature extractor, which can combine LLD feature with contextual information. Experimental results indicate that the proposed LLD based DBLSTM bottleneck feature outperforms the conventional spectrum related features for the task of speech driven talking avatar, and more sophisticated contextual information can further improve the performance.
Multimedia Tools and Applications | 2015
Zhiyong Wu; Yishuang Ning; Xiao Zang; Jia Jia; Fanbo Meng; Helen M. Meng; Lianhong Cai
Emphasis plays an important role in expressive speech synthesis in highlighting the focus of an utterance to draw the attention of the listener. As there are only a few emphasized words in a sentence, the problem of the data limitation is one of the most important problems for emphatic speech synthesis. In this paper, we analyze contrastive (neutral versus emphatic) speech recordings considering kinds of contexts, i.e. the relative locations between the syllables and the emphasized words. Based on the analysis, we propose a hidden Markov model (HMM) based method for emphatic speech synthesis with limited amount of data. In this method, decision trees (DTs) are constructed with non-emphasis-related questions using both neutral and emphasis corpora. The data in each leaf node of the DTs are classified into 6 emphasis categories according to the emphasis-related questions. The data in the same emphasis category are grouped into one sub-node and are used to train one HMM. As there might be no data of some specific emphasis categories in the leaf nodes of the DTs, a method based on cost calculation is proposed to select a suitable HMM in the same leaf node for predicting parameters. Further a compensation model is proposed to adjust the predicted parameters. We conduct a series of experiments to evaluate the performances of the approach. Experiments indicate that the proposed emphatic speech synthesis models improve the emphasis quality of synthesized speech while keeping a high degree of the naturalness.
international conference on multimedia and expo | 2016
Xinxing Li; Jiashen Tian; Mingxing Xu; Yishuang Ning; Lianhong Cai
Dynamic Music Emotion Prediction is crucial to the emerging applications of music retrieval and recommendation. Considering the influence of temporal context and hierarchical structure on emotion in music, we propose a Deep Bidirectional Long Short-Term Memory (DBLSTM) based multi-scale regression method. In this method, a post-processing component is utilised for individual DBSLTM output to further enhance the ability of temporal context processing and a fusion component is to integrate the output of all DBLSTM models with different scales. In addition, we investigate how the difference of sequence length between the training and predicting phase affects the performance of DBLSTM. We conduct our experiments on a public database of Emotion in Music task at MediaEval 2015, and the result shows that our method achieves significant improvement when compared with the state-of-art methods.
international conference on acoustics, speech, and signal processing | 2015
Yishuang Ning; Zhiyong Wu; Jia Jia; Fanbo Meng; Helen M. Meng; Lianhong Cai
This paper investigates the incorporation of hidden Markov model (HMM) based emphatic speech synthesis for audio exaggeration into an audio-visual speech synthesis framework for the corrective feedback in computer-aided pronunciation training (CAPT). To improve the voice quality of the synthetic emphatic speech, this paper proposes a new method for HMM training. In this method, the contextual questions for decision tree building are extended by considering the emphasis-related information. HMMs are then trained using a small scale emphatic corpus together with a large scale neutral corpus. The emphatic corpus is used to ensure the quality of the emphatic speech segments whereas the neutral corpus is to further improve the quality of both the non-emphatic speech segments and the emphatic ones. Finally, emphatic speech synthesis is achieved by extending the Flite+hts_engine. Experimental results show that our method can synthesize emphatic speech with high quality and make the feedback more discriminatively perceptible.
artificial intelligence methodology systems applications | 2018
Ziwei Zhu; Zhiyong Wu; Runnan Li; Yishuang Ning; Helen Meng
Recurrent neural networks (RNNs) with long short term memory (LSTM) acoustic model (AM) has achieved state-of-the-art performance in LVCSR. The strong ability in capturing context information makes the acoustic feature extracted from LSTM more discriminative. Feature extraction is also crucial to query-by-example spoken term detection (QbyE-STD), especially frame-level features. In this paper, we explore some frame-level recurrent neural networks representations for QbyE-STD, which is more robust than the original features. In addition, the designed model is a lightweight model that is suitable for the requirements for little footprint on mobile devices. Firstly, we use a traditional RAE to extract frame-level representations and use a correspondence RAE to depress non-semantic information. Then, we use the combination of the two models to extract more discriminative features. Some common tricks such as skipping frames have been used to make the model learn more context information. Experiment and evaluations show the performance of the proposed methods are superior to the conventional ones, in the same condition of computation requirements.
international conference on acoustics, speech, and signal processing | 2017
Shumei Zhang; Jia Jia; Yishuang Ning
Social media is rocking the world in recent year, which makes modeling social media contents important. However, the heterogeneity of social media data is the main constraint. This paper focuses on inferring emotions from large-scale social media data. Tweets on social media platform, always containing heterogeneous information from different combinations of modalities, are utilized to construct a cross-media dataset. How to integrate cross-media information and solve the problem of modality deficiency are main challenges. To address those challenges, this paper proposes a Cross-media Auto-Encoder(CAE) to infer emotions on cross-media data, and CAE is designed to reconstruct missing modalities and integrate heterogeneous representations. In our experiments, We employ a dataset of 226,113 tweets to infer emotions of tweets, and our method outperforms several machine learning methods (+11.11% in terms of F1-measure). Feature contribution analysis also verifies the importance of adopting cross-media features.
international conference on acoustics, speech, and signal processing | 2017
Yishuang Ning; Zhiyong Wu; Runnan Li; Jia Jia; Mingxing Xu; Helen M. Meng; Lianhong Cai
Bidirectional long short-term memory (BLSTM) recurrent neural network (RNN) has achieved state-of-the-art performance in many sequence processing problems given its capability in capturing contextual information. However, for languages with limited amount of training data, it is still difficult to obtain a high quality BLSTM model for emphasis detection, the aim of which is to recognize the emphasized speech segments from natural speech. To address this problem, in this paper, we propose a multilingual BLSTM (MTL-BLSTM) model where the hidden layers are shared across different languages while the softmax output layer is language-dependent. The MTL-BLSTM can learn cross-lingual knowledge and transfer this knowledge to both languages to improve the emphasis detection performance. Experimental results demonstrate our method can outperform the comparison methods over 2–15.6% and 2.9–15.4% on the English corpus and Mandarin corpus in terms of relative F1-measure, respectively.
affective computing and intelligent interaction | 2015
Xixin Wu; Zhiyong Wu; Yishuang Ning; Jia Jia; Lianhong Cai; Helen M. Meng
Speech are widely used to express ones emotion, intention, desire, etc. in social network communication, deriving abundant of internet speech data with different speaking styles. Such data provides a good resource for social multimedia research. However, regarding different styles are mixed together in the internet speech data, how to classify such data remains a challenging problem. In previous work, utterance-level statistics of acoustic features are utilized as features in classifying speaking styles, ignoring the local context information. Long short-term memory (LSTM) recurrent neural network (RNN) has achieved exciting success in lots of research areas, such as speech recognition. It is able to retrieve context information for long time duration, which is important in characterizing speaking styles. To train LSTM, huge number of labeled training data is required. While for the scenario of internet speech data classification, it is quite difficult to get such large scale labeled data. On the other hand, we can get some publicly available data for other tasks (such as speech emotion recognition), which offers us a new possibility to exploit LSTM in the low-resource task. We adopt retraining strategy to train LSTM to recognize speaking styles in speech data by training the network on emotion and speaking style datasets sequentially without reset the weights of the network. Experimental results demonstrate that retraining improves the training speed and the accuracy of network in speaking style classification.
international conference on signal processing | 2014
Xiao Zang; Zhiyong Wu; Yishuang Ning; Helen M. Meng; Lianhong Cai
Labeling emphatic words from speech recordings plays an important role in building speech corpus for expressive speech synthesis. People generally pronounce some words stronger than usual, making the speech more expressive and signaling the focus of the sentence. Contrastive word pairs are often pronounced with stronger prominences and their presence modifies the meaning of the utterance in subtle but important ways. We used a subset of Switchboard corpus to study the acoustic characteristics of contrastive word pairs and the differences between contrastive and non-contrastive words. To address the problem of automatic detection of contrastive word pairs, support vector machines (SVMs) are used to automatically detect contrastive word pairs. We report the results for automatic detection of contrastive word pairs based on textual and acoustic features. By adding acoustic features, a much better performance is achieved.
conference of the international speech communication association | 2015
Helen M. Meng; Jia Jia; Yishuang Ning; Zhiyong Wu; Xiaoyan Lou; Lianhong Cai