Paul Y. Chan
Agency for Science, Technology and Research
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Paul Y. Chan.
Archive | 2010
Ling Cen; Minghui Dong; Haizhou Li Zhu Liang Yu; Paul Y. Chan
Machine Learning concerns the development of algorithms, which allows machine to learn via inductive inference based on observation data that represent incomplete information about statistical phenomenon. Classification, also referred to as pattern recognition, is an important task in Machine Learning, by which machines “learn” to automatically recognize complex patterns, to distinguish between exemplars based on their different patterns, and to make intelligent decisions. A pattern classification task generally consists of three modules, i.e. data representation (feature extraction) module, feature selection or reduction module, and classification module. The first module aims to find invariant features that are able to best describe the differences in classes. The second module of feature selection and feature reduction is to reduce the dimensionality of the feature vectors for classification. The classification module finds the actual mapping between patterns and labels based on features. The objective of this chapter is to investigate the machine learning methods in the application of automatic recognition of emotional states from human speech. It is well-known that human speech not only conveys linguistic information but also the paralinguistic information referring to the implicit messages such as emotional states of the speaker. Human emotions are the mental and physiological states associated with the feelings, thoughts, and behaviors of humans. The emotional states conveyed in speech play an important role in human-human communication as they provide important information about the speakers or their responses to the outside world. Sometimes, the same sentences expressed in different emotions have different meanings. It is, thus, clearly important for a computer to be capable of identifying the emotional state expressed by a human subject in order for personalized responses to be delivered accordingly. 1
international conference on acoustics, speech, and signal processing | 2012
Ling Cen; Minghui Dong; Paul Y. Chan
In this paper, a template-based personalized singing voice synthesis method is proposed. It generates singing voices by means of conversion from the narrated lyrics of a song with the use of template recordings. The template voices are parallel speaking and singing voices recorded from professional singers, which are used to derive the transformation models for acoustic feature conversion. When converting a new instance of speech, its acoustic features are modified to approximate those of the actual singing voice based on the transformation models. Since the pitch contour of the synthesized singing is derived from an actual singing voice, it is more natural than modifying a step contour to implement pitch fluctuations such as overshoot and vibrato. It has been shown from the subjective tests that nearly natural singing quality with the preservation of the timbre can be achieved with the help of our method.
international symposium on chinese spoken language processing | 2010
Ling Cen; Paul Y. Chan; Minghui Dong; Haizhou Li
Emotional speech is one of the key techniques towards a natural and realistic conversation between human and machines. Generating emotional speech by means of converting a neutral speech is desirable as this allows us to generate emotional speech from many existing text-to-speech systems. The GMM based method is capable of synthesizing the desired spectrum, while the rule-based algorithm is effective in implementing the targeted prosodic features. Note that spectral and prosodic features are key factors that project the emotional effects of speech, in this paper, we propose the synthesis of emotional speech by applying a two-stage transformation that combines the GMM and RB methods. We synthesize happy, angry and sad speech and compare the proposed method with GMM linear transformation and RB transformation respectively. The listening test has shown that the speech synthesized by the proposed method is perceived to best portray the targeted speech emotion.
international conference on multimedia and expo | 2010
Tin Lay New; Minghui Dong; Paul Y. Chan; Xi Wang; Bin Ma; Haizhou Li
In this paper, a voice conversion system that converts spoken vowels into singing vowels is proposed. Given the spoken vowels and their musical score, the system generates singing vowels. The system modifies the speech parameters of Fundamental frequency (F0), duration and spectral properties to produce singing voice. F0 contour is obtained using F0 fluctuation information from training singing voice and music score. Duration of each vowel of speech is stretched or shortened according to the length of the corresponding musical note. To transform speech spectrum to singing spectrum the following two approaches are employed. The first method employs spectral mean shifting and variance scaling method. And, the second approach uses weighted linear transformation method to transform speech to singing spectrum. The system is tested on the database including 75 speech and 30 singing voices sung using vowels. The results show that the proposed system is able to convert spoken vowels into singing vowels with a quality very close to the target singing voice.
international conference on digital signal processing | 2015
Paul Y. Chan; Minghui Dong; Yi Qian Lim; Ashleigh Toh; Elliot Chong; Mantita Yeo; Megan Chua; Haizhou Li
This paper presents our work in formant excursion in the human singing voice. In singing voice synthesis, numerous methods have been proposed to modify pitch and energy over time in order to achieve better expressiveness and naturalness [1]-[3]. Methods to modify the spectral envelop, however, remain conservative [4], [5]. An expressive singer, nevertheless, employs different techniques to modify his vocal spectra extensively throughout a song [6]. This motivates our study on formant excursion. We hypothesize that the level of semantic reliance on vowels limits the range of formant excursion and develop a method to find |Ξ|, a measure of isolated spectral distortion attributed to singing expressiveness, independent of the spectral differences inherent between speech and singing. With this, we are able to better parameterize spectral modifications in the singing voice towards a dynamic spectral model for singing synthesis.
international conference on asian language processing | 2015
Gillian Chua; Qian Ci Chang; Ye Won Park; Paul Y. Chan; Minghui Dong; Haizhou Li
In speech, emotion is freely carried in the fundamental frequency, rate of speech, intensity and spectra of the human voice, although the spectra is largely dictated by words in speech. In singing, however, fundamental frequency and rhythm are constrained by the melody of the song, while the spectra is constrained by the song lyrics. Nevertheless, a large amount of emotion is carried in song; and the same song, with the same melody and rhythm constraints, may be expressed in a variety of emotions. This paper investigates the manner by which different emotions may be conveyed in the human singing voice, given the constraints of song. It goes further to find which subtle variations correspond to which emotion.
international symposium on chinese spoken language processing | 2010
Minghui Dong; Paul Y. Chan; Ling Cen; Haizhou Li
Proper alignment of singing voice to its corresponding musical score is particularly important for singing voice analysis and processing in many applications. This paper proposes a method to align singing voice with its MIDI melody. The MIDI is first converted into a synthesized music audio file. Then spectral features are calculated for both singing voice and MIDI synthesis files. Next dynamic time warping method is used to perform forced alignment between the two feature sequences. Experiments show that the alignment result is promising.
international conference on multimedia and expo | 2011
Paul Y. Chan; Minghui Dong; Siu Wa Lee; Ling Cen
This paper presents our work in the automatic synthesis of vocal harmony. Existing innovations either allow for dissonances (i.e. non-harmonious or clashing intervals) at various locations or require some musical ability of the user. We have developed a method that is able to automatically synthesize vocal harmony even for ordinary singers with a poor sense of harmony and rhythm. We have evaluated our method by means of spectrogram comparison as well as subjective listening tests. A spectrogram comparison of our method and two popular existing methods against that of the human voice shows that our method is least dissonant and most similar to natural human vocals. Subjective listening tests conducted separately for experts and non-experts in the field confirm that the vocal harmony synthesized using our method sounds the best in terms of consonance, inter-syllable transition, as well as naturalness and appeal.
international conference on asian language processing | 2011
Ling Cen; Minghui Dong; Paul Y. Chan
In this paper, a L1 regularized linear regression based method is proposed to model the relationship between the linguistic features and prosodic parameters in Text-to-Speech (TTS) synthesis. By formulating prosodic prediction as a convex problem, it can be solved using very efficient numerical method. The performance can be similar to that of the Classification and Regression Tree (CART), a widely used approach for prosodic prediction. However, the computational load can be as low as 76% of that required by CART.
international symposium on chinese spoken language processing | 2010
Paul Y. Chan; Minghui Dong; Ling Cen; Haizhou Li
In this paper, we propose a psychoacoustic approach towards enhancing speech intelligibility in noise. Understanding the relationship between the short-term spectral movement of a sound and a listeners sensitivity towards it, we conjecture that humans rely greatly on Inter-Phoneme Spectral Gradients (IPSGs) to distinguish each phoneme, especially when the short-term speech spectrum is masked by extremely high levels of noise. We then move on to explain how the IPSG may most effectively be steepened while introducing the concept of Formant Contrast. The effectiveness of this process is validated with spectral analysis and listening tests, verifying that our initial deduction is true. In these, we present a simple, yet novel and effective method of improving speech intelligibility - especially in extremely high noise environments.