Wai Chee Yau
RMIT University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Wai Chee Yau.
international conference on computer graphics imaging and visualisation | 2006
Wai Chee Yau; Dinesh Kumar; Sridhar Poosapadi Arjunan; Sanjay Kumar
This paper describes a new technique for recognizing speech using visual speech information. The video data of the speakers mouth is represented using grayscale images named as motion history image (MHI). MHI is generated by applying accumulative image differencing on the frames of the video to implicitly represent the temporal information of the mouth movement. The MHIs are decomposed into wavelet sub images using discrete stationary wavelet transform (SWT). Three moment-based features (geometric moments, Zernike moments and Hu moments) are extracted from the SWT approximate sub images. Multilayer perceptron (MLP) type artificial neural network (ANN) with back propagation learning algorithm is used to classify the moments features. This paper evaluates and compares the image representation ability of the different moments. The initial experiments show that this method can classify English consonants with an error rate less than 5%
computer analysis of images and patterns | 2007
Wai Chee Yau; Dinesh Kumar; Hans Weghorn
This paper presents a novel visual speech recognition approach based on motion segmentation and hidden Markov models (HMM). The proposed method identifies utterances from mouth video, without evaluating voice signals. The facial movements in the video data are represented using 2D spatial-temporal templates (STT). The proposed technique combines discrete stationary wavelet transform (SWT) and Zernike moments to extract rotation invariant features from the STTs. HMMs are used as speech classifier to model English phonemes. The preliminary results demonstrate that the proposed technique is suitable for phoneme classification with a high accuracy.
iberoamerican congress on pattern recognition | 2008
Wai Chee Yau; Dinesh Kumar; Tharangini Chinnadurai
This paper presents a lip-reading technique to identify the unspoken phones using support vector machines. The proposed system is based on temporal integration of the video data to generate spatio-temporal templates (STT). 64 Zernike moments (ZM) are extracted from each STT. This work proposes a novel feature selection algorithm to reduce the dimensionality of the 64 ZM to 12 features. The proposed technique uses the shape of probability curve as a goodness measure for optimal feature selection. The feature vectors are classified using non-linear support vector machines.Such a system could be invaluable when it is important to communicate without making a sound, such as giving passwords when in public spaces.
International Journal of Image and Graphics | 2008
Wai Chee Yau; Dinesh Kumar; Sridhar Poosapadi Arjunan
This paper presents a vision based technique to identify the unspoken phones using a small camera that is located on the headset of the speaker. The system is based on temporal integration of the video data to generate motion history image (MHI). The paper proposes the use of global features to classify the MHI and compares the use of image moments with Discrete Cosine Transform (DCT). A comparison between Zernike moments (ZM) with DCT indicates that while the accuracy of classification for both techniques is very comparable (96% for ZM and 94% for DCT) when there is no relative motion between the camera and the mouth, ZM is resilient to rotation of the camera and continues to gives good results despite rotation but DCT is sensitive to rotation. Based on the accuracy of the system and its resilience to movement artefacts such as rotation, the authors propose the use of such a system for human computer interface. Such a system could be invaluable when it is important to communicate without making a sound, such as giving passwords when in an open office or in public spaces.
advanced video and signal based surveillance | 2006
Wai Chee Yau; Dinesh Kumar; Sridhar Poosapadi Arjunan
This paper reports on a visual speech recognition method that is invariant to translation, rotation and scale. Dynamic features representing the mouth motion is extracted from the video data by using a motion segmentation technique termed as motion history image (MHI). MHI is generated by applying accumulative image differencing technique on the sequence of mouth images. Invariant features are derived from the MHI using feature extraction algorithm that combines Discrete Stationary Wavelet Transform (SWT) and moments. A 2-D SWT at level one is applied to decompose MHI to produce one approximate and three detail sub images. The feature descriptors consist of three moments (geometric moments, Hu moments and Zernike moments) computed from the SWT approximate image. The moments features are normalized to achieve the invariance properties. Artificial neural network (ANN) with back propagation learning algorithm is used to classify the moments features. Initial experiments were conducted to test the sensitivity of the proposed approach to rotation, translation and scale of the mouth images and obtained promising results.
ieee region 10 conference | 2008
Wai Chee Yau; Sridhar Poosapadi Arjunan; Dinesh Kumar
This paper presents a silent speech recognition technique based on facial muscle activity and video, without evaluating any voice signals. This research examines the use of facial surface electromyogram (SEMG) to identify unvoiced vowels and vision-based technique to classify unvoiced consonants. The moving root mean square (RMS) of SEMG signals of four facial muscles is used to segment the signals and to identify the start and end of a silently spoken vowels. Visual features are extracted from the mouth video of a speaker silently uttering consonants using motion segmentation and image moment techniques. The SEMG features and visual features are classified using feedforward multilayer perceptron (MLP) neural networks. The preliminary results demonstrate that the proposed technique yields high recognition rate for classification of unvoiced vowels using SEMG features. Similarly, promising results are obtained in identification of consonants using visual features. The results demonstrate that the system is easy to train for a new user and suggest that such a system works reliably for voiceless, simple speech based commands for human computer interface when it is trained for a user.
international conference on enterprise information systems | 2007
Sridhar Poosapadi Arjunan; Hans Weghorn; Dinesh Kumar; Ganesh R. Naik; Wai Chee Yau
This research examines the evaluation of fSEMG (facial surface Electromyogram) for recognizing speech utterances in English and German language. The raw sampling is performed without sensing any audio signal, and the system is designed for Human Computer Interaction (HCI) based on voice commands. An effective technique is presented, which exploits facial muscle activity of the articulatory muscles and human factors for silent vowel recognition. The muscle signals are reduced to activity parameters by temporal integration, and the matching process is performed by an artificial neural back-propagation network that has to be trained for each individual human user. In the experiments, different style and speed in speaking and different languages were investigated. Cross-validation was used to convert a limited set of single shot experiments into a broader statistical reliability test of the classification method. The experimental results show that this technique yields high recognition rates for all participants in both languages. These results also show that the system is easy to train for a human user, and this all suggests that the described recognition approach can work reliable for simple vowel based commands in HCI, especially when the user speaks one or more languages as also for people who suffer from certain speech disabilities.
Computational Models of Complex Systems | 2014
Sridhar Poosapadi Arjunan; Wai Chee Yau; Dinesh Kumar
There is an urgent need for having interfaces that directly employ the natural communication and manipulation skills of humans. Vision based systems that are suitable for identifying small actions and suitable for communication applications will allow the deployment for machine control by people with restricted limb movements, such as neuro-trauma patients. Because of the limited abilities of these people, it is also important that these systems have inbuilt intelligence and are suitable for learning about the user and reconfigure itself appropriately. Patients who have suffered neuro-trauma often have restricted body and limb movements. In such cases, hand, arms and the body movements may be impossible, thus head activity and face expression become important in designing Human computer interface (HCI) systems for machine control. Silent speech-based assistive technologies (AT) are important for users with difficulty to vocalize by providing the flexibility for the users to control computers without making a sound. This chapter evaluates the feasibility of using facial muscle activity signals and mouth video to identify speech commands, in the absence of voice signals. This chapter investigates the classification power of mouth videos in identifying English vowels and consonants. This research also examines the use of non invasive, facial surface Electromyogram (SEMG) to identify unvoiced English and German vowels based on the muscle activity and also provide a feedback to the visual system. The results suggest that video-based systems and facial muscle activity work reliably for simple speech-based commands for AT.
International Journal of Electronic Security and Digital Forensics | 2008
Wai Chee Yau; Dinesh Kumar; Hans Weghorn
This article presents a secure method for identification of voice-less commands using mouth images, without evaluating sound signals. The main limitation in voice recognition technologies for internet applications is that the commands will be audible to other people in the vicinity. The proposed technique identifies the unspoken utterances using support vector machines. The proposed system is based on temporal integration of the video data to generate spatiotemporal templates (STT). Sixty-four Zernike Moments are extracted from each STT. The experimental results demonstrate that the proposed system yields promising in recognising English phonemes. The proposed technique is demonstrated to be invariant to global variations of illumination level. Such a system could be invaluable when it is important to communicate without making a sound, such as giving passwords and internet applications on mobile devices.
4th International Conference of Global e-Security | 2008
Wai Chee Yau; Dinesh Kumar; Hans Weghorn
Interest in voice recognition technologies for internet applications is growing due to the flexibility of speech-based communication. The major drawback with the use of sound for internet access with computers is that the commands will be audible to other people in the vicinity. This paper examines a secure and voice-less method for recognition of speech-based commands using video without evaluating sound signals. The proposed approach represents mouth movements in the video data using 2D spatio-temporal templates (STT). Zernike moments (ZM) are computed from STT and fed into support vector machines (SVM) to be classified into one of the utterances. The experimental results demonstrate that the proposed technique produces a high accuracy of 98% in a phoneme classification task. The proposed technique is demonstrated to be invariant to global variations of illumination level. Such a system is useful for securely interpreting user commands for internet applications on mobile devices.